Corpus (pl Corpora) and TEFL

+01 424 645 5957

+39 347 378 8169

+01 424 645 5957

+39 347 378 8169

Home / Blog /

Linguistics, Technology & TEFL

A Corpus (plural Corpora) is a large collection of written texts which are used in computational linguistics for analysis of the way language is used. They are most often analyzed using a concordancer‏‎.

Types of Corpora

A corpus can be one or more of the following:

monolingual
multilingual
general texts
texts on a specific subject or genre e.g. scientific papers, Shakespeare plays only, children’s essays, etc
texts from a specific varieties of English‏‎, e.g. American English‏‎ or British English, etc

Analysis of a corpus will bring to light certain ways of language use within that group. For example, it may well show that scientific papers use the passive voice‏‎ far more often than newspapers do or that certain words are only used among certain groups of speakers.

Methods of Analysis

Corpora are generally searched and analyzed using computers which are able to search and compare millions of text strings in virtually no time. However, computer analysis does sometimes have drawbacks. For example, take these two sentences:

Time flies like an arrow.
Fruit flies like a banana.

Whilst a human can easily distinguish between the two uses of the words, flies and like a computer does not yet find this possible. To get around this, corpora are often tagged or annotated. Typically this would involve human operators giving parts of speech‏‎ tags to words before they are processed and compared by the computer, thus:

Time [noun] flies [verb] like [adverb] an [determiner] arrow [noun].
Fruit [adjective] flies [noun] like [verb] a [determiner] banana [noun].

This allows, for example, a concordancer to analyze all uses of like as a verb‏‎ as oppose to like as an adverb‏‎.

In the Classroom

Use of corpora in the classroom, for example by using a concordancer, can be carried out by students under the guide of a teacher. This will allow students to see how language is used by native speakers in everyday situations. As a teacher a student may ask questions like, “Do we say the team is or the team are?” If this happens and you have access to the internet, you can have your students find out for themselves and work out which is more appropriate and when.

Incidentally, an online search of the BNC (British National Corpus) shows 109 occurrences of the team is and just 37 occurrences of the team are. Without going into further analysis this should tell your students that, given the choice, it is 3 times more likely to be correct to use the team is than the team are!

Notable Corpora

The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late twentieth century from a wide variety of genres with the intention that it be a representative sample of spoken and written British English of that time.

Of the two parts to the 10-million word spoken corpus, one is a demographic part, containing transcriptions of spontaneous natural conversations made by members of the public and the other a context-governed part, containing transcriptions of recordings made at specific types of meetings and events. All the original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library.

The corpus is marked up following the recommendations of the Text Encoding Initiative and includes full linguistic annotation and contextual information The most recent edition, from March 2007, is distributed in XML format along with the XAIRA software. It is freely available under a license and is very widely distributed.

The BNC can be searched online for specific words or phrases.

The American National Corpus is a paid membership-based collaboration with the aim of creating an electronic text corpus of American English. The collection will include text and transcripts of spoken data produced from 1990, with the goal of a 100 million word corpus.

ANC Consortium members include publishers, software companies, and academic members. Consortium members have exclusive access throughout the development period and for five years after the first installment of the corpus. The First Release of the American National Corpus (ANC) was made available in mid-fall, 2003. The data includes approximately 11 million words of American English, including written and spoken data and a variety of text types annotated for part of speech and lemma. The corpus is provided in XML format conformant to the XML Corpus Encoding Standard (XCES).

Resources

ICAL TEFL Resources

The ICAL TEFL site has thousands of pages of free TEFL resources for teachers and students. These include: The TEFL ICAL Grammar Guide. Country Guides for teaching around the world. How to find TEFL jobs. How to teach English. TEFL Lesson Plans....

6 Tips to Make your ESL Classes More Effective

Teaching is undeniably a challenging job, in fact many consider it one of the most difficult careers you could choose. Nevertheless, being a teacher is an enriching experience. Through quality education and effective teaching methodologies,...

Hear some tips and advice from Samantha: For current TEFL students

Samantha is a previous student of ICAL TEFL on the 120-hour course. Based in USA at the moment, Samantha is looking forward to the future and where she could be using her certificate next ... Before completing your course, what were your...

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Functional

Performance

Analytics

Others

[email protected]

+01 424 645 5957

+39 347 378 8169

[email protected]

+01 424 645 5957

+39 347 378 8169