Simply put, Corpus Linguistics is the study of language using computer programs which analyze millions of lines of texts held in a corpus (pl, corpora).
To begin with masses of samples of language are collected: from newspapers, books, transcripts of the spoken word, etc. These can then be marked up, that is, tagged to show the various parts of speech they consist of.
Life + is + a + long + song.
[noun] + [verb] + [article] + [adjective] + [noun] *
* Note that this is just one way of marking up a sentence; there are other ways to do this also.
Special software programs called concordancers can then be used to search through the corpus to find patterns. These patterns are then used to describe the language.
As a very simple example indeed, a concordancer could search through a corpus of language for the occurrence of adjectives and where they appear in relation to other parts of speech. It would soon find that they always come before a noun:
[adjective] + [noun]
This, then, could be suggested as a rule of how language works… until an exception occurs when the rule has to be tweaked to suit the new findings.
Of course this kind of painstaking search through millions of lines of text can only be done through computer power. However, remember that corpus linguistics is not really the collection of data but the interpretation and analysis of that data and the searches made on it.
Corpus Linguistics & TEFL
Corpus Linguistics has affected TEFL in a number of ways. Most notably it has provided a set of real life rules (although rules does seem a bit strong a word for something which is often contradicted) which tell us how language works. These rules then make their way into grammar books, dictionaries, TEFL coursebooks and so on.
However, closer to home, software and corpora have become available online and now anyone with an internet connection can search through millions of lines of text and come up with their own rules. This is incredibly useful for students of English who can work out by themselves how language works (and needless to say, if someone finds an answer by themself rather than being told it then the answer is much more firmly embedded and useful to them).
Corpus (pl Corpora) and Language Learning – about the corpora being used in corpus linguistics
CALL – Computer Assisted Language Learning – a general look at using computers to teach English
Concordancers – the software used to search the corpora
n-grams and TEFL – looking at corpora