Computational Linguistics Knowledge Base
Learn about the statistical methods, formulas, and concepts used in corpus linguistics research.
Quick Navigation
KWIC Concordance
What is it?
KWIC (Key Word In Context) is a format for concordance lines that displays a search term centered on the page with its surrounding context. It's one of the most fundamental tools in corpus linguistics.
Use Cases
- Examining word usage patterns in authentic contexts
- Identifying collocations and co-occurrence patterns
- Studying semantic prosody (positive/negative associations)
- Analyzing grammatical constructions
- Comparing language use across registers or time periods
How to Interpret
When analyzing KWIC lines:
- Left context: Words that typically precede your search term
- Right context: Words that typically follow your search term
- Sorting: Sort by left/right context to reveal patterns
- Frequency: Look for recurring patterns across multiple lines
Collocation Analysis
What is it?
Collocation analysis identifies words that frequently co-occur together. Statistical measures help distinguish meaningful associations from chance co-occurrence.
Statistical Measures
1. Mutual Information (MI)
Formula:
MI = log₂(O₁₂ / E₁₂)
Where O₁₂ = observed frequency, E₁₂ = expected frequency by chance
Interpretation:
- MI ≥ 3: Strong association (word pair occurs 8× more than expected)
- MI 0-3: Weak to moderate association
- MI < 0: Negative association (words avoid each other)
Best for: Finding exclusive collocations, technical terms, idioms
Limitation: Favors low-frequency pairs; unreliable for very rare words
2. T-score
Formula:
t = (O₁₂ - E₁₂) / √O₁₂
Interpretation:
- t ≥ 2.0: Statistically significant association (95% confidence)
- t ≥ 3.0: Highly significant (99.7% confidence)
- Higher values = stronger evidence of non-random co-occurrence
Best for: General collocations, frequent word pairs
Advantage: More reliable for high-frequency pairs than MI
3. Dice Coefficient
Formula:
Dice = 2 × f(x,y) / (f(x) + f(y))
Where f(x,y) = co-occurrence frequency, f(x) and f(y) = individual frequencies
Interpretation:
- Range: 0 to 1 (0 = never co-occur, 1 = always co-occur)
- 0.7-1.0: Very strong association
- 0.4-0.7: Moderate association
- 0-0.4: Weak association
Best for: Symmetrical associations, comparing collocation strength across different corpus sizes
Advantage: Easy to interpret (percentage-like scale)
4. Log-likelihood (LL)
Formula:
LL = 2 × Σ(O × ln(O/E))
Sum over all cells in the contingency table
Interpretation:
- LL ≥ 15.13: Highly significant (p < 0.0001)
- LL ≥ 10.83: Very significant (p < 0.001)
- LL ≥ 6.63: Significant (p < 0.01)
- LL ≥ 3.84: Marginally significant (p < 0.05)
Best for: Statistical significance testing, large corpora
Advantage: More reliable than chi-square for corpus linguistics
Choosing the Right Measure
- MI: Use for finding exclusive, specialized collocations (technical terms, idioms)
- T-score: Use for general purpose collocation analysis with frequent words
- Dice: Use when you need an intuitive 0-1 scale or comparing across corpora
- Log-likelihood: Use for statistical hypothesis testing or very large corpora
N-grams
What are they?
N-grams are contiguous sequences of n items (words) from a text. They capture multi-word patterns and phraseological units.
Types
- Bigrams (2-grams): Two-word sequences (e.g., "corpus linguistics", "data analysis")
- Trigrams (3-grams): Three-word sequences (e.g., "in order to", "on the other hand")
- 4-grams and beyond: Longer phrases (e.g., "at the end of the day")
Applications
- Identifying formulaic language and fixed expressions
- Studying phraseology and lexical bundles
- Analyzing grammatical patterns (e.g., "it is important to")
- Register and genre analysis (different texts use different n-grams)
- Language teaching (common phrases students should learn)
Lexical Diversity Metrics
What is it?
Lexical diversity (also called lexical richness or vocabulary diversity) measures how varied the vocabulary is in a text. Higher diversity indicates more varied word choice.
1. Type-Token Ratio (TTR)
Formula:
TTR = (Number of unique words / Total words) × 100
Interpretation:
- Range: 0-100%
- Higher TTR = more diverse vocabulary
- 60-80%: High diversity (academic writing, literary texts)
- 40-60%: Moderate diversity (news, conversation)
- Below 40%: Lower diversity (repetitive texts)
Limitation: Heavily influenced by text length (longer texts = lower TTR)
2. Standardized TTR (STTR)
Method:
Calculate TTR for consecutive chunks (e.g., every 1000 words), then average
Advantage: More stable across different text lengths than simple TTR
3. Measure of Textual Lexical Diversity (MTLD)
Method:
Measures average length of sequential word strings that maintain a criterion TTR (default 0.72)
Interpretation:
- Higher values = more diverse vocabulary
- 80-100+: High diversity
- 50-80: Moderate diversity
- Below 50: Lower diversity
Advantage: Length-independent; reliable for comparisons
4. HD-D (Hypergeometric Distribution D)
Method:
Uses probability of encountering new word types based on hypergeometric distribution
Range: 0 to 1
Advantage: Mathematically sound; handles text length variation well
5. Yule's K
Method:
Measures the chance of two randomly selected words being the same
Interpretation:
- LOWER values = MORE diverse (inverse of other metrics)
- Below 100: Very high diversity
- 100-200: High diversity
- 200-300: Moderate diversity
- Above 300: Lower diversity
Use case: Authorship attribution, register analysis
Applications
- Assessing writing quality and sophistication
- Comparing vocabulary richness across genres
- Tracking language development in L2 learners
- Authorship attribution and stylometry
- Detecting simplified or controlled language
Readability Indices
What are they?
Readability indices estimate how difficult a text is to read, often expressed as a grade level (U.S. education system). They use surface features like sentence length and word complexity.
1. Flesch Reading Ease
Formula:
206.835 - 1.015(words/sentences) - 84.6(syllables/words)
Interpretation:
- 90-100: Very easy (5th grade)
- 60-70: Standard (8th-9th grade)
- 30-50: Difficult (college level)
- 0-30: Very difficult (graduate level)
Note: Higher score = easier to read
2. Flesch-Kincaid Grade Level
Formula:
0.39(words/sentences) + 11.8(syllables/words) - 15.59
Interpretation:
Output is U.S. grade level (e.g., 8.0 = 8th grade, 13.0 = college freshman)
3. Gunning Fog Index
Formula:
0.4 × [(words/sentences) + 100(complex words/words)]
Complex words = 3+ syllables
Interpretation:
- 6: Easy (6th grade)
- 12: High school senior
- 17+: College graduate level
4. SMOG Index
Formula:
1.0430 × √(polysyllables × 30/sentences) + 3.1291
Polysyllables = words with 3+ syllables
Best for: Health materials, consumer documents
5. Coleman-Liau Index
Formula:
0.0588L - 0.296S - 15.8
L = average letters per 100 words, S = average sentences per 100 words
Advantage: Uses characters instead of syllables (easier to compute)
6. Automated Readability Index (ARI)
Formula:
4.71(characters/words) + 0.5(words/sentences) - 21.43
Use: Originally designed for real-time readability on electric typewriters
Applications
- Ensuring content matches target audience reading level
- Simplifying complex documents (health, legal, educational materials)
- Comparing text complexity across genres or time periods
- Quality control for plain language initiatives
Keyness Analysis
What is it?
Keyness analysis identifies words that are statistically more frequent in one corpus compared to another reference corpus. These "key words" characterize the distinctive vocabulary of a text or genre.
1. Chi-square (χ²)
Formula:
χ² = Σ(O - E)² / E
Sum over observed vs. expected frequencies in 2×2 contingency table
Interpretation:
- χ² ≥ 10.83: Highly significant (p < 0.001)
- χ² ≥ 6.63: Very significant (p < 0.01)
- χ² ≥ 3.84: Significant (p < 0.05)
Limitation: Less reliable for low-frequency words
2. Log-likelihood for Keyness
Advantage: More reliable than chi-square for corpus comparison
Critical values:
- LL ≥ 15.13: p < 0.0001 (extremely significant)
- LL ≥ 10.83: p < 0.001 (highly significant)
- LL ≥ 6.63: p < 0.01 (very significant)
3. Effect Size (Log Ratio)
Formula:
Log Ratio = log₂(freq_target / freq_reference)
Interpretation:
- Positive values: Overused in target corpus
- Negative values: Underused in target corpus
- ±3: Word is 8× more/less frequent
- ±2: Word is 4× more/less frequent
Why use it: Shows practical significance, not just statistical significance
Applications
- Identifying characteristic vocabulary of genres, registers, or authors
- Comparing language use across time periods (diachronic analysis)
- Discovering terminology specific to specialized domains
- Detecting style markers for authorship attribution
NLP Concepts
Part-of-Speech (POS) Tagging
What is it:
Automatically labeling words with grammatical categories (noun, verb, adjective, etc.)
Common Tags:
- NOUN: Common nouns (cat, house, freedom)
- PROPN: Proper nouns (London, Shakespeare)
- VERB: Verbs (run, think, analyze)
- ADJ: Adjectives (beautiful, red, difficult)
- ADV: Adverbs (quickly, very, however)
- ADP: Prepositions (in, on, at, by)
Applications: Grammatical analysis, feature extraction, information retrieval
Lemmatization
What is it:
Reducing words to their dictionary form (lemma)
Examples:
- running, ran, runs → run
- better, best → good
- was, is, are → be
Use: Grouping related word forms for frequency analysis
Advantage over stemming: Produces actual dictionary words
Named Entity Recognition (NER)
What is it:
Identifying and classifying named entities (people, places, organizations, etc.)
Common Entity Types:
- PERSON: People's names (Shakespeare, Marie Curie)
- GPE: Geopolitical entities (London, France, California)
- ORG: Organizations (Google, United Nations, MIT)
- DATE: Dates and time expressions (Monday, 2025, 18th century)
- MONEY: Monetary values ($100, €50)
Applications: Content extraction, knowledge base construction, document classification
Dependency Parsing
What is it:
Analyzing grammatical relationships between words (subject, object, modifier, etc.)
Use: Syntactic analysis, relation extraction, question answering
Lexical Dispersion
What is it?
Lexical dispersion measures how evenly a word is distributed throughout a text or corpus. High dispersion means the word appears throughout; low dispersion means it's concentrated in specific sections.
Interpretation
A dispersion plot shows:
- Vertical lines: Each occurrence of the search term
- Even distribution: Term is used consistently throughout text
- Clustered lines: Term appears in specific sections (topic-specific usage)
- Gaps: Sections where the term doesn't appear
Applications
- Identifying theme distribution in narratives
- Comparing character mentions across a novel
- Detecting topic shifts in academic papers
- Analyzing discourse structure
Further Reading
For more in-depth coverage of these methods:
- McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.
- Gries, S. Th. (2009). Quantitative Corpus Linguistics with R. Routledge.
- Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.
- Biber, D., et al. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press.