Computational Linguistics Knowledge Base

Learn about the statistical methods, formulas, and concepts used in corpus linguistics research.

KWIC Concordance

What is it?

KWIC (Key Word In Context) is a format for concordance lines that displays a search term centered on the page with its surrounding context. It's one of the most fundamental tools in corpus linguistics.

Use Cases

  • Examining word usage patterns in authentic contexts
  • Identifying collocations and co-occurrence patterns
  • Studying semantic prosody (positive/negative associations)
  • Analyzing grammatical constructions
  • Comparing language use across registers or time periods

How to Interpret

When analyzing KWIC lines:

  • Left context: Words that typically precede your search term
  • Right context: Words that typically follow your search term
  • Sorting: Sort by left/right context to reveal patterns
  • Frequency: Look for recurring patterns across multiple lines

Collocation Analysis

What is it?

Collocation analysis identifies words that frequently co-occur together. Statistical measures help distinguish meaningful associations from chance co-occurrence.

Statistical Measures

1. Mutual Information (MI)

Formula:

MI = log₂(O₁₂ / E₁₂)

Where O₁₂ = observed frequency, E₁₂ = expected frequency by chance

Interpretation:

  • MI ≥ 3: Strong association (word pair occurs 8× more than expected)
  • MI 0-3: Weak to moderate association
  • MI < 0: Negative association (words avoid each other)

Best for: Finding exclusive collocations, technical terms, idioms

Limitation: Favors low-frequency pairs; unreliable for very rare words

2. T-score

Formula:

t = (O₁₂ - E₁₂) / √O₁₂

Interpretation:

  • t ≥ 2.0: Statistically significant association (95% confidence)
  • t ≥ 3.0: Highly significant (99.7% confidence)
  • Higher values = stronger evidence of non-random co-occurrence

Best for: General collocations, frequent word pairs

Advantage: More reliable for high-frequency pairs than MI

3. Dice Coefficient

Formula:

Dice = 2 × f(x,y) / (f(x) + f(y))

Where f(x,y) = co-occurrence frequency, f(x) and f(y) = individual frequencies

Interpretation:

  • Range: 0 to 1 (0 = never co-occur, 1 = always co-occur)
  • 0.7-1.0: Very strong association
  • 0.4-0.7: Moderate association
  • 0-0.4: Weak association

Best for: Symmetrical associations, comparing collocation strength across different corpus sizes

Advantage: Easy to interpret (percentage-like scale)

4. Log-likelihood (LL)

Formula:

LL = 2 × Σ(O × ln(O/E))

Sum over all cells in the contingency table

Interpretation:

  • LL ≥ 15.13: Highly significant (p < 0.0001)
  • LL ≥ 10.83: Very significant (p < 0.001)
  • LL ≥ 6.63: Significant (p < 0.01)
  • LL ≥ 3.84: Marginally significant (p < 0.05)

Best for: Statistical significance testing, large corpora

Advantage: More reliable than chi-square for corpus linguistics

Choosing the Right Measure

  • MI: Use for finding exclusive, specialized collocations (technical terms, idioms)
  • T-score: Use for general purpose collocation analysis with frequent words
  • Dice: Use when you need an intuitive 0-1 scale or comparing across corpora
  • Log-likelihood: Use for statistical hypothesis testing or very large corpora

N-grams

What are they?

N-grams are contiguous sequences of n items (words) from a text. They capture multi-word patterns and phraseological units.

Types

  • Bigrams (2-grams): Two-word sequences (e.g., "corpus linguistics", "data analysis")
  • Trigrams (3-grams): Three-word sequences (e.g., "in order to", "on the other hand")
  • 4-grams and beyond: Longer phrases (e.g., "at the end of the day")

Applications

  • Identifying formulaic language and fixed expressions
  • Studying phraseology and lexical bundles
  • Analyzing grammatical patterns (e.g., "it is important to")
  • Register and genre analysis (different texts use different n-grams)
  • Language teaching (common phrases students should learn)

Lexical Diversity Metrics

What is it?

Lexical diversity (also called lexical richness or vocabulary diversity) measures how varied the vocabulary is in a text. Higher diversity indicates more varied word choice.

1. Type-Token Ratio (TTR)

Formula:

TTR = (Number of unique words / Total words) × 100

Interpretation:

  • Range: 0-100%
  • Higher TTR = more diverse vocabulary
  • 60-80%: High diversity (academic writing, literary texts)
  • 40-60%: Moderate diversity (news, conversation)
  • Below 40%: Lower diversity (repetitive texts)

Limitation: Heavily influenced by text length (longer texts = lower TTR)

2. Standardized TTR (STTR)

Method:

Calculate TTR for consecutive chunks (e.g., every 1000 words), then average

Advantage: More stable across different text lengths than simple TTR

3. Measure of Textual Lexical Diversity (MTLD)

Method:

Measures average length of sequential word strings that maintain a criterion TTR (default 0.72)

Interpretation:

  • Higher values = more diverse vocabulary
  • 80-100+: High diversity
  • 50-80: Moderate diversity
  • Below 50: Lower diversity

Advantage: Length-independent; reliable for comparisons

4. HD-D (Hypergeometric Distribution D)

Method:

Uses probability of encountering new word types based on hypergeometric distribution

Range: 0 to 1

Advantage: Mathematically sound; handles text length variation well

5. Yule's K

Method:

Measures the chance of two randomly selected words being the same

Interpretation:

  • LOWER values = MORE diverse (inverse of other metrics)
  • Below 100: Very high diversity
  • 100-200: High diversity
  • 200-300: Moderate diversity
  • Above 300: Lower diversity

Use case: Authorship attribution, register analysis

Applications

  • Assessing writing quality and sophistication
  • Comparing vocabulary richness across genres
  • Tracking language development in L2 learners
  • Authorship attribution and stylometry
  • Detecting simplified or controlled language

Readability Indices

What are they?

Readability indices estimate how difficult a text is to read, often expressed as a grade level (U.S. education system). They use surface features like sentence length and word complexity.

1. Flesch Reading Ease

Formula:

206.835 - 1.015(words/sentences) - 84.6(syllables/words)

Interpretation:

  • 90-100: Very easy (5th grade)
  • 60-70: Standard (8th-9th grade)
  • 30-50: Difficult (college level)
  • 0-30: Very difficult (graduate level)

Note: Higher score = easier to read

2. Flesch-Kincaid Grade Level

Formula:

0.39(words/sentences) + 11.8(syllables/words) - 15.59

Interpretation:

Output is U.S. grade level (e.g., 8.0 = 8th grade, 13.0 = college freshman)

3. Gunning Fog Index

Formula:

0.4 × [(words/sentences) + 100(complex words/words)]

Complex words = 3+ syllables

Interpretation:

  • 6: Easy (6th grade)
  • 12: High school senior
  • 17+: College graduate level

4. SMOG Index

Formula:

1.0430 × √(polysyllables × 30/sentences) + 3.1291

Polysyllables = words with 3+ syllables

Best for: Health materials, consumer documents

5. Coleman-Liau Index

Formula:

0.0588L - 0.296S - 15.8

L = average letters per 100 words, S = average sentences per 100 words

Advantage: Uses characters instead of syllables (easier to compute)

6. Automated Readability Index (ARI)

Formula:

4.71(characters/words) + 0.5(words/sentences) - 21.43

Use: Originally designed for real-time readability on electric typewriters

Applications

  • Ensuring content matches target audience reading level
  • Simplifying complex documents (health, legal, educational materials)
  • Comparing text complexity across genres or time periods
  • Quality control for plain language initiatives

Keyness Analysis

What is it?

Keyness analysis identifies words that are statistically more frequent in one corpus compared to another reference corpus. These "key words" characterize the distinctive vocabulary of a text or genre.

1. Chi-square (χ²)

Formula:

χ² = Σ(O - E)² / E

Sum over observed vs. expected frequencies in 2×2 contingency table

Interpretation:

  • χ² ≥ 10.83: Highly significant (p < 0.001)
  • χ² ≥ 6.63: Very significant (p < 0.01)
  • χ² ≥ 3.84: Significant (p < 0.05)

Limitation: Less reliable for low-frequency words

2. Log-likelihood for Keyness

Advantage: More reliable than chi-square for corpus comparison

Critical values:

  • LL ≥ 15.13: p < 0.0001 (extremely significant)
  • LL ≥ 10.83: p < 0.001 (highly significant)
  • LL ≥ 6.63: p < 0.01 (very significant)

3. Effect Size (Log Ratio)

Formula:

Log Ratio = log₂(freq_target / freq_reference)

Interpretation:

  • Positive values: Overused in target corpus
  • Negative values: Underused in target corpus
  • ±3: Word is 8× more/less frequent
  • ±2: Word is 4× more/less frequent

Why use it: Shows practical significance, not just statistical significance

Applications

  • Identifying characteristic vocabulary of genres, registers, or authors
  • Comparing language use across time periods (diachronic analysis)
  • Discovering terminology specific to specialized domains
  • Detecting style markers for authorship attribution

NLP Concepts

Part-of-Speech (POS) Tagging

What is it:

Automatically labeling words with grammatical categories (noun, verb, adjective, etc.)

Common Tags:

  • NOUN: Common nouns (cat, house, freedom)
  • PROPN: Proper nouns (London, Shakespeare)
  • VERB: Verbs (run, think, analyze)
  • ADJ: Adjectives (beautiful, red, difficult)
  • ADV: Adverbs (quickly, very, however)
  • ADP: Prepositions (in, on, at, by)

Applications: Grammatical analysis, feature extraction, information retrieval

Lemmatization

What is it:

Reducing words to their dictionary form (lemma)

Examples:

  • running, ran, runs → run
  • better, best → good
  • was, is, are → be

Use: Grouping related word forms for frequency analysis

Advantage over stemming: Produces actual dictionary words

Named Entity Recognition (NER)

What is it:

Identifying and classifying named entities (people, places, organizations, etc.)

Common Entity Types:

  • PERSON: People's names (Shakespeare, Marie Curie)
  • GPE: Geopolitical entities (London, France, California)
  • ORG: Organizations (Google, United Nations, MIT)
  • DATE: Dates and time expressions (Monday, 2025, 18th century)
  • MONEY: Monetary values ($100, €50)

Applications: Content extraction, knowledge base construction, document classification

Dependency Parsing

What is it:

Analyzing grammatical relationships between words (subject, object, modifier, etc.)

Use: Syntactic analysis, relation extraction, question answering

Lexical Dispersion

What is it?

Lexical dispersion measures how evenly a word is distributed throughout a text or corpus. High dispersion means the word appears throughout; low dispersion means it's concentrated in specific sections.

Interpretation

A dispersion plot shows:

  • Vertical lines: Each occurrence of the search term
  • Even distribution: Term is used consistently throughout text
  • Clustered lines: Term appears in specific sections (topic-specific usage)
  • Gaps: Sections where the term doesn't appear

Applications

  • Identifying theme distribution in narratives
  • Comparing character mentions across a novel
  • Detecting topic shifts in academic papers
  • Analyzing discourse structure

Further Reading

For more in-depth coverage of these methods:

  • McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.
  • Gries, S. Th. (2009). Quantitative Corpus Linguistics with R. Routledge.
  • Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.
  • Biber, D., et al. (1998). Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press.