Knowledge Base - CorpusCraft

KWIC Concordance

What is it?

KWIC (Key Word In Context) is a format for concordance lines that displays a search term centered on the page with its surrounding context. It's one of the most fundamental tools in corpus linguistics.

Use Cases

Examining word usage patterns in authentic contexts
Identifying collocations and co-occurrence patterns
Studying semantic prosody (positive/negative associations)
Analyzing grammatical constructions
Comparing language use across registers or time periods

How to Interpret

When analyzing KWIC lines:

Left context: Words that typically precede your search term
Right context: Words that typically follow your search term
Sorting: Sort by left/right context to reveal patterns
Frequency: Look for recurring patterns across multiple lines

Collocation Analysis

What is it?

Collocation analysis identifies words that frequently co-occur together. Statistical measures help distinguish meaningful associations from chance co-occurrence.

Statistical Measures

1. Mutual Information (MI)

Formula:

MI = log₂(O₁₂ / E₁₂)

Where O₁₂ = observed frequency, E₁₂ = expected frequency by chance

Interpretation:

MI ≥ 3: Strong association (word pair occurs 8× more than expected)
MI 0-3: Weak to moderate association
MI < 0: Negative association (words avoid each other)

Best for: Finding exclusive collocations, technical terms, idioms

Limitation: Favors low-frequency pairs; unreliable for very rare words

2. T-score

Formula:

t = (O₁₂ - E₁₂) / √O₁₂

Interpretation:

t ≥ 2.0: Statistically significant association (95% confidence)
t ≥ 3.0: Highly significant (99.7% confidence)
Higher values = stronger evidence of non-random co-occurrence

Best for: General collocations, frequent word pairs

Advantage: More reliable for high-frequency pairs than MI

3. Dice Coefficient

Formula:

Dice = 2 × f(x,y) / (f(x) + f(y))

Where f(x,y) = co-occurrence frequency, f(x) and f(y) = individual frequencies

Interpretation:

Range: 0 to 1 (0 = never co-occur, 1 = always co-occur)
0.7-1.0: Very strong association
0.4-0.7: Moderate association
0-0.4: Weak association

Best for: Symmetrical associations, comparing collocation strength across different corpus sizes

Advantage: Easy to interpret (percentage-like scale)

4. Log-likelihood (LL)

Formula:

LL = 2 × Σ(O × ln(O/E))

Sum over all cells in the contingency table

Interpretation:

LL ≥ 15.13: Highly significant (p < 0.0001)
LL ≥ 10.83: Very significant (p < 0.001)
LL ≥ 6.63: Significant (p < 0.01)
LL ≥ 3.84: Marginally significant (p < 0.05)

Best for: Statistical significance testing, large corpora

Advantage: More reliable than chi-square for corpus linguistics

Choosing the Right Measure

MI: Use for finding exclusive, specialized collocations (technical terms, idioms)
T-score: Use for general purpose collocation analysis with frequent words
Dice: Use when you need an intuitive 0-1 scale or comparing across corpora
Log-likelihood: Use for statistical hypothesis testing or very large corpora

N-grams

What are they?

N-grams are contiguous sequences of n items (words) from a text. They capture multi-word patterns and phraseological units.

Types

Bigrams (2-grams): Two-word sequences (e.g., "corpus linguistics", "data analysis")
Trigrams (3-grams): Three-word sequences (e.g., "in order to", "on the other hand")
4-grams and beyond: Longer phrases (e.g., "at the end of the day")

Applications

Identifying formulaic language and fixed expressions
Studying phraseology and lexical bundles
Analyzing grammatical patterns (e.g., "it is important to")
Register and genre analysis (different texts use different n-grams)
Language teaching (common phrases students should learn)

Lexical Diversity Metrics

What is it?

Lexical diversity (also called lexical richness or vocabulary diversity) measures how varied the vocabulary is in a text. Higher diversity indicates more varied word choice.

1. Type-Token Ratio (TTR)

Formula:

TTR = (Number of unique words / Total words) × 100

Interpretation:

Range: 0-100%
Higher TTR = more diverse vocabulary
60-80%: High diversity (academic writing, literary texts)
40-60%: Moderate diversity (news, conversation)
Below 40%: Lower diversity (repetitive texts)

Limitation: Heavily influenced by text length (longer texts = lower TTR)

2. Standardized TTR (STTR)

Method:

Calculate TTR for consecutive chunks (e.g., every 1000 words), then average

Advantage: More stable across different text lengths than simple TTR

3. Measure of Textual Lexical Diversity (MTLD)

Method:

Measures average length of sequential word strings that maintain a criterion TTR (default 0.72)

Interpretation:

Higher values = more diverse vocabulary
80-100+: High diversity
50-80: Moderate diversity
Below 50: Lower diversity

Advantage: Length-independent; reliable for comparisons

4. HD-D (Hypergeometric Distribution D)

Method:

Uses probability of encountering new word types based on hypergeometric distribution

Range: 0 to 1

Advantage: Mathematically sound; handles text length variation well

5. Yule's K

Method:

Measures the chance of two randomly selected words being the same

Interpretation:

LOWER values = MORE diverse (inverse of other metrics)
Below 100: Very high diversity
100-200: High diversity
200-300: Moderate diversity
Above 300: Lower diversity

Use case: Authorship attribution, register analysis

Applications

Assessing writing quality and sophistication
Comparing vocabulary richness across genres
Tracking language development in L2 learners
Authorship attribution and stylometry
Detecting simplified or controlled language

Readability Indices

What are they?

Readability indices estimate how difficult a text is to read, often expressed as a grade level (U.S. education system). They use surface features like sentence length and word complexity.

1. Flesch Reading Ease

Formula:

206.835 - 1.015(words/sentences) - 84.6(syllables/words)

Interpretation:

90-100: Very easy (5th grade)
60-70: Standard (8th-9th grade)
30-50: Difficult (college level)
0-30: Very difficult (graduate level)

Note: Higher score = easier to read

2. Flesch-Kincaid Grade Level

Formula:

0.39(words/sentences) + 11.8(syllables/words) - 15.59

Interpretation:

Output is U.S. grade level (e.g., 8.0 = 8th grade, 13.0 = college freshman)

3. Gunning Fog Index

Formula:

0.4 × [(words/sentences) + 100(complex words/words)]

Complex words = 3+ syllables

Interpretation:

6: Easy (6th grade)
12: High school senior
17+: College graduate level

4. SMOG Index

Formula:

1.0430 × √(polysyllables × 30/sentences) + 3.1291

Polysyllables = words with 3+ syllables

Best for: Health materials, consumer documents

5. Coleman-Liau Index

Formula:

0.0588L - 0.296S - 15.8

L = average letters per 100 words, S = average sentences per 100 words

Advantage: Uses characters instead of syllables (easier to compute)

6. Automated Readability Index (ARI)

Formula:

4.71(characters/words) + 0.5(words/sentences) - 21.43

Use: Originally designed for real-time readability on electric typewriters

Applications

Ensuring content matches target audience reading level
Simplifying complex documents (health, legal, educational materials)
Comparing text complexity across genres or time periods
Quality control for plain language initiatives

Keyness Analysis

What is it?

Keyness analysis identifies words that are statistically more frequent in one corpus compared to another reference corpus. These "key words" characterize the distinctive vocabulary of a text or genre.

1. Chi-square (χ²)

Formula:

χ² = Σ(O - E)² / E

Sum over observed vs. expected frequencies in 2×2 contingency table

Interpretation:

χ² ≥ 10.83: Highly significant (p < 0.001)
χ² ≥ 6.63: Very significant (p < 0.01)
χ² ≥ 3.84: Significant (p < 0.05)

Limitation: Less reliable for low-frequency words

2. Log-likelihood for Keyness

Advantage: More reliable than chi-square for corpus comparison

Critical values:

LL ≥ 15.13: p < 0.0001 (extremely significant)
LL ≥ 10.83: p < 0.001 (highly significant)
LL ≥ 6.63: p < 0.01 (very significant)

3. Effect Size (Log Ratio)

Formula:

Log Ratio = log₂(freq_target / freq_reference)

Interpretation:

Positive values: Overused in target corpus
Negative values: Underused in target corpus
±3: Word is 8× more/less frequent
±2: Word is 4× more/less frequent

Why use it: Shows practical significance, not just statistical significance

Applications

Identifying characteristic vocabulary of genres, registers, or authors
Comparing language use across time periods (diachronic analysis)
Discovering terminology specific to specialized domains
Detecting style markers for authorship attribution

NLP Concepts

Part-of-Speech (POS) Tagging

What is it:

Automatically labeling words with grammatical categories (noun, verb, adjective, etc.)

Common Tags:

NOUN: Common nouns (cat, house, freedom)
PROPN: Proper nouns (London, Shakespeare)
VERB: Verbs (run, think, analyze)
ADJ: Adjectives (beautiful, red, difficult)
ADV: Adverbs (quickly, very, however)
ADP: Prepositions (in, on, at, by)

Applications: Grammatical analysis, feature extraction, information retrieval

Lemmatization

What is it:

Reducing words to their dictionary form (lemma)

Examples:

running, ran, runs → run
better, best → good
was, is, are → be

Use: Grouping related word forms for frequency analysis

Advantage over stemming: Produces actual dictionary words

Named Entity Recognition (NER)

What is it:

Identifying and classifying named entities (people, places, organizations, etc.)

Common Entity Types:

PERSON: People's names (Shakespeare, Marie Curie)
GPE: Geopolitical entities (London, France, California)
ORG: Organizations (Google, United Nations, MIT)
DATE: Dates and time expressions (Monday, 2025, 18th century)
MONEY: Monetary values ($100, €50)

Applications: Content extraction, knowledge base construction, document classification

Dependency Parsing

What is it:

Analyzing grammatical relationships between words (subject, object, modifier, etc.)

Use: Syntactic analysis, relation extraction, question answering

Lexical Dispersion

What is it?

Lexical dispersion measures how evenly a word is distributed throughout a text or corpus. High dispersion means the word appears throughout; low dispersion means it's concentrated in specific sections.

Interpretation

A dispersion plot shows:

Vertical lines: Each occurrence of the search term
Even distribution: Term is used consistently throughout text
Clustered lines: Term appears in specific sections (topic-specific usage)
Gaps: Sections where the term doesn't appear

Applications

Identifying theme distribution in narratives
Comparing character mentions across a novel
Detecting topic shifts in academic papers
Analyzing discourse structure

Computational Linguistics Knowledge Base

Quick Navigation

KWIC Concordance

What is it?

Use Cases

How to Interpret

Collocation Analysis

What is it?

Statistical Measures

1. Mutual Information (MI)

2. T-score

3. Dice Coefficient

4. Log-likelihood (LL)

Choosing the Right Measure

N-grams

What are they?

Types

Applications

Lexical Diversity Metrics

What is it?

1. Type-Token Ratio (TTR)

2. Standardized TTR (STTR)

3. Measure of Textual Lexical Diversity (MTLD)

4. HD-D (Hypergeometric Distribution D)

5. Yule's K

Applications

Readability Indices

What are they?

1. Flesch Reading Ease

2. Flesch-Kincaid Grade Level

3. Gunning Fog Index

4. SMOG Index

5. Coleman-Liau Index

6. Automated Readability Index (ARI)

Applications

Keyness Analysis

What is it?

1. Chi-square (χ²)

2. Log-likelihood for Keyness

3. Effect Size (Log Ratio)

Applications

NLP Concepts

Part-of-Speech (POS) Tagging

Lemmatization

Named Entity Recognition (NER)

Dependency Parsing

Lexical Dispersion

What is it?

Interpretation

Applications

Further Reading

Confirm Action