CorpusCraft Documentation

Complete guide to using CorpusCraft for corpus linguistics research

1. Getting Started

Creating an Account

CorpusCraft uses magic link authentication for secure, passwordless access:

  1. Click "Get Started" on the homepage
  2. Enter your email address
  3. Check your email for the login link
  4. Click the link to access your account

Creating Your First Corpus

  1. From your dashboard, click "Create New Corpus"
  2. Enter a name and optional description
  3. Click "Create Corpus"
  4. Start uploading documents

2. Pricing & Academic Verification

Pricing Tiers

CorpusCraft offers five pricing tiers designed for different research needs:

Free Plan - $0

  • 10 documents (50,000 tokens)
  • Full-text search & KWIC concordance
  • Basic frequency analysis
  • Perfect for learning and small projects

Academic Plan - $99/year

Verification Required
  • 500,000 tokens corpus limit
  • 25 AI analyses/month (20 GPT-4o-mini + 5 GPT-5.1)
  • Full NLP processing (spaCy)
  • All statistical analysis tools
  • Requires academic verification (see below)

Researcher Plan - $39/month

  • 500,000 tokens corpus limit
  • 55 AI analyses/month (50 GPT-4o-mini + 5 GPT-5.1)
  • Full NLP processing
  • 3 collaborators

Professional Plan - $99/month

  • 2,000,000 tokens corpus limit
  • 220 AI analyses/month (200 GPT-4o-mini + 20 GPT-5.1)
  • Priority processing
  • 10 collaborators

Institution Plan - $249/month

  • 10,000,000 tokens corpus limit
  • 1,100 AI analyses/month (1000 GPT-4o-mini + 100 GPT-5.1)
  • Full REST API access
  • Unlimited collaborators
  • Admin dashboard access

Academic Verification Process

To access the Academic plan, you must verify your academic status. We offer three verification methods:

1. Email Domain Verification (Automatic)

If you register with an academic email address, you'll be automatically verified. Supported domains include:

  • .edu (United States)
  • .ac.uk (United Kingdom)
  • .edu.au (Australia)
  • .ac.nz (New Zealand)
  • .edu.cn (China)
  • .ac.jp (Japan)
  • And 100+ more international academic domains

2. ORCID Verification (Automatic)

Link your ORCID iD to verify your researcher status:

  1. Visit the verification page
  2. Enter your ORCID iD (e.g., 0000-0002-1825-0097)
  3. We verify your ORCID via the public API
  4. Instant verification if your ORCID is active

3. Manual Review (1-2 business days)

If you don't have an academic email or ORCID, submit documentation:

  • University ID card
  • Letter from department
  • Academic staff profile page
  • Recent publication with institutional affiliation

Our team will review within 1-2 business days and notify you via email.

Getting Started: Visit your dashboard and click "Verify Now" or go to /verify-academic to begin the verification process.

Verification Status Badge

Once verified, you'll see a green verification badge on your dashboard and profile page showing:

  • Verified Status: Green badge with checkmark
  • Verification Method: Email Domain, ORCID, or Manual Review
  • Pending Review: Yellow badge while awaiting manual review

3. Document Management

Supported File Formats

  • TXT - Plain text files
  • CSV - Comma-separated values (with text column)
  • JSONL - JSON Lines format for structured data
  • PDF - Portable Document Format
  • DOCX - Microsoft Word documents

Uploading Documents

Navigate to the "Upload" tab in your corpus and either:

  • Click to browse and select files
  • Drag and drop files into the upload area

Documents are automatically analyzed for language detection.

Create from Text

Create documents directly by writing or pasting text:

  1. Navigate to your corpus page
  2. Find the "Create from Text" section
  3. Enter a document title
  4. Write or paste your text content
  5. Click "Create Document" to add it to your corpus

This is perfect for adding quick notes, manually transcribed content, or pasting text from other sources.

3. Search & KWIC Concordance

Full-Text Search

CorpusCraft uses SQLite FTS5 for powerful full-text search with operators:

  • guerra - Find exact word
  • "guerra civil" - Phrase search
  • guerra OR paz - Either word
  • guerra NOT civil - Exclude term
  • guerr* - Prefix matching (guerra, guerras, etc.)

KWIC Concordance

Key Word in Context (KWIC) displays search results with surrounding context:

  • Left Context - Words before the keyword
  • Keyword - Your search term (highlighted)
  • Right Context - Words after the keyword
  • Source - Document title

Click column headers to sort by left context, keyword, right context, or source.

Context Window

Adjust the number of words shown before and after the keyword (5-20 words).

Regex Search Advanced

Enable regex mode in Advanced Search for pattern-based searching:

  • \b(war|battle|conflict)\b - Word boundaries for alternative words
  • \d{4} - Match 4-digit numbers (e.g., years)
  • colou?r - Optional characters (color or colour)
  • [A-Z][a-z]+ - Capitalized words
  • \b\w{10,}\b - Words with 10+ characters
Note: Check the "Use Regular Expression (Regex) mode" checkbox in Advanced Search to enable regex patterns.

Batch Operations Bulk Editing

Perform operations on multiple documents at once:

  • Bulk Delete - Remove multiple documents in one action
  • Metadata Update - Apply same metadata value to selected documents
  • Bulk Export - Export selected documents as TXT or CSV
How to use: Select documents using checkboxes, then click "Batch Operations" button to access bulk editing tools.

4. Frequency Analysis

Word Frequency

Analyze how often words appear in your corpus:

  • Set minimum frequency threshold to filter rare words
  • View frequency counts and percentages
  • Results are sortable by rank, word, or frequency
  • Export results to PDF, Word, Excel, or CSV

N-gram Analysis

Analyze multi-word sequences:

  • Bigrams - 2-word sequences (e.g., "guerra civil")
  • Trigrams - 3-word sequences
  • 4-grams+ - Longer sequences

Collocation Analysis

Find words that frequently appear together with a target word using multiple statistical measures:

  • MI score - Mutual Information, measures association strength
  • t-score - Frequency-sensitive, good for common words
  • Dice coefficient - Symmetrical association (0-1 scale)
  • Log-Likelihood - Statistical significance measure
Tip: Use multiple measures together for robust collocation analysis. Export results with export button.

5. Statistical Analysis

Powerful statistical analysis features that don't require AI, providing objective metrics for your corpus.

Readability Indices Free

Assess text complexity using multiple readability formulas:

  • Flesch Reading Ease - 0-100 scale, higher = easier
  • Flesch-Kincaid Grade Level - US grade level required
  • Gunning Fog Index - Years of education needed
  • SMOG Index - Simple Measure of Gobbledygook
  • Coleman-Liau Index - Character-based grade level
  • Automated Readability Index (ARI) - Character-based formula
How to use: Navigate to the Statistics tab and click "Calculate Readability"

Lexical Diversity Researcher+

Measure vocabulary richness using advanced metrics:

  • Type-Token Ratio (TTR) - Simple vocabulary diversity measure
  • Standardized TTR (STTR) - Less affected by text length
  • MTLD - Measure of Textual Lexical Diversity, most stable metric
  • HD-D - Hypergeometric Distribution D, probabilistic approach
  • Yule's K - Vocabulary concentration measure
  • Hapax Legomena - Words appearing only once
Interpretation: Higher TTR, STTR, MTLD, and HD-D values indicate more diverse vocabulary. Lower Yule's K suggests greater diversity.

Keyness Analysis Professional+

Compare two corpora to identify statistically significant differences:

  • Chi-square test - Statistical significance testing
  • Log-likelihood - More reliable for sparse data
  • Effect size - Practical significance (log ratio)
  • Normalized frequencies - Per 10,000 words comparison
Use case: Identify which words are distinctively more common in one corpus compared to another (e.g., comparing historical periods, genres, or authors)

6. AI-Powered Analysis (18 Features)

Note: AI features use GPT-4o-mini and GPT-5.1 models. Check your plan's AI query limits on the dashboard.

Document Analysis

Auto-Classification

Automatically categorize documents by type and theme

Discover Themes

Identify main themes across your corpus

Smart Summary

Generate concise or detailed summaries

Entity Extraction

Extract people, places, dates, and organizations

Search & Discovery

Natural Language Query

Ask questions in plain language

Semantic Similarity

Find documents with similar meanings

Keyword Extraction

Extract significant keywords automatically

Pattern Suggestions

Discover trends and patterns in your data

Stylistic Analysis

Writing Style

Compare writing styles between authors

Readability

Analyze text complexity and reading level

Register Detection

Identify formality levels in language

Sentiment Analysis

Detect emotions and tone across corpus

Linguistic Analysis

Discourse Markers

Identify discourse markers and functions

Semantic Fields

Map semantic relationships

Contextual Definition

Get historical/linguistic context for terms

Quote Extraction

Extract important quotations

Comparative Analysis

Compare Corpora

Compare two corpora side-by-side

Diachronic Change

Track language evolution over time

7. NLP Processing

How to Use NLP Processing

  1. Navigate to your corpus page
  2. Click the NLP tab in the analysis navigation
  3. Select a document from the dropdown menu
  4. Click "Analyze Document" to process with spaCy
Note: NLP processing analyzes individual documents, not the entire corpus. Language is auto-detected based on document metadata.

Supported Languages

NLP processing is available for 8 languages:

  • English - en_core_web_sm (advanced pipeline with lemmatizer, NER, parser)
  • Spanish - es_core_news_sm (optimized for Spanish text)
  • Russian - ru_core_news_sm (Cyrillic text support)
  • French - fr_core_news_sm (complete French pipeline)
  • German - de_core_news_sm (German morphology support)
  • Chinese - zh_core_web_sm (simplified Chinese with word segmentation)
  • Japanese - ja_core_news_sm (Japanese tokenization and analysis)
  • Arabic - pyarabic + NLTK (tokenization, stemming, stopwords)

NLP Features

  • Complete Token Analysis - Lemmatization, POS tagging, and dependency parsing for all tokens with stopword identification
  • Sentence Segmentation - Automatic sentence extraction with statistics (total count, average/min/max length in tokens)
  • Noun Chunks - Extract noun phrases with root words and dependency relations
  • Morphological Features - Detailed grammatical analysis including gender, case, number, tense, person, mood, and aspect
  • Lemma Frequencies - Ranked frequency distribution of base forms excluding stopwords
  • Stopword Filtering - Identify and filter function words to focus on content words
  • Named Entity Recognition - Identify people (PERSON), places (GPE, LOC), organizations (ORG), dates, and more
  • Dependency Parsing - Analyze grammatical structure and relationships between words

Analysis Results

The NLP tab displays comprehensive results including:

  • Document Statistics - Total tokens, tokens without stopwords, language information
  • Sentence Statistics - Total sentences, average/min/max sentence length with full sentence list
  • Noun Chunks - All extracted noun phrases with root words and dependencies
  • Lemma Frequencies - Complete ranked frequency list excluding stopwords with counts and percentages
  • Morphological Features - All tokens with grammatical features (gender, case, tense, etc.)
  • POS Distribution - All part-of-speech tags with counts
  • Named Entities - All detected entities with their labels
  • Complete Token Analysis - All tokens showing text, lemma, POS tag, fine-grained tag, dependency relation, and stopword status
  • Professional Exports - Export complete analysis to PDF, Word, Excel, or CSV formats

8. Visualizations

Available Visualizations

  • Word Clouds - Visual representation of word frequencies
  • Frequency Distribution - Bar charts showing word frequencies
  • Lexical Dispersion - Track word usage across documents

Customization

Most visualizations allow you to adjust parameters like word count, colors, and filtering.

9. Export & Saving Results

Export Formats

All analysis results can be exported in multiple formats:

  • PDF - Professional reports with formatted tables
  • Word (DOCX) - Editable documents for further work
  • Excel (XLSX) - Spreadsheets for data analysis
  • CSV - Import into other tools

Auto-Save Feature

All analysis results are automatically saved for later export:

  • KWIC concordance results
  • Frequency analysis
  • All 18 AI analysis features
  • NLP processing results

Exports Page

Access all your saved outputs from the Exports page:

  • Browse all saved analysis results
  • Filter by corpus or analysis type
  • Export individual results in any format
  • Bulk export: select multiple outputs and combine into one document

Corpus Export

Export your entire corpus as JSONL for backup or sharing.

10. Collaboration

Adding Collaborators

Share your corpus with other researchers:

  1. Open the corpus you want to share
  2. Go to the "Collaborators" section
  3. Add collaborators by email
  4. Set permissions (view, edit, admin)

Annotation Layers

Create annotation layers for collaborative markup and analysis.

11. Version Control (Snapshots)

Creating Snapshots

Save versions of your corpus at different stages:

  1. Go to the "Export & Snapshots" section
  2. Click "Create Snapshot"
  3. Add a tag (e.g., "v1.0", "pre-cleanup")
  4. Review your snapshot history

Restoring Snapshots

Snapshots can be used to restore your corpus to a previous state if needed.

Tips & Best Practices

Research Workflow Tips

  • 1. Start with metadata - Define your schema before uploading documents
  • 2. Use snapshots - Create snapshots before major changes
  • 3. Filter searches - Use document filters to narrow your analysis
  • 4. Export often - Save your results regularly for your research notes
  • 5. Try AI features - AI analysis can reveal unexpected patterns
  • 6. Check token limits - Monitor your AI usage on the dashboard

Performance Tips

  • Use minimum frequency filters to reduce noise in frequency analysis
  • Filter by specific documents when analyzing large corpora
  • AI features automatically sample large texts to stay within token limits
  • Export to CSV for custom analysis in R, Python, or Excel

Need More Help?

Check out these additional resources: