CorpusCraft Documentation
Complete guide to using CorpusCraft for corpus linguistics research
Table of Contents
1. Getting Started
Creating an Account
CorpusCraft uses magic link authentication for secure, passwordless access:
- Click "Get Started" on the homepage
- Enter your email address
- Check your email for the login link
- Click the link to access your account
Creating Your First Corpus
- From your dashboard, click "Create New Corpus"
- Enter a name and optional description
- Click "Create Corpus"
- Start uploading documents
2. Pricing & Academic Verification
Pricing Tiers
CorpusCraft offers five pricing tiers designed for different research needs:
Free Plan - $0
- 10 documents (50,000 tokens)
- Full-text search & KWIC concordance
- Basic frequency analysis
- Perfect for learning and small projects
Academic Plan - $99/year
Verification Required- 500,000 tokens corpus limit
- 25 AI analyses/month (20 GPT-4o-mini + 5 GPT-5.1)
- Full NLP processing (spaCy)
- All statistical analysis tools
- Requires academic verification (see below)
Researcher Plan - $39/month
- 500,000 tokens corpus limit
- 55 AI analyses/month (50 GPT-4o-mini + 5 GPT-5.1)
- Full NLP processing
- 3 collaborators
Professional Plan - $99/month
- 2,000,000 tokens corpus limit
- 220 AI analyses/month (200 GPT-4o-mini + 20 GPT-5.1)
- Priority processing
- 10 collaborators
Institution Plan - $249/month
- 10,000,000 tokens corpus limit
- 1,100 AI analyses/month (1000 GPT-4o-mini + 100 GPT-5.1)
- Full REST API access
- Unlimited collaborators
- Admin dashboard access
Academic Verification Process
To access the Academic plan, you must verify your academic status. We offer three verification methods:
1. Email Domain Verification (Automatic)
If you register with an academic email address, you'll be automatically verified. Supported domains include:
- .edu (United States)
- .ac.uk (United Kingdom)
- .edu.au (Australia)
- .ac.nz (New Zealand)
- .edu.cn (China)
- .ac.jp (Japan)
- And 100+ more international academic domains
2. ORCID Verification (Automatic)
Link your ORCID iD to verify your researcher status:
- Visit the verification page
- Enter your ORCID iD (e.g., 0000-0002-1825-0097)
- We verify your ORCID via the public API
- Instant verification if your ORCID is active
3. Manual Review (1-2 business days)
If you don't have an academic email or ORCID, submit documentation:
- University ID card
- Letter from department
- Academic staff profile page
- Recent publication with institutional affiliation
Our team will review within 1-2 business days and notify you via email.
Getting Started: Visit your dashboard and click "Verify Now" or go to /verify-academic to begin the verification process.
Verification Status Badge
Once verified, you'll see a green verification badge on your dashboard and profile page showing:
- Verified Status: Green badge with checkmark
- Verification Method: Email Domain, ORCID, or Manual Review
- Pending Review: Yellow badge while awaiting manual review
3. Document Management
Supported File Formats
- TXT - Plain text files
- CSV - Comma-separated values (with text column)
- JSONL - JSON Lines format for structured data
- PDF - Portable Document Format
- DOCX - Microsoft Word documents
Uploading Documents
Navigate to the "Upload" tab in your corpus and either:
- Click to browse and select files
- Drag and drop files into the upload area
Documents are automatically analyzed for language detection.
Create from Text
Create documents directly by writing or pasting text:
- Navigate to your corpus page
- Find the "Create from Text" section
- Enter a document title
- Write or paste your text content
- Click "Create Document" to add it to your corpus
This is perfect for adding quick notes, manually transcribed content, or pasting text from other sources.
3. Search & KWIC Concordance
Full-Text Search
CorpusCraft uses SQLite FTS5 for powerful full-text search with operators:
guerra- Find exact word"guerra civil"- Phrase searchguerra OR paz- Either wordguerra NOT civil- Exclude termguerr*- Prefix matching (guerra, guerras, etc.)
KWIC Concordance
Key Word in Context (KWIC) displays search results with surrounding context:
- Left Context - Words before the keyword
- Keyword - Your search term (highlighted)
- Right Context - Words after the keyword
- Source - Document title
Click column headers to sort by left context, keyword, right context, or source.
Context Window
Adjust the number of words shown before and after the keyword (5-20 words).
Regex Search Advanced
Enable regex mode in Advanced Search for pattern-based searching:
\b(war|battle|conflict)\b- Word boundaries for alternative words\d{4}- Match 4-digit numbers (e.g., years)colou?r- Optional characters (color or colour)[A-Z][a-z]+- Capitalized words\b\w{10,}\b- Words with 10+ characters
Batch Operations Bulk Editing
Perform operations on multiple documents at once:
- Bulk Delete - Remove multiple documents in one action
- Metadata Update - Apply same metadata value to selected documents
- Bulk Export - Export selected documents as TXT or CSV
4. Frequency Analysis
Word Frequency
Analyze how often words appear in your corpus:
- Set minimum frequency threshold to filter rare words
- View frequency counts and percentages
- Results are sortable by rank, word, or frequency
- Export results to PDF, Word, Excel, or CSV
N-gram Analysis
Analyze multi-word sequences:
- Bigrams - 2-word sequences (e.g., "guerra civil")
- Trigrams - 3-word sequences
- 4-grams+ - Longer sequences
Collocation Analysis
Find words that frequently appear together with a target word using multiple statistical measures:
- MI score - Mutual Information, measures association strength
- t-score - Frequency-sensitive, good for common words
- Dice coefficient - Symmetrical association (0-1 scale)
- Log-Likelihood - Statistical significance measure
5. Statistical Analysis
Powerful statistical analysis features that don't require AI, providing objective metrics for your corpus.
Readability Indices Free
Assess text complexity using multiple readability formulas:
- Flesch Reading Ease - 0-100 scale, higher = easier
- Flesch-Kincaid Grade Level - US grade level required
- Gunning Fog Index - Years of education needed
- SMOG Index - Simple Measure of Gobbledygook
- Coleman-Liau Index - Character-based grade level
- Automated Readability Index (ARI) - Character-based formula
Lexical Diversity Researcher+
Measure vocabulary richness using advanced metrics:
- Type-Token Ratio (TTR) - Simple vocabulary diversity measure
- Standardized TTR (STTR) - Less affected by text length
- MTLD - Measure of Textual Lexical Diversity, most stable metric
- HD-D - Hypergeometric Distribution D, probabilistic approach
- Yule's K - Vocabulary concentration measure
- Hapax Legomena - Words appearing only once
Keyness Analysis Professional+
Compare two corpora to identify statistically significant differences:
- Chi-square test - Statistical significance testing
- Log-likelihood - More reliable for sparse data
- Effect size - Practical significance (log ratio)
- Normalized frequencies - Per 10,000 words comparison
6. AI-Powered Analysis (18 Features)
Note: AI features use GPT-4o-mini and GPT-5.1 models. Check your plan's AI query limits on the dashboard.
Document Analysis
Auto-Classification
Automatically categorize documents by type and theme
Discover Themes
Identify main themes across your corpus
Smart Summary
Generate concise or detailed summaries
Entity Extraction
Extract people, places, dates, and organizations
Search & Discovery
Natural Language Query
Ask questions in plain language
Semantic Similarity
Find documents with similar meanings
Keyword Extraction
Extract significant keywords automatically
Pattern Suggestions
Discover trends and patterns in your data
Stylistic Analysis
Writing Style
Compare writing styles between authors
Readability
Analyze text complexity and reading level
Register Detection
Identify formality levels in language
Sentiment Analysis
Detect emotions and tone across corpus
Linguistic Analysis
Discourse Markers
Identify discourse markers and functions
Semantic Fields
Map semantic relationships
Contextual Definition
Get historical/linguistic context for terms
Quote Extraction
Extract important quotations
Comparative Analysis
Compare Corpora
Compare two corpora side-by-side
Diachronic Change
Track language evolution over time
7. NLP Processing
How to Use NLP Processing
- Navigate to your corpus page
- Click the NLP tab in the analysis navigation
- Select a document from the dropdown menu
- Click "Analyze Document" to process with spaCy
Supported Languages
NLP processing is available for 8 languages:
- English - en_core_web_sm (advanced pipeline with lemmatizer, NER, parser)
- Spanish - es_core_news_sm (optimized for Spanish text)
- Russian - ru_core_news_sm (Cyrillic text support)
- French - fr_core_news_sm (complete French pipeline)
- German - de_core_news_sm (German morphology support)
- Chinese - zh_core_web_sm (simplified Chinese with word segmentation)
- Japanese - ja_core_news_sm (Japanese tokenization and analysis)
- Arabic - pyarabic + NLTK (tokenization, stemming, stopwords)
NLP Features
- Complete Token Analysis - Lemmatization, POS tagging, and dependency parsing for all tokens with stopword identification
- Sentence Segmentation - Automatic sentence extraction with statistics (total count, average/min/max length in tokens)
- Noun Chunks - Extract noun phrases with root words and dependency relations
- Morphological Features - Detailed grammatical analysis including gender, case, number, tense, person, mood, and aspect
- Lemma Frequencies - Ranked frequency distribution of base forms excluding stopwords
- Stopword Filtering - Identify and filter function words to focus on content words
- Named Entity Recognition - Identify people (PERSON), places (GPE, LOC), organizations (ORG), dates, and more
- Dependency Parsing - Analyze grammatical structure and relationships between words
Analysis Results
The NLP tab displays comprehensive results including:
- Document Statistics - Total tokens, tokens without stopwords, language information
- Sentence Statistics - Total sentences, average/min/max sentence length with full sentence list
- Noun Chunks - All extracted noun phrases with root words and dependencies
- Lemma Frequencies - Complete ranked frequency list excluding stopwords with counts and percentages
- Morphological Features - All tokens with grammatical features (gender, case, tense, etc.)
- POS Distribution - All part-of-speech tags with counts
- Named Entities - All detected entities with their labels
- Complete Token Analysis - All tokens showing text, lemma, POS tag, fine-grained tag, dependency relation, and stopword status
- Professional Exports - Export complete analysis to PDF, Word, Excel, or CSV formats
8. Visualizations
Available Visualizations
- Word Clouds - Visual representation of word frequencies
- Frequency Distribution - Bar charts showing word frequencies
- Lexical Dispersion - Track word usage across documents
Customization
Most visualizations allow you to adjust parameters like word count, colors, and filtering.
9. Export & Saving Results
Export Formats
All analysis results can be exported in multiple formats:
- PDF - Professional reports with formatted tables
- Word (DOCX) - Editable documents for further work
- Excel (XLSX) - Spreadsheets for data analysis
- CSV - Import into other tools
Auto-Save Feature
All analysis results are automatically saved for later export:
- KWIC concordance results
- Frequency analysis
- All 18 AI analysis features
- NLP processing results
Exports Page
Access all your saved outputs from the Exports page:
- Browse all saved analysis results
- Filter by corpus or analysis type
- Export individual results in any format
- Bulk export: select multiple outputs and combine into one document
Corpus Export
Export your entire corpus as JSONL for backup or sharing.
10. Collaboration
Adding Collaborators
Share your corpus with other researchers:
- Open the corpus you want to share
- Go to the "Collaborators" section
- Add collaborators by email
- Set permissions (view, edit, admin)
Annotation Layers
Create annotation layers for collaborative markup and analysis.
11. Version Control (Snapshots)
Creating Snapshots
Save versions of your corpus at different stages:
- Go to the "Export & Snapshots" section
- Click "Create Snapshot"
- Add a tag (e.g., "v1.0", "pre-cleanup")
- Review your snapshot history
Restoring Snapshots
Snapshots can be used to restore your corpus to a previous state if needed.
Tips & Best Practices
Research Workflow Tips
- 1. Start with metadata - Define your schema before uploading documents
- 2. Use snapshots - Create snapshots before major changes
- 3. Filter searches - Use document filters to narrow your analysis
- 4. Export often - Save your results regularly for your research notes
- 5. Try AI features - AI analysis can reveal unexpected patterns
- 6. Check token limits - Monitor your AI usage on the dashboard
Performance Tips
- Use minimum frequency filters to reduce noise in frequency analysis
- Filter by specific documents when analyzing large corpora
- AI features automatically sample large texts to stay within token limits
- Export to CSV for custom analysis in R, Python, or Excel
Need More Help?
Check out these additional resources: