Documentation - CorpusCraft

1. Getting Started

Creating an Account

CorpusCraft uses magic link authentication for secure, passwordless access:

Click "Get Started" on the homepage
Enter your email address
Check your email for the login link
Click the link to access your account

Creating Your First Corpus

From your dashboard, click "Create New Corpus"
Enter a name and optional description
Click "Create Corpus"
Start uploading documents

2. Pricing & Academic Verification

Pricing Tiers

CorpusCraft offers five pricing tiers designed for different research needs:

Free Plan - $0

10 documents (50,000 tokens)
Full-text search & KWIC concordance
Basic frequency analysis
Perfect for learning and small projects

Academic Plan - $99/year

Verification Required

500,000 tokens corpus limit
25 AI analyses/month (20 GPT-4o-mini + 5 GPT-5.1)
Full NLP processing (spaCy)
All statistical analysis tools
Requires academic verification (see below)

Researcher Plan - $39/month

500,000 tokens corpus limit
55 AI analyses/month (50 GPT-4o-mini + 5 GPT-5.1)
Full NLP processing
3 collaborators

Professional Plan - $99/month

2,000,000 tokens corpus limit
220 AI analyses/month (200 GPT-4o-mini + 20 GPT-5.1)
Priority processing
10 collaborators

Institution Plan - $249/month

10,000,000 tokens corpus limit
1,100 AI analyses/month (1000 GPT-4o-mini + 100 GPT-5.1)
Full REST API access
Unlimited collaborators
Admin dashboard access

Academic Verification Process

To access the Academic plan, you must verify your academic status. We offer three verification methods:

1. Email Domain Verification (Automatic)

If you register with an academic email address, you'll be automatically verified. Supported domains include:

.edu (United States)
.ac.uk (United Kingdom)
.edu.au (Australia)
.ac.nz (New Zealand)
.edu.cn (China)
.ac.jp (Japan)
And 100+ more international academic domains

2. ORCID Verification (Automatic)

Link your ORCID iD to verify your researcher status:

Visit the verification page
Enter your ORCID iD (e.g., 0000-0002-1825-0097)
We verify your ORCID via the public API
Instant verification if your ORCID is active

3. Manual Review (1-2 business days)

If you don't have an academic email or ORCID, submit documentation:

University ID card
Letter from department
Academic staff profile page
Recent publication with institutional affiliation

Our team will review within 1-2 business days and notify you via email.

Getting Started: Visit your dashboard and click "Verify Now" or go to /verify-academic to begin the verification process.

Verification Status Badge

Once verified, you'll see a green verification badge on your dashboard and profile page showing:

Verified Status: Green badge with checkmark
Verification Method: Email Domain, ORCID, or Manual Review
Pending Review: Yellow badge while awaiting manual review

3. Document Management

Supported File Formats

TXT - Plain text files
CSV - Comma-separated values (with text column)
JSONL - JSON Lines format for structured data
PDF - Portable Document Format
DOCX - Microsoft Word documents

Uploading Documents

Navigate to the "Upload" tab in your corpus and either:

Click to browse and select files
Drag and drop files into the upload area

Documents are automatically analyzed for language detection.

Create from Text

Create documents directly by writing or pasting text:

Navigate to your corpus page
Find the "Create from Text" section
Enter a document title
Write or paste your text content
Click "Create Document" to add it to your corpus

This is perfect for adding quick notes, manually transcribed content, or pasting text from other sources.

3. Search & KWIC Concordance

Full-Text Search

CorpusCraft uses SQLite FTS5 for powerful full-text search with operators:

guerra - Find exact word
"guerra civil" - Phrase search
guerra OR paz - Either word
guerra NOT civil - Exclude term
guerr* - Prefix matching (guerra, guerras, etc.)

KWIC Concordance

Key Word in Context (KWIC) displays search results with surrounding context:

Left Context - Words before the keyword
Keyword - Your search term (highlighted)
Right Context - Words after the keyword
Source - Document title

Click column headers to sort by left context, keyword, right context, or source.

Context Window

Adjust the number of words shown before and after the keyword (5-20 words).

Regex Search Advanced

Enable regex mode in Advanced Search for pattern-based searching:

\b(war|battle|conflict)\b - Word boundaries for alternative words
\d{4} - Match 4-digit numbers (e.g., years)
colou?r - Optional characters (color or colour)
[A-Z][a-z]+ - Capitalized words
\b\w{10,}\b - Words with 10+ characters

Note: Check the "Use Regular Expression (Regex) mode" checkbox in Advanced Search to enable regex patterns.

Batch Operations Bulk Editing

Perform operations on multiple documents at once:

Bulk Delete - Remove multiple documents in one action
Metadata Update - Apply same metadata value to selected documents
Bulk Export - Export selected documents as TXT or CSV

How to use: Select documents using checkboxes, then click "Batch Operations" button to access bulk editing tools.

4. Frequency Analysis

Word Frequency

Analyze how often words appear in your corpus:

Set minimum frequency threshold to filter rare words
View frequency counts and percentages
Results are sortable by rank, word, or frequency
Export results to PDF, Word, Excel, or CSV

N-gram Analysis

Analyze multi-word sequences:

Bigrams - 2-word sequences (e.g., "guerra civil")
Trigrams - 3-word sequences
4-grams+ - Longer sequences

Collocation Analysis

Find words that frequently appear together with a target word using multiple statistical measures:

MI score - Mutual Information, measures association strength
t-score - Frequency-sensitive, good for common words
Dice coefficient - Symmetrical association (0-1 scale)
Log-Likelihood - Statistical significance measure

Tip: Use multiple measures together for robust collocation analysis. Export results with export button.

5. Statistical Analysis

Powerful statistical analysis features that don't require AI, providing objective metrics for your corpus.

Readability Indices Free

Assess text complexity using multiple readability formulas:

Flesch Reading Ease - 0-100 scale, higher = easier
Flesch-Kincaid Grade Level - US grade level required
Gunning Fog Index - Years of education needed
SMOG Index - Simple Measure of Gobbledygook
Coleman-Liau Index - Character-based grade level
Automated Readability Index (ARI) - Character-based formula

How to use: Navigate to the Statistics tab and click "Calculate Readability"

Lexical Diversity Researcher+

Measure vocabulary richness using advanced metrics:

Type-Token Ratio (TTR) - Simple vocabulary diversity measure
Standardized TTR (STTR) - Less affected by text length
MTLD - Measure of Textual Lexical Diversity, most stable metric
HD-D - Hypergeometric Distribution D, probabilistic approach
Yule's K - Vocabulary concentration measure
Hapax Legomena - Words appearing only once

Interpretation: Higher TTR, STTR, MTLD, and HD-D values indicate more diverse vocabulary. Lower Yule's K suggests greater diversity.

Keyness Analysis Professional+

Compare two corpora to identify statistically significant differences:

Chi-square test - Statistical significance testing
Log-likelihood - More reliable for sparse data
Effect size - Practical significance (log ratio)
Normalized frequencies - Per 10,000 words comparison

Use case: Identify which words are distinctively more common in one corpus compared to another (e.g., comparing historical periods, genres, or authors)

6. AI-Powered Analysis (18 Features)

Note: AI features use GPT-4o-mini and GPT-5.1 models. Check your plan's AI query limits on the dashboard.

Document Analysis

Auto-Classification

Automatically categorize documents by type and theme

Discover Themes

Identify main themes across your corpus

Smart Summary

Generate concise or detailed summaries

Entity Extraction

Extract people, places, dates, and organizations

Search & Discovery

Natural Language Query

Ask questions in plain language

Semantic Similarity

Find documents with similar meanings

Keyword Extraction

Extract significant keywords automatically

Pattern Suggestions

Discover trends and patterns in your data

Stylistic Analysis

Writing Style

Compare writing styles between authors

Readability

Analyze text complexity and reading level

Register Detection

Identify formality levels in language

Sentiment Analysis

Detect emotions and tone across corpus

Linguistic Analysis

Discourse Markers

Identify discourse markers and functions

Semantic Fields

Map semantic relationships

Contextual Definition

Get historical/linguistic context for terms

Quote Extraction

Extract important quotations

Comparative Analysis

Compare Corpora

Compare two corpora side-by-side

Diachronic Change

Track language evolution over time

7. NLP Processing

How to Use NLP Processing

Navigate to your corpus page
Click the NLP tab in the analysis navigation
Select a document from the dropdown menu
Click "Analyze Document" to process with spaCy

Note: NLP processing analyzes individual documents, not the entire corpus. Language is auto-detected based on document metadata.

Supported Languages

NLP processing is available for 8 languages:

English - en_core_web_sm (advanced pipeline with lemmatizer, NER, parser)
Spanish - es_core_news_sm (optimized for Spanish text)
Russian - ru_core_news_sm (Cyrillic text support)
French - fr_core_news_sm (complete French pipeline)
German - de_core_news_sm (German morphology support)
Chinese - zh_core_web_sm (simplified Chinese with word segmentation)
Japanese - ja_core_news_sm (Japanese tokenization and analysis)
Arabic - pyarabic + NLTK (tokenization, stemming, stopwords)

NLP Features

Complete Token Analysis - Lemmatization, POS tagging, and dependency parsing for all tokens with stopword identification
Sentence Segmentation - Automatic sentence extraction with statistics (total count, average/min/max length in tokens)
Noun Chunks - Extract noun phrases with root words and dependency relations
Morphological Features - Detailed grammatical analysis including gender, case, number, tense, person, mood, and aspect
Lemma Frequencies - Ranked frequency distribution of base forms excluding stopwords
Stopword Filtering - Identify and filter function words to focus on content words
Named Entity Recognition - Identify people (PERSON), places (GPE, LOC), organizations (ORG), dates, and more
Dependency Parsing - Analyze grammatical structure and relationships between words

Analysis Results

The NLP tab displays comprehensive results including:

Document Statistics - Total tokens, tokens without stopwords, language information
Sentence Statistics - Total sentences, average/min/max sentence length with full sentence list
Noun Chunks - All extracted noun phrases with root words and dependencies
Lemma Frequencies - Complete ranked frequency list excluding stopwords with counts and percentages
Morphological Features - All tokens with grammatical features (gender, case, tense, etc.)
POS Distribution - All part-of-speech tags with counts
Named Entities - All detected entities with their labels
Complete Token Analysis - All tokens showing text, lemma, POS tag, fine-grained tag, dependency relation, and stopword status
Professional Exports - Export complete analysis to PDF, Word, Excel, or CSV formats

8. Visualizations

Available Visualizations

Word Clouds - Visual representation of word frequencies
Frequency Distribution - Bar charts showing word frequencies
Lexical Dispersion - Track word usage across documents

Customization

Most visualizations allow you to adjust parameters like word count, colors, and filtering.

9. Export & Saving Results

Export Formats

All analysis results can be exported in multiple formats:

PDF - Professional reports with formatted tables
Word (DOCX) - Editable documents for further work
Excel (XLSX) - Spreadsheets for data analysis
CSV - Import into other tools

Auto-Save Feature

All analysis results are automatically saved for later export:

KWIC concordance results
Frequency analysis
All 18 AI analysis features
NLP processing results

Exports Page

Access all your saved outputs from the Exports page:

Browse all saved analysis results
Filter by corpus or analysis type
Export individual results in any format
Bulk export: select multiple outputs and combine into one document

Corpus Export

Export your entire corpus as JSONL for backup or sharing.

10. Collaboration

Adding Collaborators

Share your corpus with other researchers:

Open the corpus you want to share
Go to the "Collaborators" section
Add collaborators by email
Set permissions (view, edit, admin)

Annotation Layers

Create annotation layers for collaborative markup and analysis.

11. Version Control (Snapshots)

Creating Snapshots

Save versions of your corpus at different stages:

Go to the "Export & Snapshots" section
Click "Create Snapshot"
Add a tag (e.g., "v1.0", "pre-cleanup")
Review your snapshot history

Restoring Snapshots

Snapshots can be used to restore your corpus to a previous state if needed.

Tips & Best Practices

Research Workflow Tips

1. Start with metadata - Define your schema before uploading documents
2. Use snapshots - Create snapshots before major changes
3. Filter searches - Use document filters to narrow your analysis
4. Export often - Save your results regularly for your research notes
5. Try AI features - AI analysis can reveal unexpected patterns
6. Check token limits - Monitor your AI usage on the dashboard

Performance Tips

Use minimum frequency filters to reduce noise in frequency analysis
Filter by specific documents when analyzing large corpora
AI features automatically sample large texts to stay within token limits
Export to CSV for custom analysis in R, Python, or Excel

CorpusCraft Documentation

Table of Contents