Quantitative analysis of sequential token-family constraints across 16 natural-language comparators. April 2026.
| Transcriber | Tokens | C→Q | A→Q | P/S | Bucket |
|---|---|---|---|---|---|
| Currier | 16,453 | 2.48× | 0.42× | 0.85 | SYMM-HIGH |
| FSG (Friedman) | 28,811 | 2.40× | 0.45× | 0.84 | SYMM-HIGH |
| Takahashi | 30,426 | 2.41× | 0.47× | 0.86 | SYMM-HIGH |
| Grove | 7,657 | 2.01× | 0.21× | 0.99 | SYMM-HIGH |
| ZL (baseline) | 31,608 | 2.63× | 0.50× | 0.99 | SYMM-HIGH |
The Currier hand annotation in the Zandbergen–Landini metadata splits the corpus across five scribal hands. Pooling tokens across hands could produce a manuscript-wide signature even if individual hands behave very differently. The decomposition below tests each hand independently. Bidirectional self-clustering (SYMM-HIGH) holds independently in Hands 1, 2, and 3 (≈94% of the corpus). The CHEDY→QOK transition rule, in contrast, is concentrated in Hands 2 and 3 and is essentially absent in Hand 1.
| Hand | Tokens | Prefix SC | Suffix SC | P/S | Bucket | CHEDY→QOK | n (obs) | AIIN% |
|---|---|---|---|---|---|---|---|---|
| Hand 1 | 8,997 | 1.59x | 1.39x | 1.14 | SYMM-HIGH | 1.42x | 13 | 13.1% |
| Hand 2 | 9,154 | 1.15x | 1.24x | 0.93 | SYMM-HIGH | 2.15x | 374 | 8.7% |
| Hand 3 | 11,389 | 1.31x | 1.32x | 0.99 | SYMM-HIGH | 2.28x | 222 | 13.0% |
| Hand 4 | 683 | 0.47x | 1.26x | 0.37 | SUFFIX-DOM | 7.34x | 2 | 4.1% |
| Hand 5 | 890 | 1.13x | 2.01x | 0.56 | SUFFIX-DOM | 1.06x | 6 | 6.3% |
results/per_scribe_results.json.
Five A→B→C cascade chains were tested. For each chain we computed Wilson 95% intervals on the conditional probability of B→C suffix agreement under A≡B versus A≠B, then a conservative composite CI on the cascade Δ. All five survive Benjamini–Hochberg FDR at α=0.05.
| Chain | n agree | n disagree | Δ (pp) | Conservative 95% CI | BH-FDR |
|---|---|---|---|---|---|
| CHEDY→OTHER→CHEDY | 13 | 119 | +81.0 | [+48.3, +93.9] | pass |
| QOK→OTHER→QOK | 26 | 61 | +44.0 | [+13.1, +67.2] | pass |
| QOK→QOK→QOK | 27 | 16 | +33.0 | [−9.0, +62.7] | pass |
| CHEDY→QOK→CHEDY | 26 | 45 | +56.0 | [+23.8, +77.2] | pass |
| OT→OTHER→OT | 18 | 37 | +19.0 | [−11.7, +50.3] | pass |
results/cascade_uncertainty_results.json.
The previous "productive paradigms" finding rested on r = 0.42–0.71 between log-frequency and edit-distance-1 variant counts within each family's top 50. Ten replicates of a character-trigram null — synthetic corpora matching Voynich's bigram statistics but containing no morphology — produced comparable or higher correlations. Voynich's correlations do not exceed the null's 95th percentile in any of the three tested families. Chaucer's Middle English at the same measurement gives r = 0.203, lower than both Voynich and the null. The correlation is a Zipfian-edit-graph combinatorial property, not evidence of productive morphology.
| Family | Voynich r | Null mean | Null 95%ile | Real / null | Exceeds null? |
|---|---|---|---|---|---|
| QOK | 0.602 | 0.481 | 0.607 | 1.25× | no |
| CHEDY | 0.384 | 0.358 | 0.420 | 1.07× | no |
| AIIN | 0.397 | 0.429 | 0.512 | 0.93× | no |
| Chaucer (NL reference) | 0.203 | — | — | — | — |
results/paradigm_null_results.json.
A synthetic constructed corpus (~32,000 tokens, 4,000 lines, 4 sections) was generated with deliberately designed grammatical rules — line-bounded resets, A→B class attraction at 2.5×, 70% within-class suffix clustering, ≈15% filler density — and run through the project's own measurement pipeline. The result satisfies 5 of 7 testable MVE items by construction. Items 5 (bidirectional symmetry) and 7 (open vocabulary) were not achieved by this first attempt.
| MVE item | Outcome | Key value |
|---|---|---|
| 1. Line-bounded transition reset | satisfied | within 2.22× / cross 0.56× |
| 2. Specific class-level transition | satisfied | A→B 2.01× |
| 3. Suffix agreement | satisfied | B→B 2.24× |
| 4. Agreement cascade A→O→A | satisfied | +49.8 pp |
| 5. Bidirectional self-clustering | not satisfied | P/S 0.819 / 1.527 — SUFFIX-DOM |
| 6. Section-stable grammar with shifting lexicon | satisfied | Jaccard 0.165 |
| 7. Open vocabulary | not satisfied | 15.5% hapax (target >50%) |
results/constructed_control_results.json.
Corpus: Zandbergen-Landini EVA transliteration via AncientLanguages/Voynich (Hugging Face). 4,197 lines, 31,608 tokens, 184 pages.
Families: QOK (prefix qok-), OK (prefix ok- not qok-), OT (prefix ot-), CHEDY (contains chedy/shedy/chey/shey), AIIN (contains aiin/ain). All others = OTHER.
Comparison languages: 16 natural-language comparators (13 Leipzig Wikipedia 100K including Swahili, Georgian, Tagalog, and Mandarin; 2 Gutenberg literary; 1 Ottoman Turkish UD treebank) and 1 shuffled-token control.
Statistical tests: Permutation tests (5,000–10,000 iterations), bootstrap CIs, KS tests, Chi-squared, split-half reliability. All transition ratios = observed/expected under independence.
Known limitations: Family definitions are EVA-specific and may not correspond to paleographic character boundaries. Self-clustering values are method-sensitive (pooled vs page-level). Non-IE comparison texts are modern, not medieval. Ottoman Turkish tested with small UD corpus (16,890 words): SYMM-LOW, not a match. Larger corpus needed.
• "Only compatible class" framing retired. A first-pass synthetic constructed control satisfies 5 of 7 MVE items by design; only items 5 (bidirectional symmetry) and 7 (open vocabulary) currently discriminate constructed systems from encoded natural language. See results/constructed_control_results.json.
• "Productive paradigms" (Finding 1.8) retired. Voynich's log-freq vs edit-1 variant correlations (r = 0.42–0.71) do not exceed a character-trigram null containing no morphology. Removed from the MVE checklist (formerly item 5 of 8). See results/paradigm_null_results.json.
• CHEDY→QOK reframed as scribe-specific. Manuscript-wide 2.625× is concentrated in Hands 2 and 3 (2.15× and 2.28×). Hand 1 produces the rule only at 1.42× on n=13 observations. The earlier "holds across both scribal hands" framing was overstated. See results/per_scribe_results.json.
• Cascade flagship effect re-reported with uncertainty. CHEDY→OTHER→CHEDY +81 pp point estimate stands, but rests on n=13 agreement / n=119 disagreement trials with conservative 95% CI [+48, +94] pp. Five tested chains all survive Benjamini–Hochberg FDR at α=0.05; two have CIs crossing zero. See results/cascade_uncertainty_results.json.
• "Grammar" wording narrowed. The CHEDY→QOK class-level constraint is "collocational preference with class-level structure" rather than predictive grammar — family-bigram prediction gives ~0% lift over the baseline majority class.
• Comparator count corrected from 15 to 16. Mandarin added; Voynich is the target system, not a comparator. 18 systems total in the prefix/suffix table.
• Self-clustering: 0.929x (page-level) to 1.451x (pooled backbone). Method-sensitive; the bidirectional ratio (0.99) is stable across methods.
• CHEDY→QOK page agreement corrected from 92% to 78%. Earlier figure inflated by selection bias.
• AIIN→QOK page agreement corrected from 91% to 67%. Same selection bias.
• Biological self-clustering initially reported as 'not significant.' Rerun shows p < 0.001.
• Carry-through values shift ~15% between runs due to token parsing differences. Directions are stable; exact decimals are approximate.
Transition Grammar of the Voynich Manuscript: Sequential Constraints and Bidirectional Self-Clustering Symmetry
Amy Laird · Independent Researcher · April 2026
The paper reports all findings documented on this dashboard plus additional analyses of line-bounded grammar, suffix-agreement cascades, per-scribe decomposition, and glyph-layer architecture. It concludes with a minimum viable explanation checklist: seven measured properties that any proposed explanation of the Voynich text must account for.