Voynich Manuscript

Transition Grammar Analysis

Quantitative analysis of sequential token-family constraints across 16 natural-language comparators. April 2026.

Core Findings

Verified findings
Non-random sequential structure
Chi² = 1407.8, p ≈ 0. Significantly non-random.
AIIN at 15% across Currier A/B
KS p = 0.742. Language-invariant.
CHEDY→QOK = 2.625x attraction
Split-half 95% range: [2.34x, 2.67x]
AIIN→QOK = 0.504x repulsion
Bidirectional. QOK is blocked.
Rules are class-distributed
77% of CHEDY tokens participate, 369 unique pairs — but predictively weak (family-bigram lift ~0%; concentrated in Hands 2 and 3)

What the Data Rules Out

Hypotheses excluded by the current evidence
Random / gibberish generation
Shuffled-token control tests SYMM-LOW (P/S 0.92×/0.97×); χ² per cell 0.7 vs Voynich's 30.5
Simple letter-substitution ciphers
Would preserve source language's suffix dominance; cannot produce SYMM-HIGH from any tested NL source
Template-based generation
Whole-line family templates recur at 1.04× above shuffled control — lines are not template-based
Bidirectional-morphology natural languages alone
Swahili, Ottoman Turkish, Georgian, Tagalog — all four selected for prefix+suffix morphology — test SYMM-LOW
Medieval scribal abbreviation
Even at 90%/95% abbreviation rates, Latin stays SUFFIX-DOM at ratio 0.29 — abbreviation is not a path to SYMM-HIGH
What is NOT ruled out: Voynich is the only tested system with symmetric-high self-clustering (prefix 1.52×, suffix 1.54×, ratio 0.99). All natural-language comparators with positive SC are suffix-dominant. The data rules out random generation, simple substitution ciphers, and template-based mechanisms — but does not rule out sophisticated constructed or engineered systems. A first-pass synthetic constructed control satisfies 5 of 7 MVE items; only items 5 and 7 currently discriminate it from encoded natural language.

Family Density Map

Verified family distribution across the manuscript

Cross-Transcription Stability

Findings are transcription-agnostic — not EVA artifacts
TranscriberTokensC→QA→QP/SBucket
Currier16,4532.48×0.42×0.85SYMM-HIGH
FSG (Friedman)28,8112.40×0.45×0.84SYMM-HIGH
Takahashi30,4262.41×0.47×0.86SYMM-HIGH
Grove7,6572.01×0.21×0.99SYMM-HIGH
ZL (baseline)31,6082.63×0.50×0.99SYMM-HIGH
Four independent transcribers, working decades apart with different word-boundary judgments, all produce the same structural signature. The tokenization-artifact attack vector is closed.

Minimum Viable Explanation

7 measured properties any Voynich theory must account for
1. Line-bounded transition reset — CHEDY→QOK 2.54× within-line vs 0.85× across line breaks
2. Specific class-level transition — CHEDY→QOK 2.63× while CHEDY→OK 0.83×, CHEDY→OT 0.80× (concentrated in Hands 2 and 3)
3. Suffix agreement — adjacent-word ending-class match rate 1.18–1.75× above chance
4. Agreement cascades — five 3-token chains all survive Benjamini–Hochberg FDR at α=0.05
5. Bidirectional self-clustering — P/S = 0.99, unique among 18 tested systems; holds independently in Hands 1, 2, 3
6. Section-stable coarse grammar with shifting lexicon — within-family Jaccard overlap 0.09–0.25
7. Open vocabulary — 71.4% hapax, type/token ratio 0.23
A first-pass synthetic constructed control satisfies 5 of 7. Items 5 and 7 are the only ones that currently discriminate sophisticated constructed systems from encoded natural language. Productive paradigms (formerly item 5 of 8) was retired — see paradigm null model.

Full Transition Matrix

Obs/Expected ratios. Green = attraction, Red = repulsion.

Key Rules

Strength relative to chance baseline

Carry-Through

AIIN blocks QOK but passes other families
AIIN acts as a transparent connector for OK, OT, CHEDY — but actively blocks QOK in both directions.

AIIN at 15% Across Currier A/B

The cleanest finding in the entire analysis
15.0%
Currier A (n=102 pages)
15.0%
Currier B (n=72 pages)
KS test: p = 0.742
Bootstrap CI for difference: [−2.0%, +2.0%]
Every other family differs significantly between A and B (all p < 0.005). AIIN is the only invariant.

Function Word Behavior

AIIN does NOT self-cluster — like real function words
Real function words (the, et, di) don't cluster with themselves. Neither does AIIN. Every other backbone family does.

Caveat: Section Variance

AIIN varies by section — invariance is specifically across language modes

Self-Clustering: Handle with Care

This metric is the most method-sensitive in the study
1.384x
Pooled (all classes)
Includes OTHER
1.451x
Pooled (backbone)
CI: [1.34, 1.52]
0.929x
Page-level mean
Most conservative
Why the gap matters: Pooled values let extreme small pages dominate. Page-level averaging is fairer but shows weaker effects. The cross-linguistic comparison uses pooled (apples-to-apples), but absolute values carry uncertainty.

Prefix / Suffix Self-Clustering

Voynich is the only system with symmetric-high clustering

The Grammar Test

Is CHEDY→QOK a few fixed phrases or a class-level rule?
77%
of CHEDY tokens attract QOK
(53/69 with ≥5 occurrences)
74%
of QOK tokens attracted by CHEDY
(35/47 with ≥5 occurrences)
69%
of AIIN tokens repel QOK
(36/52 with ≥10 occurrences)
Verdict: 369 unique token pairs, top 5 cover only 13.3%. The rule is DISTRIBUTED — class-level collocational structure, not fixed phrases. (Predictively weak as syntax: family-bigram lift ~0% over baseline.)

Per-Scribe Decomposition

CHEDY→QOK is a property of Hands 2 and 3 — not a uniform manuscript-wide rule

The Currier hand annotation in the Zandbergen–Landini metadata splits the corpus across five scribal hands. Pooling tokens across hands could produce a manuscript-wide signature even if individual hands behave very differently. The decomposition below tests each hand independently. Bidirectional self-clustering (SYMM-HIGH) holds independently in Hands 1, 2, and 3 (≈94% of the corpus). The CHEDY→QOK transition rule, in contrast, is concentrated in Hands 2 and 3 and is essentially absent in Hand 1.

HandTokensPrefix SCSuffix SCP/SBucketCHEDY→QOKn (obs)AIIN%
Hand 18,9971.59x1.39x1.14SYMM-HIGH1.42x1313.1%
Hand 29,1541.15x1.24x0.93SYMM-HIGH2.15x3748.7%
Hand 311,3891.31x1.32x0.99SYMM-HIGH2.28x22213.0%
Hand 46830.47x1.26x0.37SUFFIX-DOM7.34x24.1%
Hand 58901.13x2.01x0.56SUFFIX-DOM1.06x66.3%
Verdict — partial support: three of three qualifying hands (≥3,000 tokens) are SYMM-HIGH; two of three show CHEDY→QOK above 1.5×. Hand 1 produces the rule only at 1.42× on n=13 observations. Hands 4 and 5 are too small (<1,000 tokens each) to classify reliably. Driven by results/per_scribe_results.json.

Cascade Uncertainty (Wilson CI + Benjamini–Hochberg FDR)

5/5 three-token chains survive FDR correction at α=0.05; flagship effect is real but thinly sampled

Five A→B→C cascade chains were tested. For each chain we computed Wilson 95% intervals on the conditional probability of B→C suffix agreement under A≡B versus A≠B, then a conservative composite CI on the cascade Δ. All five survive Benjamini–Hochberg FDR at α=0.05.

Chainn agreen disagreeΔ (pp)Conservative 95% CIBH-FDR
CHEDY→OTHER→CHEDY13119+81.0[+48.3, +93.9]pass
QOK→OTHER→QOK2661+44.0[+13.1, +67.2]pass
QOK→QOK→QOK2716+33.0[−9.0, +62.7]pass
CHEDY→QOK→CHEDY2645+56.0[+23.8, +77.2]pass
OT→OTHER→OT1837+19.0[−11.7, +50.3]pass
The flagship CHEDY→OTHER→CHEDY cascade rests on only 13 agreement trials and 119 disagreement trials. The point estimate (+81 pp) is real, but the conservative 95% CI of [+48, +94] pp reflects substantial sampling uncertainty. Two of the five chains have conservative CIs that cross zero. Driven by results/cascade_uncertainty_results.json.

Paradigm Null Model — Finding 1.8 retired

Voynich's log-frequency vs edit-1 variant correlations do not exceed a character-trigram null

The previous "productive paradigms" finding rested on r = 0.42–0.71 between log-frequency and edit-distance-1 variant counts within each family's top 50. Ten replicates of a character-trigram null — synthetic corpora matching Voynich's bigram statistics but containing no morphology — produced comparable or higher correlations. Voynich's correlations do not exceed the null's 95th percentile in any of the three tested families. Chaucer's Middle English at the same measurement gives r = 0.203, lower than both Voynich and the null. The correlation is a Zipfian-edit-graph combinatorial property, not evidence of productive morphology.

FamilyVoynich rNull meanNull 95%ileReal / nullExceeds null?
QOK0.6020.4810.6071.25×no
CHEDY0.3840.3580.4201.07×no
AIIN0.3970.4290.5120.93×no
Chaucer (NL reference)0.203
Verdict for all three families: "correlation explained by Zipfian combinatorics (trigram null)". The "productive paradigms" item has been retired from the Minimum Viable Explanation checklist (formerly item 5 of 8). Driven by results/paradigm_null_results.json.

Constructed-System Control — 5 of 7 MVE items satisfied

A first-pass synthetic generator satisfies most of the checklist by construction; only items 5 and 7 currently discriminate

A synthetic constructed corpus (~32,000 tokens, 4,000 lines, 4 sections) was generated with deliberately designed grammatical rules — line-bounded resets, A→B class attraction at 2.5×, 70% within-class suffix clustering, ≈15% filler density — and run through the project's own measurement pipeline. The result satisfies 5 of 7 testable MVE items by construction. Items 5 (bidirectional symmetry) and 7 (open vocabulary) were not achieved by this first attempt.

MVE itemOutcomeKey value
1. Line-bounded transition resetsatisfiedwithin 2.22× / cross 0.56×
2. Specific class-level transitionsatisfiedA→B 2.01×
3. Suffix agreementsatisfiedB→B 2.24×
4. Agreement cascade A→O→Asatisfied+49.8 pp
5. Bidirectional self-clusteringnot satisfiedP/S 0.819 / 1.527 — SUFFIX-DOM
6. Section-stable grammar with shifting lexiconsatisfiedJaccard 0.165
7. Open vocabularynot satisfied15.5% hapax (target >50%)
Items 1–4 and 6 are satisfied directly by design — they do not discriminate constructed systems from natural language. Items 5 and 7 are the only items that currently distinguish encoded natural language from a sophisticated constructed control. The earlier "encoded NL is the only compatible class" framing has been retired. Driven by results/constructed_control_results.json.

Methodology

Data, tools, and reproducibility

Corpus: Zandbergen-Landini EVA transliteration via AncientLanguages/Voynich (Hugging Face). 4,197 lines, 31,608 tokens, 184 pages.

Families: QOK (prefix qok-), OK (prefix ok- not qok-), OT (prefix ot-), CHEDY (contains chedy/shedy/chey/shey), AIIN (contains aiin/ain). All others = OTHER.

Comparison languages: 16 natural-language comparators (13 Leipzig Wikipedia 100K including Swahili, Georgian, Tagalog, and Mandarin; 2 Gutenberg literary; 1 Ottoman Turkish UD treebank) and 1 shuffled-token control.

Statistical tests: Permutation tests (5,000–10,000 iterations), bootstrap CIs, KS tests, Chi-squared, split-half reliability. All transition ratios = observed/expected under independence.

Known limitations: Family definitions are EVA-specific and may not correspond to paleographic character boundaries. Self-clustering values are method-sensitive (pooled vs page-level). Non-IE comparison texts are modern, not medieval. Ottoman Turkish tested with small UD corpus (16,890 words): SYMM-LOW, not a match. Larger corpus needed.

Corrections Log

• "Only compatible class" framing retired. A first-pass synthetic constructed control satisfies 5 of 7 MVE items by design; only items 5 (bidirectional symmetry) and 7 (open vocabulary) currently discriminate constructed systems from encoded natural language. See results/constructed_control_results.json.

• "Productive paradigms" (Finding 1.8) retired. Voynich's log-freq vs edit-1 variant correlations (r = 0.42–0.71) do not exceed a character-trigram null containing no morphology. Removed from the MVE checklist (formerly item 5 of 8). See results/paradigm_null_results.json.

• CHEDY→QOK reframed as scribe-specific. Manuscript-wide 2.625× is concentrated in Hands 2 and 3 (2.15× and 2.28×). Hand 1 produces the rule only at 1.42× on n=13 observations. The earlier "holds across both scribal hands" framing was overstated. See results/per_scribe_results.json.

• Cascade flagship effect re-reported with uncertainty. CHEDY→OTHER→CHEDY +81 pp point estimate stands, but rests on n=13 agreement / n=119 disagreement trials with conservative 95% CI [+48, +94] pp. Five tested chains all survive Benjamini–Hochberg FDR at α=0.05; two have CIs crossing zero. See results/cascade_uncertainty_results.json.

• "Grammar" wording narrowed. The CHEDY→QOK class-level constraint is "collocational preference with class-level structure" rather than predictive grammar — family-bigram prediction gives ~0% lift over the baseline majority class.

• Comparator count corrected from 15 to 16. Mandarin added; Voynich is the target system, not a comparator. 18 systems total in the prefix/suffix table.

• Self-clustering: 0.929x (page-level) to 1.451x (pooled backbone). Method-sensitive; the bidirectional ratio (0.99) is stable across methods.

• CHEDY→QOK page agreement corrected from 92% to 78%. Earlier figure inflated by selection bias.

• AIIN→QOK page agreement corrected from 91% to 67%. Same selection bias.

• Biological self-clustering initially reported as 'not significant.' Rerun shows p < 0.001.

• Carry-through values shift ~15% between runs due to token parsing differences. Directions are stable; exact decimals are approximate.

Research Paper

Full write-up with methods, results, and discussion

Transition Grammar of the Voynich Manuscript: Sequential Constraints and Bidirectional Self-Clustering Symmetry
Amy Laird · Independent Researcher · April 2026

The paper reports all findings documented on this dashboard plus additional analyses of line-bounded grammar, suffix-agreement cascades, per-scribe decomposition, and glyph-layer architecture. It concludes with a minimum viable explanation checklist: seven measured properties that any proposed explanation of the Voynich text must account for.

Read Paper (PDF) View Repository
Citation: Laird, A. (2026). Transition Grammar of the Voynich Manuscript: Sequential Constraints and Bidirectional Self-Clustering Symmetry. Available at github.com/amy2213/Voynich-Transition-Grammar