Voynich Manuscript — Transition Grammar Analysis

Core Findings

Verified findings

Non-random sequential structure

Chi² = 1407.8, p ≈ 0. Significantly non-random.

AIIN at 15% across Currier A/B

KS p = 0.742. Language-invariant.

CHEDY→QOK = 2.625x attraction

Split-half 95% range: [2.34x, 2.67x]

AIIN→QOK = 0.504x repulsion

Bidirectional. QOK is blocked.

Rules are class-distributed

77% of CHEDY tokens participate, 369 unique pairs — but predictively weak (family-bigram lift ~0%; concentrated in Hands 2 and 3)

What the Data Rules Out

Hypotheses excluded by the current evidence

✗

Random / gibberish generation

Shuffled-token control tests SYMM-LOW (P/S 0.92×/0.97×); χ² per cell 0.7 vs Voynich's 30.5

✗

Simple letter-substitution ciphers

Would preserve source language's suffix dominance; cannot produce SYMM-HIGH from any tested NL source

✗

Template-based generation

Whole-line family templates recur at 1.04× above shuffled control — lines are not template-based

✗

Bidirectional-morphology natural languages alone

Swahili, Ottoman Turkish, Georgian, Tagalog — all four selected for prefix+suffix morphology — test SYMM-LOW

✗

Medieval scribal abbreviation

Even at 90%/95% abbreviation rates, Latin stays SUFFIX-DOM at ratio 0.29 — abbreviation is not a path to SYMM-HIGH

What is NOT ruled out: Voynich is the only tested system with symmetric-high self-clustering (prefix 1.52×, suffix 1.54×, ratio 0.99). All natural-language comparators with positive SC are suffix-dominant. The data rules out random generation, simple substitution ciphers, and template-based mechanisms — but does not rule out sophisticated constructed or engineered systems. A first-pass synthetic constructed control satisfies 5 of 7 MVE items; only items 5 and 7 currently discriminate it from encoded natural language.

Family Density Map

Verified family distribution across the manuscript

Cross-Transcription Stability

Findings are transcription-agnostic — not EVA artifacts

Transcriber	Tokens	C→Q	A→Q	P/S	Bucket
Currier	16,453	2.48×	0.42×	0.85	SYMM-HIGH
FSG (Friedman)	28,811	2.40×	0.45×	0.84	SYMM-HIGH
Takahashi	30,426	2.41×	0.47×	0.86	SYMM-HIGH
Grove	7,657	2.01×	0.21×	0.99	SYMM-HIGH
ZL (baseline)	31,608	2.63×	0.50×	0.99	SYMM-HIGH

Four independent transcribers, working decades apart with different word-boundary judgments, all produce the same structural signature. The tokenization-artifact attack vector is closed.

Minimum Viable Explanation

7 measured properties any Voynich theory must account for

1. Line-bounded transition reset — CHEDY→QOK 2.54× within-line vs 0.85× across line breaks

2. Specific class-level transition — CHEDY→QOK 2.63× while CHEDY→OK 0.83×, CHEDY→OT 0.80× (concentrated in Hands 2 and 3)

3. Suffix agreement — adjacent-word ending-class match rate 1.18–1.75× above chance

4. Agreement cascades — five 3-token chains all survive Benjamini–Hochberg FDR at α=0.05

5. Bidirectional self-clustering — P/S = 0.99, unique among 18 tested systems; holds independently in Hands 1, 2, 3

6. Section-stable coarse grammar with shifting lexicon — within-family Jaccard overlap 0.09–0.25

7. Open vocabulary — 71.4% hapax, type/token ratio 0.23

A first-pass synthetic constructed control satisfies 5 of 7. Items 5 and 7 are the only ones that currently discriminate sophisticated constructed systems from encoded natural language. Productive paradigms (formerly item 5 of 8) was retired — see paradigm null model.

Full Transition Matrix

Obs/Expected ratios. Green = attraction, Red = repulsion.

Key Rules

Strength relative to chance baseline

Carry-Through

AIIN blocks QOK but passes other families

AIIN acts as a transparent connector for OK, OT, CHEDY — but actively blocks QOK in both directions.

AIIN at 15% Across Currier A/B

The cleanest finding in the entire analysis

15.0%

Currier A (n=102 pages)

15.0%

Currier B (n=72 pages)

KS test: p = 0.742

Bootstrap CI for difference: [−2.0%, +2.0%]

Every other family differs significantly between A and B (all p < 0.005). AIIN is the only invariant.

Function Word Behavior

AIIN does NOT self-cluster — like real function words

Real function words (the, et, di) don't cluster with themselves. Neither does AIIN. Every other backbone family does.

Caveat: Section Variance

AIIN varies by section — invariance is specifically across language modes

Self-Clustering: Handle with Care

This metric is the most method-sensitive in the study

1.384x

Pooled (all classes)

Includes OTHER

1.451x

Pooled (backbone)

CI: [1.34, 1.52]

0.929x

Page-level mean

Most conservative

Why the gap matters: Pooled values let extreme small pages dominate. Page-level averaging is fairer but shows weaker effects. The cross-linguistic comparison uses pooled (apples-to-apples), but absolute values carry uncertainty.

Prefix / Suffix Self-Clustering

Voynich is the only system with symmetric-high clustering

The Grammar Test

Is CHEDY→QOK a few fixed phrases or a class-level rule?

77%

of CHEDY tokens attract QOK

(53/69 with ≥5 occurrences)

74%

of QOK tokens attracted by CHEDY

(35/47 with ≥5 occurrences)

69%

of AIIN tokens repel QOK

(36/52 with ≥10 occurrences)

Verdict: 369 unique token pairs, top 5 cover only 13.3%. The rule is DISTRIBUTED — class-level collocational structure, not fixed phrases. (Predictively weak as syntax: family-bigram lift ~0% over baseline.)

Per-Scribe Decomposition

CHEDY→QOK is a property of Hands 2 and 3 — not a uniform manuscript-wide rule

The Currier hand annotation in the Zandbergen–Landini metadata splits the corpus across five scribal hands. Pooling tokens across hands could produce a manuscript-wide signature even if individual hands behave very differently. The decomposition below tests each hand independently. Bidirectional self-clustering (SYMM-HIGH) holds independently in Hands 1, 2, and 3 (≈94% of the corpus). The CHEDY→QOK transition rule, in contrast, is concentrated in Hands 2 and 3 and is essentially absent in Hand 1.

Hand	Tokens	Prefix SC	Suffix SC	P/S	Bucket	CHEDY→QOK	n (obs)	AIIN%
Hand 1	8,997	1.59x	1.39x	1.14	SYMM-HIGH	1.42x	13	13.1%
Hand 2	9,154	1.15x	1.24x	0.93	SYMM-HIGH	2.15x	374	8.7%
Hand 3	11,389	1.31x	1.32x	0.99	SYMM-HIGH	2.28x	222	13.0%
Hand 4	683	0.47x	1.26x	0.37	SUFFIX-DOM	7.34x	2	4.1%
Hand 5	890	1.13x	2.01x	0.56	SUFFIX-DOM	1.06x	6	6.3%

Verdict — partial support: three of three qualifying hands (≥3,000 tokens) are SYMM-HIGH; two of three show CHEDY→QOK above 1.5×. Hand 1 produces the rule only at 1.42× on n=13 observations. Hands 4 and 5 are too small (<1,000 tokens each) to classify reliably. Driven by results/per_scribe_results.json.

Cascade Uncertainty (Wilson CI + Benjamini–Hochberg FDR)

5/5 three-token chains survive FDR correction at α=0.05; flagship effect is real but thinly sampled

Five A→B→C cascade chains were tested. For each chain we computed Wilson 95% intervals on the conditional probability of B→C suffix agreement under A≡B versus A≠B, then a conservative composite CI on the cascade Δ. All five survive Benjamini–Hochberg FDR at α=0.05.

Chain	n agree	n disagree	Δ (pp)	Conservative 95% CI	BH-FDR
CHEDY→OTHER→CHEDY	13	119	+81.0	[+48.3, +93.9]	pass
QOK→OTHER→QOK	26	61	+44.0	[+13.1, +67.2]	pass
QOK→QOK→QOK	27	16	+33.0	[−9.0, +62.7]	pass
CHEDY→QOK→CHEDY	26	45	+56.0	[+23.8, +77.2]	pass
OT→OTHER→OT	18	37	+19.0	[−11.7, +50.3]	pass

The flagship CHEDY→OTHER→CHEDY cascade rests on only 13 agreement trials and 119 disagreement trials. The point estimate (+81 pp) is real, but the conservative 95% CI of [+48, +94] pp reflects substantial sampling uncertainty. Two of the five chains have conservative CIs that cross zero. Driven by results/cascade_uncertainty_results.json.

Paradigm Null Model — Finding 1.8 retired

Voynich's log-frequency vs edit-1 variant correlations do not exceed a character-trigram null

The previous "productive paradigms" finding rested on r = 0.42–0.71 between log-frequency and edit-distance-1 variant counts within each family's top 50. Ten replicates of a character-trigram null — synthetic corpora matching Voynich's bigram statistics but containing no morphology — produced comparable or higher correlations. Voynich's correlations do not exceed the null's 95th percentile in any of the three tested families. Chaucer's Middle English at the same measurement gives r = 0.203, lower than both Voynich and the null. The correlation is a Zipfian-edit-graph combinatorial property, not evidence of productive morphology.

Family	Voynich r	Null mean	Null 95%ile	Real / null	Exceeds null?
QOK	0.602	0.481	0.607	1.25×	no
CHEDY	0.384	0.358	0.420	1.07×	no
AIIN	0.397	0.429	0.512	0.93×	no
Chaucer (NL reference)	0.203	—	—	—	—

Verdict for all three families: "correlation explained by Zipfian combinatorics (trigram null)". The "productive paradigms" item has been retired from the Minimum Viable Explanation checklist (formerly item 5 of 8). Driven by results/paradigm_null_results.json.

Constructed-System Control — 5 of 7 MVE items satisfied

A first-pass synthetic generator satisfies most of the checklist by construction; only items 5 and 7 currently discriminate

A synthetic constructed corpus (~32,000 tokens, 4,000 lines, 4 sections) was generated with deliberately designed grammatical rules — line-bounded resets, A→B class attraction at 2.5×, 70% within-class suffix clustering, ≈15% filler density — and run through the project's own measurement pipeline. The result satisfies 5 of 7 testable MVE items by construction. Items 5 (bidirectional symmetry) and 7 (open vocabulary) were not achieved by this first attempt.

MVE item	Outcome	Key value
1. Line-bounded transition reset	satisfied	within 2.22× / cross 0.56×
2. Specific class-level transition	satisfied	A→B 2.01×
3. Suffix agreement	satisfied	B→B 2.24×
4. Agreement cascade A→O→A	satisfied	+49.8 pp
5. Bidirectional self-clustering	not satisfied	P/S 0.819 / 1.527 — SUFFIX-DOM
6. Section-stable grammar with shifting lexicon	satisfied	Jaccard 0.165
7. Open vocabulary	not satisfied	15.5% hapax (target >50%)

Items 1–4 and 6 are satisfied directly by design — they do not discriminate constructed systems from natural language. Items 5 and 7 are the only items that currently distinguish encoded natural language from a sophisticated constructed control. The earlier "encoded NL is the only compatible class" framing has been retired. Driven by results/constructed_control_results.json.

Methodology

Data, tools, and reproducibility

Corpus: Zandbergen-Landini EVA transliteration via AncientLanguages/Voynich (Hugging Face). 4,197 lines, 31,608 tokens, 184 pages.

Families: QOK (prefix qok-), OK (prefix ok- not qok-), OT (prefix ot-), CHEDY (contains chedy/shedy/chey/shey), AIIN (contains aiin/ain). All others = OTHER.

Comparison languages: 16 natural-language comparators (13 Leipzig Wikipedia 100K including Swahili, Georgian, Tagalog, and Mandarin; 2 Gutenberg literary; 1 Ottoman Turkish UD treebank) and 1 shuffled-token control.

Statistical tests: Permutation tests (5,000–10,000 iterations), bootstrap CIs, KS tests, Chi-squared, split-half reliability. All transition ratios = observed/expected under independence.

Known limitations: Family definitions are EVA-specific and may not correspond to paleographic character boundaries. Self-clustering values are method-sensitive (pooled vs page-level). Non-IE comparison texts are modern, not medieval. Ottoman Turkish tested with small UD corpus (16,890 words): SYMM-LOW, not a match. Larger corpus needed.

Corrections Log

• "Only compatible class" framing retired. A first-pass synthetic constructed control satisfies 5 of 7 MVE items by design; only items 5 (bidirectional symmetry) and 7 (open vocabulary) currently discriminate constructed systems from encoded natural language. See results/constructed_control_results.json.

• "Productive paradigms" (Finding 1.8) retired. Voynich's log-freq vs edit-1 variant correlations (r = 0.42–0.71) do not exceed a character-trigram null containing no morphology. Removed from the MVE checklist (formerly item 5 of 8). See results/paradigm_null_results.json.

• CHEDY→QOK reframed as scribe-specific. Manuscript-wide 2.625× is concentrated in Hands 2 and 3 (2.15× and 2.28×). Hand 1 produces the rule only at 1.42× on n=13 observations. The earlier "holds across both scribal hands" framing was overstated. See results/per_scribe_results.json.

• Cascade flagship effect re-reported with uncertainty. CHEDY→OTHER→CHEDY +81 pp point estimate stands, but rests on n=13 agreement / n=119 disagreement trials with conservative 95% CI [+48, +94] pp. Five tested chains all survive Benjamini–Hochberg FDR at α=0.05; two have CIs crossing zero. See results/cascade_uncertainty_results.json.

• "Grammar" wording narrowed. The CHEDY→QOK class-level constraint is "collocational preference with class-level structure" rather than predictive grammar — family-bigram prediction gives ~0% lift over the baseline majority class.

• Comparator count corrected from 15 to 16. Mandarin added; Voynich is the target system, not a comparator. 18 systems total in the prefix/suffix table.

• Self-clustering: 0.929x (page-level) to 1.451x (pooled backbone). Method-sensitive; the bidirectional ratio (0.99) is stable across methods.

• CHEDY→QOK page agreement corrected from 92% to 78%. Earlier figure inflated by selection bias.

• AIIN→QOK page agreement corrected from 91% to 67%. Same selection bias.

• Biological self-clustering initially reported as 'not significant.' Rerun shows p < 0.001.

• Carry-through values shift ~15% between runs due to token parsing differences. Directions are stable; exact decimals are approximate.

Research Paper

Full write-up with methods, results, and discussion

Transition Grammar of the Voynich Manuscript: Sequential Constraints and Bidirectional Self-Clustering Symmetry
Amy Laird · Independent Researcher · April 2026

The paper reports all findings documented on this dashboard plus additional analyses of line-bounded grammar, suffix-agreement cascades, per-scribe decomposition, and glyph-layer architecture. It concludes with a minimum viable explanation checklist: seven measured properties that any proposed explanation of the Voynich text must account for.

Read Paper (PDF) View Repository

Citation: Laird, A. (2026). Transition Grammar of the Voynich Manuscript: Sequential Constraints and Bidirectional Self-Clustering Symmetry. Available at github.com/amy2213/Voynich-Transition-Grammar