Substitution Cipher Based on The Voynich Manuscript

Here’s a fun paper: “The Naibbe cipher: a substitution cipher that encrypts Latin and Italian as Voynich Manuscript-like ciphertext“:

Abstract: In this article, I investigate the hypothesis that the Voynich Manuscript (MS 408, Yale University Beinecke Library) is compatible with being a ciphertext by attempting to develop a historically plausible cipher that can replicate the manuscript’s unusual properties. The resulting cipher­a verbose homophonic substitution cipher I call the Naibbe cipher­can be done entirely by hand with 15th-century materials, and when it encrypts a wide range of Latin and Italian plaintexts, the resulting ciphertexts remain fully decipherable and also reliably reproduce many key statistical properties of the Voynich Manuscript at once. My results suggest that the so-called “ciphertext hypothesis” for the Voynich Manuscript remains viable, while also placing constraints on plausible substitution cipher structures.

Posted on December 8, 2025 at 7:04 AM10 Comments

Comments

Clive Robinson December 8, 2025 11:01 AM

@ Mexaly,

“Who did they expect to read it?”

That is the Million Dollar question…

Because “if we know that…” It allows us to come up with,

“Probable plain text words”

Because statistics alone won’t break it.

The reason for this is one of those quirks in life, I’ve mentioned before in a similar context.

Think of two basic ciphers,

1, A “One Time Pad”(OTP)

https://en.wikipedia.org/wiki/One-time_pad

2, A “straddling checkerboard”

https://en.wikipedia.org/wiki/Straddling_checkerboard

Which combined would have made the “VIC Cipher”,

https://en.wikipedia.org/wiki/VIC_cipher

Effectively unbreakable, rather than just unbroken for a considerable period as an OTP was not used but a lagged Fibonacci generator was.

As you probably know the OTP had two basic qualities,

1, It’s considered to have “Perfect Secrecy” due to the fact it’s unicity distance is longer than the sent message text.

2, It’s output is over any message length near statistically flat to various tests unlike other stream ciphers or substitution ciphers.

The down side of the second quality is that it makes the use of an OTP much more recognisable when the cipher text is examined.

The down side of this is due to “resource limitations” cryptographers are likely to not make an attempt to break it, if other more “statistically promising ciphertext” is available.

As someone who would rather such resources would waste their time trying to crack an OTP you thus have to change the cipher text statistics to look like another type of stream cipher or substitution cipher.

This is where the “straddling checkerboard comes in. As the Wikipedia article notes,

“A straddling checkerboard is a device for converting an alphanumeric plaintext into digits whilst simultaneously achieving fractionation (a simple form of information diffusion) and data compression relative to other schemes using digits.”

Put simply it replaces eight of the “high frequency letters” given by one or other of the two memorable sentences,

1, “a sin to er”
2, “eat on irish”

With “Single Digit Numbers” and all the other letters and a couple of symbols “Two Digit Numbers” which numbers they get is most often done by using a plaintext sentence from a newspaper, journal or book.

The result is technically called “fractionation”,

https://en.wikipedia.org/wiki/Transposition_cipher#Fractionation

Or sometimes “flattening the statistics”. As well as a little compression thus also hiding the plaintext message length.

The thing is it’s used on the plaintext so has to be reversible.

Now consider if you use it on the ciphertext instead, and use the de-straddling on the ciphertext you get out of the OTP encryption.

The result will be the flat statistics of the OTP ciphertext look more like the statistics of another type of stream cipher or substitution cipher.

Thus cryptographers will see it as “breakable” rather than “unbreakable” and will waste time and very valuable resources on it.

Many people have wasted many resources on the “Voynich Manuscript” as they have on other historic ciphertexts, and I suspect they will continue to do so if for nothing more than “bragging rights”.

Which you can see displayed in the extract / quote given at the top of this thread,

“My results suggest that the so-called “ciphertext hypothesis” for the Voynich Manuscript remains viable, while also placing constraints on plausible substitution cipher structures.”

I have demonstrated by cipher types that were known at the time and place of the alleged creation of the “Voynich Manuscript” that just as plausibly it can not be decrypted…

As the old saying has it,

“Pays yer money, takes your choice!”

Me I’d rather read a good book, as at least I will get something from so doing 😉

KC December 8, 2025 11:50 AM

Michael Greshko gives a fun and graphical overview of his Naibbe cipher here.

From what I understand, the system breaks the plaintext into unigrams and bigrams …

So ‘HELLO WORLD’ could be ‘H EL L OW OR L D’

Every individual character then maps to one of three positions in an encryption table, as either a (1) unigram (2) bigram prefix or (3) bigram suffix.

‘H’ would map to the unigram encryption. ‘E’ to the bigram prefix. The following ‘L’ to the bigram suffix.

But we don’t have just one encryption table; we have six. And Michael uses playing cards to randomly select the encryption table.

From the video: ‘A unigram can be represented 6 different ways. A bigram can be represented 36 different ways.’

So in effect a plaintext could produce very different ciphertexts. All with 15th century technology. Really neat!

Clive Robinson December 8, 2025 12:30 PM

@ KC,

You forgot to mention that the spliting the plain text into one or two letters was done randomly with a couple of dice.

Ray Dillinger December 8, 2025 2:36 PM

A few years ago someone claimed to have solved the Voynich MS and pestered the cryptography list with his claims. Basically his “decryption” of the MS is a mishmosh of words from a half-dozen languages thrown together in “sentences” which, although he ignores the fact, have no coherent grammar.

I wanted to at least know what I was talking about when I replied, so I went and gave it a semi-serious look.

Voynichese has a very strong bias about letters appearing cycles, in more-or-less the same order within each cycle – to the point where it seems like a lot of cycles come down to just one or two choices besides just deciding which majority of the “alphabet” is to be left out. It reminded me of an old fashioned stenogram, where the court reporter presses keys for a word on two chorded keyboards all at once, and the (approximate) vowels of a word, corresponding the left-hand keyboard appear on the left half of the page and the (approximate) consonants of the same word, corresponding to the right-hand keyboard, appear on the right, except that the Voynichese letter order wasn’t quite that completely regular. I could not detect any longer patterns which might have been about grammar rather than about individual words.

The looping sequence bias continues across spaces and line breaks, so whitespace appears not to be meaningful. When I talk about “words” I’m basically pointing at the loops in an almost-repeating cycle that runs, with variations, seven to fifteen letters per loop.

The sequence bias is represented fairly well (much better than a stenogram) with the tables of “prefixes” and “suffixes” that this paper uses to both encode information and bias the letter order within each loop. But the cipher uses spaces between iterations of the loop rather than spaces inserted more-or-less randomly as the MS does.

The most common letter in Voynichese is about two-thirds as frequent as English ‘E’ IIRC, but does not combine left or right with a wide variety of things the way vowels do. It does not appear to be used as a vowel in a substitution cipher. Neither, in fact, does any other letter in Voynichese appear to be used as a vowel in a substitution cipher. Each letter can appear after a very few things, and before a very few other things.

Finally the very strong sequence bias in Voynichese results in a ridiculous conditional probability table. If you see a sequence of some two letters, you can predict, with probability greater than 50%, the sequence of four or six letters to follow.

There are a set of letters bigger than others – you might guess that they are presentation forms like capital letters – which seem to appear only on the top line of each page and the rightmost character of other lines. But they don’t appear to be part of the repeating cycle substituting for anything else the way presentation forms would. This paper doesn’t attempt to reproduce their page positions, treating them simply as letters of somewhat more random placement and lower probability. So I strongly suspect this cipher must have been constructed from a “linearized” version of the manuscript, where the positions of page and line breaks wouldn’t be represented.

With all the highly regular statistical patterns based on sequence in the main loops, combined with a few letters that appear to follow a completely different but equally regular set of rules based on page position, it follows that if Voynichese represented a human language, it would have to be a very low-efficiency encoding – maybe around a third of a bit of entropy per letter. English writing by comparison achieves about one and a quarter or thereabouts. And that information density accords reasonably well with this cipher, which adds several random bits to each word by choosing randomly among encodings. Using mutally prefix-free Huffman encodings preserves the ability to decrypt while the randomness stretches out the meaningful entropy over more ciphertext characters. The mutually prefix-free Huffman encodings were a clever way to do this.

All told, it’s a nice piece of work, and keeps the hope of would-be decryptors alive.

That said, I’m still personally convinced that Voynichese is mainly or entirely meaningless. Probably the macguffin of a hoax somebody pulled way back in the wayback, or possibly the asemic product of someone’s untreated graphorrhea disorder in the same time frame. It’s not a modern hoax, but IMO it’s also probably not language.

The thing that argues against the latter (and therefore favors the ‘hoax’ theory) is the quality of the materials used. That is fine vellum and good mineral-pigment ink that would cost decades of a good craftsman’s earnings at the time, and a disordered mind in the throes of graphorrhea wouldn’t have access to such materials unless they were very wealthy or working for or with a very wealthy institution such as the Church. In which case the author, as a noble, royal, or other notable person having such a very specific disability, would likely be at least as well-known as Hildegard von Bingen. And no such person is known.

KC December 8, 2025 3:04 PM

@ Clive

Hmm, yes, interesting that Michael presents two dice respacing schemes, Simplified and Standard (with the Standard yielding a slight preference towards bigrams.)

I’m still reading through the paper, and also see this:

Playing cards, previously noted for their usefulness in analog keystream generation (Schneier 1999), are especially convenient.

Clive Robinson December 8, 2025 5:35 PM

@ Ray Dillinger,

With regards,

“asemic” and “graphorrhea”

Two words I would not have expected on this blog… Unless talking about LLM output 😉

But your point about insufficient layers in the statistics further suggests that it is not “natural language” and makes me think that as I earlier noted even if it is actual language it can not or will not be decoded.

But as you note the quality of the materials used is “beyond the common man” or lord for that matter, in terms of available means.

Which kind of points to it having been made by those in a religious order where hand illuminating etc was still being practiced.

So there is a small possibility it is actually a “sampler” used as a test piece to show competence at the copying / illuminating craft.

Clive Robinson December 8, 2025 5:57 PM

@ KC,

The thing with regards,

“Playing cards, previously noted for their usefulness in analog keystream generation (Schneier 1999), are especially convenient.”

Whilst semi / random “shuffling” was known back then, the Schneier and similar use of maximal “sequence” generation which was quite a bit less so. In short they work on two almost unrelated principles, the first being a random cut / interleaving process, the second being a deterministic process based on similar ideas of a lagged generator. Which from memory was first thought up in 200BC in India but did not appear in Europe untill 1200AD, when Italian mathematician Leonardo of Pisa noticed the “self similar” nature of what is in effect a power series or fractal.

fib December 9, 2025 1:56 PM

I’m not a cryptographer myself, so bear with me. Something that strikes me as curious here (and this thread is a good example) is that discussions of classical ciphers almost never frame things in group-theoretic terms, even when the structure seems to invite it.

A unigram can be represented 6 different ways. A bigram can be represented 36 different ways.’

KC’s 6-way / 36-way observation immediately suggests S₃ and S₃×S₃ symmetry, but classical-cipher discussions usually stick to vocabulary like “tables,” “homophones,” “prefix/suffix classes,” and so on.

That’s not a criticism — just an interesting cultural divide. I’d enjoy reading anything in that direction.

Givon December 15, 2025 7:44 AM

Truly fun fact. Now to implement it. So, if you use it for a book code, that would be confounding. Use a book not in the library.

Leave a comment

Blog moderation policy

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.