Lee September 23, 2021 8:55 AM

One of the challenges of using Unicode out of context is that not all symbols are shown properly on all devices, although the ROT8000 set should be reliable.

Following encryption, I typically encode in base-62 for maximum portability but would like to use a big set of Unicode symbols to reduce the number of characters (not bytes, obvs.) – but it’s really difficult to get a definitive set that includes many obscure symbols yet would display properly on most devices.

Bruce Schneier September 23, 2021 9:19 AM


“But wouldn’t anyone immediately see that the Chinese characters are repeated too often?”

Of course. It’s not actually meant to be secure.

Clive Robinson September 23, 2021 9:32 AM

@ Bruce, ALL,

What’s clever about it is that normal English looks like Chinese, and not like ciphertext

“looks” being the operative word.

It would not fool a human censor and it most certainly would not fool even the simplest program that did frequency analysis.


Of course. It’s not actually meant to be secure.

Or even unreadable… There are people out there who can read ROT13 to get the punch line for a “not suitable for work” joke.

How long before they do the same for ROT8000,

籛籲籰籱籽 籼籪籲籭 籏类籮籭簷簷簷

Z.Lozinski September 23, 2021 10:08 AM

@Snarki, child of Loki

What’s the equivalent ROT for EBCDIC?

The way ROT-13 is defined you can do it on EBCDIC characters just as easily as ASCII. The really cool thing is that on a mainframe you can implement ROT-13, or even EBCDIC-to-ASCII with a just single instruction. (Translate or Translate Extended for those who have a zSeries handy .. )

And because you can set up translations easily, you could have hours of endless fun inventing reversible transforms ..

Ah, my mis-spent youth ..

SpaceLifeForm September 23, 2021 3:37 PM

@ Z.Lozinski, Clive

In the olden daze, there was no assist from the cpu, so one had to use 256 byte arrays to do translation.

My application was file transfer between ASCII machines (Tandems) and IBM Mainframes. Either direction. This is before the FTP Protocol existed.

My critical rule was all of the data had to be human-readable. No binary numbers, no packed decimal allowed.

How did you deal with the ASCII tilde character (~) which does not exist in EBCDIC?

I translated it to the EBCDIC hook-bar which does not exist in ASCII.

My application could print from the Tandem machine to the IBM Line Printer.

If the devs wanted to print out something from the Tandem to the MF Line Printer, they were educated that the tilde would appear as the hook-bar.

There were a few more exceptions that were dealt with in a similar manner.

I do not recall them off of the top of my head, this was over 30 years ago.

Braces or Brackets I think. I would have to refresh my bits and bytes in my head, but that will have to wait.

As to the real workload, dealing with pure business application data, besides the ‘must be human readable rule’, the various business apps had to make sure that such problematic characters never made it into the data stream. It turned out, that was not a major problem.

All of the translation happened in the Mainframe side, as it was faster.

There was actually binary data involved, but that was part of the protocol, not the payload. Think headers, counters, and VLIs (Variable Length Indicators).

It took me a year to develop. Maybe a bit longer.

One of the interesting tricks for the JOB on the Mainframe side was to resubmit it’s own JCL to the internal reader, but modify it’s JCL on the fly to setup for the next transfer with the correct DD statements. All datasets on the MF were accessed via relative generation numbers.

The actual hardware communication was via a channel called TIL. On the MF side, it looked like a 9-track tape drive.

Oh yeah, mis-spent youth. I had no idea that some would think this was a fools-errand, but I like a challenge. It worked.

Jesse Thompson September 23, 2021 8:09 PM


[I] would like to use a big set of Unicode symbols to reduce the number of characters (not bytes, obvs.)

Bear in mind though, on any platform which encodes your text in UTF-16 or above, using a denser set of code points will net you a savings over ascii encodings, around 200% and up in fact! 😀

Clive Robinson September 23, 2021 8:33 PM

@ SpaceLifeForm, Z.Lozinski,

so one had to use 256 byte arrays to do translation.

But where did you put the start pointer(s)?

The argument for EBCDIC was that Holorith punch cards would not be as weak as they would using ASCII… Make of that what you will.

However back in the early 60’s most I/O devices for human use, where “electromechanical” which ment reprograming needed a “hack saw”, “4lb lump hammer” and the ever handy “OxyCet Cutting torch” if the “shaped charge” failed to have the desired effect[1].

So it’s been argued IBM was “to lazy” to do the mechanical redesign to support ASCII (mind you they do use ASCII in AIX and their version of Linux, and that “skunkworks box” we now call PC’s… But I’m told z/OS still clings to the old ways, or goes to 8bytes/char these days).

Back when I had to write a translator you needed three byte arrays… One for the so called “fixed” chars, one for the supposadly fixed but “got moved a bit” chars, oh and one of maybe 30 or more “special” chars that were sort of language or application dependent… The trick was to “fall through” using x00/NUL in a array “pigeon hole” where there was not a char and catch it in one of the subsequent arrays or just drop it or white space it depending on how the user wanted it done.

As you may remember “every byte counts” was the 8bit mantra, closely followed by “calls crawl” and the more general “gotos considered harmfull”. So being “young” I spent rather more night hours than I should have done trying to come up with an algorithm that was less bytes than that using arrays… Nope every time I thought I’d cracked it someone would find a table for some IBM peripheral that broke it…

[1] For my “real sins” when wearing the green someone thought I should do a couple of “tele-mech” courses to know how to keep the old,style teleprinters and punch tape systems up and sprightly… Some here may still play around with mechanical devices and know what a “feeler guage” is and what it was used to do (set / measure gaps). Well a thing of note, the teleprinter could have upto a quater horse power motor, and one requirment was to adjust it’s position with respect to the “mesh gap” on a cog wheel… You could do this the “slow and safe” way with the power off and unoluged from the electrical outlet, or the “quick and rip your finger off” way with it powered up…

If you chose the quick way you had to be aware of “snatch” where if you over adjusted your feeler guage would get ripped from your grasp and thrashed against other way more delicate mechanics… The solution was to “rough set” with “Rizzler blue papers” (a particular brand and type of cigarette paper). If you over adjusted all that would happen is the paper get turned to dust… Unless you made the mistake of “tickling” with the gummed edge, in which case it it was a little humid it would “stick like glue” and take even longer than the “slow and safe” adjustment method to clean up…

Ken September 24, 2021 6:48 AM

This reminds me of an episode of the Kojak TV show. The police had some seemingly secret messages in Greek which no one could figure out. It turned out that they were English messages typed on an IBM Selectric typewriter with a Greek alphabet ball in place of the English ball. It was a simple matter to retype them on a Greek keyboard typewriter with an English ball in place.

Adrian September 24, 2021 8:12 AM

Not every sequence of Unicode code points from the BMP is valid Unicode. For example, the surrogate pairs are valid only in UTF-16 and must be paired properly. Their ROT8000 equivalents appear to be legitimate code points in the main CJK section.

I wonder whether anyone will find a way to construct a valid string that looks like a ROT8000-encoded message but that “decodes” into an invalid sequence of code points designed to trigger Unicode handling bugs in software to achieve “interesting” results.

I’m pretty confident all the mainstream browsers would handle an invalid Unicode string appropriately. By their nature, they require a lot of attention to text handling for international coverage, and I’m sure somebody has fuzz-tested them thoroughly.

But there’s a lot of other software out there: email clients, text editors, custom fonts, etc.

I could imagine a system (like a server) that validates input strings only at a trust boundary. If the system could be tricked into “decoding” a text after the input has been validated, it would be possible to get your invalid string past the checkpoint.

Clive Robinson September 24, 2021 8:49 AM

@ Ken,

This reminds me of an episode of the Kojak TV show.

Yes Teley Savalivs and the lollypops…

But some one got there many centuries befor. A lad by the name of Al-Kindi[1] wrote a little treatise on what we now call “Frequency Analysis” that is the basicmethod used to crack simple substitution ciphers.

Since then somebody else worked out how to extend it to polyalphabetic ciphers…

And the designers and analysts of codes and ciphers have been fighting their arms race ever since.


anonnyMouse September 29, 2021 12:33 PM

Guys, this looks like the original
ht tps:// (now lost in the abyss).

so, why cant we simply add an OTP layer,
according to the vernam OTP rules of Frank Miller(FM),
(to preserve deniably equal odds for each output)?

I have a working version called “SHRINKXOR”,
but CSIS has thugs stalking me it seems,
and ergo I am effectively a street person in a library, (technically) and…
if you have a securedrop I could use… MAYBE… if I get a mobile lte router…]).

Anyway, you need:



  • A Deep SEIF[1] Faraday cage for the processor of the following:
  • noise generators and other precautions such as those used by Julian Assange’s tech, (Alex Mugune? vis; mullhulland foundation)
  • a list of N-Grams to shrink your text, like Norvigs N-gram lists (fractured: ht tps:// or like LDC’s corpora breakouts:
    ( ht tps:// or one you slurp yourself from a large corpus of text from your trade (hint: would love if Bruce would post an 5-GRAM LIST of SOS :p)
  • a list of randomly selected FM numbers in the correct range,
  • a subroutine to convert those ngrams into those FM numbers
  • a subroutine to XOR (or Modulus add) your keys to that output.
  • a QRCODE output from THAT xored output,
  • a cell phone to read the QRCODE from the off-line airgapped caged SEIF PC,
    so you can ride the public transit for an hour and transmit the message somewhere along the way from your phone…
  • a security system that can stop HIGHLY motivated thugs who will now wet the bed because they can’t crack your text messages anymore, and will therefore choose HOME VISITATION as the solution of choice,
  • a bunk in some straw bales in the bush someplace now that you are being badmouthed EVERYWHERE as a communist subversive arsonist pedo, by hyperventilating psychos on the home team, because highly funded no-necks from nearly every nation are suddenly wanting to hire you as a uni-bomber.

    DANG, is the text message to your girlfriend really THAT covert?
    ELSE: stand on your rooftop and holler the message in plain local vernacular. Regardless, you will wake up with a microchip embedded in your body if you use this… DONT.

    Also see, ht tps://
    FYI, “QR Code” is a registered trademark of DENSO WAVE
    [1]A SEIF is a secure em-isolated information facility.

    Thanks again Bruce. Cheers.

  • Leave a comment


    Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via

    Sidebar photo of Bruce Schneier by Joe MacInnis.