Fixing Unicode

The Unicode community is working on fixing the security vulnerabilities I talked about here and here. They have a draft technical report that they're looking for comments on. A solution to these security problems will take some concerted efforts, since there are many different kinds of issues involved. (In some ways, the "" hack is one of the simpler cases.)

Posted on March 13, 2005 at 9:31 AM • 4 Comments


Israel TorresMarch 14, 2005 9:20 AM

As long as something is maintained to be "backwards-compatible" it will have a weak link that will usually be the first to break.

Israel Torres

JohannesMarch 15, 2005 5:34 AM

Opera Software were first reluctant to fix this "bug", since they in fact only implemented it as described. In current betas it is fixed in the current way:
- Some domains, which are considered "safe", will render IDN URLs fine (for instance .no-domains. The only allowed characters here are the latin alfapeth plus the Norwegian characters ���)
- Other domains (such as .com which will allow any unicode URL), will render IDNs encoded, such as instead of
An exceptable solution, rendering some phishing attacks difficult, IMO.

There are hacks with unicode that are not dependent on backward compatibility issues. An example is the paypal-hack, which exploits the fact that there are two "a"s in the unicode-space that look exactly the same.

Chung LeongMarch 15, 2005 1:06 PM

It's really impossible to "fix" Unicode as a whole. There are so many writing systems and few people--if any--have an understanding of all of them.

The first step to solving the security problem with IDN, I think, is to define zones of allowable codepoints. Each zone would encompass a writing system--Latin, Cyrillic, Arabic, etc. A domain name would only have characters within a given zone. Names in a zone would be governed by a zone-specific set of rules, to be determined by people from counrties that uses the script.

For example, the Basic Latin zone could be defined as containing only letters used in the major European languages. When the browser encounters a name with a Cyrillic letter, it'll see that it does not fall in that zone and will look at the definitions of other zones.

The Basic Cyrillic zone might be defined as containing letters used in modern East Slavic languages plus the basic Latin set. A rule in that zone might stipulate that Cyrillic letters cannot be immediately next to a Latin one. A name like the one used in the Secuna exploit would thus fail and the browser would move on looking for another possible zone to place the name. After it has tested the name against all the zones it know, it would give up and display the name in punycode, and maybe throw up an alert message.

The idea is to break a large problem into smaller ones, which we can then solve one by one, incrementally.

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Sidebar photo of Bruce Schneier by Joe MacInnis.