Exploiting Mistyped URLs

Interesting research: “Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains“:

Abstract: Web users often follow hyperlinks hastily, expecting them to be correctly programmed. However, it is possible those links contain typos or other mistakes. By discovering active but erroneous hyperlinks, a malicious actor can spoof a website or service, impersonating the expected content and phishing private information. In “typosquatting,” misspellings of common domains are registered to exploit errors when users mistype a web address. Yet, no prior research has been dedicated to situations where the linking errors of web publishers (i.e. developers and content contributors) propagate to users. We hypothesize that these “hijackable hyperlinks” exist in large quantities with the potential to generate substantial traffic. Analyzing large-scale crawls of the web using high-performance computing, we show the web currently contains active links to more than 572,000 dot-com domains that have never been registered, what we term ‘phantom domains.’ Registering 51 of these, we see 88% of phantom domains exceeding the traffic of a control domain, with up to 10 times more visits. Our analysis shows that these links exist due to 17 common publisher error modes, with the phantom domains they point to free for anyone to purchase and exploit for under $20, representing a low barrier to entry for potential attackers.

Tags: academic papers, browsers

Posted on June 10, 2024 at 7:08 AM • 16 Comments

Comments

Bob Bishop • June 10, 2024 8:12 AM

This is why it’s important for the world to get used to expecting an Extended Validation certificate on anything remotely sensitive.

What price common sense • June 10, 2024 9:01 AM

@Bruce Schneier
@ALL

“Yet, no prior research has been dedicated to situations where the linking errors of web publishers (i.e. developers and content contributors) propagate to users.”

This surprised me when I read it, but thinking on it I don’t remember seeing anything research wise.

But if you think about it there are three basic ways this can happen

By deliberate human intent by an insider or an outsider who gets access.
By accidental human agency such as typos and incomplete cut-n-paste and similar. One of which being “end of line” conversion of long URL’s into a text editor or buffer.
Programmatically / generation by applications or tools. We’ve seen this in the past where sites get moved, updated or upgraded. But there is a new kid in town which is the likes of LLM or ML AI, that if certain people have their way will actually be permanently between the user and the presented URL.

The “AI between” is the real worry we’ve seen issues with Alexa and the like and misunderstood voice prompts.

Apple has been slipping in AI for a while Amazon wants to grab a big slice and Google and Microsoft desperate to use AI as the next surveillance tool on the User Device interfaces or further down the stack.

Think of it not so much as a “Co-Pilot” but a “Compel-Bot”.

Uthor • June 10, 2024 9:16 AM

I was wondering recently is companies bought these domains just to get mistypers to get to their entended site. Like, does Amazon own Amazn and Amzaon.

The Geogre • June 10, 2024 9:36 AM

In the very early days, we referred to “errorspace engineering.” There were certainly malicious, or at least noxious or parasitic, sites that exploited typographical errors. The most famous of these was espncom, where the missing “dot” sent users to a parasite.

Additionally, there were sites, like the much commented upon whitehouse dot com, that exploited the misunderstanding of top level domains.

These seem like innocent days, now, since they required — forgive the phrase — conscious misprision. Today, URL redirects are embedded in phishing attacks inter al.

I’m only shocked that no one performed an academic study of the revenue or traffic diversion (and therefore revenue) from “errorspace.”

Adam • June 10, 2024 10:11 AM

EV certs are useless now, have been for years. No major browser expose the extended information.

Pete • June 10, 2024 10:54 AM

I understand that this sort of thing also occurs on incorrectly printed labels that have customer support telephone numbers so that the miscreant is able to spoof the legitimate vendor when the user calls the telephone # shown.

What price common sense? • June 10, 2024 11:07 AM

@Adam
@Bob Bishop
@ALL

“[Extended Validation certificates] are useless now, have been for years. No major browser expose the extended information.

It’s not just the lack of information, or users not understanding them.

The reality is they only solved a subset of certificate problems and they did not solve those problems very well. Worse as with all partial solutions they quickly did not solve those problems that people needed solving, such is the speed of evolution in cyber-attacks.

The real issue and one nobody has talked about for around a third of a century is that actual “certificates” are a problem in themselves in many ways that can not be solved in a sensible let alone graceful way.

The simplest of the myriad of certificate major issues is, it is not just hierarchical it is because of that they are “a weapon of denial”. That vests power in hands that can not be seen by users, who would not want to understand even if they could see it.

As smart primates we have in effect totally failed in finding an equitable trust model, thus have gone with what we know was not just a known to be bad but failed model long befor Church, Gödel, and Turing did their seminal work in the early 1930’s.

Even the Founding Fathers knew hierarchical human structures enabled corruption and much much worse, yet here we are still avoiding the known issues centuries later…

It’s why I find the NIST PQC competition little more than a joke. Because as far as “Key EXchange”(KEX) is concerned we are running around putting sticking plasters on broken bones.

Back in the early 1990’s our host @Bruce Schneier made comment that algorithm competitions were not addressing the real problems of “Key Management”(KeyMan) and here we are what is effectively a full working life later and we’ve not moved forward in the slightest.

In fact it’s easy to argue that over all progress on KeyMan has been effectively negative, because other aspects have moved forward in some cases significantly.

Even what should apparently be simple problems such as “Real Time Muti-Party E2EE without a center” turn out not to have reliable solutions. Likewise “Anonymous Rendezvous Protocols” and similar. The list is long and ordinary users assume that there are easy solutions and “they want product now”.

Winter • June 10, 2024 11:57 AM

Is a password safe not supposed to capture this type of attack?

At least with login and credentials capturing?

I hardly ever type in passwords or passphrases nowadays.

polpo colpevole • June 10, 2024 1:02 PM

I thought “Mistyped URLs” was a 2016 Czechoslovakian New New Wave coming-of-age comedy film ?

Levi B. • June 10, 2024 8:01 PM

@What price common sense,

I don’t know whether to count the following as a “fourth way”, or just a variant of number 3…

Those who are not familiar with the term “bit-squatting” should look that up; it’s referenced in the paper. The computer itself—its random access memory—has a hardware error that flips a bit. For example, windowsupdate.com and 7indowsupdate.com are one bit apart. Probably nobody would ever mis-type that, but it’s available to register; and with more than a billion active Windows installations, there’s a good chance that whoever registers it will get traffic (assuming Windows still gets updates from windowsupdate.com). They’ll be able to get a valid certificate, too, so it’s good that Microsoft’s update system doesn’t rely on TLS for security.

Web servers would usually have error-correcting (ECC) memory, in which case they’re unlikely to create such links themselves. But if the content management software involves sending the full text between authors and editors, an overheating laptop could do it.

Another thing referenced in the paper (a combination of your points 1 and 3) is post-expiration takeovers. Have you ever clicked a link from somewhere and found the domain was dead? If you registered it, the other people clicking that link could be sent wherever you wanted them to be sent.

One thing that I don’t see mentioned is just making sure the top Google result points to you rather than the “expected” location. I’ve seen people just search the name of their bank or e-mail provider, click the top result, and start typing their name and password. Terrifying. (There’s a reason the button for that is called “I’m feeling lucky”.)

Extended validation was always kind of useless. Nevermind that the major browsers, around that time, seemed to be changing their security-notification interfaces annually (should I look for the lock icon? No, that’s gone, because scammers started using a lock-shaped “favicon”. Can I look for “https”? Not if the browsers are showing “simplified” addresses by default. Does a yellow bar mean anything? Maybe!). Wikipedia mentions that a researcher was able to incorporate a business called Stripe in Kentucky, and got an EV cert for “Stripe, Inc.”. Who here knew that the more famous “Stripe, Inc.”—the credit-card processor—was located in Delaware, and how to check that the EV cert referred to that one? I have no idea whether Stripe of Delaware used EV itself; certainly many major banks did not. And the system didn’t scale, was expensive and cumbersome for little benefit, and was only doing what the certificate authorities were supposed to have been doing all along.

Daniel Popescu • June 11, 2024 2:59 AM

@Levi B. – excellent input Levi, thank you

Gert-Jan • June 11, 2024 7:16 AM

It’s always interesting to theorize why an incorrect link was published. This research puts some numbers on the topic.

But when I read the countermeasures section, I’m not impressed. I’m not seeing any proper solution, but just some non scalable bandage solutions here and there.

Having said that, I don’t think it is a big problem in everyday life. The fact that so many of these “fake” domains can still be registered tells me that criminals do not think they can make money off of it. I haven’t read the paper thouroughly, but I don’t think it answers the question why criminals haven’t jumped onto this. I have no doubt criminals have done this research before the authors did.

What price common sense? • June 11, 2024 7:30 AM

@Levi B.

“Those who are not familiar with the term “bit-squatting” should look that up”

Are you sure you want to go down that rabbit hole?

It’s an instant of a general class of problems that are never going to go away.

And why in

“Web servers would usually have error-correcting (ECC) memory, in which case they’re unlikely to create such links themselves.”

The key word is “unlikely” or more formally “low probability”.

Because it’s down to the fundamentals of the universe and the failings of logic and reason as we formally use them. Which in turn has been why since at least as early as the ancient Greeks through to 20th Century, some of those thinking about it in it’s various guises have gone mad and some committed suicide.

To understand why you need to understand why things like “Error Correcting Codes”(ECC) will never by 100% effective and deterministic encryption systems especially stream ciphers will always be vulnerable.

And why it also relates to the post @slashed zero has just made over on the current Friday Squid Page,

https://www.schneier.com/blog/archives/2024/06/friday-squid-blogging-squid-catch-quotas-in-peru.html/#comment-438257

Which is about why something like 12% of what are easily preventable medical related deaths are actually nothing what so ever to do with medicine but due to fundamental “information issues” that appear to be easily solvable thus avoidable but are actually not.

And is also in part why

“Computers can count but can not do mathematics, and never will be able to in a finite universe.”

And it was an issue sufficiently well known to Georg Cantor, Alan Turing, Kurt Gödel, Claude Shannon, John von Neumann, and others, which led upto the birth of information theory at the begining of the 1960’s.

And it’s still causing issues today with why LLM and ML AI hallucinates and why Roger Penrose has been given a lot of undeserved criticism.

Enough of a build up?

There is an old riddle that actually shows the problem tangentially

“If a rooster lays an egg on a church steeple which way will it fall?”

The obvious and incorrect answer is “roosters don’t lay eggs”. It’s incorrect because you have to alow for the failing of Juvenal’s “black swan metaphor”. So the actual answer has to be based on reasoning about “if such an egg existed”, and the answer is “we can not know” because it’s “undecidable”.

To see why this applies, lets start with a system even simpler than ECC systems. That of the “error detecting but not correcting” system called “parity checking”. When all is said and done the parity check bit is

“The least significant bit of a binary count of the defined bit states in a finite data set.”

So if you count say all the set bits in your data set the result will be either “odd or even”. And depending on how you set it up an odd count could indicate an error has been found.

That is we know at least one bit has been flipped, but what if two bits have been flipped? Then the count is even indicating no error found, which is incorrect. So “parity” only detects some bit flip errors and allows others to pass undetected.

The price you pay for this partial error detection is half the range of values your data set could potentially hold. So “there is a trade” in that error detection takes significant information bandwidth for an imperfect result.

But it actually gets worse, what if the bit flip errors are in the error detection bits rather than the data bits? That gives the opposite type of error in that the actual data is correct but the check code is incorrect and the correct data is rejected as opposed to incorrect data accepted.

No matter what you do all error checking systems have both false positive and false negative results. All you can do is tailor the system to that of the more probable errors.

But there are other underlying issues, bit flips happen in memory by deterministic processes that apparently happen by chance. Back in the early 1970’s when putting computers into space became a reality it was known that computers were effected by radiation. Initially it was assumed it had to be of sufficient energy to be ‘ionizing’ but later any EM radiation such as the antenna of a hand held two way radio would do with low energy CMOS chips.

This was due to metastability. In practice the logic gates we use are very high gain analog amplifiers that are designed to “crash into the rails”. Some logic such as ECL was actually kept linear to get speed advantages but these days it’s all a bit murky.

The point is as the level at a simple logic gate input changes it goes through a transition region where the relationship between the gate input and output is indeterminate. Thus an inverter in effect might or might not invert or even oscillate with the input in the transition zone.

I won’t go into the reasons behind it but it’s down to two basic issues. Firstly the universe is full of noise, secondly it’s full of quantum effects. The two can be difficult to differentiate in even very long term measurements and engineers tend to try to lump it all under a first approximation of a Gaussian distribution as “Addative White Gaussian Noise”(AWGN) that has nice properties such as averaging predictably to zero with time and “the root of the mean squared”. However the universe tends not to play that way when you get up close, so instead “Phase Noise in a measurement window” is often used with Allan Deviation

https://www.phidgets.com/docs/Allan_Deviation_Guide

The important point to note is “measurement window” it tells you there are things you can not know because they happen to fast (high frequency noise) and likewise because they happen to slowly (low frequency noise). But what it does not indicate is what the noise amplitude trend is at any given time or if it’s predictable, chaotic, or random. There are things we can not know because they are unpredictable or beyond or ability to measure.

But also beyond a deterministic system to calculate.

Computers only know “natural numbers” or “unsigned integers” within a finite range. Everything else is approximated or as others would say “faked”. Between every natural number there are other numbers some can be found as ratios of natural numbers and others can not. What drove philosophers and mathematicians mad was the realisation of the likes of “root two”, pi and that there was an infinity of such numbers we could never know. Another issue was the spaces caused by integer multiplication the smaller all the integers the smaller the spaces between the multiples. Eventually it was realised that there was an advantage to this in that it scaled. The result in computers is floating point numbers. They work well for many things but not with addition and subtraction of small values with large values.

As has been mentioned LLM’s are in reality no different from “Digital Signal Processing”(DSP) systems in their fundamental algorithms. One of which is “Multiply and ADd”(MAD) using integers. These have issues in that values disappear or can not be calculated. With continuous signals they can be integrated in with little distortion. In LLM’s they can cause errors that are part of what has been called “Hallucinations”. That is where something with meaning to a human such as the name of a Pokemon trading card character “Solidgoldmagikarp” gets mapped to an entirely unrelated word “distribute”, thus mayhem resulted on GPT-3.5 and much hilarity once widely known.

But as noted these problems cause avoidable deaths in the medical setting many are easily avoidable. They happen because information gets hidden from view by more old fashiond AI “Expert Systems”. The result either wrong intervention choices or inaction results and death follows.

This is a problem I’ve had some involvement in from back in the 1980’s through untill more recently, and their is a whole department involved with it at Queen Mary’s Uni in East London. Sadly there is no one answer, and as such it’s an unsolvable problem with a deterministic system of the types we currently have. The issue starts with the “User Interface”(UI) equipment screens have limited area and no visible depth unlike the old fashioned “whiteboards” and “medical files”. So the medical systems can not display all information so selection choices have to be made and it’s that which kills, plain and simple.

One obvious one is “most recent first” as a selection criteria. But is that recent when a test was requested, started, or results come back? A critical test might not be displayed because a half dozen simple observations or other tests have happened since and the system displays them in preference. Even if flagged in some way there are limits due to what the UI can do, all basic selection criteria have this issue and current technology is not going to change that. In fact any changes you make are likely to make the problem worse…

But there are other selection processes. One is the electronic “British National Formulary”(BNF) and “National Institute for Clinical Excellence”(NICE) guidelines, both are seen as “Bibles” to be obeyed without question. But which takes president and why? The answer is complicated and the results can kill people such as issues to do with iron supplements, anticoagulants, antibiotics, PPI’s and NSAID’s.

Then the “Expert Systems” are rule based systems based on “current” presented conditions. They are never up-to-date, never compleat, and once walking a path are difficult to get to change. Anyone who has experienced the newer Internet search engines like those from Microsoft and Google will have a feel for this issue.

Unofficially something like 12% of avoidable deaths in a medical setting are down to “Small UI” issues…

lurker • June 11, 2024 2:37 PM

@ Levi B.
“I’ve seen people just search the name of their bank or e-mail provider, click the top result,…”

I must be a bitter and twisted cynic. There seems no limit to human laziness, so people who use Google as their address book get all they deserve. FGS, how hard is it to use your own bookmarks or contacts list?

Levi B. • June 11, 2024 3:36 PM

@lurker,

how hard is it to use your own bookmarks or contacts list?

Well, one has to know that bookmarks exist, and how to create them. With browsers deliberately blurring the interface between URLs, partial domain names (automatically adding .com in some cases), searches, bookmarks, and history, that’s not a given. Someone typing “facebook” into an address bar could get the expected result from any of those sources; it’ll work like 99% of the time, so what’s the incentive to learn about the “bookmark” feature?

As for contact lists, why bother when you can type a name and your e-mail software will auto-complete it? Careful, though; Outlook does “blurring” of its own, by searching both the address book and the “To” and (sender-selected) “From” headers of every saved message. I’ve had confidential data accidentally sent to my personal account that way. (I’d e-mailed my manager to say I was taking a sick day. From then on, whenever they’d type my name, they’d get two results—the other from the corporate address book, which apparently takes no priority—and would occasionally click the wrong one. It didn’t stop till they deleted my sick-day message and their own “get well” response.) Imagine what someone trying to get confidential data could do.

Sec • June 13, 2024 12:49 AM

Bookmarks are the best solution from the client side. Just add any website you register to into your bookmarks and never use the address bar to reach it again.
When dealing with money, always check the URl and the certificate of the website before entering financial information.
Googling is also an adequate solution, I believe Google deliberately removes those fake websites from its results, especially the first page. Though there is a privacy issue here, so perhaps a privacy search engine is better.
Disabling any execution of scripts not manually white listed before by the user also greatly weakens this attack vector but requires more hassle when visiting new websites.

Exploiting Mistyped URLs

Comments

Leave a comment Cancel reply