Detecting Phishing Sites with Machine Learning

Really interesting article:

A trained eye (or even a not-so-trained one) can discern when something phishy is going on with a domain or subdomain name. There are search tools, such as, that allow humans to specifically search through the massive pile of certificate log entries for sites that spoof certain brands or functions common to identity-processing sites. But it's not something humans can do in real time very well -- which is where machine learning steps in.

StreamingPhish and the other tools apply a set of rules against the names within certificate log entries. In StreamingPhish's case, these rules are the result of guided learning -- a corpus of known good and bad domain names is processed and turned into a "classifier," which (based on my anecdotal experience) can then fairly reliably identify potentially evil websites.

Posted on August 9, 2018 at 6:17 AM • 11 Comments


AndrewJAugust 9, 2018 6:39 AM

Will be interesting to see the effectiveness longer term given Let's Encrypt are now issuing wildcard certs - surprised they weren't mentioned in the article at all.

(required)August 9, 2018 1:50 PM

"During the two hours I spent investigating this Apple phish, another 1,678 suspicious sites have popped up—spoofing brands including Apple, PayPal, Netflix, Instagram, and Bank of America. It will be nearly two days before SingleHop responds about the initial Apple one: "We were in touch with the management of the allegedly abused server, and after discussion the reported problem is claimed to be resolved."

"That sort of interaction can't scale very well" - humans had to / have to be involved.
I don't think we're ready for AI to be pulling the plug on domains "all by itself" yet.

But if it matches xyz criteria known only to be malware kits, isn't that the next step?
Automatically detect and sinkhole domains based on their signatures before the crime?

captain_obviousAugust 9, 2018 9:23 PM

@ (required)
Maybe this site has had its plug pulled by AI "all by itself"?

IsmarAugust 9, 2018 10:02 PM

I hear you Danny and I think there are legitimate concerns (especially about the metadata side of things) here especially given where the funding is coming from.
I think that Signal might be a a good platform for US Gov to keep test their eaves dropping abilities. Something similar to having a backdoor in encryption where only they can listen into the conversations.
However, for less advanced adversaries including some countries Signal may be a good option. So maybe a good choice for a dissident in Iran or , but not for dissident in any of the 5 Eyes countries or even countries like Israel or Russia.
Not sure about places like UAE or Saudi Arabia as they may have enough money to buy the data they need from the US Gov directly.
Have you given any thought to using or reviewing Telegram ?
@Bruce please feel free to give my email details to Danny if he want to discuss this off this blob further

SteveAugust 10, 2018 1:45 AM

If you train machine learning on "a corpus of known good and bad"... then it only learns previously-known good and bad patterns from that existing corpus... it does not help with entirely new kinds of patterns that match neither, which are fairly easy to invent with some regularity... the machine learning won't know how to classify those.

Clive RibinsonAugust 10, 2018 8:30 AM

@ Danny, Ismar,

Care to comment on this?

My resoponse since before Signal, WhatsApp etc etc etc, has always been,

    They are not secure, nor can they be, due to failures in design in the underlying systems, that prevent effective endpoint security beyond the communication end point.

Valid as the other arguments are they are compleatly and utterly irrelevant as long as the above problem exists. It's something every Signals Intelligence Agency world wide knows. Likewise any Law Enforcment Organisation with any technical ability. So these apps can not in any way protect you from Police States, Tyrants, Dictators, or Representational Democracies currently.

In fact even half way technical Private Investigators know this and so do most cyber criminals or anyone else who can think or reason for more than five minutes and have a little prerequisite knowledge on Security Based Shannon Channels.

The consequence is these apps are totally and utterly insecure no matter what clever crypto algorithms, modes, methods or other security functionallity their "Oh So Clever" designers might decide to come up with and include.

As long as the User Interface with Plaintext can be subjected to an "end run attack" then it's game over no matter what these "Oh So Clever" designers come up with.

Both Moxie and Bruce and the usuall suspects on this blog are very well aware of this issue and even know solutions to them.

Do you understand my argument and reasons for my viewpoint?

Now you should be asking why what I say and Bruce says are apparently at odds, even though they are not...

Actuallt that Bruce and myself are answering two very different questions, due to a significant underlying issue.

The problem with is a very human one, and the two documents yoy need to read up on are,

1, Why Johnny can't encrypt.
2, Why Johnny still can't encrypt.

Historically they are a good starting point read as to why "People don't do OpSec".

Bruce has further indicated in the past that a major reason security does not work in businesses is that it is detrimental to the workers ability to get promotion or even retain their jobs.

That is security for the main part impares the workers ability to meet their asigned work targets. The workers then make a quite sensible trade off, they figure that they will get fired for a security breach but they will also get fired for not meeting targets. As both outcomes are a given then the workers make a "value judgement". That is, which is most likely to happen to them? Which is "get fired for missing targets"... Thus they act quite rationaly and work around or not use any security mechanism that gets in the way of them "making numbers".

So why does @Bruce apparently recommend/endorse the use of these Apps? Well untill he actually comes out and says why you have to try and determine his likely reasoning.

However it's a little complicated but you can work it out, if you start from the right point. But firstly you need some background knowledge.

Contrary to what many people think, information is not a physical entity. The probable reason for the incorrect view point is that we are "physical entities in a physical reality" and only interact by physical means.

The only reason we can interact with information is it is impressed or modulated onto energy or matter To,

1, Communicate it.
2, Store it.
3, Process it.

Some people get this when told but others take a while or need further explanation (which I've given a few times in the past on this blog).

So on the "you get it" assumption the question arises as to what these applications can and do to protect user information. Most obviously they do when information is communicated between devices, provided any intermedurary node does not have access to the message plaintext. They can also protect "information at rest" when stored on the devices provided it's not in plaintext and the keys are adequately protected.

What they cannot currently do is protect information when it is being processed. Because like communicating it from the device directly to the user, processing information currently requires the information to be in plaintext...

You need to realise that the real issue is that inadiquate device security negates the application security. The easy technical fix is to add an extra encryption layer beyond the end point of the device between it and the user. But users will just not do that..

Importantly to understanding the reasoning is what is happening in some cases is that the weakest link in the security chain "Device Security" is improving in leaps and bounds.

With Apple being the current leader in Consumer Of the Shelf Technology (COST) and clearly some users buying their products because of it.

However unfortunatly there is no reason to assume that this trend in device security will continue, it's not realy happened on other devices and infact the faux "Going Dark" argument is all about stopping it.

But currently pushing the faux "going dark" mantra is becomming more and more of an uphill struggle, the FBI and DoJ lost a great deal of credability when they took Apple to court and to try and set a legal president. Because it would now apear that the DoJ/FBI very deliberatly lied to the Magistrate involved. The fact they were going to fail as Apple fought back hard ment they had litle choice but to "pull the rip cord" so they could try again at a later date as well as saving face.

However unlike with "Crypto Wars One" which was fought and won on almost an entirely ideological grounds, consumer awareness of the need for security is actually growing. No doubt helped by the Ed Snowden Document Trove Revelations, but also the more recent political manipulation scandals. Especially the criminal investigations in the UK into very suspect individuals in the US with their illegaly funding and gathering information. Atleast one "Silicon Valley Name" --Peter Theil-- keeps comming up either directly or indirectly through association with some very unplesantly viewed individuals involved in Hedge Funds with traceable finances to the illegal activity involved with Cambridge Analytica. It just so happens that Mr Theil also has possibly the worlds largest private inteligence database and analytics organisation Plantair. Another "name" is of course "Mr Z/Zleazee" of Facebook. Who were intimately involved with Cambridge Analytica illegal election activity not just in the UK but it appears the US as well as several orher places. Recent stockmarket activity suggests that even the corporate rats are bailing out from that "leaky bucket" that Facebook has become.

Oh and also the "prosecution of a ham sandwich" issue currently going on in the US has just revealed to many that current device security with regards storage is decidedly lacking as the "Special Prosecuter" apparently accesses previous messages with total ease (whether by technical or human means is an open question). Which is probably causing quite a degree of consternation in parts of the FBI senior hierarchy. As it reveals the "faux" nature of the "going dark" issue more and more. Making them ever increasingly appear as not just charlatans but in effect perjuras... In fact I suspect that like the GOP certain elements in the FBI might well wish the Special Prosecuter to "cease and desist" with certain of his and his teams activities with rather more alacrity than "Pretty Darn Quick".

After all who knows what might happen when people start fealing threatened, we might see others starting to run "interferance" against the SP for which no doubt the Russian's will be blaimed ;-)

Thus clearly "Device Security" is becoming a "Hot Button Issue" with increasing numbers of voters with the attendent "push back" against the faux "going dark" FUD.

Thus there is a reasonable chance if hardware manufacturers sort out the Spector, Meltdown and other "bubbling up" attacks that device security will continue to improve. Potentially to the point it becomes such that the IC and LEO community will have to consider other methods than "end run attacks" to harvest data from "persons of interest" and also the more general population subject to "Hoover it up" industrialised "collect it all" behaviour out of sight to end users on the Internet backbone.

There is little doubt that the "lets encrypt" message got through to many sysadmins and the like. Who responded and plaintext on the Internet as a concequence dropped a lot. Thus causing the IC entities resource, thus priority issues. Which inturn in effect increased all users privacy and security from the IC entities as well as cyber-criminals.

However getting this point over to the general public even the technically inclined members is a hard sell, but a very necessary one.

More importantly you can not strengthen all links in the security chain to maximal values over night, it's a process that is ongoing and always will be due to the "unknown unknowns" etc issue.

It is thus an ongoing process, which Bruce is encoraging in part. But there is a more important part many do not realise which realy is of great concern.

There is a lot of crap talked about "the power of the free market" and it's all based on a series of false assumptions. One of which arises from the physical world of the "Distance Costs" idea.

Put simply the assumptions are that the further you are from the point of production the more a good costs to transport. Which in turn gives producers closer to the consumer an economic advantage, thus providing "market competition". It's fairly easy to see that it's not of importance in the physical world as it's way to minor compared to other economic factors.

But it's a joke in the online world where the cost of information transport is just like plant equipment costs. Thus it has no "distance cost" element at all any longer (there used to be one, imposed in the first century of telecommunications). So the Internet is depending on your viewpoint either a "first to market" or "winner take all" environment.

Thus due to "user issues" from the security asspect if the most secure application is not "first to market" then it never will be in the market.

Thus a little "qualified" comment now about Signal etc may have significant security implications long long into the future. I know this and I suspect Bruce knows it as well. Thus Bruce saying he uses Signal and he thinks it's the most secure of the apps is quite true, and quite different to saying it's secure to some impossible level due to other current issues. The fact that others miss the context and the reason for it is another issue entirely.

I hope that gives you a slightly different perspective on the subject.

echoAugust 10, 2018 9:08 AM

StreamingPhish seems like a good tool for language experts studying spelling and etymology.

A Nonny BunnyAugust 11, 2018 2:17 PM


If you train machine learning on "a corpus of known good and bad"... then it only learns previously-known good and bad patterns from that existing corpus... it does not help with entirely new kinds of patterns that match neither, which are fairly easy to invent with some regularity... the machine learning won't know how to classify those.
And despite not knowing how to classify them, it still will (just not necessarily correctly).

Additionally, the bad guys can (potentially) train a similar network, and use it to find a pattern that will be misclassified.

Nonetheless, recognizing existent "bad" patterns and stamping them out is already a big improvement. You remove the low hanging fruit, and make the bad guy work a little bit harder.

Clive RobinsonAugust 11, 2018 3:07 PM

@ A Nonny Bunny, Steve,

You remove the low hanging fruit, and make the bad guy work a little bit harder.

Possibly but to what result?..

If you think back to the late 1990's early naughties, Banks used to keep chopping out the "low hanging fruit" in their online systems. But because the Banks never made the gap between their old and new security systems large enough, all they ended up doing was "training" the crackers to become even better crackers...

Thus you need to consider if such system manipulation will cause one of those ECM ECCM ECCCM battles where each step gets alternatively defeated by the opposition. With each step costing a positive power more than the last untill either the cost becomes prohibitive and bot even a Pyrrhic victory is possible or one side takes what used to be called "A technology quantum leap" and thus resets the cycle, to start over again.

It's kind of like riding "the hamster wheel of pain" the only sensible opption is to "get out whilst you still can". There used to be the old joke "What ever the question the answer is not Microsoft" just replace Microsoft with AI to bring it upto date ;-)

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.