Deep Learning to Find Malicious Email Attachments

Google presented its system of using deep-learning techniques to identify malicious email attachments:

At the RSA security conference in San Francisco on Tuesday, Google's security and anti-abuse research lead Elie Bursztein will present findings on how the new deep-learning scanner for documents is faring against the 300 billion attachments it has to process each week. It's challenging to tell the difference between legitimate documents in all their infinite variations and those that have specifically been manipulated to conceal something dangerous. Google says that 63 percent of the malicious documents it blocks each day are different than the ones its systems flagged the day before. But this is exactly the type of pattern-recognition problem where deep learning can be helpful.

[...]

The document analyzer looks for common red flags, probes files if they have components that may have been purposefully obfuscated, and does other checks like examining macros­ -- the tool in Microsoft Word documents that chains commands together in a series and is often used in attacks. The volume of malicious documents that attackers send out varies widely day to day. Bursztein says that since its deployment, the document scanner has been particularly good at flagging suspicious documents sent in bursts by malicious botnets or through other mass distribution methods. He was also surprised to discover how effective the scanner is at analyzing Microsoft Excel documents, a complicated file format that can be difficult to assess.

This is the sort of thing that's pretty well optimized for machine-learning techniques.

Posted on February 28, 2020 at 11:57 AM • 16 Comments

Comments

PaulFebruary 28, 2020 1:08 PM

Thank you for reminding me why I will never have a Google email account. My email (like my normal mail) is for my eyes only. It would be a criminal offence for Google to read my normal mails. Why is this not true of my emails?

Rick BurrisFebruary 28, 2020 1:34 PM

Just remember, people sign up for the breach of privacy. It's not really different than letting someone house sit for you, and telling them to open mail and keep en eye out for a specific letter. I don't like it, but at least one party in the conversation apparently said it could be read by having agreed to the EULA.

EvilKiruFebruary 28, 2020 1:47 PM

@Paul: Because you agree to allow Google to scan your email when you sign up for the Gmail service.

JonKnowsNothingFebruary 28, 2020 4:22 PM

If you are concerned about someone scanning your emails and attachments best to return to snail mail and get off the web.

It's not just Google that can do this, pretty much anyone can do it. Certainly any LEO worth a grain of salt is doing it. You email is appreciated in pretty much every country on the planet. If it's not a LEO, it's an ISP, if not an ISP, it's any 3d party provider of services.

MHayden said (paraphrased), it wasn't illegal because "no human" looked at the metadata.

fwiw: Luddites are becoming more common now. Not only is the internet insecure, it's also incredibly expensive. This is a side aspect, but lots of folks cannot log on, cannot connect, cannot update, cannot send, text or video because their gear is too old, too out of date, dead drivers, obsolete parts.

Snail mail is looking like a bargain.

.55 USD per letter vs $60-100 (plus more $$) just for a connection. You can send 100+ letters for the cost of paying-for-your-surveillance.

However, note that nearly every postal service is highly computerized too. Every item is scanned front back and sides and logged. Parcels and letters can be opened too (old school steamers). It's not as convenient to steam open stuff but there's a great photo of the NSA steaming open a CISCO Telecom Router to install their "custom app".

Clive RobinsonFebruary 28, 2020 10:31 PM

@ ALL,

I was once a major user of Email in 7bit ASCII no attachments alowed. Just a few terse comments that mostly were not even realy meta-data, certainly no data though.

I used other methods for moving data and most often it was encrypted regardless of if it needed to be (get into good habits by never breaking them).

Few remember the "Ollie North and Fawn Hall" show, that was once so engaging. And I guess few ever get taught thrse days the big lesson of why you should not use Email at all for anything even remotely confidential (store and forward nodes have backups).

In my personal life I don't do,

1, Email.
2, Social media.
3, Secure messaging apps.
4, Any security app on a comms end point.
5, Have cookies enabled.
6, Have Javascript enabled.
7, Use HTML5.

People will no doubt find out the hard way why these are all most ill advised and I've said why in the past.

I've also got beyond the point of saying "secrecy" it's become an unclean word to use, especially as "Privacy" is realistically the actual requirment and most people understand it in an everyday way.

It is quite possible to communicate electronically with good "Privacy" if you wish to, all it takes is a little thought and learning.

But untill people are prepared to do that little amount of thought and learning and instead follow the "sheep press" route they will either have no privacy or what little privacy they have taken away from them.

The advent of sufficient computing power to make AI a "sufficient filter" should be a big big red flag waving in every thibking persons sight.

What stops universal surveillance is one thing and one thing only "resources". Limited resources of an adversary means they have to prioritize which introduces a floor on their abilities. If you operate below that floor then you are off their general surveillance scope. Keep your activities circumspect and don't have contact with those who attract attention then your privacy is not likely to be intruded upon.

The problem with the "sheep press" route is it's an "industrial process" that treats all sheep the same. Thus avoiding that well traveled route is your first step to privacy retained.

Another advantage of not going through that "sheep press" is that you don't have to run at the same speed as the other sheep. Which means you can use older methods older hardware some going back to the late 1980's to get things done.

Not only does that save you money, it's also way more difficult for others to attack you, unless they go to a lot of trouble to specifically target you.

Oh and for all it's quaintness old hardware was comparatively way more expensive in it's day, thus hardware of the 1990's tends to be quite a bit more robust thus keep working.

It's up to you, do you want to run with the rest of the sheep and be sheared by machine, or be more like a wild goat, free to go your own way at your own pace, in private?

Number SixFebruary 29, 2020 4:45 AM

@Clive Robinson (28 février, 10h31) : Just one word (sorry, it's a French one) : merci ! Yours is rather difficult a path (as is the Prisoner's path in the old TV series), but it's the safest one and even more important : the only one worth taking if one really cares about one's own dignity.
We may be sheep controlled by armed sheperds and armed dogs, but we don't have to consent.
Just for the fun of it, yet : Alphonse Daudet wrote a classic (at least here) short story about a goat wanting to go wild while having a benevolent master providing her a safe life (it's "La chèvre de monsieur Seguin" in *Les lettres de mon moulin*). She succeeds in escaping the authority of her master and then enjoys freedom for a time, assuming full responsability for this. At the end, the newly free animal gets eaten by the wolf. ;-)
Still running a 1997 Pentium II with Novell DOS 7 and Win98SE. ;-)

ilsatydFebruary 29, 2020 8:48 AM

@Clive Robinson

> In my personal life I don't do,
>
> 7, Use HTML5.

This webpage is HTML 5.

TatütataFebruary 29, 2020 1:32 PM

Snail mail is looking like a bargain.

.55 USD per letter vs $60-100 (plus more $$) just for a connection. You can send 100+ letters for the cost of paying-for-your-surveillance.

I'm sorry, but I must disappoint you...

Ron Nixon, New York Times, 3 July 2013 : "U.S. Postal Service Logging All Mail for Law Enforcement

WASHINGTON — Leslie James Pickering noticed something odd in his mail last September: a handwritten card, apparently delivered by mistake, with instructions for postal workers to pay special attention to the letters and packages sent to his home.

“Show all mail to supv” — supervisor — “for copying prior to going out on the street,” read the card. It included Mr. Pickering’s name, address and the type of mail that needed to be monitored. The word “confidential” was highlighted in green.

“It was a bit of a shock to see it,” said Mr. Pickering, who with his wife owns a small bookstore in Buffalo. More than a decade ago, he was a spokesman for the Earth Liberation Front, a radical environmental group labeled eco-terrorists by the Federal Bureau of Investigation. Postal officials subsequently confirmed they were indeed tracking Mr. Pickering’s mail but told him nothing else.

As the world focuses on the high-tech spying of the National Security Agency, the misplaced card offers a rare glimpse inside the seemingly low-tech but prevalent snooping of the United States Postal Service.

Mr. Pickering was targeted by a longtime surveillance system called mail covers, a forerunner of a vastly more expansive effort, the Mail Isolation Control and Tracking program, in which Postal Service computers photograph the exterior of every piece of paper mail that is processed in the United States — about 160 billion pieces last year. It is not known how long the government saves the images.

Together, the two programs show that postal mail is subject to the same kind of scrutiny that the National Security Agency has given to telephone calls and e-mail.

Mail sorting equipment has long been taking images of items and analyzed them. Storing this info, and compiling metadata from, once a major undertaking, has become perfectly trivial, especially in view of the decline of the volume of sorted mail. This is now even offered as a service.

Just a few weeks before that NYT item, cartoonist Jeff Danziger propagated the same misconception on the privacy of the US postal system.

Clive RobinsonFebruary 29, 2020 4:34 PM

@ ,

This webpage is HTML 5.

"Compatible" like most web pages are. But the browser I'm using is most certainly not.

The thing is HTML5 contains earlier HTML standards and tags for backwards compatabiliry. Thus you can generate a page that is functional with an HTML3 browser even though the server dishing it out might claim it's HTML5.

I have come across some pages that do use HTML5 in what are inadvisable and probably illegal[1] ways. Most of the ill advised extensions in HTML5 that cause such issues exist for the benifit of just a handfull of major corporations like Google that effectively blackmailed them in... The W3C effectively just bent over to and a certain person just mumbled inaudible objections at best.

The big issue though is HTML is now a compleat mess if not nightmare and it needs to be canned pronto not fuether extended in hapless ways. If you were designing a sensible and robust protocol these days HTML would be a long long way from the starting point.

[1] Disability legislation lays a legal requirment on all ICT service suppliers to have correctly functioning products for disabled people be it sight, hearing, or physical interaction.

JonKnowsNothingFebruary 29, 2020 6:22 PM

@Tatütata

re: U.S. Postal Service Logging All Mail for Law Enforcement

You might have missed where I indicated this:

Every item is scanned front back and sides and logged. Parcels and letters can be opened too (old school steamers).

But some clarification about "paying-for-your-surveillance" is in order.

When the USPS scans and opens your stuff for LEOs, in theory they require a warrant but these days, that isn't certain. They likely pass their scan logs to the NSA which parses it and passes it around the other Federal LEOs, State and Locals with or without warrants. The "what you don't know, you can't complain about" sort of warrant.

When it comes to paying for this surveillance at the USPS, it is done by various funding mechanisms including stamps. The USPS is always short on income.

When using an ISP, connections start at @$60 USD and rise upwards. Depending on speeds, services and type of carriers, it is not uncommon for people to spend $400+ USD per month.

So, we do pay for the surveillance. The choice is ".55 cents per view" or "$400 per month for their unlimited 7/24/365 access".

Theoretically, folks like FB, Google, Amazon et al, do not get copies of the USPS scans but given their deeper connections to the USGov, maybe now they do.

ht tps://en.wikipedia.org/wiki/United_States_Postal_Service#Law_enforcement_agencies

The United States Postal Inspection Service (USPIS) is one of the oldest law enforcement agencies in the U.S. Founded by Benjamin Franklin, its mission is to protect the Postal Service, its employees, and its customers from crime and protect the nation's mail system from criminal misuse.
(url fractured to prevent autorun)

gordoFebruary 29, 2020 6:56 PM

People can sign up to preview their USPS-delivered mail before it arrives:

Informed Delivery® by USPS®

Digitally preview your mail and manage your packages scheduled to arrive soon! Informed Delivery allows you to view greyscale images of the exterior, address side of letter-sized mailpieces and track packages in one convenient location.*


* Images are only provided for letter-sized mailpieces that are processed through USPS' automated equipment

Sign Up for Free

ht tps://informeddelivery.usps.com/box/pages/intro/start.action#/

Givon ZirkindMarch 1, 2020 5:40 AM

My 2c:

Anyone who manages multiple email accounts will notice that there is a sudden flurry of the same spam/scam, phishing message. What I don't hear Google is doing, which a large email provider like Google could do (or even small sysadmins), is to compare a known burst of the same message (content and origin) to an inordinate large amount of users. While this requires some AI & categorization too, it isn't hard. It's what AI is all about.

Ex. There maybe millions of emails a day whose body is "LOL". Not spam. But, there are also a flurry of emails that appear, "To My Gracious Dear...My husband died in the former regime...millions of dollars in a bank account...to transfer to your bank account..." on or about the same day.

All the spam filters I have seen, target the content of an individual user, which definitely has value and is predictive. However, when the same message appears in thousands of accounts, the probability of it being spam is much greater. Or, in more technical terms, the prediction is higher.

I don't know how Google's spam filter works. Some how, I get the feeling this isn't being done. But, IMHO, this should be part of their algorithm.

JonKnowsNothingMarch 1, 2020 10:12 AM

@Givon Zirkind
re:

compare a known burst of the same message (content and origin) to an inordinate large amount of user

Many ISPs do this. If you have a "mailing list" or "newsletter" and this newsletter is sent to a subscriber's list of "many" (~100), and your send function pumps too fast you will find your newsletter blocked and locked.

It's tricky to get some of the send functions to work without getting a lock. Timing, frequency and size of batches. For a small newsletter (~3,000) this is a nightmare. Sometimes you have to queue the send up and have it run over a number of days. If all you can send is 100 per hour and you have 3,000 that's a 30 hour send. If you newsletter has coupons or time-limited information on sales the info is stale fast.

Big senders run their own systems and have their own pipes. They are not restricted.

One problem in the USA, is that some aspects of the content are "free speech" meaning you cannot block it legally. We are in a current election cycle in the USA and a recent MSM report on the throttling of subscribed political emails indicated in some cases 60% were being rejected by the carriers, way before they get to the inbox.

The key difference between spam and not-spam is "subscribed". afaik, ISPs have no way to tell if the item is "subscribed" or "faked subscribed".

JonKnowsNothingMarch 1, 2020 10:32 AM

@Clive @All

re: HTML 5
disclosure: My own equipment is so old, even spiders refuse to build webs in it.

When HTML 5 was being reviewed by MSM Tech Sites, there wasn't anything in it I wanted. I don't want "auto running video" with no controls. I don't need a DRM system that parses my hard drive either.

One thing I would like to remove permanently are EMOJIs. All of them from every device. They are the worst for spam ridden text replacement spear phishing attacks in emails and text msgs. They are a complete menace.

I block as many as I can and use TEXT ONLY, but you know, the spammers find a way to send me a lot of cute emoji icons in different fonts with a variety of character encodings purporting to be from Google, Apple, Amazon, EBay, PapPal, Banks and Lost Princes.

ht tps://en.wikipedia.org/wiki/Emoji
(url fractured to prevent autorun)

Clive RobinsonMarch 1, 2020 1:39 PM

@ JonKnowsNothing,

One thing I would like to remove permanently are EMOJIs. All of them from every device.

Unfortunatly according to some they have "Become a language in their own right"... So much so that London's SOAS[1] has had lectures on them...

My personal feeling is that when an industry insider creates a day for it[2] then it's almost the same as "Hallmark Greatings Cards" starting "Nurses day" so they could create a "new market" to push product in. Or worse yet as an excercise in "job creation" and "self promotion".

Unfortunately "language" or not, it's claimed that something like a half billion or so messages get sent with them every day...

So who to blaim... Well you could say it's the fault of the UN ITU, which actually predates the UN for descisions made a little over a century ago. But it actually goes back nearly two hundred years ago to the "Cooke and Wheatstone Telegraph" from around the 1830s. It was the first telegraph system to be put into commercial service. Unfortunately it was a "needle telegraph" which required multiple wires in parallel which made it expensive to operate and had issues with insulation, cross talk and puting wires in lead pipes and burying them. It was invented by English "serial tinkerer and entrepreneur" William Cooke who sought advice from the master of magnetism and electric forces Michael Farady, and later the assistance of the academic and code developer Charles Wheatstone. Another of it's failings was that it had "a limited non extensible alphabet", the five needles and thus six wires only alowed for twenty symbols[3]. Whilst many saw this as a limitation, Charles Wheatstone knew from his other work it was not.

That is two symbols sent in a known sequence can form the side indices of a square thus just six different values for those two symbols gives you thirty six "pigeon holes" in your square sufficient for all the leters and numerals in common use. There is some indication that such a square was originally intended but the limitations of the electromechanics of the time gave the five needles in a row arangment.

Others such as Samual Morse and Émiel Baudot extended the idea of using codes in sequences to convey information down single wires. Whilst Baudout's code had fixed length it alowed for extending the idea of symbols in serial sequences to increase the size of an alphabet, hence the "letters" and "figures shift" keys to change alphabets.

The problem that arose was the use of the Roman Alphabet now called the Latin Alphabet is particular to European languages. Thus many languages could not be sent via ITU or later ANSI or CCITT transmission codes. Thus the process or "Romanization" of languages began where a language with either non-roman/latin symbols had "glyph" replacments to keyboard and print head, or if the alphabet was larger each native alphabet glyph was replaced by a two or more key sequence.

Thus the likes of Chinese, Japanese, and other oriental lanquages where the glyphs represent words or parts of words and thus alphabets of two thousand or more glyphs are used became "Romanised" to the point where quite a few people could remember them and type them in.

Thus it is unsurprising that the far east where Romanization is an everyday practice would give rise to emojis.

Thus Emojies are the inheritors of the ideas about transmitting information down single wires by serialisation and the extention into dual alphabets for letters and figures, then Romanization codes to the runaway madness of UTF sequences which has arisen like a boil on the face of the transfer of electronic information...

Something tells me that much like you and I would gladdly see the back of emojies they have like so many "faux markets" become a fact of life, and that we would thus be considered by emoji aficionados as "luddites" and any attempt we make to have our prefrences be given acknowledgment as the act of "throwing a clog[4] in the wheels of progress"


[1] SOAS was originaly called the "School of Oriental and African Studies" when formed back in 1916 as one of the last "vestiges of Empire", it was a place to "infom Government" via Civil Servants etc. It has since expanded it's areas of study including the Middle East and some of the "-stans". It now forms part of a conglomeration of University's and places of Higher Education under the title of "University of London",

https://www.soas.ac.uk/about/

[2] Yes there is a "World Emoji Day" and yes it was created by an industry insider who calls himself the premier Emoji expert and Emoji Historian and has set up an independent "-pedia" you just know that there is something wrong,

https://en.m.wikipedia.org/wiki/World_Emoji_Day

[3] You can see an original two needle six wire telegraph in the ground floor hall of the London Science museum behind the hall on rockets. For all it's importance to mankind as the first practical method of sending information electronically, and also the first that did not require the user to learn any codes, nearly everyone fails to notice it. Often they ignore it for the near by solid silver scale model of the Forth Rail Bridge (that was also a first) due I guess to it's bullion value.

[4] Whilst in English we would say "putting the boot in" which has much wider meaning. The French word for a peasant wooden shoe is "sabot" hence the word "sabotage" arose from peasant weavers and the like. Which is not that unconnected from computing via the work of French inventor Basile Bouchon, and later improvments that gave rise to the Jacquard Loom,

https://www.computerhope.com/jargon/j/jacquard-loom.htm

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Sidebar photo of Bruce Schneier by Joe MacInnis.