Natural Language Shellcode

In this paper we revisit the assumption that shellcode need be fundamentally different in structure than non-executable data. Specifically, we elucidate how one can use natural language generation techniques to produce shellcode that is superficially similar to English prose. We argue that this new development poses significant challenges for inline payloadbased inspection (and emulation) as a defensive measure, and also highlights the need for designing more efficient techniques for preventing shellcode injection attacks altogether.

Tags: academic papers, camouflage, computer security, concealment, filtering, steganography

Posted on March 25, 2010 at 7:16 AM • 27 Comments

Comments

GreenSquirrel • March 25, 2010 8:00 AM

Interesting if hard (for me anyway) to follow at times.

I am unconvinced by their conclusion though. It doesnt seem to “conclude” anything from the material before it it just states an introduction.

However, my (weak, as mentioned) understanding of the paper supports the idea that there are significant challenges for inline inspection and there is a need for more efficient prevention techniques.

I just wish it had been written a touch better.

Clive Robinson • March 25, 2010 8:25 AM

“Specifically, we elucidate how one can use natural language generation techniques to produce shellcode that is superficially similar to English prose.”

Great throw in a bit of nuro linquistic programing and phone a “droid” up to do your bidding 😉

Now you know why wispering “sweet nothings” in your significant other’s ear is so important…

Could this be just one step short of hypnosis for computers 😉

BF Skinner • March 25, 2010 8:29 AM

Infect computers with the same fuzzy and lack of critical thinking humans display? Only to have to do it very very fast? Poor things.

Reminds me of Vonneguts (think it was Vonnegut) Robot Monk (a believing machine) who was sat down and plugged into the TV.

Arancaytar • March 25, 2010 8:41 AM

This looks like a fascinating subject. I have to find out some more about natural language generation.

“Vonneguts (think it was Vonnegut) Robot Monk (a believing machine)”

Douglas Adams, actually – Dirk Gently’s Holistic Detective Agency.

Gweihir • March 25, 2010 9:23 AM

Read this papers some time ago. Really impressive.

They write dual-interpretation code, which is both executable assembly (x86) and sort-of english language.

Gives the threat of executable code in non-executable containers a whole new dimension.

clvrmnky • March 25, 2010 9:48 AM

As long as their generated code is more natural than their own prose, I’ll be satisfied.

Ross Snider • March 25, 2010 10:03 AM

This is one of those things where you slap your head and ask, “Why didn’t I think of this (I certainly had 10 years to do so)?”

That’s brilliant. They did a really good job.

peri • March 25, 2010 10:12 AM

I have to agree with Ross Snider that I also had a “why didn’t I think of this” moment as well. From other comments I got the impression nobody is really paying attention to this so I thought I would summarize so you don’t miss out on a great paper.

In this article the authors are searching for useful executable IA32 machine code that is also statistically similar to plain English.

Figure 9 shows that the string “There is a major” disassembles into:

push %esp
push $20657265
imul %esi,20(%ebx),$616D2061
push $6F
jb short $22

I haven’t had time to fully read the paper yet but the authors cleverly use jumps by finding runs ending in jump codes which allows them quite a bit of space to fill in the skipped regions more easily with statistically plausible English.

moo • March 25, 2010 10:29 AM

I can see why non-programmers might have difficulty reading this paper, but I found it pretty clear. The bottom line is that complete shellcode can be encoded in ways that are statistically identical to natural english prose, and thus escape detection by any technique that relies on statistical analysis of byte or instruction frequencies etc. Also, in contrast with less advanced ASCII encodings (e.g. alphanum ones) a shellcode encoded like this might also escape notice by humans who give it a passing glance.

Consider e.g. a blogspam post, which is semantically nonsense but at least as readable as a speech by Clintov/Gingov, and which can actually be directly executed as x86 code for nefarious effect, and can not be detected by any specific marker or pattern, or by the usual statistical methods.

Gweihir • March 25, 2010 11:16 AM

It is a good paper, and yes, I had the “Why did I not think of this” moment as well.

It will be difficult to understand without some assembler knowledge, but that is the same with most low-level malware papers, as a lot of malware works on assembler level.

Joe Buck • March 25, 2010 12:23 PM

This reminds me to a joke “virus email” that told people to first mail a copy of the message to all their friends, then run a command that would remove all of their files.

paul • March 25, 2010 12:42 PM

So this is, in a sense, the inverse of the work from the early/mid 90s on mimic ciphers which showed it was possible to make ciphertexts that were statistically indistinguishable from english text, and to use limited vocabularies that even made the text readable.

bob • March 25, 2010 1:29 PM

@Joe Buck

It’s the “honor system virus”. See the Parodies heading in this article:

http://en.wikipedia.org/wiki/Virus_hoax

The shellcode article reminds me of work I once did making on a Forth system, where it mistakenly executed the word name due to a bad pointer. It took me a long time to find that bug, because executing the ASCII was doing something non-fatal and then continuing execution by returning to the caller.

Dennis Gamayunov • March 25, 2010 1:46 PM

Nice, but:
1) No loop support. Shellcode polymorphism and ecryption, bye-bye.
2) Antispam engines handle English text classification into malicious/benign quite successfully, especially in end-point mail filtering solutions.

I wonder what kinds of shellcode could be generated using UTF-8 simplified Chinese encoding.

Clive Robinson • March 25, 2010 1:55 PM

For those saying “why didn’t I think of that” well…

It is a consiquence of having redundancy in the underlying instruction set.

This issue is one that occures with CISC systems but considerably less with RISC systems.

It would be interesting for example to see if the same can be achived with the Sun SPARC architecture for instance.

Also it works because of the iX86 architecture (von Neumann) it would be considerably more difficult (bordering on impossible) to do on a strict Harvard architecture.

With regards the “statistical properties” unfortunatly whilst it may be statisticaly the same as English at the charecter and word level it does produce “stilted prose” which would be subject to statistical analysis of the type used to find Stenography (which in effect is what this is). A version of which was done by Bacon back in the time of William Shakespeare (in fact see the “tales of Riverbank” to see an interesting connection)

All that being said they have done what is a very very hard step, which is to take what is theoreticaly possible and turn it into usable everyday fact… so much qudos for that.

However I suspect they are being a little coy on it’s potential, as it can also be used for a number of other things (think control channel and code for updating a bot net for instance).

And for those having trouble reading the paper, I sympathise, they have tried to cram a lot in to a few pages, whilst keeping a lot of detail.

It would be better written as the chapter of a book at between 20-40 pages going into things in the same depth but a little more gently.

Nathan Tuggy • March 25, 2010 2:09 PM

@Dennis Gamayunov:
The authors address this in section 4.2, with the self-modifying decoder (hint: arbitrary instructions are generated by patching). Section 5 also notes that the payload is polymorphically encoded.

Dennis Gamayunov • March 25, 2010 2:49 PM

@Nathan Tuggy:
Oh, indeed, sorry for the first remark.

Ross Snider • March 25, 2010 6:04 PM

@Clive Robinson

‘For those saying “why didn’t I think of that” well…

It is a consequence of having redundancy in the underlying instruction set.’

Well this certainly isn’t why I didn’t think of it. I didn’t think of it because it’s very creative.

I figure updating a botnet is a pretty solved task. Not sure how much you’d get out of encoding all your opcodes into quasi-English first…

I could not find the “tales of Riverbank” you were referring to after a short bout of Googling (unless you were talking about a children’s cartoon?).

Aside:
A curious bit of information you mentioned. I didn’t have any clue x86 was related to Von Neumann (other than the obvious theoretical underpinnings that would apply equally well to his contemporaries). Can you speak more about this?

Ross Snider • March 25, 2010 6:04 PM

@Clive Robinson

‘For those saying “why didn’t I think of that” well…

It is a consequence of having redundancy in the underlying instruction set.’

Well this certainly isn’t why I didn’t think of it. I didn’t think of it because it’s very creative.

I figure updating a botnet is a pretty solved task. Not sure how much you’d get out of encoding all your opcodes into quasi-English first…

I could not find the “tales of Riverbank” you were referring to a short bout of Googling (unless you were talking about a children’s cartoon?).

Ross Snider • March 25, 2010 6:11 PM

@Clive again

Oh. That was stupid. You just meant Von Neumann architecture. Got ahead of myself.

Jay • March 25, 2010 7:24 PM

@Clive: It’s a bit of a moot point, whether this would work on a strict Harvard architecture. There aren’t any – at least outside of very small embedded systems (like the PIC or AVR).

They’re barely compatible with C (have you seen AVR C? Pointers and pointers to ROM constants are incompatible types!), definitely incompatible with any sort of general OS, and doomed to stay in the niche they’re in.

Oh, and they’re not completely secure either – Return-Oriented Programming did for that…

peri • March 25, 2010 8:01 PM

I have now had a chance to really read the paper and it seems I was missing a few steps: initialization, unpacking, decoding and automation. I am even more impressed now.

I am a bit disappointed because I feel like English is overkill. Base64 encoded data is similarly considered non-executable, simpler to both decode and encode by orders of magnitude and it has about the same number of codes as English with punctuation. I have a feeling somebody can write a single line (76 bytes compared to 2054 bytes) self decoding base64 decoder.

peri • March 25, 2010 8:49 PM

@Clive Robinson

I think discovering this works in any other instruction set would be only mildly interesting: it should only take an afternoon to plan a base64 version in whatever machine codes you are familiar with.

Your Harvard architecture comment strikes me as a red herring because this paper had nothing to say about actually getting self-modifying shellcode running. It was only addressing the belief that shellcode detection is a viable strategy.

Clive Robinson • March 26, 2010 2:05 AM

@ Ross Snider,

“Well this certainly isn’t why I didn’t think of it. I didn’t think of it because it’s very creative.”

Ok, my first comment on redundancy, was perhaps a little to short (I’ll come back to it later). And as I said they have done a rather niffty job of turning the theoretical to the practical.

“I figure updating a botnet is a pretty solved task. Not sure how much you’d get out of encoding all your opcodes into quasi-English first…”

It is very far from a solved problem, the explanation is a bit long so I don’t mind if you don’t read it…

Put simply the “obvious botnets” are those that are seen due to the network traffic they create.

The solution to botnets currently is once a botnet is seen the control channel is identified and attacked and so the bot net remains but gets no commands so is kind of out of action (but this is not a reliable process as was demonstrated to the MWG when the botnet operator got control of part of the botnet back).

So the next step for botnet operators is to go from their “obvious botnet” used to do “DoS/Spam” to “covert” botnets used for information gathering and targeted “insider attacks”.

We have already seen a modified version of the ZeuS botnet being used (badly) for intel gathering in that it was designed to go for .mil & .gov hosts and slurp up any user viewable documents and PDF’s. It was seen due to being detected by a small number of AV systems and the resulting network traffic of it sending the documents etc out.

ZeuS’s designer has just recently added “shell capability” to the available code base so “insider attacks” can be done. One consiquence of which is where the attacker gets your computer to transfer your money into their buffer account. As it’s your computer with your credentials on it etc it’s as though they have sat and typed at your keyboard. I wish you good luck trying to convince a bank it was not you… And moving from the obvious to the less obvious but even more interesting tricks such as manipulating your share trades which I suspect we will hear more about in the next few months (then think on a 1/2year after that and what you could do with a pump and dump stock if you could actually make a 1/2 million people buy 100USD of it).

That is botnet operators are starting to think and get inventive on how to better capitalize on “their investment” in the botnet…

As a result they will become more covert, and the way to get at them will be via the control channel. However what if the control channel is not done via one or more hosts that can be taken off line?

I’ve already described how to do this “decoupled control channel” using open blog pages and a search engine like Google.

So the question about how to deal with bot nets will move to the infection vector. Now depending on who you belive Adobe PDF’s have overtaken all MS file formats for a primary infection vector.

This system potentialy opens up any “text source” into an “attack vector” so put a little thought into how it might be used via RSS feeds or IRC or NewsNet or Blog web pages or potentialy even “tweets”…

Back to your other points,

‘I could not find the “tales of Riverbank” you were referring to’

Sorry “Riverbank” is where a lab was in the early part of the last century. The owner of which had a fixation for “Bible codes” in the works of Shakespeare and hired a group of people to “prove his idea”. One of whom was William Friedman, who was employed for other work. But he rather fancied a young lady Elizabeth who was working on the “shakespeare codes”.

The result was they got married and William in his late twenties published a series of reports (Riverbank Publication 22 being most notable). And the lab was more cryptographicaly capable than anything the US Gov had at the time.

Friedman is recognised as taking cryptography from art to science and is the father of the “index of coincidence” or “chi test” method. David Khan has written several articals and a book or two about William Friedman. Saddly Elizabeth who was quite outstanding in her own right kind of got totaly eclipsed by her husband.

William ended up as an administrator and got embroiled with the “politics of cryptography” and many “ills” have been laid at his door, the truth of which is not likley to be resolved.

Try googling [“William friedman” “Riverbank publication”]

Clive Robinson • March 26, 2010 5:25 AM

@ Jay,

With respect to “strict Harvard” architectures and your comments,

“They’re barely compatible with C…”

True but from the security asspect that is an advantage 😉

Although I don’t subscribe to the “C is the root of all Malware” viewpoint, I would chearfully excuse anyone who did think it based on oh 25-35 years of Malware (yup it all started back in 1972).

“definitely incompatible with any sort of general OS,”

Yes and no. Yes if you mean the likes of MS OS’s and *nix OS’s but not true of other OS’s.

The main issue is to do with “loading into memory” the von Neumann architecture more closely resembles the Turing “one tape” machine which is the true theoretical ancestor of all silicon computers that we use today.

However people incorectly make the assumption that a Harvard architecture programe is “hard coded” into the “program memory”. This is not actually true (well it is on many hardware platforms but that is a CPU specific design not architecture issue).

“and doomed to stay in the niche they’re in.”

I rather hope not. The reason the Harvard architecture was depreciated is a little historic in nature and goes back to when 1K bit of RAM cost you 20USD or more. It was not cost effective to have two sets of RAM (prog & data) in early machines. The nature of “C” is intimately tied to the von Neumann combined memory architecture as that was the “accepted/available architecture” back when K&R where doing their thing…

With regard to,

“Oh, and they’re not completely secure either – Return-Oriented Programming did for that…”

It is a complicated subject and involves a bit of a fiddle, that can be avoided with appropriate coding practices (however that in turn brings up other issues…).

For those that are going huh?

Have a look at the original paper,

http://arxiv.org/pdf/0901.3482v1

And “a gentle guide, that was started just a few days ago,

http://blog.zynamics.com/2010/03/12/a-gentle-introduction-to-return-oriented-programming/

RH • March 29, 2010 1:45 PM

This reminds me of the virus test vector (I wont repeat it here, for fear of falsely triggering your antivirus softwares). The basic goal was to declare SOME string of bits to be a virus, agreed upon by all major virus scanners. That way you could test settings without bringing in an actual malicious virus.

The vector is all printable characters, and is both human readable and executable on a x86 computer. This is the next logical step of hiding the data not only as printable characters, but as words.

I wonder if there’s some formalism of this: unless you know the payload, noise and are indistinguishable, or something along those lines?

misinformed • December 5, 2012 4:19 PM

Im a bit lost though; to use the already stated example:

[////////
Figure 9 shows that the string “There is a major” disassembles into:

push %esp
push $20657265
imul %esi,20(%ebx),$616D2061
push $6F
jb short $22
////////]

Do they only use linear sweep or recursive traversal on the opcodes [There is a major]?

Natural Language Shellcode

Comments

Leave a comment Cancel reply