Schneier on Security
A blog covering security and security technology.
« Matthew Weigman |
| The Exaggerated Fears of Cyber-War »
September 1, 2009
Hacking Swine Flu
So how many bits are in this instance of H1N1? The raw number of bits, by my count, is 26,022; the actual number of coding bits approximately 25,054 -- I say approximately because the virus does the equivalent of self-modifying code to create two proteins out of a single gene in some places (pretty interesting stuff actually), so it’s hard to say what counts as code and what counts as incidental non-executing NOP sleds that are required for self-modifying code.
So it takes about 25 kilobits -- 3.2 kbytes -- of data to code for a virus that has a non-trivial chance of killing a human. This is more efficient than a computer virus, such as MyDoom, which rings in at around 22 kbytes.
It’s humbling that I could be killed by 3.2 kbytes of genetic data. Then again, with 850 Mbytes of data in my genome, there’s bound to be an exploit or two.
Posted on September 1, 2009 at 1:13 PM
• 50 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Darn, my chance to be the first to comment, and I have nothing to say.
Of course, I am sitting here battling this damn flu as we speak/type, so I could be hallucinating the whole thing.
Actually, 22KB is too much for a virus - there has been stuff as small as ~1500 bytes (the Slammer worm, from which the actual virus was about 400B and the rest was padding and the overflow).
Still, 3.2KB is pretty good, even though it has one simple effect (killing people, not mind-controlling them, for example).
This from Andrew "bunnie" Huang's personal blog. He can hack an Xbox AND swine flu! He's full of Awesome!
"Still, 3.2KB is pretty good, even though it has one simple effect (killing people, not mind-controlling them, for example)."
Killing people is a bug, not a feature. The purpose of the virus is to replicate itself enough to remain in the wild indefinitely. Killing people does not help it with that -- in fact it just attracted the particular attention of an intelligent well-motivated opponent.
If the virus had included code to shrug and call it a day when it finds itself in a host who's at risk of dying from the flu, it would have been much more likely to spread unseen and uncared-for through the population and achieve species survival.
This is a very interesting way of looking at the flu- or any organic virus.
But can you reverse hack the H1N1 to have it uninfect a victim?
As Cmos said: the killing is a tupid feature because:
1) People will know about you -> they will try to make drug against you.
2) the most powerfull illness is just normal flu.. or maby some kind of virus that we didn´t find yes..
"The purpose of the virus is to replicate itself enough to remain in the wild indefinitely."
If it takes one mutation to generate H5N1, and, by the math in the document, one in tens of thousands of viruses gets the mutation, we'd see a lot more H5N1.
Which begs the question of what prevents it from happening. Perhaps a H5N1 located in the throat area simply has a hard time finding a suitable place to reproduce amidst a full blown H1N1 infection since it really prefers a posh pad in the deeper parts of our lungs? H5N1 would actually get selected against?
If this is the case, then lets give a round of applause for the human body's design - isolating the throat from the lungs with mucus and cilii!
The article really should say that it takes no more than 25 kilobits. This just gives an upper bound, not a minimum.
"The purpose of the virus is to replicate itself enough to remain in the wild indefinitely."
Read "The Selfish Gene" by Richard Dawkins. If you don't believe in that kind of thing, then check Genesis 1:20-25.
Your calculation is likely to be off by a couple of orders of magnitude. You can't graft straight information theory onto a primary sequence because of the self-editing properties of RNA (among other things). The nucleic acid is not only an information courier, its also an enzyme that modifies itself. (Look up Altschul and Ceck, if you're interested in the biochemistry)
To use a programming metaphor, the code in viruses, including RNA viruses such as the Orthomyxoviridae, is self-unrolling code- as a result, the information contained is substantially greater than a straight transcription of the primary sequence.
The same is true for almost any other nucleic acid sequence, including the human genome.
Bunnie has hit on several of the wonderful things about viruses: They include self mutating elements that accelerate their evolution. It's like Darwinism only turbocharged.
They have been around longer than us, and evolve faster than us. Sobering.
"Then again, with 850 Mbytes of data in my genome, there’s bound to be an exploit or two."
Then again, unlike some systems that ship with well more than 1GB of code, at least humans ship with some anti-virus code, and our bundled immune system is quite a lot better than anything Norton or Symantec sell as an add-on!
@Scott: How can RNA have more information than information theory predicts? You seem to be claiming that it's the analog of a compressed file rather than an executable, but how would that change the information payload?
Anyone want to start a code review on the human genome? Kind of rough reviewing code that hasn't had a code review in 2-4 billion years of cellular evolution!
To really compare information vs representation...
You should compare the the gzip or 7zip of the sequence or something similar to the gzip of all the source code in the program. This would be more comparable. In that this would reduces some of over estimation in complexity for both.
This isn't exactly a fair comparison, as computer viruses aren't using nearly the entire code space (i.e. could be seriously compressed if there was a built-in mechanism -- as there is for RNA -- for decompressing on the other side).
But mostly it exposes architectural differences between bodies and computers. The cells that reproduce biological viruses don't (generally) have strong recognition/rejection methods: you latch onto a receptor and in you go. Computer viruses would probably be built differently (and possibly smaller) if the internet depended entirely on probabilistic detection of malware in transit.
yeah, but how many mutations before H1N1 gives birth to H1Z1, the zombie virus? :)
With regards the "killing the host" problem you need to consider where the particular virus came from (the clue is in the name)
Bird and swine flu are mutations from another species, it is a bit like porting code from one computer architecture to another, you expect a few bugs in translation.
The important thing is in neither case does it kill all hosts, it can therefor mutate slightly in each host untill a (sub)optimal mutation that works is found. Further mutation will then optomise the mutation by force of numbers not by any kind of choice.
Each year we see a different flu virus take prevalence which is why sometimes your flu shot works and sometimes it does not. This is because we humans have to second guess probability (take a gamble) on which virus is going to have the best chance in the sweep stake this coming year and like betting on horses we do it "on form" and "lineage".
For those looking for the "ultimate" virus etc, you first have to say what your criteria are...
Because it is difficult to make a claim about anything unless you know what you are measuring and how and for what duration.
For instance from an infective agents point of view being of use to a host would appeare to be a way to ensure it's own future. But would it?
How about an infective agent that doubled the human life span?
Initialy it would apeare to be a good idea, untill you start thinking about the consiquences and outcomes it would have on the hosts.
But have a look at the (not so) comman cold and smallpox, for ideas.
Oh and then consider what is less than a virus but still infects humans, and just how small they can get.
Another problem with the comparison of the size of a biological virus and a computer virus is that the environments are entirely different. Many things needed for the virus are found in the cell it infects. Computer viruses need to carry a lot more functions themselves, I think.
In the right environment a virus can be a simple as "copy this and pass it on"
"But can you reverse hack the H1N1 to have it uninfect a victim?"
In theory yes with another virus.
However there are a number of problems.
Not least of which is what actually kills you...
In some cases the virus provokes a reaction that causes your body to attack it's self as was seen in the 1918 flu.
And it is this not the virus that kills you, so by the time you know you have the virus it would be to late to unwind what it had done with another virus.
The usual sign of this type of virus is the death rate in "economicaly productive" people (fit adults of working age) being greater than the "non economicaly productive" people (children and the old and in some cases the infirm).
This obviously has significant social implications that far out live the virus and can last five or more human generations.
>> So how many bits are in this instance of H1N1
So many stupidity in single sentence. Living things doesn't operate on concept of bits or information theory. This is the very reason why no computer, no matter how powerful can't correctly model even simple behavior of living creature. Not even insects.
Attempt to measure "how many bits there's in DNA" is as stupid as attempt to measure voltage in meters.
Amino acid is not "finite automate" or state machine. We don't know (yet) how predictably program those things (that's what genetic research is for). And results of attempts to apply CS principles to mode living creatures are miserable failures in best case.
"And results of attempts to apply CS principles to mode living creatures are miserable failures in best case."
I think you are muddeling up the effects of complexity with determinisum.
It is quite possible to have something that is purely determanistic in nature (and therfore can be modeled mathmaticaly) when grouped with others exhibit complex behaviour states that are effectivly beyond calculation within any finite time period.
An example of such is weather forcasting, the mathmatics involved is actually fairly simple for any given point at any given time. The problem is making the mathmatics of a point at a known point in time with known inputs work across a volume of many points into the "unknown" future with each point providing input to each other point.
We are currently at the point where we can predict the weather within a short time period with a great deal of accuracy. However it becomes more aproximate the further we move into the future for good reason.
There are two other things to consider that most definatly limit what can be determined from even a determanistic system these are "noise" and "sensitivity".
The definition of noise is such that it is a random input to a determanistic system. A small number of systems are designed to multiply the noise for various reasons, most systems however are designed to be as insesitive to noise as possible.
However when you have multiple dependent systems acting in a network even though designed to be insensitive to noise, the noise can quickly mount up and swamp the predictable behaviour of the system with time.
The usuall engineering solution to this is to use error correcting codes and adaptive behaviour to anomalies in expected behaviour.
We are finding that most biological systems have the same sort of safe guards built in against random mutations that exceed certain limits.
As "computer science" is the foundation to the use of computers the programing of which is perhaps the most complex thing man currently does. It is perhaps not un-natural for people to take the concepts that are known from one branch of science to another. This process has been going on since before renasounce man was first recognised.
Somebody above mentioned using gzip compression to get a better idea of the real information content of the genome. That’s lossless compression. How much further could you get with lossy compression?
How to define lossy compression on a genome? Start with single-nucleotide polymorphisms (SNPs). Some variations seem to be quite commonplace and harmless: where one individual has an A, someone else might have, say, a C. So out of the four possible base pairs, if only one is valid, then that represents 2 bits, if 2 can occur, then you only need 1 bit at that position, if 3 possibilities can happen, that brings it down to about 0.4 of a bit, and if all 4 are acceptable, then that’s 0 bits.
Of course, there will probably be correlations between variations in different base-pair positions. But this is amenable to standard probability analysis, which straightforwardly converts to an information measure in the usual way.
You are completely wrong on all accounts, you are hardly in a position to accuse others of stupidity.
What possible measure could you use for DNA/RNA other than number of bits or some comparable measure (such as codons)?
Are you claiming that actual DNA or RNA contains something other than what we can find through gene sequencing?
DNA/RNA is pretty much pure information. How they are interpreted is indeed fairly complicated, but we know much more about how the process works than you appear to assume. The main difficulty in simulating biological processes on a computer is the absolutely massive parallelism occurring in a living organism.
I have a problem with this "number of bits" stuff. While it may be accurate in a way; it can be very misleading. If a virus is a short bit of computer code, then it gets the OS libraries to do a hell of a lot of work.
@Bernie " gets the OS libraries to do a hell of a lot of work"
Human viruses also exploit the way our organisms work for their own benefit, although this comparison may not be entirely appropriate.
> even though it has one simple effect
> (killing people,
You watch too much TV. Swine flu, like any other flu, usually doesn't even makes a person seriously ill, much less kill them. Not one infection in a thousand results in death. A much more common result is relatively minor annoyance: some nausea, some aches, maybe a bit of vertigo, and a couple of days of lost work. You know, the flu.
> People will know about you ->
> they will try to make drug against you.
This is a flu virus we're talking about. The medical community has absolutely no idea where to even get started in creating a drug against it. They just treat symptoms and wait until the person's immune system gets a handle on the situation.
>> What possible measure could you use for DNA/RNA other than number of bits or some comparable measure
So, before concept of "voltage" was developed, it was OK to measure power of electricity in meters? Answer of cause is no.
DNA of insect have roughly same amount of information as yours. This doesn't make possible for fly to put nonsense on internet.
This imply, that there's "something other". Maybe process of gene sequencing is not enough, don't you think?
Your notion about deterministic and random systems simply doesn't apply here. Living creatures are not exactly stochastic systems as it's defined in theory.
Weather (for a short period) can be _modeled_ as such system in same meaning as gravity theory can model your body movement during walk. With all computer science, we don't have robot that can walk as reliably as 4-years child.
Same principles apply on gene level too. Nobody can say _why_ some of your cells suddenly go nuts and become cancer cells. There are theories. Many theories. None explain all known cases as proper theory should.
Science yet to explain "how life works". DNA is fun, but it's not enough. You can't build it "from scratch". You can't create DNA of simplest organism from stone, oxygen and water.
When you succeed - we will return to conversation about bits.
Darwinian selection means the "purpose" of a virus is to reproduce enough to survive so far - it says nothing about "indefinitely"...
"It’s humbling that I could be killed by 3.2 kbytes of genetic data."
Um...how many bits/bytes of data are in NaOH? Enough of that in one place will take your body apart quite nicely.... :-)
If you reduce it to a subdanowin nanocycle, the reduction can actually exceed the ternial bit rate, in terms of genomic participles. So you're down to bit deviations responsible for the infection. It becomes remarkable that our bodies last as long as they do when you think of it in those terms.
"You can't create DNA of simplest organism from stone, oxygen and water."
Maybe / maybe not I'm not up with the latest research on the subject. But you can create amino acids with little more than "comet gas" and an electric arc.
It is an experiment that has been performed a number of times, and can be done by kids at home if they want to.
The experiment showed that the building blocks of life could be made by just the chemicals and processes you would expect in pre-life Earth and on even very small inter planetary/stella bodies such as comets.
We have in the lab made artificial genes, and we have spliced genes from various species into other species, and we have produced clones.
We have also produced self replicating chemical structures.
So I'm actualy not sure what point you are trying to argue.
If it is "we have not yet made life" then as far as I'm aware you are currently correct, partly because we realy don't have a good argument as to what constitutes life.
Will we be able to in the future I suspect that yes we will be able to develop the technology. But "will we create life" I suspect the question is more of ethics than technology.
1) Unlike a computer virus, influenza and other natural viruses pack some of their own hardware, such as RNA polymerases and other enzymes. How many bits are those worth?
2) Computer viruses are exploiting all the information packed into both the hardware and software already installed.
3) I would argue that the information packing of living systems is orders of magnitude more efficient than in a computer. The most sophisticated computer cannot assemble itself from a box of parts, let alone make the parts from raw materials, and all that information is right there in the genome (and the germ cells).
4) For a wonderful education on viruses, go to:
It is wonderfully fun to listen to the podcast.
The comments criticizing the amount of information being measured in nucleic acids are accurate. Nucleic acids (DNA/RNA) encode the information in a sequential order. The unit of information is the codon (three letters) of the four letter nucleic acid alphabet. But this is an oversimplification. It was pointed out that two genes could overlap. Alternatively, the RNA transcript could be edited several different ways by cellular machinery ( post-transcriptional RNA editing, splicing, insertion of uracil or cytosine bases within the transcript, capping, adding a tail, etc.). Then, you have translation of the message by the ribosomes, post-translational modification of the protein (cleavage of the protein, export to the cellular membrane, addition of sugars and lipids to specific amino acid residues). There are even proteins that may help the protein to fold correctly.
Up to half of the amino acid sequence of a protein is filler. The sequences are there to correctly align the amino acids that make up the critical residues necessary for catalysis or modification by a group such as a phosphate. While analogous to NOP sequences, they still may carry information such as whether the region is water loving or water hating. There are critical portions of the protein that can not be mutated or function is crippled or lost. Structure and function may be restored by a compensating mutation at another part of the sequence that changes an amino acid that interacts with the first mutated amino acid. The problem is that proteins and other cellular machinery are fuzzy systems. There are levels of redundancy and processing at multiple stages. Information can be added and removed by the cell itself, by a viral protein, or both. We haven't even taken into account antiviral measures by the cell, or the immune system's response to infection.
The closest analogy via computer science is a program and its compiler. How much information is encoded within the compiler which converts the source code into an executable program? How many header files and which libraries are necessary for the program to compile? How much information is within those files? How much information is encoded within the hardware that that program executes on? So, the estimate is an extreme lower bound, and is not likely accurate.
@Yosi "When you succeed - we will return to conversation about bits."
@Paul: "The cells that reproduce biological viruses don't (generally) have strong recognition/rejection methods: you latch onto a receptor and in you go."
I thought this was the case until I read http://en.wikipedia.org/wiki/... . Utterly ELEGANT method of infection. The body recognizes that the thing latching on is a virus (probably due to something simple like the analog of a very loud port scan) and pinches off a portion of its cell wall around the virus, turning it into a Lysome. The cell then proceeds to happily dose whatever's in the Lysome with acid until it denatures. HA acts like a sort of grappling hook, unfolding half way through the digestion process, and latching onto the lysome wall like a grappling hook. As digestion continues, it reels in the hook until it merges the membranes of the lysome and its own body, spilling out its genes.
By metaphor, it found a privilege escalation bug in the smart firewall, relying on port scanning to ensure it gets "caught." As Ian Malcom said in Jurassic Park, "Nature finds a way."
Obviously our system call library is far too rich.
When I personally talk about the bits, I'm counting each pair as two bits (four possible values), and counting stretches of RNA or DNA that have no obvious use. The creation of larger and more complicated-looking structures in the cell is analogous to compressing a program or image on a computer. Interfacing with cell structures is analogous to using already installed programs and OSes. It doesn't take much compressed source code to do some very complicated things. On the other hand, the result can't have more information, in a very specific sense, than the input. It can be transformed, and can be very impressive, but it doesn't carry more information. I can write a program to print out pi to any desired accuracy, but I can't do it for any arbitrary real number (the number of possible programs, and hence deterministic outcomes, is countable, and the set of reals isn't).
I also believe I have reason to consider this without actually being able to build a gene from raw materials, in much the same way that I consider it useful to use and discuss physical conservation laws while not being able to scratch-build a universe.
I'm currently doing research in evolutionary genetics (specifically phylogenetic methods.)
Two bits per base-pair is a very sensible measure of the information in a virus. The only way you'd get more information is in 'epigenetic factors', such as if the virus carried a prion along with it. Epigenetic factors are rare, and wouldn't add many bits anyhow. (I haven't heard of any with viruses, but I am not active in viruses or epigenetics, so this doesn't carry much weight.)
Someone mentioned multiple genes from the same sequence. This does happen, but we've accounted for this information by counting our information at the DNA (or RNA) level rather than the amino acid sequence level.
Two bits per base pair is already somewhat conservative: if the virus has a strong composition bias (AT rich or GC rich) then the information content goes down a bit.
Basically, if you have 10000 bases to play with, you can only make 2^(2*10000) possible viruses: i.e. 20000 bits of information. (Of course, nearly all those 'possible viruses' won't reproduce.)
TS: Thank you. I just ordered it from Amazon.
"When you succeed - we will return to conversation about bits."
Which leaves us with what in the interim? Should we be talking about magic elves?
We know, with a high degree of certainty, that DNA and its interactions with cellular machinery account for the heritable variation we observe in biological organisms. We know how the "digital" structure of DNA varies, and we have repeatable processes that allow us to observe that variation of structure. To whit, two different scientists, with different apparatus, can examine the genetic material of a given organism and come up with the same sequence of "A"s, "C"s, "G"s, and "T"s. Does anyone reading Schneier not accept this?
If the critique is specific to the precise amount of information conveyed in the DNA message and one is worried about what RNA and the rest of the cell actually does with that message, one is confusing the message with the method of encoding. If Alice and Bob agree on the meaning of a 1-bit message beforehand, then one bit is all it takes to differentiate between "stop by the store and pick up sugar" and "get some fuel for the car", or between any other pair of complex alternatives. We still say the message contains one bit.
Perhaps one accepts that since DNA is a method for recording information, that we measure information in bits, and that therefore bits may measure DNA, but one wishes to observe that "nature vs. nurture" is a false dichotomy? Well, OK, we can agree on that. Yes the cell is an amazing place where occur wondrous things. So is the processor core of a computer. You might say that the cell reads more into DNA than is written there. When a computer plays an mp3, isn't the same thing happening? We still don't say that a 1 MB mp3 file is "really" 5 MB.
"It’s humbling that I could be killed by 3.2 kbytes of genetic data."
Actually, it about as interesting as the perspective of looking at our bodies in terms of info security.
I guess we're either a giant program, or a giant OS.
Voltage is nothing more than the electrical potential difference. It takes no physicist to measure it in meters.
It's not particularly surprising to me that a flu virus is more efficient than MyDoom - after all, has only gone through a coupla dozen generations of mutations. Intelligently designed mutations, it's true, but I'll plump for the coupla *trillion* generations of random mutations and natural selection over the selection efforts of a few spotty teenagers any day.
Viral genomes are astonishingly compact, making clever use of multiple reading frames to maximise information content. HIV-1 is described by a sequence of only 10^4 base pairs, which equates to ~2.5kB. Shorter than the average piece of unfiltered spam and yet responsible for about 2 million deaths annually.
I've seen lots of creative methods like this of breaking down swine flu. On another blog, someone even made a song using the calculations in the breakdown.
Interesting, that we can do this..
Schneier.com is a personal website. Opinions expressed are not necessarily those of Co3 Systems, Inc.