Schneier on Security
A blog covering security and security technology.
« Tamper-Evident Paper Mailings |
| Trusted Computing Best Practices »
August 30, 2005
Unintended Information Revelation
Here's a new Internet data-mining research program with a cool name: Unintended Information Revelation:
Existing search engines process individual documents based on the number of times a key word appears in a single document, but UIR constructs a concept chain graph used to search for the best path connecting two ideas within a multitude of documents.
To develop the method, researchers used the chapters of the 9/11 Commission Report to establish concept ontologies – lists of terms of interest in the specific domains relevant to the researchers: aviation, security and anti-terrorism issues.
"A concept chain graph will show you what's common between two seemingly unconnected things," said Srihari. "With regular searches, the input is a set of key words, the search produces a ranked list of documents, any one of which could satisfy the query.
"UIR, on the other hand, is a composite query, not a keyword query. It is designed to find the best path, the best chain of associations between two or more ideas. It returns to you an evidence trail that says, 'This is how these pieces are connected.'"
The hope is to develop the core algorithms exposing veiled paths through documents generated by different individuals or organisations.
I'm a big fan of research, and I'm glad to see it being done. But I hope there is a lot of discussion and debate before we deploy something like this. I want to be convinced that the false positives don't make it useless as an intelligence-gathering tool.
Posted on August 30, 2005 at 12:53 PM
• 15 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
I wonder if this is at all conceptually related to google sets? Given, the trails and underlying documents aren't shown.
It's just "six degrees of information".... :)
"snippets of information – that by themselves appear to be innocent – may be linked together to reveal highly sensitive data"
I see a theme here, among the snippets of information, to what the authorities have on their mind:
"U.S. authorities remain concerned because, as one official said, even seemingly innocuous information, when pulled together from various sources, can yield useful intelligence to an adversary...."
I remember an early speech by Clinton about this subject, regarding the need for better information syntheses since availability of data may exceed our individual analytic abilities when we need to act quickly.
But that does not seem to be the case, if you follow the post-9/11 reports. Instead the gaps were in procedure and basic familiarity with policy. Perhaps the most famous was this case:
"After phoning the local FBI office four times, a flight instructor finally reached the right FBI agent, relayed the suspicions on Moussaoui and bluntly warned, 'Do you realize that a 747 loaded with fuel can be used as a bomb?' [...] Minnesota-based FBI agents notified the CIA and the FBI liaison in Paris, seeking further information; French intelligence sources reported that Moussaoui was 'a known terrorist who had been on their watch list for three years.'"
If Unintended Information Revelation is a post-9/11 reaction, I am curious if it would it have helped?
As far as I can tell the problem was apparently rooted in the fact that the FBI misinterpreted the law and did not agree with the French definition of terrorism ("Chechen rebels were not a recognized terrorist group under US law at that time").
So the problem is this case was NOT that too much or too little information was available, but that agency simply suffered from a lack of resources that led to unfamiliarity with the law and shoddy analysis of a wealth of information that should have been sufficient to arrest the terrorist(s)...
Maybe I'm misunderstanding the concept here, but I don't see how "false positives" would matter. In fact, I don't see how false positives are even possible. You're specifying the two things you want connected, and this just comes up with some way to connect them. It's not a guarantee that it will be a solid connection, it just gives you direction.
This is mosaic analysis.
The problem isn't really (or just) false positives.
The problem is that when people assemble all the little bits of information into a picture, they tend to fit them into whatever image was in their heads to begin with.
As a Star Trek fan I immediately related this idea to the episode "The Voyager Conspiracy". The synopsis for that episode is at http://www.startrek.com/startrek/view/library/...
Basically Seven of Nine, a former Borg drone, analyses connecting threads of data and derive patterns from the threads which initially are extremely useful, however the patterns are extended so that in the end Seven begins to draw incorrect conclusions based on assumed intentions.
I pretty much echo Damien Hamwood. For example, If a sheet of paper is copied many times (the sheets are indistinguishable from each other to make it easier) and torn, ripped, cut, etc many times and random pieces are picked up and the rest thrown out, you may get enough stuff to recreate the original document. But you have to be careful that you have a good random sampling or else you're reconstructed is no better than a million monkeys at a million typewriters hacking away randomly for a million years. This should be kept as a tool, and not a yardstick.
There's a fairly technical paper on this available at http://www.cedar.buffalo.edu/info_revelation/ under the "Publications" link. I haven't had time to do more than skim it yet, but it looks like the "core algorithms" are well developed and ready for implementation.
By "false positive" do you mean a situation where this method of analysis generates conections and links where none in fact exist?
Hmm... Let me clarify - what I mean is for example when patterns are found in data that are not actual or intentional patterns but merely an accident of a random configuration. If this makes sense?
If this can be avoided (not sure how) then this methodology will potentially be an extremely powerful research tool.
This seems like it would rapidly become the magic 8-ball of conspiracy theories.
"Now that's a fascinating site"
Indeed. I continuously find that the data I get back is wholly dependent on the quality of the taxonomy I feed the site, which supports Cheburashka's point above. Reminds me of the witch hunts in the US not too long ago that branded Robert Oppenheimer (no relation) and Albert Einstein as government subersives (http://www.amnh.org/exhibitions/einstein/global/mccarthy.php).
It also reminds me of the "associations" used to profile activists in the South during the 1960s.
Now that I think about it, how crazy is the fact that Oppenheimer was stripped his security clearance based only on some very loose affiliations, while Rove (who has been accused of an actual/documented act of treason) does not have to give up his clearance. But I digress...
All this really proves is that during the next US witch hunt, citizens need to be strategic (intentional) about all information if they care about the authorities finding the proper scent, instead of letting investigators select some random smell. In other words, you can choose either the hard road of complete anonymity or you can take the ages-old path of active participation in areas that you ultimately hope will define you by association.
The remaining issue, therefore, is who gets to control the taxonomy of good/bad that drives this "revelation" machine.
This smells like a normal clustering algorithm. It may be useful to know that site ABC and site DEF contain information about 'horses', even though DEF calls them 'h0rses' for some reason (based on correlations between ABC and DEF). However, I believe that the most common use of search engines is to find information about a specific topic. That's accomplished with the current technology. I don't really care about what other sites may be related to the site I've found, because I don't really care about what site I've found-- I care about a set of information found within that site.
I suppose this could improve accuracy of results by filtering out outlying sites (perhaps defeating sites that include a bunch of keywords to skew the search results).
Confucius say, what you seek, that you will find.
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.