Schneier on Security
A blog covering security and security technology.
« RFID Zapper |
| Kevin Kelly on Anonymity »
January 5, 2006
Data Mining and Amazon Wishlists
Data Mining 101: Finding Subversives with Amazon Wishlists.
Now, imagine the false alarms and abuses that are possible if you have lots more data, and lots more computers to slice and dice it.
Of course, there are applications where this sort of data mining makes a whole lot of sense. But finding terrorists isn't one of them. It's a needle-in-a-haystack problem, and piling on more hay doesn't help matters much.
Posted on January 5, 2006 at 6:15 AM
• 37 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
Another reminder to not put data out on a public database if you don't want people snooping it.
If I'm truly subversive, the kind the government really wants to keep an eye on, not just some college freshmen who took his first poli-sci course, I probably wouldn't be using Amazon. And I certainly wouldn't keep the things on a wishlist.
This may not be a useful approach for finding actual terrorists and preventing actual terrorist attacks. Then again, who is actually trying to do that? The numerous feel-good security plans that are derided here suggest that actual solutions aren't even on the table.
The beauty of institutionalized security theatre is that an approach like this could be used to find subversives that can be paraded as potential terrorists. The appearance of addressing the problem is political capital and has the side effect of making people more fearful of being interested in politically incorrect topics.
When all you have is a terrorist smashing hammer, everyone's a terrorist.
A good example of how casting too wide a net can go wrong, and on the need to watch for / remove false positives from your result set (for example, the map showing "you" had me over 1,600 miles to the west of my actual location.)
Again, Bruce, if this sort of operation doesn't make sense when looking for terrorists (and I am not necessarily disagreeing with you here), then what does? You feel that intelligence is where we need to expend more of our anti-terror resources (and I also agree that we need to spend the right amount doing the right thing) -- but what constitutes the right thing?
I would *love* to see an essay from you on this topic.
Well, to those who wonder why this data is public - duh, it's a wishlist - making it private would defeat its purpose.
I took that article to be satire, but maybe that's because I saw it on boing!boing! first, and assume most posts there of this vein are satire.
Amazon's wishlist data is indeed public but the guy should not really have been able to pull down amazonian forests of data to play with. Try this trick with google: they have software which looks for excessive downloads and presents one of those "type the following letters" challenges. Amazon's data is just too open. Luckily it's not really confidential but my point is that even so it should be guarded that little bit better.
@ Joe Buck
"making it private would defeat its purpose."
Amazon claims you can make your list private, and "invite" (for lack of a better word) people to view it.
"Well, to those who wonder why this data is public - duh, it's a wishlist - making it private would defeat its purpose."
There are degrees of public and private. I would very much like my friends to see my wishlist, but I don't want strangers to see it. And I certainly don't want the government to see it.
@ Ed T.
> Again, Bruce, if this sort of operation doesn't make sense when
> looking for terrorists (and I am not necessarily disagreeing with you
> here), then what does?
Let me throw some words in Bruce's mouth (btw, Bruce, I agree with Ed -> this would make a great essay topic).
Data mining is good for producing statistical trends (ie, following mob activity). If you want to analyze a population to make predictions with regards to trends, data mining is a good way to do it. You can project (for example) how many copies of the NYT bestseller list you'll sell to your customer base in the two weeks before Christmas. If you have good data to mine and a reasonably sized customer base, you'll probably be not too far off the mark.
Data mining is next to useless in predicting individual behavior. Look at FBI profiling, for example. A serial killer is very likely to have certain characteristics in common with other serial killers (white, male, 18-35, etc) This information is really only useful if you *already* have a list of suspects. With 10 suspects, you can compare them to the profile, eliminate 10-N of the suspects, and focus your resources on the N that fit the profile. (Note that this doesn't mean you ignore the other 10-N suspects). Using the profile to compile a list of suspects is next to useless, unless you want to start investigating 40% of the adult population.
Bruce's analogy of a needle in a haystack is spot-on -> in this type of data mining, you're approaching the problem precisely ass-backwards. Instead of saying, "We're not interested in pieces of hay longer than 3'" and sorting them out to make a smaller pile to examing, you're gathering the biggest pile of hay you can get your hands on, and making some undoubtedly wild guesses as to what sort of hay might be bad hay.
You're unlikely to find terrorists by sifting data on a computer. You're much more likely to find terrorists by putting more human agents into the field, bribing suspected terrorists to roll over, etc.
In addition to online DB data mining in the UK, there has been persistant rumours that the UK Government wants unrestricted access to databases held for "Store Loyalty cards" Credit Cards and other "what you bought" DBs. So that people can be profiled by their purchases etc.
I realy cannot see how this will help with Anti-terorist activities (as there accounts are generally short lived), but for Tax collection / health care and other things it is going to be realy usefull...
So in a couple of years you could go to you GP (Doctor) to talk about a treatment you need just to get told "I'm sorry but we see you used to buy a lot of beer / candy bars / red meet / etc you must agree to give them up for atleast six months or else you cannot have the treatment you so urgently and desperatly need".
Before you think this is a bit of fantisy thinking please remember that in the UK there has been a recent outcry over people being told they cannot have urgent operations because they smoke, and in the US it has been known for a store to use it's loyalty DB to show that a customer who had had an accident in store used to by a lot of beer.
wider publication of this article could be the death knell for wishlists right there. i don't understand the reason for amazon wishlists. as a regular, frequent patron of both online and bricks-and-mortar bookstores, if i want a specific book, i can get it for myself much faster than i could expect some other swinging dick to look at my list and grant my wish.
of course government and enterprise are going to mine any available data, that's what they've been doing since the pleistocene. far too much commentary is wasted on nugatory themes like "the fbi shouldn't be doing this" instead of the effective remedy, which is educating each unit of the populace not to disclose information unnecessarily, particularly on the internet. the young'uns are shameless in this regard, uploading all kinds of personal information to the various myspaces, blogging their personal lives as if a prospective employer five years down the pike might be favorably impressed that they had a one night stand with a...a trombone player. those that fail to reach enlightenment the easy way, through instruction, risk reaching it the hard way, through experience.
The 'You' for me netted: "You are not located in the United States and this system cannot map you."
Apparently Los Angeles County seceded from the Union and I missed it on the news.
On a similiar trail of thought about mining: few days ago, I was wondering what if someone mines del.icio.us (Social bookmark site) data - Can't this give a clue to build an attack vector? As an example if someone bookmarked his/her online banking URL.
Amazon lets you add items to your wish list (or did recently) without logging in. Thus if you were on a public terminal using Amazon, I can add d___-growing and b___-making books to your wishlist.
Datamining is just the tools to a goal. What is not clear is what is the methodology behind that goal?
Are there any measurable parameters for success or failure of that methodology?
It is suprising and disconcerting that these issues are not being discussed in public forums.
In the absence of a coherent process you end up with fishing expeditions which can be used to "show" that something is being done, note that most of the foreign individuals who were arrested in the US after 9/11 were deported for visa issues not for terrorist activities.
One major question is do we know what we are looking for? Will we recognize it if we come across it?
While I don't disagree with several of the points you raised, I do have a comment on the following:
"... You're much more likely to find terrorists by putting more human agents into the field, bribing suspected terrorists to roll over, etc."
Unfortunately, I am not sure this is so. Terrorists are similar to some of the urban gangs in this respect, that it is extremely difficult to infiltrate their ranks, because they tend to be very closed societies (some of the gangs that originated outside the USA are even more so) and suspicious of 'outsiders'. Since they are fanatics, bribing them is also unlikely to succeed -- how many $$$ do we have as a bounty for UBL, and how long has this bounty been out there?
If I recall, we also had an abysmal record penetrating the Axis powers with human agents during WWII -- with the notable exception of the German expionage service, which rolled almost to a man. Penetrating any sort of totalitarian organization, especially when it has a functional internal security organization, is highly problematic.
While data mining isn't really useful for predicting individual actions, as you noted it is useful for following trends. And, selective data mining (for example, closer examination of significant interest in the security arrangements of specific ports of entry, or the air defense systems of a country, or weather patterns combined with interest in certain types of biological and/or chemical agents) may well be able to allow us to 'connect the dots' -- something we aren't all that good at (or at least we didn't demonstrate proficiency at prior to Sept 11.)
Data mining, like interrogations, is but one tool available -- and it is not the best tool in every instance.
Every good TSA agent knows you can only find terrorists by strip searching old ladies and toddlers before they board an airplane. Your basic terrorist won't have his mother going onto Amazon.com to pick out the Anarchist's Cookbook for him, anyway.
If you don't want your wishlist to be completely public, don't post it in a completely public database.
Wouldn't there be some kind of inverse effect to this -- anyone who's seriously interested in specific topics because they're planning some terrorist activity would try to *hide* that from public view, thus if you find books about bomb-making or whatever on a particular individual's public list you'd use that fact to *reduce* your interest in that individual. I guess you might miss any really nieve terrorists by doing that, but there probably aren't too many of those.
This is disturbing. The person posting above with the name "Joe Buck" is not me; I regularly comment on Bruce's blog and I've used that name in net discussions going back to the mid-80s.
This is my actual name, it is not a pseudonym. Who is the other "Joe Buck"? Is that your name also? If it isn't, please choose another handle.
@ Joe Buck(s)
Anonymity is dangerous!
@ Ed T.
I admit I don't have the necessary data to back that statement up, but my educated guess would be that I'm correct.
It is certainly difficult to directly infiltrate certain organizations, but it isn't impossible. Also, if the investigation technique fails, it's not going to lead you to start pursuing a large percentage of false positives, which can be a huge drain on resources.
Whereas this sort of data mining I would imagine is fraught with a high false positive rate. A student buys a copy of the Koran (for his comparative religions class) and a copy of The Anarchist's Cookbook (because he's a teenage male) and we spend resources checking him out?
Data mining would be more likely to be useful as a secondary tool. If you have possible terrorist suspects, gathered through traffic analysis, known associates lists, or other investigative techniques, data mining on those suspects would allow you to set a threat level to each suspect and spend your investigative resources appropriately.
> selective data mining may well be able to allow us to
> 'connect the dots'
When you're looking at a large population of targets, only a tiny percentage of which would classify as a suspect, this assumes that you know which dots to look at prior to an event, which I would imagine is highly unlikely. You're passing into Bruce's "movie plot threat" world, where committees of agents try to come up with possible plots and then try to come up with possible data points that might indicate the possible plot is being put into action, so that they can mine the data for those possible data points.
"Well, to those who wonder why this data is public - duh, it's a wishlist - making it private would defeat its purpose."
I hate people like you, who feel the need to pretend that they are superior while attempting (yet, ironically, often failing) to make a point.
Some sites have PRIVATE wishlists. It's just a way to keep track of what you plan to buy in the future.
Other sites let you keep the wishlist selectively private, only allowing certain others to view it.
But the concept of a "wishlist" itself has no direct implication of public or private status.
Quote: "You're much more likely to find terrorists by putting more human agents into the field, bribing suspected terrorists to roll over, etc.
Posted by: Pat Cahalan at January 5, 2006 10:37 AM"
What exactly are you basing this guess on ? Just how well the $$million dollar bounties have gone down in some of the poorest areas on earth to date ?
Human penetraion is a very long shot. Banking on religious fanatics rolling over for cash is an even longer one.
Funny. I like people (or should I say letters?) like you.
Here's another odd story about a less technical form of data monitoring.
"An airline passenger with the words 'suicide bomber' written in his journal was arrested when his plane arrived in San Jose, California, on Wednesday, but the words appeared to refer to music and he was later released, officials said."
Apparently someone was peering into his journal and profiling him based on the fact that he had a backpack and was "acting a little suspiciously".
So I have three questions:
1) Was there anything else written next to the words "suicide bomber"? Something like "my favorite band is" or maybe it actually said "Suicidal Tendencies"? Which brings me to my next question:
2) Doesn't every word under the sun have some kind of band or song attached to it? WMD is a new punk band, right? They were a spin-off of Uncle Karl's Yellow Cake if I remember correctly. It's like a cipher for the next generation. Wait, cipher is the name of a band too...
3) And finally, since when is carrying a backpack and being under the influence of drugs or alcohol considered "suspicious" behavior for people headed to California?
"You're much more likely to find terrorists by putting more human agents into the field, bribing suspected terrorists to roll over, etc."
That is, in fact, what history has shown to be the case. The US Special Forces, to pick one example, have proven a fairly well-thought-out method of integration and infiltration of local populations, where the concept of "bribing" can often mean simple gestures of goodwill, fixing roads, building schools, etc. that expedite the objective of finding reliabile information on threats and eliminating them. Technology on its own simply isn't intelligent enough yet to know good from evil, right from wrong, no matter how big the database...
"Human penetration is a very long shot. Banking on religious fanatics rolling over for cash is an even longer one."
I hope you're talking about what I think you are...
Why is it less of a long shot than the accumulation of mounds of information? What does it mean when your notebook says "Suicide Bomber"?
Don't forget that meaning must still be assigned by humans peering into the data and assigning value, which means their effectiveness depends on their familiarity with the actual environment that the data was collected in, which brings you full circle to needing people in the field...
The book "Under and Alone" about an ATF agent's infiltration of the Mongols comes to mind.
All the folks saying that data mining doesn't work are talking out of their asses. Hands up all of you who actually tried it? Yeah, that's what I thought.
Well, I have actually tried it, and I know why it is big business today -- if done skillfully, the results are almost magic. (Of course, like anything, there are outfits who do it badly, and also outfits that will charge you $20 thousand for software any bright college IT major could write in a couple of days.) Here's what happened to me (because of confidentiality I will disguise a few facts). My company had a big shipment of an expensive product stolen. For the sake of argument, let us say that it was a middling expensive but popular perfume.
Next week while reading sales reports from our distributors, a coworker happened to notice some odd patterns. So I started running this data through various statistical analyses, and began to form a suspicion, which I admit became an obsession, and even resulted in a reprimand at work for negglecting my regular duties. I studied the theories of data mining and more advanced techniques such as pca and cluster analysis, and the patterns were made visible: there were 4 sources illegally distributing our product. Maybe 3 or 5, but almost certainly not 2 or 6. 3 of the 4 sources I could pin down to particular neighborhoods. One I could pin down to about 2 miles, and determine that this guy was likely only selling our product on Thursday nights.
I managed to convince the company that I was on to something, and a PI was hired; after about 3 Thursday nights of bellyaching in different bars about the cost of anniversary gifts he got offered some fine perfume remarkably cheap, the cops were called, we caught our man, he sang like a canary, and we recovered most of our product.
Data mining is not about looking for the obvious like "people who order suspicious books from Amazon". That sort of thing is just called "SQL" and has many uses but not catching criminals, except maybe the real dumb ones. Data mining is much more subtle, and where it really shines is finding unexpected patterns in a big stack of hay. For example your data mining algorithm might find that there exists a strongly separated cluster of entities which first created Amazon wish lists at one of 4 time clusters a few minutes long but weeks apart, all from within a particular region but different ISPs, with names which aren't registered as motor vehicle owners in that region, and all those wish lists include books from "people who bought ``The Rabbi's Cat'' by Joann Sofar also bought...", and with each having at least one item paid for from IP addresses in the south east of South America, and that you can reject the null hypothesis "this pattern formed at random" with a significance level of 10^-6. Your mining algorithm will find many, many such odd relations. Most of them will be meaningless to you, and probably to the people in them, too. But if you later raid a bomb factory in Sydney, Australia, and arrest someone who turns out to be in that cluster, is it worthwhile looking at the rest of the cluster? Most definitely it is. Not a proof of guilt, but definitely a good lead to start applying some human skull sweat to. Probably you will soon realize why they registered in that strange pattern, and what South America has got to do with it. Maybe you will never understand what the hell Joann Sofar has to do with it, but the algorithm spotted the pattern anyway.
Is this a bad thing for the cops to do? Hell, no. Collecting information and finding out the snippets that point to guilt is what we used to call "investigation". It's what we pay them to do. It's only a bad thing if they get private information illegally (which isn't the case here) or if they're wasting their time and our money, and I don't believe they are.
I enjoyed that. Especially the part where you come up with a hunch, hired an investigator and had him go out and cop a confession. Couldn't your data just stand on its own two feet? Heh.
And as you point out in the start of your comment big business can mean snake oil, so it doesn't prove much about the veracity of the products.
I don't argue with the theory of data mining, since automation is often desireable even with a technology-based version of what we consider intelligence. I just think that highly intelligent operators are still mandatory for any kind of reasonably accurate results from data...unless you have some kind of amazing artificial intelligence product you want to tell us about.
Even then you might find out that if you give your contraption enough data all the thing will do is answer "42".
FWIW, I didn't say that data mining doesn't work. Trend analysis is certainly a useful tool when judging population behaviors.
Your given example actually illustrates why data mining like this is a bad method for uncovering terrorists.
In order for your analysis to succeed, you needed a large volume of pre-existing trend data to examine. By comparing sales distributions post-theft to sales distributions pre-theft you were able to notice a significant change in trends. Moreover, you're looking at one specific set of trend data (your sales), and you're able to safely exclude a large volume of related data (which, coincidentally, you don't possess anyway) such as the sales of other brands of perfume in the area.
This enables you to pinpoint a particular change in traffic in a particular area that corresponds precisely to a period of time (the event of the theft) -> all data that you have already accumulated. In other words, you have a reasonable set of initial conditions upon which to base your analysis.
Terrorist activity doesn't work that way. As I pointed out in my January 5, 2006 05:54 PM post, we *don't know* what sort of trends we need to analyze to predict terrorist activity. Do sales of "The Anarchist Cookbook" have anything to do with terrorist activity? How about sales of fertilizer? Theft of high explosives? Changes in travel frequency? Degrees in Biochemistry? Religious orientation? Socioeconomic background? We don't know what our initial conditions are.
In your specific example, imagine instead that you had the sales figures and trends for *every brand* of perfume across the nation, and you didn't know which type of perfume was in the shipment that was stolen, or the location of the original theft. Your analysis would result in garbage -> the results of the theft in sales trends would be lost in the "noise" of a million other things that affect sales trends... a snowstorm in New York that cuts down on foot traffic to stores, a plane crash in Houston that prevented a block of shipments from reaching the stores, a shoplifting ring that is currently operating in Seattle cutting down on sales figures there, etc. etc. etc. ad nauseam.
When examining the entire population of the US (or the entire population of Amazon customers), we don't have any way to assign meaningful probabilities to what may or may not be suspicious behavior.
The only way we can get a successful data mining result would be if we made a large volume of guesses (thousands of guesses) about what sorts of trends we ought to be examining which actually happened to be correct. Even then, the likelihood of false positives will be astronomical.
The likelihood of this giving any sort of reasonable result is infintesimal.
Now, on the other hand, if we can provide ourselves with a reasonable set of initial conditions, data mining could be incredibly useful in narrowing a pre-existing suspect list down to a manageable size for assigning investigative priority.
That's not what we're talking about on this thread -> we're talking about grabbing wholesale the entire purchasing and wish list activity of hundreds of thousands of people, essentially just to play with. The activity may be educational, and having real data to play with may be useful for intelligence officials to teach themselves advanced analysis techniques which may in the future help them in different scenarios, but on the face of it the activity in and of itself is going to produce bufkus results.
> What exactly are you basing this guess on?
The fact that I know enough about statistics to know that statistical analysis is only useful in certain applications that contain a reasonable set of initial conditions and a reasonable set of behavior to examine, with results aimed at a particular time frame.
This is why running analysis of weather data only gives you a useful prediction window of about 5 days. What we're looking at here is trying to predict precisely how many hurricanes will hit the east coast of the U.S. in the year 2006, their exact point of landfall, and the intensity of the storm... on January 6th. Not only are we going to guess completely wrong (which is bad enough in that we're wasting time and resources for null results), we're going to cause more harm than good -> any posted results will have a huge economic cost while the wrong cities get ready for hurricanes that aren't going to hit, and other wrong cities go like lambs to the slaughter.
> Just how well the $$million dollar bounties have gone down in some of
> the poorest areas on earth to date?
I don't know, but if even one has worked (and I imagine more than one has worked given that the deck of cards is smaller than it used to be) it's much more useful than this sort of garbage. Civil liberties issues aside (which are worth an entire discussion on their own), this is just a huge waste of resources.
Note also that offering million dollar bounties has a null fiscal cost for a false positive. There is, of course, still a civil liberties issue.
Somebody brings you Osama, and it's not the Osama you're looking for, you don't pay. It's probably a good idea to apologize to the non-Osama, however.
There are companies which will sell you a list of phone numbers called by the cell phone number you give them. Apparently this practice today isn't entirely illegal. My guess is it would be illegal for US spy agencies to purchase this service, while it would be okay for spy agencies of other countries to purchase it. How goofy is that?
With wishlists such as those used by Amazon.com, there is much opportunity for fiddling with the items on the list. One can tailor his or her "Amazon public face" by selecting items not because they are wanted but because they fit an image. If somebody buys you the item, no big problem. (But raises the interesting question of why the person bought you the item.)
This is an electronic version of people decorating their homes by getting leatherbound books, most of them in a foreign language, because they produced a desired look.
The Amazon.com profiles not displayed to the public can get interesting and sometimes quite funny. After buying myself a microbiology textbook for a course and buying knitting books for my wife, Amazon.com sent me an email with special offers. It was to the effect of "People interested in microbiology and knitting....." That might score a few hits on some US homeland security data mining tool. Should have tried to get books on Zen and some on motorcycle maintenance. Oh well.
"Should have tried to get books on Zen and some on motorcycle maintenance. Oh well."
Indeed, and we can only hope that DHS is familiar with the book's gift to the world of policy-makers. Amazing how Pirsig's writing just increases in relevance and value with age -- easy to find clear insight into why large ambitious technology projects often fail, just like a show-room motorcycle can fail when the rider is oblivious to the required maintenance and failure warning signs of a thing s/he rides at death-defying speeds.
"I enjoyed that. Especially the part where you come up with a hunch, hired an investigator and had him go out and cop a confession. Couldn't your data just stand on its own two feet? Heh."
I am really not sure of your point here. Of course the remote data analysis alone does not solve the crime by itself. It is a tool to assist in the investigation, but it was an extremely valuable and powerful tool, that enabled 99% of the work to be done with a few hours work from the comfort of my desk, instead of an investigator spending hundreds of hours rubbing shoulders with criminals at $500 a day plus expenses or whatever the hell they get paid. Plus I would not call significance at 0.1% "a hunch". It is not enough for a conviction by itself, but it is more reliable than most eyewitnesses.
"And as you point out in the start of your comment big business can mean snake oil, so it does not prove much about the veracity of the products."
I did not use the size of the business to validate the methodology; I said its sometimes astonishing effectiveness explains WHY it is big business, despite the existence of some snake oil merchants and over-inflated claims. In case you are interested, the following is an example of a data mining business which is definitely not selling snake oil:
SAS is the largest completely privately owned software company in the world, which means that unlike public megacorps corporate policy reflects the personal ethics of the CEO:
"...I just think that highly intelligent operators are still mandatory for any kind of reasonably accurate results from data..."
I agree absolutely. Data mining is a very powerful tool, but it needs a perspicacious mind to drive it. It does not always work, but since it can be a lot faster and cheaper than people seem to think, it is very often worth a try.
"FWIW, I didn't say that data mining doesn't work. Trend analysis is certainly a useful tool when judging population behaviors"
OK, but note that data mining is not the same thing as trend analysis. Some your criticisms depend on this misconception.
"Terrorist activity doesn't work that way. As I pointed out in my January 5, 2006 05:54 PM post, we *don't know* what sort of trends we need to analyze to predict terrorist activity. Do sales of "The Anarchist Cookbook" have anything to do with terrorist activity?..."
You seem to misunderstand what data mining fundamentally IS, or at least what I mean by the term :). Data mining is a set of complex, powerful and flexible statistical tools, mostly designed for operation on extremely large, weakly structured data sets, and which can be thought of as an extreme generalization of descriptive statistics. Like regular descriptive statistics, it is for finding the patterns which underlie a large heap of data, and thereby enabling that data (or parts of it) to be more succinctly summarized. However unlike most "conventional" descriptive stats which output only a few parameters, data mining algorithms may output complex functional relationships between many dimensions of the data. Interpretation of those relations is still left up to a human intellect.
Importantly, when applied intelligently it can be useful for such things as DISCOVERING "what sort of trends we need to analyze to predict terrorist activity".
Now suppose it generated a result such as (using your example): 7% of non-farmers who buy "The Anarchist Cookbook" and then bulk fertilizer will build a bomb. In that case it would not be particularly useful, because that is a hypothesis you could postulate from simple reasoning, and then test if it was valid by well known methods. What is really neat with data mining is when it finds totally unexpected relations, as it does surprisingly often. Sometimes the relations are obvious with the advantage of 20:20 hindsight. Sometimes, there is a Ph.D. waiting for someone who can figure out WHY it occurs, but that does not actually matter to the investigator. And this unexpected stuff is particularly useful in a law enforcement scenario because smart criminals are likely to try to obscure the obvious relations (e.g. do not buy fertilizer and the book with the same cc), but will not even be able to guess what "unexpected relations" they should try to obscure.
"In your specific example, imagine instead that you had the sales figures and trends for *every brand* of perfume across the nation, and you didn't know which type of perfume was in the shipment that was stolen, or the location of the original theft. Your analysis would result in garbage..."
That is almost the whole point: you do not really KNOW that. You are guessing, and probably on the strength of very little knowledge about perfume sales (no offence intended). Initially no one bothered to look at our sales figures, if they thougght about it at all, because it was "obvious" that it could not show up through the noise. What the data mining demonstrated was that if you processed the data just the right way, the pulses DID show clearly through the noise.
If I only had the weaker data set you propose, I would still be a fool not to try to mine it. Now I know how it is done it would probably take me 3 to 4 hours to set up, then run it in machine idle time. The next morning Pat would still be saying "Cannot be done, cannot be done", I could either be saying "darn, Pat was right" or "Officer, they are in Cleveland now, call the FBI!" or maybe even "the thieves could be anywhere, but I know the shippers lied: it was 'Armani Le Parfum', not 'Bvlgari Pour Femme'!"
This is my main point, I guess. People seem to think that data mining is expensive, inflexible or ineffective, or would be stymied at finding a needle in a haystack of irrelevant data. But actually data mining is flexible and powerful, can be pretty cheap if you are smart about it, and can sift immense amounts of hay, sometimes finding a needle, sometimes a diamond ring you did not know was lost, and sometimes just hay :)
> note that data mining is not the same thing as trend analysis.
> You seem to misunderstand what data mining fundamentally IS
No, I know what it is :) But you're right, my previous post sounds more like a description why trend analysis has difficulties than data mining. Note, though, that data mining just gives you trends to analyze (but you and Davi have already pointed out that there is a huge human element).
One of my main points, which I didn't illustrate well, is that any data analysis requires some sort of pre-existing data, and to be most effective, some sort of framework around which to analyze the data (you may find relationships you don't expect, but you're still looking for specific relationships). Again, you have specific data you were analyzing, and you have specific questions you want to ask, regarding an event that already occurred. In other words, you already know what crime (specifically) you're investigating, where it took place, when, how, etc. You are then trying to find the effects of this event in the data set you're analyzing. You're performing data forensics, essentially... geek CSI.
However, the use case that this thread is about isn't a case of forensics. The feds aren't combing through the Amazon database looking for information that might be helpful in discovering who blew up the Chrysler Building on Jan 23, 2003 (made up event). They're looking for information that may be useful in discovering who may blow up something, somewhere, sometime in the future. This isn't forensics, it's investigating thought crime - things that haven't happened yet.
> Importantly, when applied intelligently it can be
> useful for such things as DISCOVERING "what sort of
> trends we need to analyze to predict terrorist activity".
I doubt this is actually the case (admittedly I'm not a DBA or a DA and it's been a long time since I've done real math) -> at least in a *investigative* sense. To expand on that a little -> we can pump numbers into a cluster and run data minining algorithms against it and we may get all sorts of useful information regarding terrorist activity profiles. For example, you could discover that car bombing events are much more likely to occur in certain sets of environmental conditions. This sort of data may be highly useful in spending resource dollars in preventing car bombings (or at least give some sort of mathematical reason to elevate the terror level). Individual terrorist acts are so infrequent in the U.S., however, that a lot of our data and our assumptions is going to be based upon data gathered from other countries. Finding out that car bombings are most likely to occur on Friday isn't terribly useful if the reason they occur on Friday is because it's the Jewish Sabbath and most car bombings take place in Israel ;)
> That is almost the whole point: you do not really KNOW that.
> You are guessing, and probably on the strength of very little
> knowledge about perfume sales (no offence intended).
Oh, sure, admittedly I'm guessing. Since you weren't providing much in the way of details, I'm assuming that the event you're discussing may not have anything to do with the perfume industry at all. No offense taken.
> If I only had the weaker data set you propose, I would still be a
> fool not to try to mine it. Now I know how it is done it would
> probably take me 3 to 4 hours to set up, then run it in machine
> idle time.
This is going to come out offensive, but it's not meant that way. You sound like you've got a classic case of geek overempowerment. You've found a new set of tools, and you're in the relative early stages of figuring out what they can do, and as a result you see opportunities everywhere. I've found that in most cases, people suffering from geek overempowerment forget that there are costs everywhere, too :)
You actually may be a really big fool to try and mine it. Probably not, in your particular case. If you're blowing 4 hours of your individual work time tracing a single event, that's a pretty good tradeoff if you have even a small chance to recover a single stolen shipment. A great number of associated costs for your activity are already covered.
However, look at it from the stance of building this sort of thing from scratch (again, what the Federal government needs to to to implement this sort of thing).
First, they need to build an infrastructure to house all the data and crunch it, securely. Building a cluster for this sort of activity requires millions of dollars just in facility costs, let alone machine purchasing, maintainance, etc. (and while I may not know much about the perfume industry, I know how many foot-pounds of coolant you need to shove into a room to chill 1024 processors to operational temperature) :) You certainly don't want to run this sort of analysis on people's desktops overnight, since there will be a lot of sensitive data and access control will be critical. Plus, you're talking about a data mine that contains thousands of mini-mines like the one your perfume company has, so you're going to need thousands of machines to crunch the numbers.
They have to acquire large blocks of data not just from Amazon, but from thousands of vendors across the U.S. Each vendor has a different database system, running a different version of the database. Some of it is encrypted, some of it isn't. All of it needs to be ordered to be processed. The largest, most useful databases are probably normalized to some extent, but they're all customized to meet the needs of the individual vendors. You'll need to re-normalize all of the databases in the same fashion, compile the data into a single aggregated system, and denormalize it for performance. Moreover, you'll need to do this on a moving data set, because new data is always being generated and will need to be injected into the system.
If you're doing this sort of work, you probably make between $80 and $120K a year, so with benefits, time off, etc., and with intradepartmental relationship effects, you cost your company's IT department probably around a quarter of a million dollars to employ. We'll have to find quite a few of you to staff this monstrosity, all of whom have security clearance, know a lot about criminal forensics, and have spook training (otherwise we don't have highly intelligent operators). We need someone to manage all these guys, someone to manage the hardware and software uptime, and someone to do hardware repair and system maintainance, all of whom again need security clearance (although they may not all need spook training). In this particular case, the guy who swaps out broken hard drives probably makes about as much money as you do.
Let's say you can get useful performance out of a 2048 node supercomputer. A room to house that monkey is going to cost about 3 or 4 million dollars, counting backup generators, heat exchangers, fire supression, etc. The computer itself, with networking equipment, high speed storage, etc., is going to be another 8 million. You may spend several million just on storage for all of the data you've gathered. Putting all of this stuff into a building with offices, power, desktop computers, phones, etc. is another $10 million.
Suddenly, you're talking about a facility that costs tens of millions of dollars to build, data that costs millions of dollars to acquire, and has an operational cost of tens of millions of dollars a year. Not to mention the fact that you've got a hardware replacement cycle of about 36 months, so you're rebuilding the supercomputer every three years, which means you need to upgrade facility power, coolant, etc., every three years...
To make guesses about data looking for crimes that haven't been committed yet? The fact that we're arguing over the efficacy of this sort of system means that there's at least some possibility that you'll get no practical results from this endeavor at all...
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.