Schneier on Security
A blog covering security and security technology.
« Trusting Windows |
| UK Report on July 7th Terrorist Bombings »
May 31, 2006
Data Mining Software from IBM
In the long term, corporate data mining efforts are more of a privacy risk than government data mining efforts. And here's an off-the-shelf product from IBM:
IBM Entity Analytic Solutions (EAS) is unique identity disambiguation software that provides public sector organizations or commercial enterprises with the ability to recognize and mitigate the incidence of fraud, threat and risk. This IBM EAS offering provides insight on demand, and in context, on "who is who," "who knows who," and "anonymously."
This industry-leading, patented technology enables enterprise-wide identity insight, full attribution and self-correction in real time, and scales to process hundreds of millions of entities -- all while accumulating context about those identities. It is the only software in the market that provides in-context information regarding non-obvious and obvious relationships that may exist between identities and can do it anonymously to enhance privacy of information.
For most businesses and government agencies, it is important to figure out when a person is using more than one identity Package (that is, name, address, phone number, social insurance number and other such personal attributes) intentionally or unintentionally. Identity resolution software can help determine when two or more different looking identity packages are describing the same person, even if the data is inconsistent. For example, by comparing names, addresses, phone numbers, social insurance numbers and other personal information across different records, this software might reveal that three customers calling themselves Tom R., Thomas Rogers, and T. Rogers are really just the same person.
It may also be useful for organizations to know with whom such a person associates. Relationship resolution software can process resolved identity data to find out whether people have worked for some of the same companies, for example. This would be useful to an organization that tracks down terrorists, but it can also help businesses such as banks, for example, to see whether the Hope Smith who just applied for a loan is related to Rock Smith, the account holder with a sterling credit rating.
Posted on May 31, 2006 at 6:52 AM
• 31 Comments
To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.
"In the long term, corporate data mining efforts are more of a privacy risk than government data mining efforts"
I would have to disaggre with that one.
In the UK the Government are (descreatly) putting into place access to all Credit Card and Loyalty card DBs in the UK.
The purpose is supposedly to look for the usual suspects (Terror/Drugs/whatwever sounds good this week).
However as they have also approached the SAP people for specialised profiling software it would appeare the real reason is to collect Tax in one form or another.
So you have a Government with Access to all of these DBs and specialised mining software -v- a large Coperate with only access to one or two Dbs using less specialised minning software. Guess which one has the greater scope for harm to Joe Public....
My concern with the private sector doing all of this is that there's no good controls on the data they collect. Since they are in business to make money, what's to stop them from selling all of this information?
In your articles about NSA et. al. data mining, you hypothesize that the number of false positives will make data mining useless. Why is data mining by private corporations different?
Given the commercial and financial success of data mining in the corporate sphere -- certainly credit-card companies use it to reduce fraud -- arguments about NSA data mining should focus on privacy issues, not efficacy issues.
What's to stop the Government from selling the data?
In the UK some government departments have been made to be "cost nuetural" to the Government, i.e. Met Office (Weather), HMSO (government documents laws etc) and other Depts responsible for Maps / Charts etc.
There is now considerable concern in the UK that they are effectivly compeating unfairly with the private sector. Especially as it's almost impossible to refere to the Government run regulators to complain...
That software sounds quite useful as an anti-fraud measure, particularly for government organisations dealing with entitlements.
BTW you've provided no basis for your assertion that such commercial data mining efforts result in an increased privacy risk.
Either they have your personal data to mine in the first place or they don't. Whether a corporation choses to buy some data mining software Thursday, Friday, next week or never at all to utilise the customer records they already have has no bearing on whether your privacy is infringed.
Frankly all this is a little ridiculous in 2006. Are we working on the basis that everyone pretends credit cards don't constitute a default financial instument for the majority of all first world citizens ?
That when everyone gets one you don't fill out a personal questionaire where you declare all static personal details that anyone could ever obtain.
That the financial provider of the card doesn't get informed on all your transactions, where you shop, how frequently and in what financial amounts .
Right so theres some other company that we should care about that can buy some software to mine some personal data and financial data that comes nowhere near that which credit card companies have always had and supplied directly by you.
Honestly what possible concern is it if some different company decides to mine it's data ?
Since they are in business to make money, what's to stop them from selling all of this information?
Posted by: Andy at May 31, 2006 07:43 AM
Same thing that stops every business selling anything. A market.
If your answer is there is a market for your personal information then congratulations, you figured out where they got it in the first place and them selling it to someone else really hasn't changed anything.
The only thing that should keep you up at night is that nagging question in the back of your mind.... "WTF is new about any of this?"
The premise underlying IBM's EAS (and competing products in future) is that any personal information acquired by a business or government agency becomes their resource, no longer the property of the subject, and that they are free to exploit it for whatever purpose they choose.
Perhaps it's appropriate to quote the principles in the UK's Data Protection Act*:
1. Personal data shall be processed fairly and lawfully [the Act specifies what that means].
2. Personal data shall be obtained only for one or more specified and lawful purposes, and shall not be further processed in any manner incompatible with that purpose or those purposes.
3. Personal data shall be adequate, relevant and not excessive in relation to the purpose or purposes for which they are processed.
4. Personal data shall be accurate and, where necessary, kept up to date.
5. Personal data processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes.
6. Personal data shall be processed in accordance with the rights of data subjects under this Act.
7. Appropriate technical and organisational measures shall be taken against unauthorised or unlawful processing of personal data and against accidental loss or destruction of, or damage to, personal data.
8. Personal data shall not be transferred to a country or territory outside the European Economic Area unless that country or territory ensures an adequate level of protection for the rights and freedoms of data subjects in relation to the processing of personal data.
Principle 2 would make use of IBM's product unlawful (unless, of course, when obtaining the information they put the usual 4-point type saying "filling in this form constitutes consent for my details to be used in any way that makes money for the company").
* Data Protection Act 1998, schedule 1: http://www.opsi.gov.uk/acts/acts1998/19980029.htm
Oh, and "IBM Anonymous Resolution allows you to 'disguise' sensitive data before you share it with others for purposes of identity resolution and relationship detection." Is this, or is this not, bullshit? If sensitive data is 'disguised', you can't use it for identity and relationship!
"arguments about NSA data mining should focus on privacy issues, not efficacy issues"
I do not think that anybody doubts the efficacy of Data Mining for some problems. However, equating the problem space faced by the NSA and the one faced by the banking/credit industry, and thus assuming that the same solutions are applicable on both, is misleading.
The basic difference between these two scenarios (by no means the only one) is that in one case we are looking at *retrospective analysis* while in the other one we are looking at *forecasting*. In very simple terms the difference between these is the same one that exists between connecting the dots and figuring out which dots to connect in the first place. In one case an expert system or a pattern matcher will excel; in the second one some clustering or topological analysis algorithm will yield better results.
The problems are different, the algorithms are different, the metrics used to measure their performance are different and thus comparisons between them are difficult.
I agree with you that we must focus in the privacy issues. However, we must be aware of the technical issues as well in order to be able to draw informed cost/benefit decisions.
“Why did they do it?��? asks one major section of the government report.
There are no clear answers—nothing in the report that explains why one morning these British men blew themselves up and killed dozens of commuters and injured hundreds more.
According to the report, the men were serious about their religion—but then so are thousands of other members of the very same community. The men spoke out about politics at times but, of course, plenty of people do that.
Some evidence suggests that a local gym the young men attended attracted people with radical views. A local bookstore was rumored to stock radical writings and DVDs. The men liked to go on camping trips—leading to speculation that the trips were training programs. The report finds little significance in any of these things. The men had visited Pakistan with their families. Again, though, many Britons make the very same trip.
The report reaches some chilling conclusions. “The case demonstrates,��? it says, “the real difficulty for law enforcement agencies and local communities in identifying potential terrorists.��? There was “little in the backgrounds��? of the London bombers to “mark them out as particularly vulnerable to radicalization.��? On the whole, the men were “well integrated into British society.��? While they may have experienced moments of “instability��? there was nothing “extraordinary��? about their life circumstances.
Posted by Jason Mazzone at 09:12 PM | Comments (0) | TrackBack (0)
Gee, that sounds like an old project of theirs from the early 20th century in Germany...mass data collection of individuals, storage and analysis of said data...yeah, this definitely creeps me out a bit.
Okay, I'll come clean: stuff like this scares me s**tless.
I've posted on this before - let's not forget that IBM has been in the data mining business a very long time - and look who they were doing business with: http://www.ibmandtheholocaust.com/ . I can also say from having worked for IBM that they DO have very interesting customers for the data they mine - three letter groups included.
@Andy: "Since they are in business to make money, what's to stop them from selling all of this information?
Posted by: Andy at May 31, 2006 07:43 AM"
@Tank: "Same thing that stops every business selling anything. A market.". Posted by: Tank at May 31, 2006 07:58 AM
In the UK, the Data Protection Act and the Human Rights Act both limit what can be done with personal information by public bodies.
The Data Protection Act states that information can only be used for the purpose for which it was collected, and must be disposed of when no longer required for the original purpose. It is also illegal to transport personal information outside the EU, except where there are "safe-harbour" agreements, of which there are a few with American companies, and probably some others too. Our information commissioner (the person ultimately responsible for enforcement of the DPA) has teeth too - you don't want to get on the wrong side of him.
The Human Rights Act, article 8 guarantees the right to privacy and freedom from interference in family life.
A few years back, there was case in the UK (Robertson vs. Wakefield council) where they tried to sell electoral roll information to private companies (for junk mailing). He complained, and won. They had to change all their systems.
Surely this becomes worse when the information these tools generate then becomes lost. 26.5 million records is one thing; networks displaying how those 26.5 million people are connected is another, possibly a whole lot more more valuable?
"In the UK, the Data Protection Act and the Human Rights Act both limit what can be done with personal information by public bodies."
This is hardly reassuring to those of us in the US where questions of privacy of personal information seem to generate large amounts of lip service and little else.
The information is available, so people are going to use it. I don't think it's practical to try to limit the access to information. There are always ways to work around the laws, whether it's taking the data to a country with less restrictive laws, a government agency contracting the task out to a private company, or simply claiming another legitimate use for the data.
The companies that collect and use this data do so because it's profitable. They aren't concerned with anyone's views on privacy. Trying to phrase this as a privacy discussion means that we're not talking in terms that the companies can understand. The government seems less interested in privacy, because the response to any concerns is "NATIONAL SECURITY!! TERRORISTS!!!!"
I don't think there's a solution that would satisfy a majority of people. It seems that we can discuss it all we want, but it's going to keep getting worse.
In response to John Stanning --
IBM Anonymous Resolution transforms information to be shared into strings of alphanumeric characters that are mathematically impossible to convert back to their original form by using an industry-standard one-way hash. The hashed data can then be shared with another party while the actual information remain anonymous. When a match is found, Anonymous Resolution immediately generates a real-time alert and notifies the agreed-upon party. Information indicates that a match has been found between specific records. IBM Anonymous Resolution provides no information about the specific data in each record. The actual content is anonymous; instead it sends the data holder a "pointer" back to the original data.
More here: http://www-306.ibm.com/software/data/db2/eas/...
What I find most technically interesting is their claim that they can match anonymous data. At first glance it offers a solution to the big brother database since it's possible to set up the big database with no identifying information in it. Presented with a subpoena, the matching could be done and those contributing the hashed data would then reveal the real identities.
I can only speculate how this works. We're certainly not talking about cryptographic hashing here, since cryptographic hashing is designed NOT to match unless the input is identical. They are somehow reducing the information into characteristics with something akin to a Soundex transformation, losing enough information on the way to prevent going backwards, but allowing matching.
Where this appears to fall down; however, is that if one has a second database, perhaps not as rich, obtained from public sources, then the same technology that ferrets out matches in the anonymized data could then match the cluster of records of interest in the second database, thereby revealing the identities. That's the big problem with databases; records that can't be linked in one database can be linked with the help of information from another.
One can resolve duplicate individuals, given a rich set of information, with a very high degree of accuracy. Asking the question, who is the same as this other person will rarely give a false positive with good matching software. Asking the same question about everybody in the country will return lots of false posities.
@John Stanning -- Oh, and "IBM Anonymous Resolution allows you to 'disguise' sensitive data before you share it with others for purposes of identity resolution and relationship detection." Is this, or is this not, bullshit? If sensitive data is 'disguised', you can't use it for identity and relationship!
You can if it is "disguised" in a methodical, repeatable way, e.g., hashing, such that each instance of a name or address component is transformed consistently into something unrecognizable. String-matching is content insensitive, so whether your name is "Smith" or "*^&%HJRGDF," as long as it's always the same, you can match and link relationships. Of course, if you really want to know who the person is, you have to go back to the undisguised source.
I would like to add, that based on first-hand experience at a company that collated public information, that the privacy policies of corporations which collect this information, in certain circumstances, are not worth the paper they're not printed on. In particular, the company I worked for went belly-up, and its data was bought in a fire sale. The acquiring company was not bound by the dead company's privacy policies, and were basically free to do whatever they wanted.
"...mathematically impossible to convert back to their original form by using an industry-standard one-way hash."
A few points:
The industry-standard hashes - MD5, SHA1, et al, have not been proven secure in the mathematical sense. For some, quite the opposite.
Even when the hash's security has been proven mathematically, there's always brute force.
In this case, brute force can be made particularly efficient. The keyspace is small, and dictionaries (e.g phonebooks) are available.
"In the long term, corporate data mining efforts are more of a privacy risk than government data mining efforts."
I haven't seen a corporation throw someone in jail or kill them.
Some members of corporations, I'm sure have stolen things, but the government operates completely on theft everywhere and at all times. For reference, check your pay stub.
I wouldn't worry so much about IBM data mining efforts, I'd worry much more about the government taking it by pressure or doing it themselves at your expense (as they already are).
"Since they are in business to make money, what's to stop them from selling all of this information?"
Nothing. Our legal system is so bad.
There is no competition in legal conduct - hence everyone has to accept a one-size-fits all solution. The government default stance on limited liability removes the market for risk assessment - hence it allows companies to do what they wish in regard to consumer info.
The government does not mind. It wants information, tax revenue, and all around control. The amount of money made from selling consumer info is less than any of the wonderful government programs slowly, but surely, destroying civilization.
"Our information commissioner (the person ultimately responsible for enforcement of the DPA) has teeth too - you don't want to get on the wrong side of him."
Sadly, german "Datenschutzbeauftragte" (data protection officers, probably our flavour of information commissioners) have no teeth at all. Ignoring them is a favorite sport of companies and all government agencies; and they compete ambitiously in that game...
"The companies that collect and use this data do so because it's profitable. They aren't concerned with anyone's views on privacy. ... The government seems less interested in privacy ..."
Conclusion: Companies and governments have compatible interests, and combine their forces to violate our interests which are the exact opposite.
We have the most powerful, unscrupulous, intransparent and organized enemies on earth, and are weak, intimidated, wholesale surveilled and unorganized ourselves. Guess who will win?
My understanding is that the three-letter agencies work closely with big corporations and always have. There are probably several reasons. First of all, if the NSA were to outright own a major part of the network infrastructure or to provide a major service, such as storing people's e-mails and journals and so on, then the average person would be understandably suspicious. But if a corporation does it, well, that's less conspicuous to the average consumer. Also, I believe there might be legal loopholes that make the corporation-approach desirable to the three-letter agencies. Tehnically, there's rules about the government collecting information on citizens. However, these rules do not apply to private corporations. So the three-letter agencies can skirt around the law by letting the private companies collect the data then buying it from the private companies.
I can think of a legitmate and useful application of this technology: medical records. If a complete set of medical records for an individual who had visited multiple providers is desired, a tool such as this one would be a great help.
(Of course, such a collection of records can have beneficial and malicious uses. Also, it would be a tool not a panecea.)
Re: business and government data availability
Too Late - they won the war a long time ago.
Another chronicle of how it happened and the current state (book):
No Place to Hide: Behind the Scenes of Our Emerging Surveillance Society
by Robert O'Harrow
(login required, but I am sure you can find it elsewhere)
I.B.M. Software Aims to Provide Security Without Sacrificing Privacy
By STEVE LOHR
Published: May 24, 2005
"... The technology for anonymous data-matching has been under development by S.R.D. (Systems Research and Development), a start-up company that I.B.M. acquired this year.
Much of the company's early financial backing came from In-Q-Tel, a venture capital firm financed by the Central Intelligence Agency that invests in companies whose technologies have government security uses.
S.R.D., now I.B.M.'s Entity Analytics unit, has worked for years on specialized software for quickly detecting relationships within vast storehouses of data. Its early market was in Las Vegas, where casinos used the company's technology to help prevent fraud or employee theft. The matching software might sift through databases of known felons, for example, to find any links to casino employees.
By the late 1990's, United States intelligence agencies had discovered S.R.D. and the potential to use its technology for winnowing leads in pursuing terrorists or spies. After 9/11, the government's interest increased, and today most of the company's business comes from government contracts.
The new product goes beyond finding relationships in different sets of data. The software, which I.B.M. calls DB2 Anonymous Resolution, enables companies or government agencies to share personal information on customers or citizens without identifying them..."
One nice application of the technology that SRD has talked about in a Black Hat keynote is NORA - Non-Obvious Relationship Analyzer (I think that's the right decode of the acronym.) Simple example is unearthing collusion between employees, between employees and suppliers, between employees and known bad actors... Fascinating stuff and a fascinating story of taking a powerful idea to market. Google for:
(with quotes) – I think that the audio/video may be online in the Black Hat archives that show up in this query but I have not looked. The SRD website seems to have disappeared from the Web – even from the usual archives…
I first became aware of SRD and NORA when trying to clean up the data in a rather large internal application. SRD wouldn't talk price until they had someone come visit, which meant 5+ digit price tag. Our main application involved electronic prescriptions and we had no end of mangled addresses and phone numbers with all the area code splits: and that is just the doctors (I vaguely remember the number 600k of doctors, nurses, dentists, etc who were legally authorized to write prescriptions) and pharmacies (70k in the US). The actual prescription details weren't going to be touched.
Here is a paper describing some of the history involved.
Historically, you'd see this written up as "patient matching" back in the 60s, and since the late 90s you'd read about this sort of stuff in gene sequencing. I think the latest work in this area is in algorithms based on BLAST (which is *the* algorithm for matching genes). His dissertation provides most of the historical information about this sort of problem.
I.B.M. Software Aims to Provide Security Without Sacrificing Privacy
By STEVE LOHR
Published: May 24, 2005
Schneier.com is a personal website. Opinions expressed are not necessarily those of BT.