Searching Google for Unpublished Data

We all know that Google can be used to find all sorts of sensitive data, but here’s a new twist on that:

A Spanish astronomer has admitted he accessed internet telescope logs of another astronomer’s observations of a giant object orbiting beyond Neptune but denies doing anything wrong.

Jose-Luis Ortiz of the Institute of Astrophysics of Andalusia in Granada told New Scientist that it was “perfectly legitimate” because he found the logs on a publicly available website via a Google search. But Mike Brown, the Caltech astronomer whose logs Ortiz uncovered, claims that accessing the information was at least “unethical” and may, if Ortiz misused the data, have crossed the line into scientific fraud.

Tags: Google, privacy, search engines

Posted on September 23, 2005 at 1:43 PM • 30 Comments

Comments

ac • September 23, 2005 2:00 PM

Does lack of security imply the right to unfettered public access?

Of course not. This is basically the same issue of the Republican staffer who stole Democratic strategy notes off an unsecured computer. Of course it unethical. The fact that it’s technology just clouds the issue.

To use everyone’s favorite car analogy: if I leave my keys in my car and leave the doors unlocked, is it unethical for you to steal it? Would I be unjustified in calling the cops?

If you want to put something into the public domain, you have to specifically say so.

Chris Walsh • September 23, 2005 2:05 PM

Ed Felten had a write-up on this a couple of weeks ago.

The situation is not as clear as the quotation from Mike Brown would lead one to believe. Ed’s comments provide some useful links, as subsequent comments here no-doubt will.

Chris Walsh • September 23, 2005 2:07 PM

My embedded link was stripped :^)

http://www.freedom-to-tinker.com/?p=895#comments

Joe Buck • September 23, 2005 2:20 PM

Academic honesty would require that the source of all data used be clearly documented in any publication. It appears that Ortiz gave no credit originally, implying that he was relying only on his own observations, and that he only confessed when confronted with evidence in the form of logs. Even if it is true that he merely used the data on the web to confirm his own results, he did not report it, which is unethical. Academic honesty requires that all sources used are accurately reported.

Ari Heikkinen • September 23, 2005 2:25 PM

If you find something through google how do you know if it’s meant for public use or not?

Kevin Davidson • September 23, 2005 3:20 PM

This seems to happen a lot. Just a few years back medical records were put on server where they shouldn’t have been and subsequently indexed by Google.

When I am given access to restricted content on a web site, I routinely search Google to see if it’s indexed–and it sometimes is.

Here’s an article listing some examples:

http://www.washingtonpost.com/ac2/wp-dyn/A24053-2004Feb8

Stephen • September 23, 2005 3:20 PM

“If you want to put something into the public domain, you have to specifically say so.”

The server said he could view the document. His browser asked, and the server said “here you go, enjoy”

ac • September 23, 2005 3:39 PM

@Stephen

The browser did indeed ask if it could view the document, and the server did indeed say “Here you go, enjoy”. But if the owner of that document hadn’t specifically intended for that document to be read, then the web server was “speaking” without authority. Web servers don’t have legal rights that allow them to override the will of document authors. I suppose you think if you put the key in the car and the car agrees to start, that gives you the right to drive it?

@Ari

Good question. If you find something on the web, how do you know if it’s published intentionally or accidentally leaked? Most people don’t have to worry about this, as they just read the information found, but don’t create derivative works. If you’re a researcher, supposedly you could do some research to figure the matter out. Some matters can’t be resolved via an automated process. Sometimes you need to pick up a phone.

publicinfo • September 23, 2005 4:16 PM

@ac

This is more about ethics than the fact that the information was publicly available.

A better physical analogy (than the car) would be that of an instructor that leaves the answers to a test on their classroom desk, out in the open, in plain view, to be read by any and all students that happen to pass by.

While there is no question that the information is in the “public domain”. The only issue is that of the ethics of those passing by that can view the information, and what choice they decide to make about what to do with this publicly available information they have access to.

havvok • September 23, 2005 4:22 PM

This is not an issue of ‘if the key is in the ignition’, it is an issue of ‘I left a paper in public view’.

Lets say one day I am at the library and by dumb luck I discover a mechanism for factoring really arbitrarily sized numbers quickly and easily 🙂

I am so impressed with myself I grab my papers and run back to the lab to start writing my research paper. When I get there I decide to hold off on my announcement; why just release a paper on factoring large numbers when I could announce a far more noticable (at least to the media) break of RSA? Little do I know I left a piece of scrap paper on the desk where I was working.

Some other brilliant researcher finds my notes and is inspired by them and figures out the factoring method as well. Being far less concerned about funding and fame, this student announces the finding to the appropriate forum.

Although I was the first to discover it, the person who announces it will be credited. It has happened where such things are reversed later in the game, but this is not usually the case. The reason for these practices is to support the foundation for IP; if you are not the first person to announce and protect the idea, then you cannot patent it. This is why ideas cannot be patented after they are in the public domain, or if there is prior art (of course, in theory :P)

This was not the first time, and will not be the last time, a scientist who has withheld information to try and complete research before announcing a finding, and then lost the opportunity to take credit for it.

Wether or not the Spanish scientists did anything unethical is open to debate, but a release of all of the pertinent information, including access logs and such should clarify wether or not this was an unethical action.

I am more inclined to think that the Spanish had discovered it as well, and then panicked and rushed to announce when they discovered that someone else knew about the object.

Saar Drimer • September 23, 2005 4:29 PM

Perhaps robots.txt should exclude everything by default and then the user would need to edit it to allow directories the bots are allowed to go to (and learn what robots.txt is and use it correctly.)

This of course would not prevent people from getting to these directories/resources if they knew where they are, but they wouldn’t be able to get there by googling or doing using regular search engines.

publicinfo • September 23, 2005 5:07 PM

@Saar

“Perhaps robots.txt should exclude everything by default and then the user would need to edit it to allow directories the bots are allowed to go to (and learn what robots.txt is and use it correctly.)”

Of course this only works if the search engine crawler follows the robots.txt settings.

If one puts information in a public location (aka on a publicly accessible web server), then one can only assume that someone will eventually find it. Putting up a sign that says, given all the public info available, don’t read this information over here, but it is ok to read that information over there, only works if those reading this publicly accessible information are ethical enough to read and follow the signs.

If one doesn’t want information to be found/read, then don’t put it in a public place to begin with. Or, enforce simple authentication. In the example, instead of using the robots.txt, to stop the search engine crawlers, why not just require/enforce authentication for that information that you don’t want to be found/read.

Roy Owens • September 23, 2005 5:32 PM

Stephen had it right. If the document and its directory have wide-open read permissions, then the world gets to read it, period.

On the honesty issue, Ortiz should have credited Brown for the information. There’s no excuse for uncredited use.

targetpractice • September 23, 2005 6:11 PM

@publicinfo

A better analogy than a professor leaving the notes on a desk would be the professor leaving the notes with a TA, and one of the students asking the TA for the notes during the test, and receiving them.

Saar Drimer • September 23, 2005 6:13 PM

@publicinfo,
Agreed. I was suggesting a solution that could be applied now (without enforcing much) to solve the problem described: google indexing “unintended” information.
Obviously, more sophisticated measures would work much better at protecting information.
AFAIK, “regular search engines” adhere to robots.txt rules, others may not, as I mentioned.

Raluca Musaloiu-E. • September 23, 2005 6:47 PM

More than that, with Google cache, sometimes you can have access to protected data. For example, when I was looking for a paper from Usenix ’05: “A Tool for Automated iptables Firewall Analysis”, I was pointed to a pdf file which requires a password: http://www.usenix.org/events/usenix05/tech/freenix/full_papers/marmorstein/marmorstein.pdf

I didn’t have it so I tried to google the url http://www.usenix.org/events/usenix05/tech/freenix/full_papers/marmorstein/marmorstein.pdf and I got the document from the Google cache:

http://www.google.com/search?q=cache:zdyG8ROXSdQJ:www.usenix.org/events/usenix05/tech/freenix/full_papers/marmorstein/marmorstein.pdf+&hl=en

Bruce Schneier • September 23, 2005 8:56 PM

“If you find something through google how do you know if it’s meant for public use or not?”

Exactly. The problem is that Google finds things that are available for public use, regardless of how they were meant. You could take a “buyer beware” sort of position and say that if the data owner makes his data searchable then that’s his problem, but I don’t think that’s fair or right.

Sam • September 24, 2005 4:30 AM

The car with the keys in it analogy does not apply.

The only practical assumption is that information posted on public servers is meant for public consumption. Period. If it’s not public, you must require some sort of authentication. Then if someone hacks the authentication, you can legitimately cry foul. And robots.txt only affects robots. A human could still click through to the information, and it would be perfectly all right.

Of course, the use made of the legitimately obtained public information is another matter. In this case, Ortiz should have been upfront about revealing that he had seen Brown’s postings.

Jungsonn • September 24, 2005 6:30 AM

mmm…

Why should an astronomer post logs of his observations on a map on his server.
What is put on th internet is basicly free4all. There is almost no way to claim genuinity online and offline. Only by searching patent databases, but these are so enormous and huge that is is virtually unthinkable to claim one has right to claim such thing. Maybe there is someone that already found something like it, and got a patent on it allready, where do you start looking?

To protect webservers is one thing, but to put sensitive information there is a second. the best way to protect that info, is to not put on a webserver. and keep it in your head.

my 2 cents.

David Harmon • September 24, 2005 8:11 AM

The issue of “how do you know it’s for public use” is difficult. It’s also irrelevant to this case, but just to get it out of the way:

1) The vast majority of users are flatly not competent to manage security on the machines they use. This is not going to change anytime soon.

2) Computers don’t “do” privacy, ethics, or even “security” as such. Computers are not persons, and they cannot carry responsibility! In general, they act like “infrastructure”, rather than “agents”. By analogy, consider a file-cabinet. You can lock a cabinet, but neither the cabinet nor the lock cares if: (a) the lock is trivially pickable, (b) all cabinets of the type have the same key, (c) the back of the cabinet is made of cardboard.

In this case, the issue is not security, but professional (scientific) ethics. The usual custom of science is to share information freely, but also to give credit for shared information. The main exceptions to information sharing are precisely those cases where credit, or publication priority, may be at issue, because those are the “prizes” of the endeavor, and it’s unrealistic to expect perfect behavior from humans when prizes are in contention. As usual for any human group, the detail standards of ethics are ultimately decided by the community as a whole. Various councils, boards, or task groups may issue declarations, but those have authority only as granted by the community in question.

Now consider that scientists, by definition, are trained to make the most of small fragments of information — say, the targeting logs for a heavily computerized telescope. The ethical standards held by scientific communities naturally take account of that training!

Those logs were widely available because there are many perfectly legitimate uses for the information — but by the nature of things, there are also less-legitimate uses. Using them the way the Spaniards did here, is at least dubious in terms of scientific ethics. I, personally, consider it unethical, but ultimately the question will be decided by the astronomical community.

jammit • September 24, 2005 2:42 PM

I have no excuse because this is one of my favorite sites: http://johnny.ihackstuff.com/index.php?module=prodreviews
I’ve always been curious to the legalities of google just happening to find something on the intarweb. It’s not like google is sniffing packets and running password crackers. I remember when google kept getting cease and desist from various places just because they had a site (most likely a pirate site) indexed. Google then replaced the indexed site with a page linking to the C&D order which had the “bad” webpage listed. A simple cut-and-paste was all that was needed and the C&D provided it. What Mr.(professor?) Ortiz did was wrong and should be beaten (or scolded harshly) for what he’s done, but google isn’t to blame for having it and Mr. (Professor?) Brown needs to be a little more careful.

Vance • September 24, 2005 11:14 PM

How long before astronomers start “accidentally” leaving sets of coordinates on public servers which point to nothing but empty sky?

another_bruce • September 25, 2005 11:55 AM

duuude! don’t put your secret data on a publicly accessible server! don’t leave your keys in the ignition! don’t run with scissors!

anonymous • September 26, 2005 2:09 AM

don’t put your secret data on a publicly accessible server!

another_bruce has the best solution of all

Arturo Quirantes • September 26, 2005 4:55 AM

Here are my view FWIW. I happen to know the spanish astronomers, we shared class at the University! I have seen him in action, he’s the kind of guy who would not move until he has checked and re-checked averything. If I know him, he was just doing that, making sure he had all the data straight, not stealing anybody’s first. He had his firsts, too. Remember the Shoemaker-Levy impacts on Jupiter years ago? He was the guy who made the pictures, here in Spain.

I’m confident that he acted on good faith. Of course, I know him, so I might be biased towards him. Hey, I might even be he disguised under a fake name. Well, talking about paranoia…

Phillip Hofmeister • September 26, 2005 11:51 AM

@Kevin Davidson

“When I am given access to restricted content on a web site, I routinely search Google to see if it’s indexed–and it sometimes is.”

Which is why God, through the members of the Apache foundation, in His infinate Wisdom, created .htaccess and .htpass files.

If people don’t want their content accessed, the should USE them. Don’t rely on reobots.txt files or meta tags. Set up a small amount of security and it will keep spiders (and the public) out!

(Am I the only one who thinks this is a DUH case?)

BTW, I do agree with those who say academic ethics require the disclosure of all sources used. This is a case of academic dishonesty, but not a crime.

Chewie • November 2, 2005 8:58 AM

If you put a blackboard on the façade of your house with sensible data written on it, you can’t pretend people walking on the street to stop and read it, wether or not you “didn’t mean” to put it there.

If later that people takes advantage of using that data without giving credit to the blackboard owner, that’s another matter.

Computational • February 16, 2006 6:22 PM

It’s really amazing how many quality sites are still unpublished in google, although they’re more than 6 months old, some of them have more quality content than the listed ones.

IMU • February 21, 2006 6:29 AM

simple… security is the obligation of the administrator of the server. Need not look further. If I drive by your house and the doors is open, and I happen to be running a video cam that sees into your house. Is this my fault or yours?

Anonymous • February 27, 2006 2:51 AM

its gay site

Schneier on Security

Searching Google for Unpublished Data

Comments

Leave a comment Cancel reply