## Third Parties Controlling Information

Wine Therapy is a web bulletin board for serious wine geeks. It's been active since 2000, and its database of back posts and comments is a wealth of information: tasting notes, restaurant recommendations, stories and so on. Late last year someone hacked the board software, got administrative privileges and deleted the database. There was no backup.

Of course the board's owner should have been making backups all along, but he has been very sick for the past year and wasn't able to. And the Internet Archive has been only somewhat helpful.

More and more, information we rely on -- either created by us or by others -- is out of our control. It's out there on the internet, on someone else's website and being cared for by someone else. We use those websites, sometimes daily, and don't even think about their reliability.

Bits and pieces of the web disappear all the time. It's called "link rot," and we're all used to it. A friend saved 65 links in 1999 when he planned a trip to Tuscany; only half of them still work today. In my own blog, essays and news articles and websites that I link to regularly disappear -- sometimes within a few days of my linking to them.

It may be because of a site's policies -- some newspapers only have a couple of weeks on their website -- or it may be more random: Position papers disappear off a politician's website after he changes his mind on an issue, corporate literature disappears from the company's website after an embarrassment, etc. The ultimate link rot is "site death," where entire websites disappear: Olympic and World Cup events after the games are over, political candidates' websites after the elections are over, corporate websites after the funding runs out and so on.

Mostly, we ignore the issue. Sometimes I save a copy of a good recipe I find, or an article relevant to my research, but mostly I trust that whatever I want will be there next time. Were I planning a trip to Tuscany, I would rather search for relevant articles today than rely on a nine-year-old list anyway. Most of the time, link rot and site death aren't really a problem.

This is changing in a Web 2.0 world, with websites that are less about information and more about community. We help build these sites, with our posts or our comments. We visit them regularly and get to know others who also visit regularly. They become part of our socialization on the internet and the loss of them affects us differently, as Greatest Journal users discovered in January when their site href="http://barry095.vox.com/library/post/greatest-journal-death.html">died.

Few, if any, of the people who made Wine Therapy their home kept backup copies of their own posts and comments. I'm sure they didn't even think of it. I don't think of it, when I post to the various boards and blogs and forums I frequent. Of course I know better, but I think of these forums as extensions of my own computer -- until they disappear.

As we rely on others to maintain our writings and our relationships, we lose control over their availability. Of course, we also lose control over their security, as MySpace users learned last month when a 17-GB file of half a million supposedly private photos was uploaded to a BitTorrent site.

In the early days of the web, I remember feeling giddy over the wealth of information out there and how easy it was to get to. "The internet is my hard drive," I told newbies. It's even more true today; I don't think I could write without so much information so easily accessible. But it's a pretty damned unreliable hard drive.

The internet is my hard drive, but only if my needs are immediate and my requirements can be satisfied inexactly. It was easy for me to search for information about the MySpace photo hack. And it will be easy to look up, and respond to, comments to this essay, both on Wired.com and on my own blog. Wired.com is a commercial venture, so there is advertising value in keeping everything accessible. My site is not at all commercial, but there is personal value in keeping everything accessible. By that analysis, all sites should be up on the internet forever, although that's certainly not true. What is true is that there's no way to predict what will disappear when.

Unfortunately, there's not much we can do about it. The security measures largely aren't in our hands. We can save copies of important web pages locally, and copies of anything important we post. The Internet Archive is remarkably valuable in saving bits and pieces of the internet. And recently, we've started seeing tools for archiving information and pages from social networking sites. But what's really important is the whole community, and we don't know which bits we want until they're no longer there.

And about Wine Therapy, I think it started in 2000. It might have been 2001. I can't check, because someone erased the archives.

This essay originally appeared on Wired.com.

scripted lynx userFebruary 27, 2008 6:42 AM

I decided, two years ago, to store a copy of every text I read on internet. I browse without images, and use

script -q -c bin/.scripted.lynx -a ~/log.of.lynx

where the file bin/.scripted.lynx contains :
#! /bin/sh
exec 3>&1
TERM=vt100 /usr/bin/lynx -nopause -preparsed -trace -tlog 2>&1 >&3 3>&- | tr '\233\012a' '\012a\233' | sed 's/aWriting:a$$POST[^a]*$$a$$\(.[^a]*\\ra$$*\)\\ra$$[^a]*$$a-----/aHTAccess: loading document \1 with \4a/' | tr '\012a\233' '\233\012a' | grep '^HTAccess: loading document' 3>&-
wait
exec 3>&-
wait

I have 200 Mo/year of internet logs. I look at them on average twice a month.

Warning, I use an enhanced version because this script cannot be parallelised. The screen has from time to time to be manually refreshed.

This also worries me about things like webmail (where I can either leave messages there or download them, but not both) and tax preparation (Ohio has a tax form website, but its all stored on their system, if -they- lose the info, -I- am the one in the dock).

And pinging off on a tangent, why do I have to pay some 3rd party (turbotax, taxcut, etc) in order to save the IRS money by electronically filing my taxes?

davetweedFebruary 27, 2008 8:13 AM

@scripted lynx user:

That looks interesting but "I browse without images" is a potential problem for me since most math blogs/wiki entries/etc contain formulae as inline images. I wonder if somehow grabbing the rendered pages and saving them as DjVu would be both readable not take up too much space? Maybe I'll experiment...

Grumpy PhysicistFebruary 27, 2008 8:34 AM

The european Dark Ages weren't (of course) literally "dark", but the term is apt because records of those times either weren't produced, or if produced, rarely survived to the present era.

Welcome to the new, improved Digital Dark Ages, where even more information is produced, but it's in digital form.

And the backups, if kept, are unreadable.

...maybe I can modify a dot-matrix printer to print on clay tablets. The bit density sucks, but the longevity is good.

@davetweed

Maybe the Page Saver plugin for firefox will do what you want.

Regarding the question of saving posts or comments. Does this remind anyone else of Anne and Lynn Wheeler's garlic.com?

So, Bruce... how often is the archive for *this* blog backed up? :)

Nomen PublicusFebruary 27, 2008 1:55 PM

Recently I got a message from a former user of one of our computer systems. Apparently the user had somehow managed to lose their entire web site through a combination of hardware failures.

Did we by chance have any backups of the site when it was hosted on our system?

It was unlikely, but I checked to make sure and no, all the backups from that time had been overwritten.

The best I could do was to point the user to the Wayback machine which did have a snapshot of most of the lost site.

But the situation got me thinking. We've upgraded the storage since we hosted the lost site and deleted all the old, "unwanted" data. But, the new storage is over 10 times the size of the old, so we could have kept all the old data in less than 10% of the new storage.

I've now recommended that should we repeat the expansion, we should not delete the old data even if there seems to be no reason to keep it.

I estimate that given the cost reduction of disk storage and the rate of growth of data, we could afford to keep final copies of _everything_ over the 20 years of the service in about 20% of the current storage.

Kara McNairFebruary 27, 2008 2:01 PM

Which is why I love that Macs come with built-in "Print to PDF" support. A few bits on my drive gives me so much joy & peace of mind...

Lesser SilenceFebruary 27, 2008 2:26 PM

I know of at least two old guys, the archives of masssssssive everything they laid their eyes on, newsgroups to image libraries (;

AnonymousFebruary 27, 2008 4:42 PM

"we should not delete the old data even if there seems to be no reason to keep it"

Precisely why (Google|Yahoo|MS|*) scares the !#@K outta me...

Clive RobinsonFebruary 27, 2008 4:45 PM

@Paul Harison

Links in Bruce´s blog do disapear quit rapidly, and at a rate considerably greater than other blogs.

The reason is simple and understandable.

As an example I posted a link to a PDF of a book chapter on explosives. It was one of the best I had found.

It was taken down due to the old "aiding and abeting the enamy" principle.

The simple fact is that links posted on this site tend to be of that type and due to the popularity of the blog the linked pages get a suden increase in hits.

Perhaps we should call it "The Bruce effect" I think it might just catch on 8)

Phil MFebruary 27, 2008 7:53 PM

This wasn't a problem when all the discussions that now happen on any one of thousands of Web forums were instead conducted on Usenet.

I can only dream that someday the major Web forum programs become front ends for NNTP.

I'm sorry, but doesn't it bother you more when crucial information is purposefully destroyed, than when web communities disappear?

I find it more frightening that the whitehouse.gov webmaster can now do more in minutes than the entire Ministry of Truth did in weeks.

AmbroseChapelFebruary 27, 2008 9:12 PM

I have a script which grabs new discussion thread posts from a forum website, every hour on the hour, and puts them into a database.

I thought I was being overly obsessive, but it turns out there are a couple of other people who do the same -- basically we were all sick of having stuff deleted or lost to errors and failures.

DoClueMeFebruary 27, 2008 10:17 PM

Please forgive the imposition of a question, posed while there are likely to be active linkers about.

Are links in these comments made through HTML tag coding by the comment authors, or is there an automatic process?

There's the bit from Marx (Lenin? sorry, I only know the
gist) that says "Workers must control the means of production." I'm
not a Communist, but I firmly believe that this has a 21st century
equivalent: that publishers *must* control the means of publication.
That's why I run my own web and mail server: because once those
functions are outside my control, I'm dependant on the good will of
loves his job. So there's that, too. :-)

This is only part of the problem, of course; you point out the number
of resources we depend on that are outside our control, and @scripted
lynx user has the right solution for that, and my ISP might shut me
down for any number of reasons. But what *I* publish on my silly blog
is *mine* to control, and *that* part of problem is something I can solve.

Duncan KinderFebruary 27, 2008 11:49 PM

What all this boils down to - and as the Wine Therapy debacle demonstrates - is that Internet information is more like Beaujolais than Burgundy.

I'm afraid that mass saving pages makes one vulnerable to copyright infringement lawsuits. Consider the case when notebook drive is searched at customs and a lot files looking like copyrighted works copied from Internet are found on hard drive.

I have always preferred mailing lists to web-based forums. Now I have an even better-reasoned basis for the preference: I can keep whatever part of the exchanges on the list I like, under my own control, with access through my own indexing and search software.

ZaD MoFoFebruary 28, 2008 2:03 AM

Would it be a nice idea to have some day a virtual flea market of old chat sessions or vintage .html pages (the... no script nor java). Better, entire era recreated by volunters to help our kids see what was the WEB before browser head implants...

I must admit, I have "recorded" webpages since 1995 (thoses worth reading) and my very first email "circa 1984".
I must be a nostalgic :-)

Eric CramptonFebruary 28, 2008 2:55 AM

Problems with linkrot? Use Furl. It saves copies of any page you like, gives you a searchable archive of your saved pages, and you can download your archive any time you like.

Alan JenkinsFebruary 28, 2008 2:56 AM

NickB: me too.

Unfortunately, at least for reading, I find real newsgroup access a bit of a hassle compared to Google Groups.

For web forums, email subscriptions go some of the way. I regularly use a forum where new messages on threads I've posted to or explicitly subscribed are emailed to me. That doesn't include my posts though, and once it's sent an email, it doesn't send another for the same thread until you re-visit the thread.

OT @ DoClueMe: "Are links in these comments made through HTML tag coding by the comment authors, or is there an automatic process?"

The comments are not automatically linkified, and no kind of markup is supported, AFAIK. If you see clickable URLs, so will most likely be using some kind of browser plugin that does it for you.
The only exception is the URL field, which converts your name into a clickable link to the URL that you supply.

Ian EiloartFebruary 28, 2008 4:19 AM

I used to subscribe to the UK's Consumer Association web site. They're consumer advocates, and publish "Which?" magazine.

They had a useful collection of historical articles, which I paid £5 per month to see - in preference to paying a little bit more to get the magazine. I regularly used to check back to see how a company rated through time - for example to find an ISP who had a consistently good reputation.

One day, they revamped their web site. All the historic articles were removed. I felt like they'd been in my living room and stolen my magazine collection.

SteveJFebruary 28, 2008 5:48 AM

@Bruce: "recently, we've started seeing tools for archiving information and pages from social networking sites. But what's really important is the whole community"

This isn't new. If my local pub closes, I lose a community in almost exactly the same way as if a social networking site goes down. We've learned how to live with that - either get the phone numbers of the people you regularly hang out with, or be prepared to make new friends in another pub.

What is new is that we don't keep copies of information that we're interested in or care about. People put their photos online instead of retaining their own copies.

Part of the new problem we know how to solve. To replace your boxes of photos and shelves of CDs, buy a bigger hard drive, and back it up properly. It's probably cheaper, and certainly makes it less work to move house.

What we don't know how to solve is Ian's problem immediately above. If you rent access to online information instead of buying a copy (either printed or to download), then you can't legally keep it yourself. This needs to be considered when subscribing to that kind of service, and consumers should refuse to pay more than pennies if they're offered temporary access to information where they really want permanent access.

The script from comment #1 looks very useful, I wish something like that existed as a Firefox plugin. Not only do links vanish frequently, but also I often forget where I read various bits of information... currently I bookmark a lot of links I find interesting at del.icio.us, but very often it takes some time to realize that information I read was important - and then I already forgot where it came from. Having that archived in searchable form on my PC would be very convenient IMHO, not only for that vanishing links problem.

Alan PorterFebruary 28, 2008 8:46 AM

> This is changing in a Web 2.0 world, with websites
> community.

Nope. Not me. I don't frequent blogs, and I never

Link rot and site death, certainly, but I've also been bitten by forgotten formats. The vendors control when the old app won't run on the new OS, not me, making the old files unavailable. (Hint: VMware is your friend.) I have also been burned by uncorrected and undetected data errors. I realize the failure rate is very low, but undetected errors are still errors, they still happen, and I nearly lost a dear photo. If it's important, back it up. Backups must be many and various and geographically dispersed.

@C

A newspaper is copyrighted; but it's not a violation to actually have the newspaper in your hand, to take clippings and put them in your scrapbook, make copies of articles for your own records. Making copies of articles and selling them, that's a violation of copyright.

DylanMorganFebruary 28, 2008 9:31 AM

I think the most sinister aspect of link rot and the transitory nature of data on the internet is the memory hole effect. Consider the example of a position paper removed from a politician's website. What if it was not removed because of a re-consideration but the dishonest wish to alter the politician's appearance. If it had not been for mirrored archives of Usenet postings that in turn were copies of a physical newsletter, there would be no public record of the Ron Paul newsletter, and the racist essays published within.

Sociologically speaking, we seem to be moving beyond oral or written history, into something in between.

lynx script userFebruary 28, 2008 4:06 PM

@davetweed "since most math blogs/wiki entries/etc contain formulae as inline images."

I looked for a random page http://planetmath.org/encyclopedia/Argument.html from planetmath.org and found that all formulas have the latex code of their content ($f : {\mathbb{C}} \rightarrow {\mathbb{C}}$ for example) in their ALT tag (in html source pages) that lynx relies on. I got the same conclusion for first google answers for "math wiki entries".

But I agree that this is specific to mathematics. ALT tags are not so common in other communities, and the name of the image file is often not enough. Lynx is useless for tables and for plenty of applications like google map. Still, lynx uses the same readable black bold font for all sites, and cannot show flashing advertisments, and its plugins extract text from .pdf and .ps. This is wy I chose to live with lynx. Someone else may replace Lynx by W3m to get images, or by Links to show tables properly.

If you insert in your comment a word starting with the seven characters h t t p : / / and ending with a space, it will be a link (experience with the Preview button).

@nrq "I wish something like that existed as a Firefox plugin."

Look more carefully. Or switch to an http proxy with full logging ability. Careful, your log will grow way faster than 200Mb/year.

@Alan Jenkins "I find real newsgroup access a bit of a hassle."

Switch to a newsreader with better ergonomics (flrn, ...).

script lynx userFebruary 28, 2008 4:15 PM

If i distribute md5sum of copyrighted work, without the consent of the copyright holder, is it a copyright violation ? If not, there should be a volunteered community that publish md5sum (and other checksums) of every major newspaper articles and major web pages, just in case.

@Pierre THIERRY: "researchers already gave a thought to this problem"

This is a nice paper from 1997. freenet is now deployed and does a part of the job.

"Unfortunately, there's not much we can do about it. The security measures largely aren't in our hands. We can save copies of important web pages locally, and copies of anything important we post. The Internet Archive is remarkably valuable in saving bits and pieces of the internet. And recently, we've started seeing tools for archiving information and pages from social networking sites. But what's really important is the whole community, and we don't know which bits we want until they're no longer there."

Which is why you should just save everything, automatically. I use the Slogger Firefox extension to do this (just Google Slogger).

The result is subdirectories holding copies of every webpage I visited. Compress those into Zip files every few weeks, and index using something like DTSearch. Storage needs come out to about 3gb/month.

This is part of a bigger change which we're just starting to be able to see.

The definition of "history", in terms of a written record that can be reviewed in the future, is very different now from what it was even 30 years ago. The internet is only the latest piece of this -- though in sheer volume easily the largest -- but television, radio, printing, and other recording technologies have fundamentally altered how and what things are "retained" for historical purposes.

There are organizations out there trying to preserve digital material. The Internet Archive has been mentioned, but especially for scholarly purposes authors can archive webpages "on-demand" using WebCite (http://www.webcitation.org).

The primary problem is not that information is out of our control, it's that people don't do backups. I can't believe nobody has ever mentioned Subversion (SVN) in the last forty-something comments.

I make sure that my blog (http://gunther-eysenbach.blogspot.com/) and the comments associated with it are preserved by adding a widget (a dynamic link) to my blog which says "Cite this page!" to the WebCite archiving form (www.webcitation.org/archive ).
WebCite is a member of the International Internet Preservation Consortium (of which the Internet Archive and many libraries are a member). Whenever somebody wants to cite my blog, it is automatically archived for "eternity", and the citing author knows that whatever he cited will be available to the reader exactly the way he saw it..

ReimarMarch 15, 2008 6:17 AM

@datenritter:
Actually, on this topic I would mention git (http://git.or.cz/). I use it to grab a full copy of every project I work on, including full history. The disadvantage is that it may take a few hours for large projects, but I like knowing that I (and probably others as well) have a full backup of something I spend a lot of my time on.
Fascinatingly, it usually even uses less disk space than an ordinary SVN checkout without any history...

"In my own blog, essays and news articles and websites that I link to regularly disappear -- sometimes within a few days of my linking to them."

The question is: Would they have disappeared had you not posted them? Maybe you're (unvoluntarily) playing an active role in destroying the Internet. ;-)