The ever-expanding job of preserving the internet's backpages

The ever-expanding job of preserving the internet’s backpages

Within the walls of a beautiful former church in San Francisco’s Richmond district, racks of computer servers hum and blink with activity. They contain the internet. Well, a very large amount of it.

The Internet Archive, a non-profit, has been collecting web pages since 1996 for its famed and beloved Wayback Machine. In 1997, the collection amounted to 2 terabytes of data. Colossal back then, you could fit it on a $50 thumb drive now.

Today, the archive’s founder Brewster Kahle tells me, the project is on the brink of surpassing 100 petabytes – approximately 50,000 times larger than in 1997. It contains more than 700bn web pages.

The work isn’t getting any easier. Websites today are highly dynamic, changing with every refresh. Walled gardens like Facebook are a source of great frustration to Kahle, who worries that much of the political activity that has taken place on the platform could be lost to history if not properly captured. In the name of privacy and security, Facebook (and others) make scraping difficult.

News organisations’ paywalls (such as the FT’s) are also “problematic”, Kahle says. News archiving used to be taken extremely seriously, but changes in ownership or even just a site redesign can mean disappearing content. The technology journalist Kara Swisher recently lamented that some of her early work at The Wall Street Journal has “gone poof”, after the paper declined to sell the material to her several years ago.

As we start to explore the possibilities of the metaverse, the Internet Archive’s work is only going to get even more complex. Its mission is to “provide universal access to all knowledge”, by archiving audio, video, video games, books, magazines and software. Currently, it is working to preserve the work of independent news organizations in Iran and is storing Russian TV news broadcasts. Sometimes keeping things online can be an act of justice, protest or accountability.

Yet some challenge whether the Internet Archive has the right to provide the material at all. It is currently being sued by several major book publishers over its “OpenLibrary” lending platform for ebooks, which allows users to borrow a limited number of ebooks for up to 14 days. The publishers argue it is hurting revenue.

Kahle says that’s playful. He likes to describe the task of the archive as being no different from a traditional library. But while a book doesn’t disappear from a shelf if the publisher goes out of business, digital content is more vulnerable. You can’t own a Netflix show. News articles are there for only as long as publishers want them to be. Even songs we pay to download are rarely ours, they’re simply licensed.

Set up so that it doesn’t rely on anyone else, the Internet Archive has created its own server infrastructure, much of it housed within the church, rather than use a third-party host such as Amazon or Google. All this comes at a cost of $25mn a year. A bargain, Kahle says, pointing out that San Francisco’s public library system alone costs $171mn.

Unless we think today’s first draft of history isn’t worth preserving, the internet’s disappearing acts should trouble us all. Consider how hollow coverage of Queen Elizabeth’s death would have been had it not been illustrated with profound archival material.

Can we say with any confidence that the journalism produced around her death will be as accessible even 20 years from now? And what of all the social media posts made by everyday people? We will come to regret not competently preserving “everyday” life on the internet.

Dave Lee is an FT correspondent in San Francisco

Leave a Comment

Your email address will not be published. Required fields are marked *