• nl
  • en

Web archiving: the international arena (IIPC 2011)

10 mei 2011 Gepubliceerd door 1 Commentaar

Last month we reviewed the Dutch web archiving landscape (blog in Dutch), and yesterday the international web archiving community descended on The Hague for the IIPC (International Internet Preservation Consortium) annual General Assembly. So, Dutch fans, for once a blog in English, to report on the kick-off, an open conference entitled ‘Out of the Box: Building and Using Web Archives’, which was held at the KB in The Hague on 9 May. Despite cuts in travel budgets, some 100 professionals attended the General Assembly, from all over the world. The day was divided into three sessions: Collection treasures; Web archiving short stories; and The use of web archives. Let me share some highlights with you.

Collection treasures – and how to organize the international treasure hunt

Early in the day, Gildas Ilien of the Bibliothèque national de France (BnF) articulated some fundamental questions facing (national) web archiving institutions: ‘The Web challenges our national boundaries and policies. What’s in? What’s out? There is a need to define consistent selection and scoping criteria while projects and events develop. And: who should take care of what is in between nations, or everywhere? What is the risk? What is the value?’ Ilien referred specifically to event harvesting. The challenges there are emergency, scope and collaboration.

Ilien’s case was illustrated by the 2011 revolutions in Tunisia and Egypt which sparked a buzz of event archiving activity. In Egypt, the the internet was first shut down completely on 25 January, but the moment the ban was lifted, on 9 February, the Library of Alexandria started crawling the web for information, pictures and video to document the Egyptian revolution. Across the ocean, the Library of Congress realized something was going on which deserved capturing. Abbie Grotke of the LoC (standing in for Kris Carpenter of the Internet Archive) reported how the LoC used the informal IIPC network to call around and find out who was capturing what. This led to an impromptu collaborative effort between the Library of Congress, the Internet Archive, the BnF, the British Library, the American University in Cairo and Stanford University, to capture the events in the Middle East. They did not only archive websites, but blogs and social media as well. As Arthur Thomas would say later in the day: ‘It is more accurate to talk about collections of stuff from the Internet.’

As inspiring as this impromptu collaboration surrounding events in the Middle East was, the fundamental questions posed by Ilien are yet to be answered: who does what and how do we collaborate to capture the web that is fundamentally international and interinstitutional, and where, moreover, events develop at a much faster pace than our institutional decision-making procedures can keep up with (emergency!). We will no doubt continue the debate on these issues at the upcoming Aligning national approaches to digital preservation conference in Tallinn on 23-24 May. The slide to the right (by Carpenter, Grotke, Ilien) frames the important questions to be debated, including the many legal restrictions applicable to sharing data from web archiving activities – and the question on how we can organize central operators (such as the Internet Archive, maybe) and a possible World Watch.


The many faces of web archiving were demonstrated by two young archivists from the Czech National Library, Zuzanna Kratochvilova (left) and Lukas Gruber (right). After the turmoil from the Middle East, they reported that web archiving allows a national library not only to capture the traditional media, which specialize in bad news, but also many smaller websites that concentrate on good news, such as Czech citizens sharing their hobbies on the internet. ‘Content is more important for us than format’, Lukas told his audience, and so the team organized a ‘blog of the year’ contest to allow the public at large to help decide which blogs to include in the national collection. And Zuzanna could not help but smile when she conceded that memories from their own childhoods had prompted the team to select a site on a popular Czech children’s programme for archiving. It is part of the national cultural history.


Tjarda de Haan of the Amsterdam Museum recounted enthusiastically all the efforts that are going into recreating a milestone website, ‘The Digital City’ (De Digitale Stad), which was in the air between 1994-2001 and which was designed to help the general public learn to use the new possibilities of the internet. A huge cultural treasure, which tells the story of internet acceptation in the Netherlands, was lost when the commercial interests which owned the site by then simply pulled the plug in 2001. The Amsterdam Museum is now doing everything it can to find whatever content has survived, period hardware, and whatever they can find. Their efforts include a true Gravediggers’ Party, to be held on 13 May in Amsterdam.

To conclude the ‘treasures’ section, Gillian Lee of the National Library of New Zealand reported on a small-scale project to capture the Canterbury Earthquake on 22 February, and Mark Williamson of Hanzo, a commercial web archiving company, showed how Hanzo archives Coca Cola’s marketing history.

Web Archiving Short Stories – a parade of what we’ve got

In seven five-minute mini presentations, a wide array of web archiving projects was presented to the audience. A worthwhile show for anyone still doubting the need for web archiving. The web is where our lives are happening, the web is where our lives are being documented, the web is worth preserving. The parade included the Swiss, Croatian, Dutch, British and American national libraries and the Rotterdam Municipal Archive, and was concluded by a great film documenting an Internet Archive K12 project to train students and teachers about the need for and use of web archiving. Memorable quotes from a high school teenager: ‘We don’t usually realize that we’ll be history in like … two years.’ and: ‘Fifty years from now someone will read a textbook and it might be about us.’

Use of Web Archives – is what we’ve got good enough for researchers?

Having captured all of this wonderful stuff from the web, the question remains how these treasures are being used. More often than not, copyright restrictions keep national libraries from putting the material online so it can benefit the general public. Which means that all that work is being done to serve researchers. What do they think about these collections? The organizers had invited social scientists to debate the question, beginning with Ralph Schroeder, Arthur Thomas and Eric Meyer (picture at right) from the Oxford Internet Institute. They recently completed a draft of a report entitled Web Archives: the Future(s). One of their conclusions: present web archives are harvesting content, especially html/http, while that part of the web is becoming more insignificant with every passing day. Increasingly the web is dominated by the social platforms, location-based services and other very complex two-way and peer-to-peer mechanisms, semantic web/linked data, which are extremely complicated in a technical sense and thus are hardly being archived at all. Meyers: ‘Virtual worlds are our social life’, social scientists want to study how the networks function, who is interacting with whom about what, and that information is not being recorded.

iipc10So much for taking pride in our treasures 😉

Anne Helmond and Esther Weltevrede, of the University of Amsterdam (in first picture of this blog above), demonstrated the type of research the Oxford team refers to. They analyzed the Dutch blogosphere 1999-2009. How did the network evolve? What platforms are being used by bloggers? (increasingly Dutch platforms!) Etcetera. Studies on the workings of the internet itself rather than the content it offers.

The Oxford team had a number of recommendations for web archives: try to archive server logs; try to archive traffic itself (an ambitious project!); allow researchers to trigger (frequency of) crawls; work on mechanisms to organize small and big events archiving (let machines trigger the harvesting, perhaps); think in terms of collections and collections of collections rather than individual sites. Quite an agenda! But Eric Meyer had more down-to-earth advise as well for institutions wishing to promote use of their web resources: ‘The best way to promote use of your digital resources is to provide examples of meaningful uses by others.’ and: ‘Institutions must put more effort into advertising what they have got.’ Gildas Ilien also recommended keeping in touch with the general public. ‘Not many people will use the web archive, but they will pay taxes for it if they know about it.’

Inevitably, the question of selection came up in the discussions. How do researchers feel about selection? The Oxford team: researchers increasingly say: collect everything, because we do not know what the future will need. The fact that we cannot analyze and make meaningful uses of this material presently, should not stop us from collecting. Arthur Thomas drew a comparison with his old field: biology. It all started with people collecting specimens, often they were amateurs. These collections were incomplete but they often generated great ideas, such as evolution and classification. Gildas Ilien of the BnF agrees with this philosophy: ‘We should collect as much as possible – do not wait until you lose it.’

‘Can we please everyone?’ asked Helen Hocks-Yu of the British Library retorically when describing a BL project to encourage use of the web archives and get feedback from researchers. As yet, the question remains unanswered. But help is on the way, as represented by Paul Girard of the largest French social sciences lab (Sciences Po). He described an open source software project ‘Hypertext Corpus Initiative’ in which a community of researches and web experts collaborates to develop mechanisms by which researchers will be enabled to sieve the massive amounts of data from the internet and find exactly what they need – without them having to have too many technical skills. In so doing they hope to deal with the quantity/quality dilemma. [NB: a corpus consists of web entities.]

The panel: ‘Document everything about your web archive’

The concluding panel (photo at the top of this blog), ably chaired by Martha Anderson of the Library of Congress, included the researchers mentioned above and two web archivists: Birgit Nordsmark Henriksen of the Danish National Library and Cathy Hartman of the University of North Texas. The Danish KB both carries out an annual .dk domain harvest (without quality checks) and more selective web archiving projects with better quality assurance – researchers specifically ask for experimental sites to be harvested. The University of North Texas started harvesting government websites when nobody else was doing it (that is vision for you_, and in so doing it gained an important position in the US web archiving community. Cathy explained that their web archive has now become an important part of the university’s identity: it serves the curriculum and attracts students.

In the discussions it transpired that what researchers need most from web archives is to know What is In the Box. How was the collection assembled? What was included and what not, for what reasons? Which harvests failed? Which breakdowns occurred at what times? Even curation activities and preservation actions should be carefully documented. All in the interest of sound research. Preferably, also, researchers should be enabled to use their own tools to analyse the data. And Paul Girard dreams of an index of all web archives in the world.


Dark Archive or Online Access?

As mentioned before, the contents of web archives are mostly restricted to no or only on-site access, because of many copyright and privacy regulations. The institutions are not happy with this situation, but they are reticent to break laws – no matter how old-fashioned they may be. But Martha Anderson had some encouraging news for the group. The Library of Congress has kept meticulous records of the use of its web archives over the past ten years, and now it seems the lawyers are beginning to agree that opening up the archives may not be such a dangerous thing to do. We shall hope to hear from the LoC soon.

And the Dutch KB had good news as well: at the end of the day it opened up its web archive for the first time. As not all technical and legal problems have as yet been sorted, access is limited to 700 websites which have passed (manual!) quality control. 2,300 websites will be added as they too pass quality checks. In about three years’ time, 10,000 Dutch websites should be harvested regularly. As for on-line access: the KB will strive to provide as much on-line access as possible in due course.

The IIPC General Assembly continues this week with a number of technical workshops – I hope that someone else, who is more knowledgeable when it comes to technical matters, will report on those.

(See also next blog.)


Gecategoriseerd in :Geen categorie

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *

Translate »