Emerald Group Publishing Limited
Copyright © 2004, Emerald Group Publishing Limited
Archive of Web sites
Compiled by Monica Blake
Archive of Web sites
Little attention has been paid to the long-term preservation of Web sites. With the life of an average Web site estimated to be around 44 days (about the same lifespan as a housefly), there is a danger that invaluable scholarly, cultural and scientific resources will be lost to future generations. To address this problem, a consortium of six leading UK institutions is working collaboratively on a project to develop a test-bed for selective archiving of UK Web sites.
The UK Web Archive Consortium (UKWAC) – comprising The British Library, Joint Information Systems Committee of the Higher and Further Education Councils (JISC), The National Archives, The National Library of Wales, the National Library of Scotland and the Wellcome Trust – will run for an initial period of two years, during which approximately 6,000 Web sites will be collected and archived.
Consortium members will obtain the permission of Web site owners to archive selected sites whilst working collaboratively to explore how to develop compatible selection policies and to investigate the complex technical challenges involved in collecting and archiving Web material.
Each consortium member will select and “capture” content relevant to its subject and/or domain. For example, the British Library will archive sites reflecting national culture and events of historical importance. These could include Web pages focusing on key events in national life, museum Web pages, e-theses, selected blogs to support research material and Web-based literary and creative projects by British subjects. Wellcome will preserve a record of medicine on the Web whilst The National Archives will focus on archiving selected materials from six main clusters of government departments. The Scottish and Welsh national libraries will collect material reflecting the culture and history of Scotland and Wales and JISC will preserve Web sites from leading-edge, innovative ICT projects in UK higher and further education.
Infrastructure costs, such as software, hardware and ongoing technical development and support will be shared equally amongst the consortium members. UKWAC will use HTTrack – the open source Web crawler to acquire files for storage. The software to carry out the archiving processes – PANDORA Digital Archiving System (PANDAS) – has already been developed and tested by the National Library of Australia and its partners for archiving Australian Web sites and making them accessible through PANDORA, the Australian national Web Archive (see: http://pandora.nla.gov.au/index.html). PANDAS can be set to automatically tag, gather and prepare pages for public display. If pages are not suitable for immediate public access, due to commercial, cultural or privacy reasons, PANDAS can manage appropriate access restrictions.
UKWAC members have selected Magus Research Limited to help extend the PANDAS software for UK needs and provide the shared hardware and technical support they require.
David Thomas, Head of Government and Technology at The National Archives, said: “From government organizations posting travel advice to newlyweds putting their wedding photos online, Web sites provide a unique insight into the political and social world we live in today. Through collaboration in the UKWAC, The National Archives is taking steps to ensure that government Web sites are preserved for future generations.”
Lynne Brindley, Chair of the Digital Preservation Coalition and Chief Executive of The British Library said: “The launch of UKWAC is an essential step in helping us to understand the scope of the UK Web space and how we can set about developing a selective yet useful national Web archive. Initially this will be on a voluntary basis, although it is anticipated that secondary legislation will, in due course, allow the BL – and the other legal deposit libraries – to collect Web materials. Working with other UKWAC members, we can make real progress in developing complementary selection policies, exploring the best ways to collect and archive Web materials and refining how we work together”.