UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

19 December 2012

Digital Humanities and the Study of the Web and Web Archives

Add comment Comments (0)

In early December 2012, I attended a PhD seminar on Digital Humanities and the Study of the Web and Web Archives. It was organised by netLab, a research project for the study of Internet materials affiliated to Centre for Internet Studies, Aarhus University, Denmark.

18 PhD candidates from different parts of the world attended the seminar. They are all at different stages of their research but together represent a new generation of researchers who have embraced the Internet to study society and culture as “it holds the most multifaceted material documenting contemporary social, cultural and political life”, in the words of the organisers. The workshop draws specific attention to Web Archives. This is not surprising as Niels Ole Finneman and Niels Brugger, organisers of the seminar, were not only closely involved in the conception and development of the Danish National Web Archive (NetArchive.dk), but also use web archives as a key source in their own research of the history of the Internet. The purpose of the workshop is two-fold: to explore relevant digital research tools and methods, and to introduce web archives, their characteristics and analytical and methodological consequences to the students as a corpus for research.

Presentations from the students painted a diverse picture of research topics and disciplines. Things that struck me included the already creative use of various digital research methods as well as the (almost indispensable) role of social networks such as Twitter and Facebook. Adrian Bertoli of the University
of Copenhagen, for example, who studies the online diabetes community, is also investigating how that community relates online to medical professionals, pharmaceutical companies and governments. He produced a hyperlink map to illustrate the interactions between the various actors. Another example is Jacob Ørmen, also of the University of Copenhagen, who investigates the interplay between established media and social media in the coverage of worldwide “media events”, such as the Diamond Jubilee or the 2012 London Olympics, where social media data about the events would be fundamental to the research. 

Over time, users of web archives such as those at the seminar are likely to need more and more the means to collect or assemble individual research corpora.  From our point of view, that of a web archiving service provider whose main users are academic researchers, broad national web archive collections, which often only have limited accessibility for legal and technical reasons, may not meet the dispersed needs of individual researchers, and be in danger of providing a “one-size fits nobody” solution. Archiving and providing access to individual historical web resource is the basic “must-have” of a web archive. To add value beyond that, we should think about collecting and storing those web resources in such a way that it will allow individual researchers to organise and then continually reassemble their own research corpora. We also need to provide the tools for processing and manipulating them using various digital methods.

One of the difficulties in studying web archives highlighted by Niels Brugger is the problematic interoperability between web archives with different scopes and geographical coverage. What we need is a research infrastructure which is capable of supporting the study of the history of the Internet across web archives in different countries, collected using different principles and with content in different languages. There is a funding bid under consideration by the EU to develop this.

706105_117656848399585_613850088_o
Helen Hockx-Yu, December 2012  

 

04 December 2012

Capturing the police authorities

Add comment Comments (0)

For almost half a century Police Authorities in England and Wales fulfilled their role of ensuring that the public had an efficient and effective local police force. This system was however replaced by a single elected individual (a Police & Crime Commissioner) following the Police Reform and Social Responsibility Act 2011.

Thursday 15th November saw elections for the new Police and Crime Commissioners in the 41 police force areas in England and Wales outside London (The Mayor of London, Boris Johnson, has since January held the equivalent role over the Metropolitan Police Force).

We in the British Library Web Archiving Team were concerned that with the abolition of the Police Authorities and the disappearance of their websites significant documentary material would be lost. Information on the Authority websites typically includes annual reports, statements of accounts, policing plans, public consultations, strategy and delivery plans and newsletters, all of which serve to inform the public of the work of the Authorities and to enable Authority members to scrutinise the constabulary and hold the Chief Constable accountable.

In light of this we contacted the Police Authorities asking for permission to archive their current websites before being replaced by the PCCs on 20 November. Some Authorities responded immediately whereas others required further information and (after a little bit of chasing) we received a 100% positive response rate. This is certainly something to be pleased about as the usual response rate is between 25 and 30 % and so for the first time we have been able to capture a nationwide administrative change comprehensively.

Between two and four snapshots of each website have been taken and reviewed individually for quality and completeness before being submitted to the archive. Typical issues included the need to add supplementary seeds to capture linked documents and style sheets external to the host server; applying filters to prevent crawler traps and probing crawl logs to identify the reasons for missing content. The final snapshots were taken on 20th November in case of any last minute changes. See the whole collection.

29 November 2012

Monarchy and New Media: bookings open

Add comment Comments (0)

Bookings are now open for this one-day conference, in London, on Thursday 7 February 2013. We at the UK Web Archive are joint organisers, with the Institute of Historical Research (University of London), and the Royal Archives.

The end of the Diamond Jubilee year affords an opportunity to look back and examine a neglected aspect of the history of the monarchy: the engagement with new forms of media. The event will include reflections on royal engagement with successive new technologies: telegraphy, radio, newsreel and television.

The event will also see the formal launch of our own jubilee collection, with reflections on our experience of creating it in collaboration with the Royal Archives and the IHR, and one historian’s engagement with the collection itself (our very own Peter Webster).

Booking costs a very reasonable £10, and further details, including a programme and a booking form may be found here.

22 November 2012

Upgrading the Wayback Machine

Add comment Comments (0)

We're very shortly to upgrade our deployment of the Open Source Wayback Machine, the software made openly available by the Internet Archive to enable browsing of timed snapshots of an archived site. (See it in action in the UK Web Archive.) We're deploying a new version made available by the Internet Archive earlier this year.

Users will see immediately some enhancements. The banner at the top now will include more information about the number of instances of each site that are available, and an easier way of navigating between them. The information will be available in Welsh, in recognition of the remit of the archive for the whole of the UK; and there's also a handy Help link. For now, however, it will no longer be possible to minimise the banner and then reveal it again; it will be necessary to reload the page to see the banner once minimised.

Behind the scenes, the new version reads directly from our Hadoop Distributed File System (HDFS) which is more cost-effective, simpler to administer, more robust, and easier to scale up to cater for growing levels of usage.

15 November 2012

Non-Print Legal Deposit Regulations 2013: what will they say ?

Add comment Comments (0)

Next year we anticipate that regulations will come into force which provide for legal deposit for non-print works, mirroring the longstanding situation for printed works. The final draft of the regulations, to be laid before Parliament, has recently been published.
Here's a summary of their impact in relation to web archiving.

What will we be collecting ?

The regulations cover four deposit models, of which the most relevant for the web archiving team are that:

(i) a deposit library may copy UK publications from the open web, including  websites, plus open access journals and books, government publications etc.;
(ii) a deposit library may collect other password-protected material by harvesting, subject to giving at least 1 month’s written notice for the publisher to provide access credentials (with some limited exemptions).

The regulations apply to any digital or other non-print publication, except:

(i) film and recorded sound where the audio-visual content predominates [but, for example, web pages containing video clips alongside text or images are within scope];
(ii) private intranets and emails;
(iii) personal data in social networking sites or that are only available to restricted groups.

The regulations apply to online publications:

(i) that are issued from a .uk or other UK geographic top-level domain, or;
(ii) where part of the publishing process takes place in the UK;
(iii) but excluding any which are only accessible to audiences outside the UK.

What will the Library be able to do with it ?

Deposited material may not be used for at least seven days after it is deposited or harvested.

After that, deposit libraries may:

(i) transfer, lend, copy and share deposited material with each other;
(ii) use deposited material for their own research;
(iii) copy deposited material, including in different formats, for preservation.

What will users be able to do with it ?

Users may only access deposited material while on “library premises controlled by a deposit library”.

Users may only print one copy of a restricted amount of any deposited material, for non-commercial research or other defined ‘fair dealing’ purposes such as court proceedings, statutory enquiry, criticism and review or journalism.

No more than one user in each deposit library may access the same material at the same time.

Users may not make any digital copies, except by specific and explicit licence of the publisher.

What restrictions may publishers request ?

The publisher or other rights holders may request at any time an embargo of up to 3 years, and may renew such request as many times as necessary. The requested embargo must be granted if the deposit library is satisfied on reasonable grounds that providing access would conflict with the publisher’s or rights holders’ normal exploitation of the work and unreasonably prejudice the legitimate interests of the publisher.

These conditions remain in force forever, including after all intellectual property rights in the deposited material have expired [“perpetual copyright”].

08 November 2012

Web archiving at LIKE39: what, why and how

Add comment Comments (0)

[A guest post from Marja Kingma, curator of Dutch language collections at the British Library, and one of the leading lights of LIKE, the London Information and Knowledge Exchange.]

With a captivated audience of information professionals before him, Peter Webster (British Library) kicked off the new season of LIKE events at LIKE39. Peter had just moved to the British Library to take up his new role as Web Archiving Engagement & Liaison Officer and LIKE had just moved its meeting place to a new venue: The Castle, sister pub of The Crown Tavern. The upstairs room has state-of-the-art technical facilities and, more importantly, its own bar! And it is even closer to Farringdon Station than the Crown.

Peter is a contemporary historian and has worked with digital information in previous jobs. He now works with the UK Web Archive to raise people’s awareness of it and to encourage them to submit sites to the Archive. LIKE39 provided an excellent platform for this, because attendees know about the general issues around the ‘Digital Black Hole’ and the ephemeral nature of the Web, but they were not all familiar with the UK Web Archive.

Archiving the web, i.e. harvesting websites based in the UK on a regular basis, is just an extension of what the BL, TNA and other participants do with print material. In the past much of what has been printed has been lost and now the same is threatening to happen with electronic material and websites. Websites either disappear completely, or are abandoned, leaving no trace of a contact, which Peter called ‘orphans’. An example is the City Information Group, where the idea for LIKE was born just as CIG folded. Its site is still on the web, but the ‘Contact’ page is no longer available. In this case, there is a good chance that a contact can be found, but this is much more problematic in other cases.

Professionals started to see a Digital Black Hole appearing and something had to be done. In 2003 the Legal Deposit Library Act was passed, establishing an legal deposit requirement for publishers of electronic material, but this Act still has to be implemented. Fortunately the BL, TNA, and the Wellcome Trust didn’t wait for that to happen and started the UK Web Archive, selection and permission-based. After ten years of setting up and establishing partnerships, it is now ‘business as usual’, just in time for the implementation of the LDLA, which now seems to be going forward in earnest next year. This should make redundant the current practice of asking web owners' permission to capture their site, although this would still be necessary to make the archived copy publicly available. It should also speed up ingestion of content into the Archive, by systematic crawling of the UK domain. Alongside this method curators will continue to create thematic collections by actively bringing together websites from within the larger dataset.

It is important that all professionals dealing with websites in one way or other prepare for the preservation of their site(s), as part of the life-cycle for records management. Archiving your website preserves it for future access; researchers as well as the general public will always be able to see what your site looked like in the past. Any one can nominate sites for inclusion in the UK Web Archive, using the simple online form.

LIKEnews.org.uk is being processed for the UK Web Archive and that is a good thing, because we like to think of LIKE as the first networking group for information professionals founded on and managed by using social media tools. It would be a shame if future historians would not be able to track its development from the start!

30 October 2012

How good is good enough? – Quality Assurance of harvested web resources

Add comment Comments (1)

Quality Assurance is an important element of web archiving. It refers to the evaluation of harvested web resources which determines whether pre-defined quality standards are being attained.

So the first step is to define quality, which should be a straightforward task considering the aim of web harvesting is to capture or copy resources as they are on the live web. Getting identical copies seems to be the ultimate quality standard.

The current harvesting technology unfortunately does not deliver 100% replicas of web resources. One could draw up a long list of known technical issues in web preservation. Dynamic scripts, streaming media, social networks, database-driven content… The definition of quality quickly turns into a statement of what is acceptable, or how good is good enough. Web curators and archivists regularly look at imperfect copies of web resources and make trade-off decisions about their validity as archival copies.

 We use four aspects to define quality:

1. Completeness of capture: whether the intended content has been captured as part of the harvest.

2. Intellectual content: whether the intellectual content (as opposed to styling and layout) can be replayed in the Access Tool.

3. Behaviour: whether the harvested copy can be replayed including the behaviour present on the live site, such as the ability to browse between links interactively.

4. Appearance: look and feel of a website.

When applying these quality criteria, more emphasis is placed on the intellectual content rather than appearance or behaviour of a website.  As long as most of the content of a website is captured and can be replayed reasonably well, then the harvested copy is submitted to the archive for long term preservation, even if the appearance is not 100% accurate.

Capture
Example of a "good enough" copy of a web page, despite missing 2 images

We also have a list of what is “not good enough” which helps separate the “bad” from the “good enough”.  An example of this is the so called “live leakage”, a common problem in replaying archived resources, which occurs when links in an archived resource resolve to the current copy on the live site, instead of to the archival version within a web archive. This is a particular concern when the leakage is to a payment gateway which could cause confusion to users leading them to make payments for items that they do not intend to purchase or that do not exist. There are certain remedial actions we can take to address the problem but there is as yet no global fix.  Suppressing the relevant page from the web archive is often a last resort. 

Quality assurance in web archiving currently relies heavily on visual comparison of the harvested and the live version of the resource, review of previous harvests and crawl logs. This is time consuming and does not scale up.  For large scale web archive collections, especially those based on national domains it is impossible to carry out the selective approach described above.  Quality assurance, if undertaken, often relies on sampling.  Some automatic solutions have been developed in recent years which for example examine HTTP status code to identify missing content.  Automatic quality assurance is an area where more development will be welcome.

Helen Hockx-Yu, Head of Web Archiving, British Library

26 October 2012

Ambassador, with these websites, you're really spoiling us

Add comment Comments (0)

[A special guest post from Stella Wisdom, British Library Digital Curator at the British Library.]

A little help from our friends for the video games collection

In my post last month I discussed the challenges in obtaining permission from website owners for the sites I’m selecting for the new video games collection. I’ve decided to try a new approach, recruiting an ambassador who is well known and respected in the video game industry to champion the collection, who can help explain and promote the benefits of web archiving to site owners. I would hereby like to introduce Ian Livingstone, Life President of Eidos, the company behind the success of Lara Croft: Tomb Raider. Ian brings so much experience to the party, we don’t even expect him to bring chocolates !

Ian’s long history in the gaming industry started in 1975 when he co-founded Games Workshop, I_Livingstone - new - smalllaunching Dungeons & Dragons in Europe, then building a nationwide retail chain and publishing White Dwarf magazine. In 1982, with Games Workshop co-founder Steve Jackson, he created the Fighting Fantasy role-playing gamebook series, which has sold over 16 million copies to date. Fighting Fantasy is 30 years old this year, and Ian has written a new gamebook Blood of the Zombies to celebrate the anniversary, and it has also recently launched as an App on iOS and Android.

In 1984, Ian moved into computer games, designing Eureka, the first title released by publisher Domark. He then oversaw a merger that created Eidos Interactive, where he was Chairman for seven years. At Eidos he helped bring to market some of its most famous titles including Lara Croft: Tomb Raider. Ian became Life President of Eidos for Square Enix, which bought the publisher in 2009, and he continues to have creative input in all Eidos-label games.

Ian is known for actively supporting upcoming games talent; as an advisor and investor in indie studios such as Playdemic, Mediatonic and Playmob. He is vice chair of trade body UKIE, a trustee of industry charity GamesAid, chair of the Video Games Skills Council, chair of Next Gen Skills, a member of the Creative Industries Council and an advisor to the British Council.

In 2010 he was asked by Ed Vaizey, the UK Minister for Culture, Communications and Creative Industries, to become a government skills champion and was tasked with producing a report reviewing the UK video games industry. The NextGen review, co-authored with Alex Hope of visual effects firm Double Negative, was published by NESTA in 2011, recommending changes in education policy, the main one being to bring computer science into the schools National Curriculum as an essential discipline.

With this wealth of experience and connections, I can’t think of anyone better to work with and I’m hopeful the collection will successfully grow with Ian’s support. This is the first time the UK Web Archive has appointed an ambassador for a collection, so it will be interesting to follow its progress. If the use of a champion is successful, then other collections may benefit from the same approach.

I’ve also been doing some advocacy work of my own this week; talking about the video game collection at GameCity7 festival, meeting many interesting people there, discussing video game heritage and engaging them in web archiving.

As ever, I’m still seeking nominations, so if you know of any sites that you think should be included, then please get in touch (at [email protected] or via Twitter @miss_wisdom) or use the nomination form.