UK Web Archive blog

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

13 September 2012

Web Archives and Chinese Literature

Add comment Comments (0)

The following is a guest post by Professor Michel Hockx, School of Oriental and African Studies, University of London, who explains the difference between doing research on internet literature from doing research on printed literature, and how web archives help.


In July of this year, Brixton-based novelist Zelda Rhiando won the inaugural Kidwell-e Ebook Award. The award was billed as “the world’s first international e-book award.” It may have been the first time that e-writers in English from all over the world had been invited to compete for an award, but for e-writers in Chinese such awards have been around for well over a decade. This might sound surprising, since the Chinese Internet is most frequently in the news here for the way in which it is censored, i.e. for what does not appear on it. What people often forget, however, is that the environment for print-publishing in China is much more restricted and much more heavily censored. Therefore, those with literary interests and ambitions have gone online in huge numbers. Reading and writing literature is consistently ranked among the top-ten reasons why Chinese people spend time online.

 

I have been following the development of Chinese internet literature almost since its inception and I am currently finalizing a monograph on the subject, simply titled Internet Literature in China and due to be published by Columbia University Press. (That scholars of literature feel compelled to publish their research outcomes on topics like this in the form of printed books shows how poorly attuned the humanities world still is to the new technologies.) Doing research on internet literature is substantially different from doing research on printed literature, most importantly because born-digital literary texts are not stable. Printed novels may come in different editions, but generally the assumption of literature scholars who do research on the same novel is that they have all read the same text. For internet literature there can be no such assumption, because “the text” often evolves over time and usually looks different depending on user interaction.  The text looks different depending on when you visited it and what you did with it. So one of the methods I employ is to present my interpretations of such texts at different moments in time. For traditional literature scholars, this is unusual: they don’t normally tell you in their research “when I read this text in 2011, I interpreted it like this, but when I read it again in 2012, I interpreted it like that.” Using this method relies on the availability of the material, and on the possibility to preserve it so that other scholars can reproduce my readings. And that is where web archives come in.

 

As far as I know, there is no Chinese equivalent of the UK Web Archive. In the area of preservation of born-digital material, China is very far behind the UK (instead it devotes huge resources to the digitization and preservation of its printed cultural heritage). Some literary websites in China have their own archives. In the case of popular genre fiction sites these archives can be huge, and they can be searchable by author, genre, popularity (number of hits or comments), and so on. Genre fiction (romance fiction, martial arts fiction, erotic fiction, and so on) is hugely popular on the Chinese Internet, because of the relatively few legal restrictions compared to print publishing. Readers subscribe to novels they like and they then receive regular new instalments, often on a daily basis. However, no matter how large the archives, there usually tends to be a cut-off point after which works are taken offline. When I first started my research in 2002, I was blissfully unaware of such potential problems. As a result, roughly 90% of the URLs mentioned in the footnotes to my first scholarly articles on the topic are no longer accessible. Fortunately, when I began to rework some of my earlier articles for my book, I found that the Internet Archive had preserved a substantial number of the links, so in many cases my footnotes now refer to the Internet Archive. Although the Internet Archive does not preserve images and other visual material (which can play an important role in online literature), having the texts as I saw them in 2002 is definitely better than having nothing at all, and will convince my fellow scholars that I am not just making them all up!

 

During my later research, I took care to save pages, and sometimes entire sites or parts of sites, to my own computer to ensure preservation of what I had seen. But archiving material on my computer does not make it any more accessible to others. That is why I use the services of the Digital Archive for Chinese Studies (DACHS, with one server in Heidelberg, and one in Leiden), where scholars in my field can store copies of online material they refer to in footnotes to publications. DACHS also has another important function: it preserves copies of online material from China that is in danger of disappearing, because it is political or ephemeral, or both. DACHS also invites scholars to introduce such materials and place them in context, as in Nicolai Volland’s collection of online documents pertaining to “Control of the Media in the People’s Republic of China”, or Michael Day’s annotated collection of Chinese avant-garde poetry websites.

 

In order for online Chinese-language literature to be preserved, its cultural value needs to be appreciated not just by foreign enthusiasts like myself, but more generally by scholars and critics in China itself. The first decade or so of Chinese writing on the Internet will probably never be restored in any detail, although a relatively complete picture might still emerge if existing partial archives were merged. Meanwhile, I hope that new archiving options for later material will become available soon. 

05 September 2012

How to Make Websites More Archivable?

Add comment Comments (1)

I was contacted by an organisation which is going to be disbanded in a couple of months. When the organisation disappears, so will its website. Fortunately we have already archived a few instances of their website in the UK Web Archive.

The lady who contacted me however complained that the archival copies are incomplete as they do not include the “database” and would like to deposit a copy with us. Under examination it turns out that a section called “events” which has a calendar interface, was not copied by our crawler. I also found out that 2 other sections, of which the content is pulled dynamically from an underlying database, seem to be only accessible via a search interface. These would have been missed by the crawler too.

The above situation reflects some common technical challenges in web archiving. The calendar is likely to send the crawler into the so-called “crawler trap” inadvertently as it would follow the (hyper-linked) dates on the calendar endlessly. For that reason, the “events” section was excluded from our previous crawls. The database driven search interface presents content based on searches or interactions, which the crawler cannot perform. Archiving crawlers are generally capable of capturing explicitly referenced content which can be served by requesting a URL, but cannot deal with URLs which are not explicitly in the HTML but embedded in JavaScript or Flash presentations or generated dynamically.

We found out the earliest and latest dates related to the events in the organisation’s database and used these to limit the data range the crawler should follow. We then successfully crawled the “events” section without trapping our crawler. For the other 2 sections, we noticed that the live website also has a map interface which provides browseable lists of projects per region. Unfortunately only the first pages are available because the links to consequent pages are broken on the live site. The crawler copied the website as it was, including the broken links.

There are a few basic things, if taken into account when a website is designed, which will make a website a lot more archivable. These measures ensure preservation and help avoid information loss, if for any reason a website has to be taken offline.

1. Make sure important content is also explicitly referenced.
This requirement is not in contradiction with having cool, interactive features. All we ask you to do is providing an alternative, crawler-friendly way of access, using explicit or static URLs. A rule of thumb is that each page should be reachable from at least one static URL.

2. Have a site map
Use a site map to list the pages of your website accessible to crawlers or human users, in XML or in HTML.

3. Make sure all links work on your website.
If your website contains broken links, copies of your website will also have broken links.

There are more things one can do to make websites archivable. Google for example has issued guidelines to web masters to help find, crawl, and index websites: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769. Many best practices mentioned here are applicable too to archiving crawlers. Although archiving crawlers work in a way that is very similar to search engine crawlers, it is important to understand the difference. Search engine crawlers are only interested in files which can be indexed. Archiving crawlers intend to copy all files, of all formats, belonging  to a website.  

Helen Hockx-Yu, Head of Web Archiving, British Library

30 August 2012

Analysing File Formats in Web Archives

Add comment Comments (0)

Knowledge of file formats is crucial to digital preservation. Without this, it is impossible to define a preservation strategy.  Andy Jackson, Web Archiving Technical Lead at the British Library explains how to analyse formats used in archived web resources for digital preservation purposes. This is also posted as an Open Planets Foundation Blog

UK Web Archive recently released a new suite of visualisations and datasets. Amongst these is a format profile, summarising the data formats (MIME types)  in the JISC UK Web Domain Dataset (1996-2010). This contains some 2.5 billion HTTP 200 responses stretching from 1996 to 2010, neatly packed into ARC files and stored on our HDFS cluster.  Storing it in HDFS allows us to run Map-Reduce tasks over the whole dataset, and analyse the results.

Given this infrastructure, my first thought was to use it to test and compare format identification processes by running multiple identification tools over the same corpus. By analysing the depth and coverage of the results, we can estimate which tools are better suited to which types of resources and collection. Furthermore, much as double re-keying can be used to establish 'groud truth' for OCR data, each tool acts as an independent opinion on the format of an resource and so permits us a little more confidence in their assertions when they are found to coincide. This allows us to focus our attention on where the tools disagree, and helps to ensure that our efforts to improve those tools will have the greatest impact.

To this end, I wrapped up Apache Tika and the DROID binary signature identifier as part of a Map-Reduce task and ran them over the entire corpus. I mapped the results of both to a formalised extended MIME type syntax, such that each PUID has a unique MIME type of the form 'application/pdf; version=1.4', and used that to compare the results of the tools.

Of course, as well as establishing trust in the tools, this kind of data helps us start to explore the way format usage has changed over time, and is a necessary first step in understanding the nature of format obsolescence. As a taster, here is a chart showing the usage of different version of HTML over time:

As you can see, each version rises to dominance and then fades away, but the fade slows down each time. Across the 2010 time-slice, all the old versions of HTML are still turning up in the crawl. You can find some more information and results on the UK Web Archive site.

Finally, as well as exporting the format identifiers, I also used Apache Tika to extract any information it found about the software or hardware platform the resource was created on.  All of this information was combined with the MIME type declared by the server and then aggregated by year to produce a rich and complex longtitudinal multi-tool format profile for this collection.

Fmt-html-versions

If this is of interest to you, please go and download the dataset and start exploring it. Please let me know if you find this dataset useful, and please share any interesting results you dig out of the dataset.

22 August 2012

Visualising the UK Web Domain

Add comment Comments (0)

The UK Web Archive is a selective archive containing Websites selected and preserved by the British Library and partners since 2004.

  “.uk” is one of the largest country-code top level domains in the world with 10 million registrations in March 2012. Selective archiving has many advantages but is costly and fails to capture a comprehensive picture of the national domain. The Legal Deposit Libraries in the UK will be able to collect Web resources at scale when the non-print Legal Deposit legislations are in place, expected sometime in 2013.

The benefits of archived Web resources can only be realised when these are actively used, for research, learning and teaching.  This was the impetus for us to work with the Joint Information Systems Committee (JISC) and the Internet Archive on a collaborative project which extracted a copy of UK Websites from the Internet Archive’s collection. This research dataset , supported by JISC funding, contains Websites crawled between 1996 and 2010 by the Internet Archive and is the largest historical dataset of the UK domain in existence.  One of the objectives of the project is to develop visualisations and services to demonstrate how large scale Web archive collections can be used for analytics, showing embedded trends and patterns which would not have been possible by just consulting historical copies of Websites individually.

The visualisations and secondary datasets are now released on the UK Web Archive http://www.webarchive.org.uk/ukwa/visualisation. The N-gram search is a phrase-usage visualisation tool which charts the monthly occurrence of user-defined search terms or phrases over time, as found in the JISC UK Web domain dataset (1996-2010). The link visualisation shows the relationship between domain suffixes over time.  The format profile is a visualisation of the format analysis, summarising the data formats (MIME types) contained within all of the HTTP 200 OK responses.  We have also released two downloadable secondary datasets which can used to develop further applications, a list of MIME types and a postcode index.

The JISC has also funded two additional projects, using the JISC UK Web domain dataset (1996-2010) to develop analytical access to large scale Web archive collection. These are  Analytical Access to the Domain Dark Archive  and Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research.  We are running a joint workshop at Digital Research 2012 Conference: Digital Research Using Web Archives.  If you would like to find out more about our projects and Web archiving in general, please come along and join us.

01 August 2012

Diamond Jubilee Collection live

Add comment Comments (0)

We are pleased to announce that our new web collection about the Queen’s Diamond Jubilee is now live. This collection represents an important historical record of online resources which is hoped will provide a lasting legacy of the event and fulfil our aim to prioritise selection of websites that feature political, cultural, social and economic events of national importance.  

The collection, comprising over 130 titles, was initiated in late 2011 by the British Library in collaboration with the Royal Archives and the Institute of Historical Research. Content has been selected by subject specialists from a variety of sources including the Twittervane tool developed by the British Library which enables curators to identify sites frequently shared on social media relevant to specified search terms. Websites were also selected by members of the public who submitted nominations on the UK Web Archive’s online nomination form.

Archiving of websites commenced in January 2012 with a focused period of high-frequency and intensity crawls in the weeks directly before and after the Jubilee weekend on June 2nd – 5th. All harvested websites were checked for quality and completeness before submission to the archive. We will continue to collect websites until December 2012 in order to capture analysis and debate on the issues around the Jubilee.

The aim of the collection was to cover the event as comprehensively as possible and to reflect a multiplicity of strands and themes including official events, the economic impact, public sentiment and political and constitutional debate. Staff at the Royal Household nominated sites of official interest such as the website of the British Monarchy and the official website of The Queen’s Diamond Jubilee.

Websites of official events initiated by Buckingham Palace have been archived including the Thames Diamond Jubilee Pageant, the Queen’s Diamond Jubilee Beacons, the Big Lunch and the BBC Concert at Buckingham Palace.

The Jubilee inspired local, unofficial celebrations such as street parties and other community based events and a selection of their websites have been captured, for example Newry Drama Festival, the Horsted Keyes Diamond Jubilee Organising Committee and Wetherby’s Diamond Jubilee Website.

Beginning in March 2012, The Queen, accompanied by The Duke of Edinburgh, conducted a series of royal tours throughout the UK to mark the Diamond Jubilee year. We have captured samples of local press coverage to cover Her Majesty’s regional visits. See for example the Queen’s visit to Ebbw Vale, Gwent and the Blog by photographer Chris Seddon capturing the Queens Diamond Jubilee Tour of Leicester.

As much of the UK geared up to celebrate the Diamond Jubilee, the occasion also impelled debate about the future of the monarchy. Dissenting voices and opposition to the monarchy have been captured in the archive, see for example the website of the Jubilee Protest ‘Protest at the Pageant’ and Republic: campaigning for a democratic alternative to the monarchy.

The Mass Observation Project worked with us to record online observations from members of the public about the Diamond Jubilee. The observations were hosted on a blog which has been harvested as part of the Diamond Jubilee collection.

New content will continue to be added until December 2012. The British Library would be delighted to receive your nominations for this collection via our online form.  

Nicola Johnson, Web Archivist 1st August 2012

 

 

 

 

 

25 July 2012

Archiving the history of the British slave trade, from the web

Add comment Comments (0)

The following is a guest post by Dr Philip Hatfield, Curator for Canadian and Caribbean Studies at the British Library, who is curating a special collection on ‘Slavery and Abolition in the Caribbean’ for the UK Web Archive.

BoilingHouse
Exterior of an Antigua Boiling House, William Clark 1823 (BL Shelfmark: 1796.c.9). From the Library’s ‘Caribbean Views’ gallery 

When I started working as a curator (only in 2010) one thing I did not expect was how much time I would spend using the Web as part of my work. However, as Curator for Canadian and Caribbean Studies it makes sense, not least because the Internet connects me to the international audience who use the Library’s collections. Also, the Internet contributes to the Library’s collections and, through the UK Web Archive, is becoming part them too. 

This means Library curators are now trialling the development of special collections for the UK Web Archive and when I was invited I jumped at the chance, knowing exactly what required my attention. Back in 2007 the Internet was an important engagement space for museums, archives, libraries and various other institutions to relay the history of slavery and mark the bicentenary of the abolition of the British slave trade. Since then a number of websites featuring online galleries, teaching resources and other materials related to the bicentenary have disappeared from the web. Moreover, this is not the only situation in which such valuable work has been lost to the general public.

So, a special collection focussing on ‘Slavery and Abolition in the Caribbean’ seemed a suitable framework through which to preserve relevant parts of the contemporary UK Web from being lost. I’m currently in the process of selecting websites from a range of UK government, heritage institution, local history and other sites for the collection, which is developing nicely. However, there are a number of stages (permission to archive being but one) before the sites can be collected and the selection goes live. My hope is that, once it does, it will be a useful resource to specialist and general users of the UK Web Archive.

I’m also beginning to realise that there is much more material out there than even I had anticipated and this has a couple of consequences. First, the title needs to change; there are a number of sites which deserve adding to the collection but don’t quite fit with ‘Slavery and Abolition in the Caribbean’. Second, I’m increasingly aware that despite my best efforts the chances are I will miss some excellent material; meaning that if anyone wants to suggest sites from the UK Web please get in touch.

19 July 2012

UK Web Archive in the eyes of scholars

Add comment Comments (4)

We commissioned IRN Research earlier this year to gather a scholarly perspective on the UK Web Archive. This work has now completed and we have received feedback on the Archive’s perceived research value, and particularly on the content and access mechanisms which should be further developed to support research use.

The feedback came from two groups of users: those who already use the Archive for research (26%) and those who have not used the Archive (74%). The overwhelming majority are from Arts and Humanities or Social Sciences disciplines. The participants were interviewed over the telephone and a small group also undertook a second phase where they searched the Archive based on specific case studies, detailing each step of the search and results.

All participants appreciated the potential scholarly value of the Archive. Those interested in web history, statistics and digital preservation research highly value the Archive in particular. However, the selective nature of the Archive seems to impact the perception of those using it for the first time, in that they could not find content relevant to their research. This is further related to the search tool, which has been seen by some as complex with  the presentation of the search results perceived as unstructured. On the contrary, existing users are generally satisfied with the search tool, suggesting that increased familiarity with the Archive may help overcome the perceived weakness.

Special Collections were thought by all users to be useful. However, users would like to understand our selection criteria and how the themes for Special Collections are established. There is a desire to see more Special Collections and the facility to nominate themes. “UK politics” and “Contemporary British History” are the 2 broad themes which have been suggested. All users expressed the requirement for including more images and rich media, as well as more blogs.

Many first-time users are unsure about the usefulness of the visualisation tools, especially the N-gram search. However a small group of users are extremely enthusiastic about this. Again there is more interest in visualisation tools from existing users, suggesting the need to add better explanations about the functions and features of the Archive.

The study has given us some insight on how the UK Web Archive is perceived by scholars, which will direct us through the next stage of development. Things to consider for improvement or adjustment include not only the user interface, but also the underlying search and the scope of our collection.

Many thanks to IRN and those took part in the project.

Helen Hockx-Yu, Head of Web Archiving

 

04 July 2012

Religious Websites and the Diamond Jubilee

Add comment Comments (0)

The following is an edited version of two posts on Peter Webster's blog: one in March before the main Jubilee weekend, and a second in June. They are mainly concerned with sites relating to the Jubilee produced by or in connection with the mainline Christian denominations in the UK.

Although we are still a couple of months away from the event itself, I thought it would be worth starting to pull together some of the various sites for the Queen's jubilee that come from within or relate to the Christian churches. This will include press sources that the UKWA don't ordinarily take. I thought I'd make a start with some of the more predictable and national ones.

Official church resources

As you would expect, the several denominations have made various preliminary statements. The Church of England's site refers to several linked ventures: the Big Jubilee Lunch, with a specially composed grace; there will also be a special service at St Paul's on June 5th, and also the Big Jubilee Thankyou, where Anglicans are invited to sign a copy letter displayed in churches, all of which will then be combined and presented to the Queen - a petition, as it were, without demands. The lunch is being coordinated by HOPE, a pan-church organisation which is evangelical in origin, but has partnerships in place with most of the Protestant denominations in the UK.

See also the Bishop of London's sermon on the accession (Feb 6) in his role as Dean of the Chapels' Royal.

The Catholic bishops in England and Wales have urged parishes to pray for the Queen on Sunday June 3 (which is also Trinity Sunday), as reported in the Catholic Herald. (The press release is here.)

Churches Together in England are assembling resources as they appear here, and there is a joint presidential statement from Canterbury, Westminster, the Free Churches Group, and the Lutheran church, although it is rather lost amongst references to the Olympics.

The Jubilee Churches Festival is looking to co-ordinate celebrations at a local level.

Oppositional voices

One has to dig quite deep to find many Christians voicing opinions critical of either the event or the monarchy itself. Ekklesia noted the beginnings of the campaign of protest by Republic, and complaints about the BBC's coverage, but refrained from comment. (Incidentally, Republic's position on the established church is also interesting.) However, one would expect this type of comment to appear more reactively, and nearer the event; and so watch this space for later posts.

My earlier post looked at some of the preparatory statements from official church sources, and some very early oppositional voices. Here are some examples of reportage and comment after the event.

Rowan Williams' sermon at St Paul's

Perhaps predictably, the archbishop did not allow the pieties of the situation to restrict his thinking on the subject, making some robust comments about aspects of current economic life. See the full text, and the reactions of the Daily Mail (negative) and the Guardian and Nelson Jones in the New Statesman (rather more positive).

Local events

The Church Times gave a useful digest of local events, including a street party in the nave of Ripon Cathedral and various sermons, including that of the Dean of Belfast.  Events in local communities includes an inter-faith Family Fun Day in Tooting, south London.

The 'real meaning' of Jubilee

A good few campaigning sites sought to draw a distinction between the biblical concept of jubilee and the pattern of the celebrations, often making a more or less explicit connection with the current climate of austerity. See Christianity Uncut, Ekklesia and Symon Hill. The work of the Jubilee Debt Campaign predates this year's events, although their site did draw attention to the connection.

Dr Peter Webster