UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

18 October 2012

Religion, the state and the law in contemporary Britain

Add comment Comments (1)

Another in a series of forthcoming new collections is one that I myself am curating with the working title of 'State, religion and law in contemporary Britain.'

The politics of religion in Britain looks like a much more urgent area of inquiry in 2012 than it did a decade ago. In large part due to the terrorist attacks of 9/11 and 7/7, questions about the nexus of faith and national identity have found a new urgency. At the same time, older questions about the place of faith schools and of the bishops in the House of Lords, or of abortion or euthanasia have been given new and sharper focus in a changed climate of public debate.

The period since 2001 is also marked by a massive upswing in the use of the web as a medium for religious and religio-political debate, both by the established churches and campaigning secularist organisations, and by individuals and smaller organisations, most obviously in the blogosphere.

This collection is therefore trying to capture some representative sites concerned with issues of politics, government and law that touch on the disputed role of religious symbolism, belief and practice in the public sphere in Britain.

The collection is still ongoing and suggestions are very welcome, to [email protected], or via the nomination page. So far, the collection is rather weighted towards Christian voices and organisations, and suggestions for sites from amongst other faiths would be particularly welcome.

I've attempted to capture some representative general voices, such as the blog of the human rights campaigner Peter Tatchell, which deals with religious issues; the public theology think-tank Theos, and the National Secular Society.

We have already harvested some interesting sites relating to specific issues and events, such as the official site for the 2010 Papal visit to the UK, and some of the dispute at the time about the appropriateness or otherwise of spending public money on the security arrangements for the visit, from the BBC and elsewhere.

An issue at the 2010 General Election was the place of the bishops in the House of Lords, and the Power2010 campaign pressed for that to change, as did the British Humanist Association.

An issue that has come to prominence in recent weeks is that of the appropriate time limit for abortion, and we have twelve archived instances of the site of the Society for the Protection of the Unborn Child, stretching back as far as 2005.

11 October 2012

BlogForever: a new approach to blog harvesting and preservation ?

Add comment Comments (0)

[Ed Pinsent of the University of London Computer Centre writes about the BlogForever project.]

The European Commission funded BlogForever project is developing an exciting new system to harvest, preserve, manage and reuse blog content. I'm interested not only as a supplier to the project, but also because I'm fairly familiar with the way that Heritrix copies web content, and the BlogForever spider seems to promise a different method.

The system will perform an intelligent harvesting operation which retrieves and parses hypertext as well as all other associated content (images, linked files, etc) from blogs. It copies content by interrogating not only the RSS feed of a blog (similar to the JISC ArchivePress project), but also by copying data from the original HTML. The parsing action will be able to render the captured content into structured data, expressed in XML; it does this in accordance with the project's data model.

The result of this parsing action will carve semantic entities out of blog content on an unprecedented micro-level. Author names, comments, subjects, tags, categories, dates, links, and many other elements will be expressed within the hierarchical XML structure. When this content is imported into the BlogForever repository (based on CERN’s Invenio platform), a public-facing access mechanism will provide a rendition of the blog which can be interrogated, queried and searched to a high degree of detail. Every rendition, and updated version thereof, will be different, representing a different time-slice of the web; without the need for creating and managing multiple copies of the same content. The resulting block of XML will be much easier to store, preserve, and render than current web-archiving methods.

BlogForever are proposing to create a demonstrator system to prove that it would be possible for any organisation, or consortium of like-minded organisations, to curate aggregated databases of blog content on selected themes. If there was a collection of related blogs in fields of scientific research, media, news, politics, arts, education, a researcher could search across that content in very detailed ways, revealing significant connections between written content. Potentially, that's an interrogation of web content of a quality that even Google cannot match.

This interests me as it might also offer us the potential to think about web preservation in a new way. In most existing methods, the approach is to copy entire websites from URLs, replicating the folder structure. This approach tends to treat each URL as a single entity, and follows the object-based method of digital preservation; by which I mean that all digital objects in a website (images, attachments, media, stylesheets) are copied and stored. We've tended to rely on sophisticated wrapper formats to manage all that content and preserve the folder hierarchy; ARC and WARC are useful in that respect, and in California the Bag-It approach also works for websites, and is capable of moving large datasets around a network efficiently.

Conversely, the type of content going into the BlogForever repository is material generated by the spider: it’s no longer the unstructured live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BlogForever system. The spider creates a “rendition” of the live web, recast into the form of a structured XML file. XML is already known to be a robust preservation format.

If these renditions of blogs were to become the target of preservation, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce. It feels like instead of trying to preserve the behaviour, structure and dependencies of large numbers of digital objects, we would instead be preserving very large databases of aggregated content.

BlogForever (ICT No. 269963) is funded by the European Commission under Framework Programme 7 (FP7) ICT Programme

04 October 2012

Exploring the lost web

Add comment Comments (0)

There has been some attention paid recently to the rate at which the web decays. A very interesting recent article by SalahEldeen and Nelson looked at the rate at which online sources shared via social media subsequently disappear. The authors concluded that 11% would disappear in the first year, and after that there would be a loss of 0.02% per day (that's another 7.24% per year); a startling rate of loss.

There are ways and means of doing something about it, not least through national and international web archives like ourselves. And we preserve many extremely interesting sites that are already lost from the live UK web domain.

Some of them relate to prominent public figures who have either passed away, or are no longer in that public role. One example of the former is the site of the former Labour MP and foreign secretary Robin Cook, who died in 2005. One of the latter is that of his colleague Clare Short, who left parliamentary politics in 2010 after serving as secretary of state for international development.

Organisations also often have limited lives as well, of course, and amongst our collections is the site of the Welsh Language Board, set up by act of Parliament in 2003, and abolished by later legislation in 2012. Perhaps more familiar was one of the major corporate casualties of recent years, Woolworths, which went into administration in 2009.

Some others relate to events that have happened or campaigns that have ended. In the case of some of the more 'official' sites, we in the web archiving team can anticipate when sites are likely to be at risk, and can take steps to capture them. In other cases, we need members of the public to let us know. If you know of a site which you think is important, and that may be at risk, please let us know using our nomination form.

One such site is One and Other, Anthony Gormley's live artwork on the vacant fourth plinth in Trafalgar Square. Also in the archive is David Cameron's campaign site when a candidate for the constituency of Witney in the 2005 general election. Finally, there is What a difference a day makes, a remarkable blog post from one who experienced the London terrorist attacks of 2005. All three now exist only in the web archive.

27 September 2012

Digital Research 2012, Oxford

Add comment Comments (0)

I recently made the trip to the Digital Research 2012 gathering in Oxford, with my colleagues in the web archiving team Helen Hockx-Yu and Andy Jackson. We were taking part in a day of presentations and workshops on the theme of digital research using web archives. (See the programmes of our session and of the whole conference.)

It was an excellent opportunity to showcase a cluster of current projects, both here at the BL and in association with us, and to make connections between them. Andy demonstrated some of the forthcoming visualisation tools for the archive, some of which are already available on the UK Web Archive site (see earlier post). Helen presented some summary results from a recent survey of our users, about which she wrote in an earlier post.

Recently, the JISC very generously funded two projects to explore the use of the UK Web Domain Dataset, and there were presentations from both. Helen Margetts from the Oxford Internet Institute presented the Big Data project, which is conducting a link analysis of the whole dataset, showing its usefulness for political scientists and other social science researchers by analysing the place of government in information networks in the UK.

I myself then presented some early findings from the Analytical Access to the Domain Dark Archive project, led by the Institute of Historical Research (University of London). I reported on a series of workshops with potential users of the dataset, who raised important questions about research of this type. How far should researchers trust analytical tools inside a 'black box', presenting results generated by algorithms that are not (and often cannot) be transparent ? Also, how far does research on datasets of this scale present new questions of research ethics, and who should be looking for the answers to them ? 

In the afternoon we discussed some of the themes raised in the morning, to do with potential users and their needs. Some of these were:

(i) that large datasets present amazing opportunities for analysis at a macro level, but at the same time many scholars will still want to use web archives as simply another resource discovery option, to find and consult individual sites. Both approaches need to be catered for.

(ii) possible interaction with Wikipedia. As over time more and more sites disappear from the live web, and UKWA  increasingly becomes the repository for the only copy, we might expect UKWA to become cited as a source more in Wikipedia. However, there may be ways to aid and encourage this process.

(iii) how do we identify potential user groups ? We can't safely say that scholars in Discipline A are more likely to use the archive than those in Discipline B. It may be that sub-groups within each discipline find their own uses. For instance: one wouldn't find much data about the Higgs Boson in the archive; but a physicist interested in public engagement with the issue might find a great deal. One wouldn't look in UKWA for the texts of the Man Booker prize shortlist; but a literature specialist could find a wealth of reviews and other public engagement with those texts.  

Overall, it was a most successful day, which gave us much food for thought.

20 September 2012

Valuing Video Games Heritage: an update on our new video games collection

Add comment Comments (2)

[British Library Digital Curator Stella Wisdom updates us on a forthcoming special collection, preserving the rich digital heritage of video games.]

Some of you may remember my blog post from February this year , where I explained that I was selecting websites for a new Web Archive collection that will preserve information about computer games; including resources documenting gaming culture and the impact that video games have had on education and contemporary cultural life.

Since then I’ve been busy researching several target areas for sites that I would like to add to the collection, such as:

  • Sites which illustrate the experience of playing games, e.g. walkthroughs, image galleries, videos of game play and FAQs
  • Fansites
  • Forums
  • Vulnerable sites, e.g. industry sites for companies that have ceased trading
  • Sites about popular games i.e. the types of games played by people who do not identify themselves as "gamers"
  • Gamification, i.e. use of game features and techniques being adopted in non-game contexts
  • Educational games and sites which illustrate the progression of game development education
  • Events, e.g. game launches, game culture festivals
  • Pro and anti-video games and game culture sites
  • Sites which chart the evolution of video games
  • Game development competitions, including those that showcase student and independent game developers’ work
  • Game publishers, retailers and reviewers, including journalistic output

One of my challenges has been in obtaining permission from website owners; as not everyone within the video game industry or player community seems to value the richness of its history and heritage, or understand the concepts of digital preservation and web archiving. However, I’ve been making progress in networking, both online and in person, with those who create and play video games. So I’m hoping that this engagement activity will encourage more site owners to respond positively and give their support to the project. I’m also still seeking nominations, so if you know of any sites that you think should be included, then please get in touch (at [email protected] or via Twitter @miss_wisdom) or use the nomination form

So far, I’ve discovered some wonderful resources and have been able to archive interesting sites, which include:

  • GameCity; an annual videogame culture festival that takes place in Nottingham
  • Dare to be Digital; a video games development competition at Abertay University for students at UK universities and art colleges
  • BAFTA Games; who give British Academy Games Awards and also organise a competition for 11 to 16 year olds to recognise and encourage young games designers
  • North Castle; one of the oldest fansites for the Nintendo game The Legend of Zelda   
  • The Oliver Twins; a site that tells the story of Philip & Andrew Oliver, who from the age of 12 began writing games for the UK games market and co-founded Blitz Games Studios in 1990.

13 September 2012

Web Archives and Chinese Literature

Add comment Comments (0)

The following is a guest post by Professor Michel Hockx, School of Oriental and African Studies, University of London, who explains the difference between doing research on internet literature from doing research on printed literature, and how web archives help.


In July of this year, Brixton-based novelist Zelda Rhiando won the inaugural Kidwell-e Ebook Award. The award was billed as “the world’s first international e-book award.” It may have been the first time that e-writers in English from all over the world had been invited to compete for an award, but for e-writers in Chinese such awards have been around for well over a decade. This might sound surprising, since the Chinese Internet is most frequently in the news here for the way in which it is censored, i.e. for what does not appear on it. What people often forget, however, is that the environment for print-publishing in China is much more restricted and much more heavily censored. Therefore, those with literary interests and ambitions have gone online in huge numbers. Reading and writing literature is consistently ranked among the top-ten reasons why Chinese people spend time online.

 

I have been following the development of Chinese internet literature almost since its inception and I am currently finalizing a monograph on the subject, simply titled Internet Literature in China and due to be published by Columbia University Press. (That scholars of literature feel compelled to publish their research outcomes on topics like this in the form of printed books shows how poorly attuned the humanities world still is to the new technologies.) Doing research on internet literature is substantially different from doing research on printed literature, most importantly because born-digital literary texts are not stable. Printed novels may come in different editions, but generally the assumption of literature scholars who do research on the same novel is that they have all read the same text. For internet literature there can be no such assumption, because “the text” often evolves over time and usually looks different depending on user interaction.  The text looks different depending on when you visited it and what you did with it. So one of the methods I employ is to present my interpretations of such texts at different moments in time. For traditional literature scholars, this is unusual: they don’t normally tell you in their research “when I read this text in 2011, I interpreted it like this, but when I read it again in 2012, I interpreted it like that.” Using this method relies on the availability of the material, and on the possibility to preserve it so that other scholars can reproduce my readings. And that is where web archives come in.

 

As far as I know, there is no Chinese equivalent of the UK Web Archive. In the area of preservation of born-digital material, China is very far behind the UK (instead it devotes huge resources to the digitization and preservation of its printed cultural heritage). Some literary websites in China have their own archives. In the case of popular genre fiction sites these archives can be huge, and they can be searchable by author, genre, popularity (number of hits or comments), and so on. Genre fiction (romance fiction, martial arts fiction, erotic fiction, and so on) is hugely popular on the Chinese Internet, because of the relatively few legal restrictions compared to print publishing. Readers subscribe to novels they like and they then receive regular new instalments, often on a daily basis. However, no matter how large the archives, there usually tends to be a cut-off point after which works are taken offline. When I first started my research in 2002, I was blissfully unaware of such potential problems. As a result, roughly 90% of the URLs mentioned in the footnotes to my first scholarly articles on the topic are no longer accessible. Fortunately, when I began to rework some of my earlier articles for my book, I found that the Internet Archive had preserved a substantial number of the links, so in many cases my footnotes now refer to the Internet Archive. Although the Internet Archive does not preserve images and other visual material (which can play an important role in online literature), having the texts as I saw them in 2002 is definitely better than having nothing at all, and will convince my fellow scholars that I am not just making them all up!

 

During my later research, I took care to save pages, and sometimes entire sites or parts of sites, to my own computer to ensure preservation of what I had seen. But archiving material on my computer does not make it any more accessible to others. That is why I use the services of the Digital Archive for Chinese Studies (DACHS, with one server in Heidelberg, and one in Leiden), where scholars in my field can store copies of online material they refer to in footnotes to publications. DACHS also has another important function: it preserves copies of online material from China that is in danger of disappearing, because it is political or ephemeral, or both. DACHS also invites scholars to introduce such materials and place them in context, as in Nicolai Volland’s collection of online documents pertaining to “Control of the Media in the People’s Republic of China”, or Michael Day’s annotated collection of Chinese avant-garde poetry websites.

 

In order for online Chinese-language literature to be preserved, its cultural value needs to be appreciated not just by foreign enthusiasts like myself, but more generally by scholars and critics in China itself. The first decade or so of Chinese writing on the Internet will probably never be restored in any detail, although a relatively complete picture might still emerge if existing partial archives were merged. Meanwhile, I hope that new archiving options for later material will become available soon. 

05 September 2012

How to Make Websites More Archivable?

Add comment Comments (1)

I was contacted by an organisation which is going to be disbanded in a couple of months. When the organisation disappears, so will its website. Fortunately we have already archived a few instances of their website in the UK Web Archive.

The lady who contacted me however complained that the archival copies are incomplete as they do not include the “database” and would like to deposit a copy with us. Under examination it turns out that a section called “events” which has a calendar interface, was not copied by our crawler. I also found out that 2 other sections, of which the content is pulled dynamically from an underlying database, seem to be only accessible via a search interface. These would have been missed by the crawler too.

The above situation reflects some common technical challenges in web archiving. The calendar is likely to send the crawler into the so-called “crawler trap” inadvertently as it would follow the (hyper-linked) dates on the calendar endlessly. For that reason, the “events” section was excluded from our previous crawls. The database driven search interface presents content based on searches or interactions, which the crawler cannot perform. Archiving crawlers are generally capable of capturing explicitly referenced content which can be served by requesting a URL, but cannot deal with URLs which are not explicitly in the HTML but embedded in JavaScript or Flash presentations or generated dynamically.

We found out the earliest and latest dates related to the events in the organisation’s database and used these to limit the data range the crawler should follow. We then successfully crawled the “events” section without trapping our crawler. For the other 2 sections, we noticed that the live website also has a map interface which provides browseable lists of projects per region. Unfortunately only the first pages are available because the links to consequent pages are broken on the live site. The crawler copied the website as it was, including the broken links.

There are a few basic things, if taken into account when a website is designed, which will make a website a lot more archivable. These measures ensure preservation and help avoid information loss, if for any reason a website has to be taken offline.

1. Make sure important content is also explicitly referenced.
This requirement is not in contradiction with having cool, interactive features. All we ask you to do is providing an alternative, crawler-friendly way of access, using explicit or static URLs. A rule of thumb is that each page should be reachable from at least one static URL.

2. Have a site map
Use a site map to list the pages of your website accessible to crawlers or human users, in XML or in HTML.

3. Make sure all links work on your website.
If your website contains broken links, copies of your website will also have broken links.

There are more things one can do to make websites archivable. Google for example has issued guidelines to web masters to help find, crawl, and index websites: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769. Many best practices mentioned here are applicable too to archiving crawlers. Although archiving crawlers work in a way that is very similar to search engine crawlers, it is important to understand the difference. Search engine crawlers are only interested in files which can be indexed. Archiving crawlers intend to copy all files, of all formats, belonging  to a website.  

Helen Hockx-Yu, Head of Web Archiving, British Library

30 August 2012

Analysing File Formats in Web Archives

Add comment Comments (0)

Knowledge of file formats is crucial to digital preservation. Without this, it is impossible to define a preservation strategy.  Andy Jackson, Web Archiving Technical Lead at the British Library explains how to analyse formats used in archived web resources for digital preservation purposes. This is also posted as an Open Planets Foundation Blog

UK Web Archive recently released a new suite of visualisations and datasets. Amongst these is a format profile, summarising the data formats (MIME types)  in the JISC UK Web Domain Dataset (1996-2010). This contains some 2.5 billion HTTP 200 responses stretching from 1996 to 2010, neatly packed into ARC files and stored on our HDFS cluster.  Storing it in HDFS allows us to run Map-Reduce tasks over the whole dataset, and analyse the results.

Given this infrastructure, my first thought was to use it to test and compare format identification processes by running multiple identification tools over the same corpus. By analysing the depth and coverage of the results, we can estimate which tools are better suited to which types of resources and collection. Furthermore, much as double re-keying can be used to establish 'groud truth' for OCR data, each tool acts as an independent opinion on the format of an resource and so permits us a little more confidence in their assertions when they are found to coincide. This allows us to focus our attention on where the tools disagree, and helps to ensure that our efforts to improve those tools will have the greatest impact.

To this end, I wrapped up Apache Tika and the DROID binary signature identifier as part of a Map-Reduce task and ran them over the entire corpus. I mapped the results of both to a formalised extended MIME type syntax, such that each PUID has a unique MIME type of the form 'application/pdf; version=1.4', and used that to compare the results of the tools.

Of course, as well as establishing trust in the tools, this kind of data helps us start to explore the way format usage has changed over time, and is a necessary first step in understanding the nature of format obsolescence. As a taster, here is a chart showing the usage of different version of HTML over time:

As you can see, each version rises to dominance and then fades away, but the fade slows down each time. Across the 2010 time-slice, all the old versions of HTML are still turning up in the crawl. You can find some more information and results on the UK Web Archive site.

Finally, as well as exporting the format identifiers, I also used Apache Tika to extract any information it found about the software or hardware platform the resource was created on.  All of this information was combined with the MIME type declared by the server and then aggregated by year to produce a rich and complex longtitudinal multi-tool format profile for this collection.

Fmt-html-versions

If this is of interest to you, please go and download the dataset and start exploring it. Please let me know if you find this dataset useful, and please share any interesting results you dig out of the dataset.