Introduction

The UK web is one of the most important aspects of the nation’s digital record. But the web is extremely vulnerable, and websites can and do disappear frequently. Preserving them, and providing access to those preserved versions, have become matters of urgency and strategic importance.

26 October 2012

Ambassador, with these websites, you're really spoiling us

Add comment Comments (0)

[A special guest post from Stella Wisdom, British Library Digital Curator at the British Library.]

A little help from our friends for the video games collection

In my post last month I discussed the challenges in obtaining permission from website owners for the sites I’m selecting for the new video games collection. I’ve decided to try a new approach, recruiting an ambassador who is well known and respected in the video game industry to champion the collection, who can help explain and promote the benefits of web archiving to site owners. I would hereby like to introduce Ian Livingstone, Life President of Eidos, the company behind the success of Lara Croft: Tomb Raider. Ian brings so much experience to the party, we don’t even expect him to bring chocolates !

Ian’s long history in the gaming industry started in 1975 when he co-founded Games Workshop, launching Dungeons & Dragons in Europe, then building a nationwide retail chain and publishing White Dwarf magazine. In 1982, with Games Workshop co-founder Steve Jackson, he created the Fighting Fantasy role-playing gamebook series, which has sold over 16 million copies to date. Fighting Fantasy is 30 years old this year, and Ian has written a new gamebook Blood of the Zombies to celebrate the anniversary, and it has also recently launched as an App on iOS and Android.

In 1984, Ian moved into computer games, designing Eureka, the first title released by publisher Domark. He then oversaw a merger that created Eidos Interactive, where he was Chairman for seven years. At Eidos he helped bring to market some of its most famous titles including Lara Croft: Tomb Raider. Ian became Life President of Eidos for Square Enix, which bought the publisher in 2009, and he continues to have creative input in all Eidos-label games.

Ian is known for actively supporting upcoming games talent; as an advisor and investor in indie studios such as Playdemic, Mediatonic and Playmob. He is vice chair of trade body UKIE, a trustee of industry charity GamesAid, chair of the Video Games Skills Council, chair of Next Gen Skills, a member of the Creative Industries Council and an advisor to the British Council.

In 2010 he was asked by Ed Vaizey, the UK Minister for Culture, Communications and Creative Industries, to become a government skills champion and was tasked with producing a report reviewing the UK video games industry. The NextGen review, co-authored with Alex Hope of visual effects firm Double Negative, was published by NESTA in 2011, recommending changes in education policy, the main one being to bring computer science into the schools National Curriculum as an essential discipline.

With this wealth of experience and connections, I can’t think of anyone better to work with and I’m hopeful the collection will successfully grow with Ian’s support. This is the first time the UK Web Archive has appointed an ambassador for a collection, so it will be interesting to follow its progress. If the use of a champion is successful, then other collections may benefit from the same approach.

I’ve also been doing some advocacy work of my own this week; talking about the video game collection at GameCity7 festival, meeting many interesting people there, discussing video game heritage and engaging them in web archiving.

As ever, I’m still seeking nominations, so if you know of any sites that you think should be included, then please get in touch (at [email protected] or via Twitter @miss_wisdom) or use the nomination form.

Posted by Peter Webster at 9:28 AM

18 October 2012

Religion, the state and the law in contemporary Britain

Add comment Comments (1)

Another in a series of forthcoming new collections is one that I myself am curating with the working title of 'State, religion and law in contemporary Britain.'

The politics of religion in Britain looks like a much more urgent area of inquiry in 2012 than it did a decade ago. In large part due to the terrorist attacks of 9/11 and 7/7, questions about the nexus of faith and national identity have found a new urgency. At the same time, older questions about the place of faith schools and of the bishops in the House of Lords, or of abortion or euthanasia have been given new and sharper focus in a changed climate of public debate.

The period since 2001 is also marked by a massive upswing in the use of the web as a medium for religious and religio-political debate, both by the established churches and campaigning secularist organisations, and by individuals and smaller organisations, most obviously in the blogosphere.

This collection is therefore trying to capture some representative sites concerned with issues of politics, government and law that touch on the disputed role of religious symbolism, belief and practice in the public sphere in Britain.

The collection is still ongoing and suggestions are very welcome, to [email protected], or via the nomination page. So far, the collection is rather weighted towards Christian voices and organisations, and suggestions for sites from amongst other faiths would be particularly welcome.

I've attempted to capture some representative general voices, such as the blog of the human rights campaigner Peter Tatchell, which deals with religious issues; the public theology think-tank Theos, and the National Secular Society.

We have already harvested some interesting sites relating to specific issues and events, such as the official site for the 2010 Papal visit to the UK, and some of the dispute at the time about the appropriateness or otherwise of spending public money on the security arrangements for the visit, from the BBC and elsewhere.

An issue at the 2010 General Election was the place of the bishops in the House of Lords, and the Power2010 campaign pressed for that to change, as did the British Humanist Association.

An issue that has come to prominence in recent weeks is that of the appropriate time limit for abortion, and we have twelve archived instances of the site of the Society for the Protection of the Unborn Child, stretching back as far as 2005.

Posted by Peter Webster at 2:36 PM

Tags

Collections

11 October 2012

BlogForever: a new approach to blog harvesting and preservation ?

Add comment Comments (0)

[Ed Pinsent of the University of London Computer Centre writes about the BlogForever project.]

The European Commission funded BlogForever project is developing an exciting new system to harvest, preserve, manage and reuse blog content. I'm interested not only as a supplier to the project, but also because I'm fairly familiar with the way that Heritrix copies web content, and the BlogForever spider seems to promise a different method.

The system will perform an intelligent harvesting operation which retrieves and parses hypertext as well as all other associated content (images, linked files, etc) from blogs. It copies content by interrogating not only the RSS feed of a blog (similar to the JISC ArchivePress project), but also by copying data from the original HTML. The parsing action will be able to render the captured content into structured data, expressed in XML; it does this in accordance with the project's data model.

The result of this parsing action will carve semantic entities out of blog content on an unprecedented micro-level. Author names, comments, subjects, tags, categories, dates, links, and many other elements will be expressed within the hierarchical XML structure. When this content is imported into the BlogForever repository (based on CERN’s Invenio platform), a public-facing access mechanism will provide a rendition of the blog which can be interrogated, queried and searched to a high degree of detail. Every rendition, and updated version thereof, will be different, representing a different time-slice of the web; without the need for creating and managing multiple copies of the same content. The resulting block of XML will be much easier to store, preserve, and render than current web-archiving methods.

BlogForever are proposing to create a demonstrator system to prove that it would be possible for any organisation, or consortium of like-minded organisations, to curate aggregated databases of blog content on selected themes. If there was a collection of related blogs in fields of scientific research, media, news, politics, arts, education, a researcher could search across that content in very detailed ways, revealing significant connections between written content. Potentially, that's an interrogation of web content of a quality that even Google cannot match.

This interests me as it might also offer us the potential to think about web preservation in a new way. In most existing methods, the approach is to copy entire websites from URLs, replicating the folder structure. This approach tends to treat each URL as a single entity, and follows the object-based method of digital preservation; by which I mean that all digital objects in a website (images, attachments, media, stylesheets) are copied and stored. We've tended to rely on sophisticated wrapper formats to manage all that content and preserve the folder hierarchy; ARC and WARC are useful in that respect, and in California the Bag-It approach also works for websites, and is capable of moving large datasets around a network efficiently.

Conversely, the type of content going into the BlogForever repository is material generated by the spider: it’s no longer the unstructured live web. It’s structured content, pre-processed, and parsed, fit to be read by the databases that form the heart of the BlogForever system. The spider creates a “rendition” of the live web, recast into the form of a structured XML file. XML is already known to be a robust preservation format.

If these renditions of blogs were to become the target of preservation, we would potentially have a much more manageable preservation task ahead of us, with a limited range of content and behaviours to preserve and reproduce. It feels like instead of trying to preserve the behaviour, structure and dependencies of large numbers of digital objects, we would instead be preserving very large databases of aggregated content.

BlogForever (ICT No. 269963) is funded by the European Commission under Framework Programme 7 (FP7) ICT Programme

Posted by Peter Webster at 10:57 AM

04 October 2012

Exploring the lost web

Add comment Comments (0)

There has been some attention paid recently to the rate at which the web decays. A very interesting recent article by SalahEldeen and Nelson looked at the rate at which online sources shared via social media subsequently disappear. The authors concluded that 11% would disappear in the first year, and after that there would be a loss of 0.02% per day (that's another 7.24% per year); a startling rate of loss.

There are ways and means of doing something about it, not least through national and international web archives like ourselves. And we preserve many extremely interesting sites that are already lost from the live UK web domain.

Some of them relate to prominent public figures who have either passed away, or are no longer in that public role. One example of the former is the site of the former Labour MP and foreign secretary Robin Cook, who died in 2005. One of the latter is that of his colleague Clare Short, who left parliamentary politics in 2010 after serving as secretary of state for international development.

Organisations also often have limited lives as well, of course, and amongst our collections is the site of the Welsh Language Board, set up by act of Parliament in 2003, and abolished by later legislation in 2012. Perhaps more familiar was one of the major corporate casualties of recent years, Woolworths, which went into administration in 2009.

Some others relate to events that have happened or campaigns that have ended. In the case of some of the more 'official' sites, we in the web archiving team can anticipate when sites are likely to be at risk, and can take steps to capture them. In other cases, we need members of the public to let us know. If you know of a site which you think is important, and that may be at risk, please let us know using our nomination form.

One such site is One and Other, Anthony Gormley's live artwork on the vacant fourth plinth in Trafalgar Square. Also in the archive is David Cameron's campaign site when a candidate for the constituency of Witney in the 2005 general election. Finally, there is What a difference a day makes, a remarkable blog post from one who experienced the London terrorist attacks of 2005. All three now exist only in the web archive.

Posted by Peter Webster at 10:00 AM

Tags

vanished site

27 September 2012

Digital Research 2012, Oxford

Add comment Comments (0)

I recently made the trip to the Digital Research 2012 gathering in Oxford, with my colleagues in the web archiving team Helen Hockx-Yu and Andy Jackson. We were taking part in a day of presentations and workshops on the theme of digital research using web archives. (See the programmes of our session and of the whole conference.)

It was an excellent opportunity to showcase a cluster of current projects, both here at the BL and in association with us, and to make connections between them. Andy demonstrated some of the forthcoming visualisation tools for the archive, some of which are already available on the UK Web Archive site (see earlier post). Helen presented some summary results from a recent survey of our users, about which she wrote in an earlier post.

Recently, the JISC very generously funded two projects to explore the use of the UK Web Domain Dataset, and there were presentations from both. Helen Margetts from the Oxford Internet Institute presented the Big Data project, which is conducting a link analysis of the whole dataset, showing its usefulness for political scientists and other social science researchers by analysing the place of government in information networks in the UK.

I myself then presented some early findings from the Analytical Access to the Domain Dark Archive project, led by the Institute of Historical Research (University of London). I reported on a series of workshops with potential users of the dataset, who raised important questions about research of this type. How far should researchers trust analytical tools inside a 'black box', presenting results generated by algorithms that are not (and often cannot) be transparent ? Also, how far does research on datasets of this scale present new questions of research ethics, and who should be looking for the answers to them ?

In the afternoon we discussed some of the themes raised in the morning, to do with potential users and their needs. Some of these were:

(i) that large datasets present amazing opportunities for analysis at a macro level, but at the same time many scholars will still want to use web archives as simply another resource discovery option, to find and consult individual sites. Both approaches need to be catered for.

(ii) possible interaction with Wikipedia. As over time more and more sites disappear from the live web, and UKWA increasingly becomes the repository for the only copy, we might expect UKWA to become cited as a source more in Wikipedia. However, there may be ways to aid and encourage this process.

(iii) how do we identify potential user groups ? We can't safely say that scholars in Discipline A are more likely to use the archive than those in Discipline B. It may be that sub-groups within each discipline find their own uses. For instance: one wouldn't find much data about the Higgs Boson in the archive; but a physicist interested in public engagement with the issue might find a great deal. One wouldn't look in UKWA for the texts of the Man Booker prize shortlist; but a literature specialist could find a wealth of reviews and other public engagement with those texts.

Overall, it was a most successful day, which gave us much food for thought.

Posted by Peter Webster at 9:43 AM

Tags

Web/Tech

20 September 2012

Valuing Video Games Heritage: an update on our new video games collection

Add comment Comments (2)

[British Library Digital Curator Stella Wisdom updates us on a forthcoming special collection, preserving the rich digital heritage of video games.]

Some of you may remember my blog post from February this year , where I explained that I was selecting websites for a new Web Archive collection that will preserve information about computer games; including resources documenting gaming culture and the impact that video games have had on education and contemporary cultural life.

Since then I’ve been busy researching several target areas for sites that I would like to add to the collection, such as:

Sites which illustrate the experience of playing games, e.g. walkthroughs, image galleries, videos of game play and FAQs
Fansites
Forums
Vulnerable sites, e.g. industry sites for companies that have ceased trading
Sites about popular games i.e. the types of games played by people who do not identify themselves as "gamers"
Gamification, i.e. use of game features and techniques being adopted in non-game contexts
Educational games and sites which illustrate the progression of game development education
Events, e.g. game launches, game culture festivals
Pro and anti-video games and game culture sites
Sites which chart the evolution of video games
Game development competitions, including those that showcase student and independent game developers’ work
Game publishers, retailers and reviewers, including journalistic output

One of my challenges has been in obtaining permission from website owners; as not everyone within the video game industry or player community seems to value the richness of its history and heritage, or understand the concepts of digital preservation and web archiving. However, I’ve been making progress in networking, both online and in person, with those who create and play video games. So I’m hoping that this engagement activity will encourage more site owners to respond positively and give their support to the project. I’m also still seeking nominations, so if you know of any sites that you think should be included, then please get in touch (at [email protected] or via Twitter @miss_wisdom) or use the nomination form.

So far, I’ve discovered some wonderful resources and have been able to archive interesting sites, which include:

GameCity; an annual videogame culture festival that takes place in Nottingham
Dare to be Digital; a video games development competition at Abertay University for students at UK universities and art colleges
BAFTA Games; who give British Academy Games Awards and also organise a competition for 11 to 16 year olds to recognise and encourage young games designers
North Castle; one of the oldest fansites for the Nintendo game The Legend of Zelda
The Oliver Twins; a site that tells the story of Philip & Andrew Oliver, who from the age of 12 began writing games for the UK games market and co-founded Blitz Games Studios in 1990.

Posted by Peter Webster at 2:58 PM

13 September 2012

Web Archives and Chinese Literature

Add comment Comments (0)

The following is a guest post by Professor Michel Hockx, School of Oriental and African Studies, University of London, who explains the difference between doing research on internet literature from doing research on printed literature, and how web archives help.

In July of this year, Brixton-based novelist Zelda Rhiando won the inaugural Kidwell-e Ebook Award. The award was billed as “the world’s first international e-book award.” It may have been the first time that e-writers in English from all over the world had been invited to compete for an award, but for e-writers in Chinese such awards have been around for well over a decade. This might sound surprising, since the Chinese Internet is most frequently in the news here for the way in which it is censored, i.e. for what does not appear on it. What people often forget, however, is that the environment for print-publishing in China is much more restricted and much more heavily censored. Therefore, those with literary interests and ambitions have gone online in huge numbers. Reading and writing literature is consistently ranked among the top-ten reasons why Chinese people spend time online.

I have been following the development of Chinese internet literature almost since its inception and I am currently finalizing a monograph on the subject, simply titled Internet Literature in China and due to be published by Columbia University Press. (That scholars of literature feel compelled to publish their research outcomes on topics like this in the form of printed books shows how poorly attuned the humanities world still is to the new technologies.) Doing research on internet literature is substantially different from doing research on printed literature, most importantly because born-digital literary texts are not stable. Printed novels may come in different editions, but generally the assumption of literature scholars who do research on the same novel is that they have all read the same text. For internet literature there can be no such assumption, because “the text” often evolves over time and usually looks different depending on user interaction. The text looks different depending on when you visited it and what you did with it. So one of the methods I employ is to present my interpretations of such texts at different moments in time. For traditional literature scholars, this is unusual: they don’t normally tell you in their research “when I read this text in 2011, I interpreted it like this, but when I read it again in 2012, I interpreted it like that.” Using this method relies on the availability of the material, and on the possibility to preserve it so that other scholars can reproduce my readings. And that is where web archives come in.

As far as I know, there is no Chinese equivalent of the UK Web Archive. In the area of preservation of born-digital material, China is very far behind the UK (instead it devotes huge resources to the digitization and preservation of its printed cultural heritage). Some literary websites in China have their own archives. In the case of popular genre fiction sites these archives can be huge, and they can be searchable by author, genre, popularity (number of hits or comments), and so on. Genre fiction (romance fiction, martial arts fiction, erotic fiction, and so on) is hugely popular on the Chinese Internet, because of the relatively few legal restrictions compared to print publishing. Readers subscribe to novels they like and they then receive regular new instalments, often on a daily basis. However, no matter how large the archives, there usually tends to be a cut-off point after which works are taken offline. When I first started my research in 2002, I was blissfully unaware of such potential problems. As a result, roughly 90% of the URLs mentioned in the footnotes to my first scholarly articles on the topic are no longer accessible. Fortunately, when I began to rework some of my earlier articles for my book, I found that the Internet Archive had preserved a substantial number of the links, so in many cases my footnotes now refer to the Internet Archive. Although the Internet Archive does not preserve images and other visual material (which can play an important role in online literature), having the texts as I saw them in 2002 is definitely better than having nothing at all, and will convince my fellow scholars that I am not just making them all up!

During my later research, I took care to save pages, and sometimes entire sites or parts of sites, to my own computer to ensure preservation of what I had seen. But archiving material on my computer does not make it any more accessible to others. That is why I use the services of the Digital Archive for Chinese Studies (DACHS, with one server in Heidelberg, and one in Leiden), where scholars in my field can store copies of online material they refer to in footnotes to publications. DACHS also has another important function: it preserves copies of online material from China that is in danger of disappearing, because it is political or ephemeral, or both. DACHS also invites scholars to introduce such materials and place them in context, as in Nicolai Volland’s collection of online documents pertaining to “Control of the Media in the People’s Republic of China”, or Michael Day’s annotated collection of Chinese avant-garde poetry websites.

In order for online Chinese-language literature to be preserved, its cultural value needs to be appreciated not just by foreign enthusiasts like myself, but more generally by scholars and critics in China itself. The first decade or so of Chinese writing on the Internet will probably never be restored in any detail, although a relatively complete picture might still emerge if existing partial archives were merged. Meanwhile, I hope that new archiving options for later material will become available soon.

Posted by Hhockx at 9:30 AM

05 September 2012

How to Make Websites More Archivable?

Add comment Comments (1)

I was contacted by an organisation which is going to be disbanded in a couple of months. When the organisation disappears, so will its website. Fortunately we have already archived a few instances of their website in the UK Web Archive.

The lady who contacted me however complained that the archival copies are incomplete as they do not include the “database” and would like to deposit a copy with us. Under examination it turns out that a section called “events” which has a calendar interface, was not copied by our crawler. I also found out that 2 other sections, of which the content is pulled dynamically from an underlying database, seem to be only accessible via a search interface. These would have been missed by the crawler too.

The above situation reflects some common technical challenges in web archiving. The calendar is likely to send the crawler into the so-called “crawler trap” inadvertently as it would follow the (hyper-linked) dates on the calendar endlessly. For that reason, the “events” section was excluded from our previous crawls. The database driven search interface presents content based on searches or interactions, which the crawler cannot perform. Archiving crawlers are generally capable of capturing explicitly referenced content which can be served by requesting a URL, but cannot deal with URLs which are not explicitly in the HTML but embedded in JavaScript or Flash presentations or generated dynamically.

We found out the earliest and latest dates related to the events in the organisation’s database and used these to limit the data range the crawler should follow. We then successfully crawled the “events” section without trapping our crawler. For the other 2 sections, we noticed that the live website also has a map interface which provides browseable lists of projects per region. Unfortunately only the first pages are available because the links to consequent pages are broken on the live site. The crawler copied the website as it was, including the broken links.

There are a few basic things, if taken into account when a website is designed, which will make a website a lot more archivable. These measures ensure preservation and help avoid information loss, if for any reason a website has to be taken offline.

1. Make sure important content is also explicitly referenced.
This requirement is not in contradiction with having cool, interactive features. All we ask you to do is providing an alternative, crawler-friendly way of access, using explicit or static URLs. A rule of thumb is that each page should be reachable from at least one static URL.

2. Have a site map
Use a site map to list the pages of your website accessible to crawlers or human users, in XML or in HTML.

3. Make sure all links work on your website.
If your website contains broken links, copies of your website will also have broken links.

There are more things one can do to make websites archivable. Google for example has issued guidelines to web masters to help find, crawl, and index websites: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769. Many best practices mentioned here are applicable too to archiving crawlers. Although archiving crawlers work in a way that is very similar to search engine crawlers, it is important to understand the difference. Search engine crawlers are only interested in files which can be indexed. Archiving crawlers intend to copy all files, of all formats, belonging to a website.

Helen Hockx-Yu, Head of Web Archiving, British Library

Posted by Hhockx at 8:25 AM

Tags

Collections