Introduction

The UK web is one of the most important aspects of the nation’s digital record. But the web is extremely vulnerable, and websites can and do disappear frequently. Preserving them, and providing access to those preserved versions, have become matters of urgency and strategic importance.

08 July 2015

Big UK Domain Data for the Arts and Humanities: working with the archive of UK web space, 1996–2013

Add comment Comments (0)

In January 2014, the Institute of Historical Research, University of London (in partnership with the British Library, the Oxford Internet Institute and Aarhus University) was awarded funding by the Arts and Humanities Research Council for a project to explore ways in which humanities researchers could engage with web archives. The main aims of ‘Big UK Domain Data for the Arts and Humanities’ were to highlight the value of web archives for research; to develop a theoretical and methodological framework for their analysis; to explore the ethical implications of this kind of big data research; to train researchers in the use of big data; and to inform collections development and access arrangements at the British Library.

Helen Hockx-Yu showing the BUDDAH interface to people at the Being Human Festival 2014

For the past 15 months the project team have been working with 10 researchers, drawn from a range of arts and humanities disciplines, to address these issues and particularly to develop a prototype interface which will make the historical archive (1996–2103) accessible. The researchers came armed with a range of fascinating questions, from analysing Euro-scepticism on the web to studying the Ministry of Defence’s recruitment strategy, from examining the history of disability campaigning groups and charities online to looking at Beat literature in the contemporary imagination. The case studies that they have produced demonstrate some of the challenges posed by the archived web, but also its value and significance. They are available from the project website.

Along the way, the project has produced not only one of the largest full-text indexes of web archive (WARC) files in the world, but also a sophisticated interface which supports complex query building and gives researchers the ability to create and manipulate corpora derived from the larger dataset.

This interface is accessible as a beta version. It opens up a fascinating range of options now that you longer need to know the URL of a vanished website in order to find it in the archive.

For those less familiar with the concept of web archives, we’ve also produced two short animations, ‘What is a Web Archive?’ and ‘What does the UK Web Archive collect?’. They’re both available under a CC-BY-NC-SA licence, so do please share!

Jane Winters
Professor of Digital History
Institute of Historical Research, School of Advanced Study, University of London
@jfwinters

Posted by Sabine Hartmann at 1:37 PM

03 July 2015

What is a Web Archive? (in less than 3 mins)

Add comment Comments (0)

You may have heard of the term 'Web Archiving' but what is it and why is it important that the UK Legal Deposit libraries support this? This short video is a good start:

What do the UK Web Archive collect?
What can you expect to find and where might you go to access the three collections that the UK Web Archive hold?

These videos were produced as part of the AHRC funded 'Big UK Domain Data for the Arts and Humanities' project.

Posted by Jason Webber at 12:02 PM

Tags

Web/Tech

26 June 2015

Ten years of archiving the web - A reflective blog post by Nicola Bingham, Web Archivist

Add comment Comments (0)

It is ten years since the UK Web Archiving Consortium, the precursor to the UK Web Archive, launched one of the world’s first openly accessible web archives. It is therefore a fitting time to look up from the crawl logs and reflect on what we have achieved over the past decade.

Early Years

In the late 1990s and early 2000s memory institutions around the world started to turn their attention to preserving the enormous amount of highly ephemeral material on the emerging World Wide Web. The earliest exploration around web archiving at the British Library was in 2001-2 with the “UK Domain project”, a pilot study to explore the feasibility of archiving around 100 websites. When the Legal Deposit Libraries Act was passed in 2003 it meant that the Library could plan to scale up its operations to archive the whole UK web domain - once further enabling legislation had been put into place. We did not realise at the time that this process would take a further ten years!

In order to put in place the systems, people and policies to carry out web archiving the Web Archiving Programme was launched in 2003. The Programme’s aims were to “enable the British Library to become the first point of resort for anyone who wants to access a comprehensive archive of material from the UK web domain.”

In order to realise these ambitious goals the Library joined the UK Web Archiving Consortium in 2003 along with five other partners (the National Archives, National Library of Wales, National Library of Scotland, JISC and the Wellcome Trust). To the best of the partner’s knowledge there were no other UK institutions working in this way to archive the UK Web. The achievements of the Consortium were summarised in the final project report:

On “…..strategic and operational levels, the Consortium has been successful in addressing, in a shared and collaborative manner, a range of legal, technical, operational, collection development and management issues relating to web archiving. UKWAC has laid the foundations of a national web archiving strategy and a shared technical infrastructure for the United Kingdom (UK) and has prepared the ground for future development, taking into account the need to prepare for forthcoming secondary legislation associated with the Legal Deposit Libraries Act 2003 and the extension of legal deposit to non-print materials including websites.”

The author of this post joined the web archiving team in January 2005 just a few days after the Indian Ocean Tsunami. Our web archiving operations were on a much smaller scale than they are today; websites being archived selectively with the express permission of site publishers. The permissions process took up much of the team’s time and resources. Web archiving tools were still very much in development. Crawling was carried out with the PANDORA web archiving system developed by the National Library of Australia using infrastructure shared by the UKWAC. One of the many issues facing the Consortium partners in these early days was the issue of manually controlling load balancing on the system at periods of high intensity crawling using a traffic light system. Green meaning a crawl could be initiated, red meaning go away and have a cup of tea while the crawl backlog cleared.

New forms of content

In addition to the technical aspects of web crawling, the Library was getting to grips with cataloguing, describing and providing access to completely new forms of publication. We explored the fact that “many new ways of communicating are entirely web based; Blogs, Wikis, MySpace and YouTube”. We established a special collection of “Blogs” to reflect the fact that this new and exciting format coincided with the advent of web publishing tools that facilitated the posting of content to the web by non-technical users.

In terms of access, the focus was initially on presenting website snapshots as documents categorised according to traditional library subject taxonomies or as special collections. One of the first collections we published was in response to the Indian Ocean Tsunami of Boxing Day 2004.

First harvests

The UK Web Archive went live on May 9 2005. Figures from 20 March 2006 reveal that 1172 Titles and 3641 instances were accessible. Some of the first websites we archived were:

Pathways to the past (TNA) http://www.webarchive.org.uk/ukwa/target/99634/source/search

Y Gwleidydd (The Politician) NLW http://www.webarchive.org.uk/ukwa/target/101915/source/search

Listening Room (BL) http://www.webarchive.org.uk/ukwa/target/101989/source/search typical of the time, frames, coloured text on black background, message board.

Arthurlloyd music hall (BL) http://www.webarchive.org.uk/ukwa/target/102127/source/search

trAce (BL) online writing centre http://www.webarchive.org.uk/ukwa/target/102190/source/search

Menna Elfyn (NLW) welsh poet have contemporary copies http://www.webarchive.org.uk/ukwa/target/103764/source/search

BioCentre – Centre for Bioethics and Public Policy (Wellcome) http://www.webarchive.org.uk/ukwa/target/103792/source/search

Social Care Institute for Excellence (BL) http://www.webarchive.org.uk/ukwa/target/101868/source/search

http://www.webarchive.org.uk/ukwa/target/102148/source/search

Glasgow Effective Records Management Project (JISC) http://www.webarchive.org.uk/ukwa/target/99707/source/search

Churches Together in England (BL) http://www.webarchive.org.uk/ukwa/target/101767/source/search (has good spread of instances)

The earliest material we hold is the first version of the BL’s website “Portico” which was reconstructed from files stored on the BL’s servers.

Big Data

In 2015 the scale of our web archiving activity has magnified from thousands to millions of websites and billions of URLs. As researchers look beyond text as the object of study, we no longer take a document focussed approach. The kind of distant reading that we hope to provide will allow researchers to explore patterns of change, geolocation, linked networks and entities, the so-called Big Data approach.

Collaboration

In 2008 we began engaging seriously with researchers, initially to get researchers to curate collections within their own areas of expertise and later focussing on presenting the archive as a dataset for study. The UK domain dataset (1996-2013) acquired by JISC from the Internet Archive, has been made available to researchers to experiment with query building, corpus formation and handling. Some of the projects carried out with this data include the Analytical Access to the Domain Dark Archive, led by the Institute of Historical Research in partnership with the British Library and the University of Cambridge. The JISC data set, along with the Open UK Web Archive was used by Jules Mataly (University of Amsterdam), for his thesis, The Three Truths of Mrs Thatcher, completed in 2013. Dr Rainer Simon explored how the UK web was linked in 1996 using the 1996 portion of the JISC dataset. Going forward, we will use our experience of working with researchers to influence how we archive, store and present web archives to fully integrate with scholar’s workflows.

Having been launched as a pilot Programme little more than a decade ago, web archiving is now a key aspect of the British Library’s Strategic Priorities and is considered in corporate terms a ‘business as usual activity’. In ‘Living Knowledge’, the publication which lays out the key strategic priorities for the Library on its journey to its 50^th anniversary in 2023, Roly Keating, Chief Executive, states that our partnership with the National Libraries of Scotland and Wales, the Libraries of the Universities of Oxford, Cambridge and Trinity College Dublin “lies at the heart of our single greatest endeavour in digital custodianship, the comprehensive collecting under Legal Deposit of the UK and Ireland’s output of born-digital content, including the archiving of the entire UK web.”

What have we achieved?

In ten years we have come a long way.

Legal Deposit Legislation enacted in 2003 and enabled in 2013, allowing the Legal Deposit Libraries to comprehensively archive the UK web space.
Tools and infrastructure. The implementation of state of the art crawling and indexing technologies enabling ingest of and access to archived material. A bespoke annotation curation tool which allows non-technical curators to harvest and describe the web, as well as build their own collections.
A publicly accessible web archive. The Open UK Web Archive is one of the few web archives in the world offering a full-text search. Our open permissions based web archive has 15,000 websites and 68,000 snapshots, serving as a window for our larger legal deposit collection discoverable through the British Library’s online catalogue and searchable by key word in the reading rooms.
In excess of 60 special collections, including a decade’s worth of UK General Elections. Other collections include researcher led collections on…..?. Rapid response collections, e.g., the London bombings in 2005, Olympic Games 2012. Women’s issues.
Over eight billion resources and over 160 TB compressed data (comprising the Open Archive since 2004, Legal Deposit Archive since 2013 and JISC Historical Archive 1996-2013.
Successful collaborative relationships with the global archiving community. International Internet Preservation Consortium, partners across the UK and beyond The IIPC had 12 founding members in 2003 and now has nearly 50 members in 2015.

How has my job as a web archivist changed over the past ten years? The goals remain the same, to acquire, describe and preserve web content for the benefit of future generations of researchers. Obviously though the task is now much bigger. One of the things I enjoy most about my job is the juxtaposition of the micro and macro approach. On the one hand the web archivist must consider the curation and development of the collection as a whole – millions of websites, billions of documents, hundreds of terabytes of data. On the other hand a micro approach is required to ensure the collection is properly curated. This involves a particularly high level of forensic endeavour for example in analysing a crawl log to determine why a particular object has not been picked up by the crawler. And on that note, I think it is time to go back to the crawl logs.

Nicola Bingham, Web Archivist

Posted by Sabine Hartmann at 9:17 AM

19 June 2015

RESAW Conference – showcasing research of the historical web

Add comment Comments (0)

RESAW, a self-organising initiative aimed at building a pan-European research infrastructure for the study of web archives, has been active for a couple of years now. A group of active researchers from Europe and North America have gathered around this network. They met last week in Aarhus, Denmark and presented their work at the RESAW Conference, entitled “Web Archives as Scholarly Sources: Issues, Practices and Perspectives”. The diverse approaches and findings I witnessed reflect the increasing awareness and understanding of the characteristics of archived web material, and the development of appropriate research methods to study it.

Packed programme

The conference had a packed programme, with parallel sessions running on all three days. In addition to 3 plenary sessions, it included 10 long papers, 12 short papers, 4 themed panels and 1 workshop. The format was refreshing and worked really well in bringing forward different perspectives: presentations were kept strictly to fixed time, while each paper received structured comments, followed by questions and discussion with the audience. The only downside was the hard choices one had to make, deciding where to go when a number of interesting papers were on at the same time.

Meghan Dougherty of Loyola University, one of the very first researchers working with web archives, called for a more exclusive approach to archiving the web in her opening keynote. Instead of preserving the web as series of linked documents, the focus should be on its rich complexity as new media including interactions, expectations, and how people live through and experience it. We otherwise risk excluding many features of today’s live web experience which will be valuable for future researchers. Recognising the lack of good methodology for studying the historical web, Meghan observed the relevance of archaeological methods and practices, how they can help recover, document and analyse a record of information culture through virtual digging, and in that process taking into account the invisible and missing elements. She also asked archivists to reach out to researchers and researchers to collaborate more so that specialist knowledge and skills can be joined up.

Aarhus University is also the home of the Danish State and University Library, which has been collaborating with the Royal Danish Library to archive the Danish Web since 2005. The conference coincided with the ten year anniversary of the National Danish Web Archive, which now contains 600TB of data. Ditte Laursen and Per Møldrup-Dalum presented an overview of the Danish Web and shared the various legal, curatorial, technical and access challenges the Archive had to face and address. A key one is the identification and collection of Danish material hosted on non .dk domains, which is applicable to many national web archives. After focusing on comprehensive data collection, access and use are now high on the agenda. The Archive launched full-text search on the anniversary and there are exciting plans to actively develop data mining and analytics, and to strengthen collaboration with researchers.

It is no surprise that historians were among the first who started using web archives to study contemporary history. There was a strong presence of historians at the Conference who explored diverse aspects of the historical web. Ian Milligan of Waterloo University used the GeoCities Web Archive to explore the nature of virtual communities, highlighting the technical challenges and how critical overcoming these is to the historiography of the early web. Peter Webster studied British creationism in the historical UK Web Archive by analysing the creationist web estate and high-level patterning of host-to-host linkage, to conclude that in addition to its marginalisation British creationism was mostly ignored by academia, the media and the churches. Sophie Gebeil presented a history of North African immigration memories through the French Web archives. A number of the researchers attached to the Big UK Domain Data for the Arts and Humanities project also presented their work, sharing the methodological frustrations and highlighting the challenges for large scale web archives to support qualitative research.

Media scholars, social scientists, computer scientists and music and literature scholars are also using web archives. It is encouraging to see how aspects of the web other than “text” were explored by researchers, including software, programming language, social networks and the earlier Bulletin Board System. Anne Helmond showed how to make use of the social media code snippets, embedded in the archived source code , to issue API calls to social network platforms and obtain the embedded content (currently not collected by web archives) . Anat Ben-David presented an impressive effort in understanding and recovering the former .yu ccTLD which has now disappeared from the web entirely. In both cases, I think there are things web archives can do to remove reliance on social networks and to surface content related to all expired ccTLDs.

There was so much inspiring research, covering all aspects of the web, in ways we have not envisaged. Those interested in finding out more should follow the storyfied tweets, put together by the Institute of Historical Research (IHR), University of London, which was also a co-organiser of the Conference. This is what significantly differs from the past – providers of web archives had to speculate possible use scenarios. I do not think we are short of use cases now. The RESAW conference has given us much evidence and food for thought. The next step is to collate, synthesize and extract high-level requirements out of these and use them to guide our development of tools and services.

As a proud co-organiser of the Conference, it was a delight to see work produced by the British Library Web Archiving Team being used by researchers and other web archives. We should however bear in mind that it is too early to settle on fixed methods of using web archives. We must try different approaches and continue the exploration and experiments to move forward.

The absolute highlight of the Conference was the announcement of the next RESAW Conference in London, to be sponsored and organised by the IHR. Hopefully RESAW will become an on-going platform for showcasing more research on the historical web, carried out by more researchers including those from the less privileged countries and regions.

Helen Hockx-Yu, Head of Web Archiving

Posted by Sabine Hartmann at 3:06 PM

16 June 2015

Beginner’s Guide to Web Archives Part 2

Add comment Comments (0)

Image credit: http://www.solarnavigator.net/climate_chaos.htm

Having begun to collect websites for a special collection on climate change, in part two of our beginner’s guide, science policy intern Peter Spooner continues his journey into web archives, considering how they can be used by researchers.

Digging Deeper

While still a young field, more and more is being thought about how to use web archives as a research tool. Different ways of using the archive include detailed evaluation of website content (language, imagery, context), evaluation of the links between websites and establishing trends across the material. I will give some brief examples of these approaches and highlight some of the difficult problems involved.

In perhaps the simplest example, the UK web archive contains an interactive visualisation tool that can track the number mentions of a word or phrase over time. This was explored in some detail in the blog post ‘Towards a Macroscope for UK Web History’. For example, since I am interested in climate change I could type “climate” into the search box and see what I get. However, the single word “climate” could also be a reference to anything from holidays in Spain to climate control in cars. Instead I could enter the more specific phrase “climate change” (Fig 1). The search “climate change” is likely a more reliable indicator of climate change on the web, although phrases such as “warming climate” or “changes in climate” would not be found by this search.

The data, while imperfect, suggest that climate change has been growing as an online topic between 1996 and 2010. But what other quick climate tests can we do? A claim among some sceptical of climate science is that certain organisations and groups of people have dropped the term “global warming” in favour of the term “climate change”, because of an apparent lack in surface warming since 1998. The data from the UK web archive suggest that any such ‘lexicon shift’ has not made its way into the UK web, with “climate change” always being mentioned more from 1996 to 2010 (Fig 2). Data from published books and science papers also indicate that this shift has not happened on a large scale. However, what this tool cannot tell us in detail is who was using the terms online. In order to get any kind of detailed understanding, digging further into the data is essential.

One amongst a variety of other tools that could be used is sentiment analysis. In our case such analysis could be used to determine the attitude of a website to climate related policies (essential or ridiculous), climate scientists (heroes or demons), climate science (true or misleading), etc. Again there are limitations. Websites usually contain sections and quotes written by different people which may not have the same ‘sentiment’ as the website overall. Once again, the key to the research would be to dig deeper into the website. Digging deeper was a common theme amongst speakers at a recent British Library workshop on web archives, where the researchers involved mostly had to revise their initial research questions and make them more specific. For example, I could start out by asking “How has climate change been portrayed online since 1996?”, but I would likely find so much data I would not know what to do with it.

Uses of Special Collections

Sub-collections of websites from the archive can be useful in trimming down the data set, helping to focus the research question. The researcher must be aware that the set of websites has been subjectively chosen. Collections require a slightly different approach than analysing the entire archive. For example, in an early example, a set of websites (a corpus) were collected in order to analyse political action online using links between websites. In terms of my collection, rather than the very broad question I asked above, we could ask questions like: “How do energy companies’ portrayals of climate change change over time?” which could involve detailed analyses of the websites, link analysis, language analysis, etc. It would be hoped that all the necessary sites would be in the collection ready for the researcher to use.

Research Collaboration

An obvious way to make sure this hope becomes a reality would be to engage researchers in the creation of special collections. This approach would mean that more relevant sites are included, and that researchers can learn first-hand about the web archive before using it, helping them to better develop the questions required to use the archive as a tool alongside their existing research methods. Many papers study human attitudes to climate change, and how attitudes have changed over time (eg. here). We hope to involve some of these researchers in the web archives project. Stay tuned for updates.

Peter Spooner, Science Policy Intern

Posted by Sabine Hartmann at 2:58 PM

03 June 2015

Towards a Macroscope for UK Web History

Add comment Comments (0)

The UK Web Archive

I work at the British Library as technical lead for the UK Web Archive, where we have been archiving thousands of UK web sites by permission since 2004. However, as of April 2013, we have been able to crawl the entire UK web domain under non-print Legal Deposit legislation, meaning that we now archive millions of sites every year. Furthermore, we also hold a copy of the JISC Historial Archive, which is a copy of every .uk resource from the Internet Archive collection up until the adoption of non-print Legal Deposit.

All together, our archives contain over six billion resources, over 100TB of compressed data, which means we have one big problem.

The Problem

If our readers and researchers already know the URLs they are interested in, then it’s relatively easy to support. URL-based lookup of resources is necessary to enable us to ‘replay’ web pages, and so this is a feature of all web archives.

However, if the knowledge of the URL is lost, how can we help researchers find what they need?

The Solution?

The obvious answer was to try and create something like a Google search - a full-text ‘historical’ search engine. However, building a search engine is a major undertaking and so, in order to ensure the development of such a system was relevant to our users, we decided to work closely with academic researchers who are interested in the modern web. First, this was funded by JISC, through the Analytical Access to the Domain Dark Archive project, which was later followed up by the currently ongoing AHRC-funded Big UK Domain Data for the Arts and Humanities project.

The process of building a historical search engine through these two projects has been a very challenging two main reasons. Firstly, the scale means we are at the limits of what many search technologies can support, and it has taken us a long time to learn how to effectively index billions of resources given the skills and hardware we have available.¹

However, the bigger problem was the difference between the usual expectations for search and discovery that are baked into the tools, versus the actual needs of our researchers.

For example, what is the goal of a Google search? Well, the goal is to find the URL you want at the top of the list. “Just get me to the documents I need, as quickly as possible.”. That model of information retrieval is baked deep into the available features of search tools, like the assumption that word frequencies are crucial to relevant ranking.

But this is not what our researchers want.

The historians we consulted are generally not just looking for a few specific documents about a specific topic. They look to the web to see a refraction of wider society, and of the communities, groups and voices within them. Even when focussed on a relatively small subset of the whole, they still need to understand the position of that subset within the wider context of the whole dataset.

The Macroscope

In 2011, Katy Börner advocated the implementation of “plug-and-play macroscopes]1”, and the idea seems to have struck something of a chord within the digital humanities.²

“Macroscopes provide a ‘vision of the whole,’ helping us ‘synthesize’ the related elements and detect patterns, trends, and outliers while granting access to myriad details.”“Plug-and-Play Macroscopes” - Katy Börner (2011).

This approach neatly unifies the modern notion of ‘distant reading’ of texts with the more traditional ‘close reading’ approach, by encouraging individual items to be discovered via (or contra to) the prevailing trends, while also attempting to embed and present those individual resources in their broader context.

The Demonstration

One of my research interests is around techonology evolution and adoption, so one of the searches I’ve done before is forCAPTCHA. Unlike current search engines, the default is to show you the very earliest records first. Here, the first crawled page is this BBC News article - Computer pioneer aids spam fight, crawled just two days after it was published. However, if we go back and look at the next hit, we see this Computing article which was actually published in December 2002. This illustrates how the dynamics of the crawler tend to favour popular sites, and so appear to skew the timeline of events. Something we will need to learn to correct for.

Faceted Search

We can refine a bit more, just looking at 2003, and we see there are just 72 results. Digging deeper still, we can then just look at resources hosted on co.uk domains, to get an idea of which commercial organisations were talking about CAPTCHA in 2003. We can understand things a bit more if we then add the ‘Links Domains’ facet, which shows the domain names that these pages are linking to. Here, the website that first publicised CAPTCHAs is clearly visible, but if you then disable the 2003 filter, you can see that overall the sites that host CAPTCHAs dominate the picture in terms of links.

Corpus Building

This illustrates the typical workflow followed by the researchers we collaborated with. Crucially, rather than relying on complex relevance ranking algorithms, we provide as many different facets and search options as we can, to help our historians ‘slice and dice’ the dataset in order to find sub-sets of the documents relevant to their particular interest.³

This is an important and indeed fairly traditional mode of engagement, usually ending in the ‘close reading’ of a farily small number of individual pages. However, it soon became clear that the researchers also need to understand something of the overall trends and biases of their corpora and of the wider context the corpora were drawn from.

Visualising Trends

Within the faceted search, you can start to get a feel for this by searching for everything. You can quickly get an idea of the overall composition of the archive in terms of formats, domain names, links and years.

To make this kind of information more accessible, we have also added a visualisation interface for exploring overall trends within the dataset, broadly following the model of the Google Books NGram. This ‘distant reading’ mode gives our results a proper time-axis, like this one for ‘big data’.

This graph is a fairly typical shape for many buzzwords within the UK web. A search for iPhone quickly illustrates the rapid growth in importance of Apple device.

Then, by adding another search term (iPhone,UNIX), we can quickly contrast that with a much older but less fashionable technology.

Furthermore, if you know the right incantations, any of the search fields known to the system can be used in the trends interface. For example, you can construct a search that compares the percentage of resources on ac.uk versus co.uk over time.

So, while in absolute terms, both ac.uk and co.uk have grown significantly since 1996, the rate of growth of co.uk far outstrips that of ac.uk. This illustrates the kinds of overall trends that can appear in a large-scale web crawl, the presence of which must be taken into account when interpreting trends relating to commercial or academic hosts.

Understanding Trends

Interesting and useful though this may be, it is still a rather poor Macroscope. Specifically, the composition of voices underneath these trends remains unclear.

For example, if you look at the iPhone curve, and switch to a logarithmic scale (by clicking the vertical axis label), you see a strange dip ahead of the growth curve.

What’s going on?

Well, by clicking on a data point, the system attemps to bridge the gap between the trend and the underlying data by fetching a fairly large randomly-selected sample of the ‘hits’ that contribute to that point on the curve. This provides a very fast and natural way to evaluate the trends and understand what’s going on underneath them.

From this I was able to learn that in the late nineties there was a Internet Phone called the IPhone, and that this was still in some use as the iPhone hype began.

Similarly, if you search for something like terrorism OR terrorist, you can see peaks associated with major events, and start to dig into them.

In particular, this curve shows that the occurance of the words terrorism or terrorist not only peaked during 2001, but that the fraction of the web that discusses terrorism has been greater ever since.

The same approach can be used to study periodic events, e.g. “General Election”.

Searching for Genome provides another interesting example. At first, I expected this peak to be related to news about the human genome sequencing project, but by digging into it, we can see that the truth is more complicated than that. A significant fraction of these hist appear to come from the Sanger Institute itself, but more associated with the development of the institute’s website rather than a specific experimental milestone.

In this way, by providing samples and links back to full search results, we make it much easier for a researcher’s assumptions about the data to be tested. It also helps make unexpected biases and flaws in the dataset much more apparent.

Although this is a very early prototype, I hope you can see the potential that this kind of ‘Macroscope’ in helping understand large and complex collections like web archives.

The Future

There are two main challenges to this work, going ahead. The first is scale and sustainability - our collection is growing very rapidly, and it is not yet clear whether the level of sophistication I’ve demonstrated today can be maintained over time.

The second challenge is to provide the features and usability that make this a compelling, powerful and useful service. Our partnership with historians has been fruitful, but their feedback also made it clear that significant modifications are requried to improve the quality of the search results and the utility of the system.

It is therefore very important for us to be able to show that this is valuable to researchers, in order to justify future development as a core part of what the web archive does, and so we’d be very interested in hearing about any project that can benefit from our historical search engine.

This is the rough script of the demonstration I gave at IDCC15, with a few extra notes and details.

By Dr Andrew Jackson, Web Archiving Technical Lead, The British Library

Appendix

Beginners Guide to Web Archives Part 1

Add comment Comments (0)

Arriving at the British Library as an intern, one of the tasks laid out before me was to create and curate a special collection for the UK web archive. To some readers of this blog this activity may seem fairly self-explanatory. However, before arriving at the library I had never even heard of web archiving, let alone considered why we do it and who it could be useful for. In a short series of blogs I will explore these questions from the novice’s point of view, both my own and that of academic researchers hoping to use the resource. I hope to convey the new user’s perceptions of the challenges and opportunities of the archive, as well as providing an introduction for interested beginners.

Spiders spinning furiously

The web is a vast resource. In 2008 Google had found 10¹² URLs online. It has been suggested that the web represents a rapid expansion in human knowledge. Certainly it enables greater access to human knowledge for billions of people. It is also a place where a huge range of opinions are openly expressed. However, the content of the web has a very rapid turnover, with around 40 % of websites changing their content within a week. Without web archiving (the practice of collecting and storing websites), many human writings are inevitably - often accidentally - lost.

The UK web archive now collects almost the entire UK web-space. One of the problems facing users of the archive is the astounding amount of data through which to sift. One way of getting around this problem is to create so-called ‘special collections’, groups of websites that fall under a particular theme. This enables the curator to provide the user with a set of data that is easier to sort and search.

My special collection

https://www.flickr.com/photos/erlandh/270904893/

As a science PhD student, I felt my special collection should be built with the aim of answering research questions related to a scientific topic. I specialise in oceanography and past climate changes and I am aware of the almost constant debate that occurs on hundreds of climate related websites about climate science, the social impacts of climate change and the policies that should be enforced. A special collection on these issues might be useful for answering questions such as: How has the web influenced public opinion on climate change? As new science rolls in, how do viewpoints expressed on the web change? How do different organisations use the web as a platform for promoting their beliefs?

https://www.flickr.com/photos/wheatfields/4688140998/in/photolist-2XsBdQ-92Bik-7u74nu-a55CZL-s6TSND-89gWZd-8FJLyQ

To provide a resource for answering these questions I plan to select webpages from organisations including environmental charities, climate sceptic think-tanks, energy companies and government; and yet more pages of blogs, articles and discussion. I hope that this collection will become a useful resource for anyone interested in the climate change issue. But would this resource be something researchers might actually use? And how might they go about using it? Find out in my next post.

Peter Spooner, Science Policy Intern

Posted by Sabine Hartmann at 2:23 PM

23 April 2015

Web archiving as a challenging business

Add comment Comments (0)

My internship here at the British Library’s Web Archiving team comes to an end and I try to sum up my impressions. I would say, I have been somewhat stricken by how a daunting task web archiving is, and how much challenges it creates for professionals.

Displaying an open collection

The British Library provides the public with an open collection of websites, accessible from anywhere. These open collections are resource heavy, being enriched with metadata and descriptions. This task is done by web curators and web archivists. The latter are also in charge of quality assurance, they check if the harvest was done properly by the web crawling software. Giving open access means asking permission from the website owners. This is a very labour intensive and slow process, which would easily require two or three times the current available resources. To face the emergency of some events, such as next General Election, the selection is done now, while the permission requests have to be postponed to a less busy time. For some resources, open access is not an option as for example some news websites who charge for access to their own archives.

Providing searching tools

You’d think things should get easier since the 2013 Legal Deposit Libraries (Non-Print Works) Regulations have allowed British Library to collect and preserve UK websites without asking permission. But new issues arise: collecting a huge quantity of data, indexing it, preserving it on a long term perspective, dealing with the fact that the appearance of an archived website may not be the same as its live version. And then all this content must be made available for users (restricted to the reading rooms for websites without permission).

But how does one search a web archive? Anyone who tried once probably had this annoying sense that there is definitely too much data to deal with. One of the challenges is consequently to provide users with efficient tools enabling them to find their way through this maze of data. Consequently users need to learn how to use these tools, bearing in mind their expectations may be shaped by the habit of using Google. Yet, using the web archive for scholarly purposes is a completely different approach. A historical search engine must meet specific requirements. No Google-like relevance sorting here but a mere chronological ranking enhanced with powerful results refine functionalities like events or time line. This research project from the L3S Research Centre in Germany is one amongst other involving web archive, showing that the tool building is made hand in hand with researchers who use web archive as a material for their work.

Being involved in web archiving today is really fascinating. It means observing and being part of an emerging field. This was also discussed at the opening presentation of 2014 IIPC General Assembly.

A new job?

Web archiving is not really part of librarians’ training yet, and professionals have to learn by doing. At this moment in time web archiving only concerns few people, not more than a handful mostly based in national libraries (this becomes less true over time as can be seen in the composition of IIPC).

But issues arising with web archiving are in line with general trends for libraries. It concerns electronic journals management, mostly bought and displayed as packages, or mass digitisation projects. The new challenge consists in dealing with scale matters. The core business of librarians is seemingly shifting from selecting to highlighting resources. Social media channels are one of the new librarian’s tricks to do so. Most of digital libraries have a twitter account (see the often humorous @GallicaBnF) as well as the web archives (@internetarchive, @UKWebArchive, @DLWebBnF).

Apart from archiving work these teams of specialists are doing, one other task is the promotion of web archives inside the libraries themselves. The reference staff may not be comfortable yet with this new material, and still very few readers use the web archive. Another challenge to come!

Clémence Agostini (intern at the BL Web Archiving team from ENSSIB)

Posted by Sabine Hartmann at 3:34 PM