UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

The UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Jason Webber. Read more

27 August 2015

13 August 2015

Characterisations of Climate Change

Add comment Comments (0)

If you have read any of my previous blogs (Beginner’s Guide to Web Archives 1,2,3) you will know that as part of my work at the British Library I have been curating a special web archive collection on climate change. But why did I choose this subject?

World-changing issue

Having begun as a topic of scientific interest, the threat of climate change has developed into a potentially world-changing issue with major implications for how we live our lives. The projected impacts of climate change have profound impacts on things like food, water, human health; and therefore on national and international policy and the ‘business as usual’ world economy. Naturally therefore, the topic is heavily debated in the public arena, from the science of global warming and its associated effects to the policies designed to mitigate or adapt to it.

Ox_EnvChangeInst
Screen shot of www.eci.ox.ac.uk

We might expect different individuals and organisations – as for any topic – to portray the issue in different ways. But how exactly is climate change characterised on the internet? For instance, while there are many websites that accept the current understanding of climate science and actively promote action to limit global warming, there are many others that partially or completely deny the science. How is the issue portrayed by these different groups? Or another example: how is the issue portrayed by renewable energy companies compared to fossil fuel companies, two groups with very conflicting interests? As climate change progresses, how will its online characterisation change? I wanted to build a collection that could help to answer some of these questions.

Special interest groups

The collection consists of websites from different societal groups that have an active interest in the subject: for example academics; the energy sector; policy makers; special interest groups; the media and some members of the public. Websites generally fall into one of the following categories: personal blog pages/twitter feeds, non-governmental organisations/coalitions, news, government, energy companies, religious organisations, educational websites, learned societies and university institutions. The proportion of each website devoted to climate change ranges from almost 100 % (some blogs/specialist websites) to more limited coverage. Some websites may be notable for the complete absence of climate change references. For example, after discussions in Cardiff, I have included each of the main UK energy companies, even when their websites do not mention climate change. Such information was considered to be useful in terms of the questions posed above.

ClimateCabaret
Screen shot of twitter.com/ClimateCabaret

The collection is an evolving beast, so if you have any suggestions regarding extra websites we could include, please fill in the online form here. We are hoping to make as many of the websites openly available as possible, but don’t forget that if you want to view the whole collection, you will need to head to your nearest legal deposit library to do so.

 Peter Spooner, Science Policy Intern

PeteSpooner

10 August 2015

Beginner’s Guide to Web Archives Part 3

Add comment Comments (0)

Coming to the end of his short time working on web archives at the British Library, science-policy intern Peter Spooner reflects on the process of creating a web archive special collection.

Some issues with ‘Special Collections’

In my previous blog entry, I covered why we might want to create special collections. Here, I would like to examine the pros and cons of these collections in more detail.

In order for an archivist to create a special collection, he/she must come up with a subject, refine the scope of the topic to prevent the collection from becoming too large, and then collect websites. In my case - climate change – I decided to collect websites to show how climate change is portrayed across society (by charities, the energy sector, interested individuals, learned societies etc.) with a focus on the portrayal of climate science and policy. Whilst I hope such a collection will be interesting and useful, problems do exist.

Cardiff

In July, the British Library team headed to meet some environmental psychologists from Cardiff University. The major success of the meeting was to inform the researchers about web archiving and our climate change special collection. The resource was well received and was seen as being potentially useful. However, a number of issues came up before and during the discussion:

  1. Each of the five researchers who attended had slightly different research interests;
  2. How can we integrate these interests when creating archive resources?
  3. How can the climate change collection be kept relevant as the subject evolves?
  4. Who should be responsible for sustaining and updating the special collection?
  5. What kinds of research question can be asked?

Widening the net

The last of these points I addressed in a previous blog entry, but the remainder are worth commenting on here. As I highlighted above, special collections are designed to be small and easy to use. However, such limited scope may not meet the needs of different researchers. There are several approaches one could take in order to try and resolve this issue. In some cases, collections may focus on a particular, event, such as a general election. The web content associated with these collections is often short-lived and after the event the collection would not need much updating. However, for collections on long-lasting themes, more involvement is required.

In one instance, thematic special collections could remain under the control of dedicated archivists. In this case, collection users could send in suggestions of websites to include when important events occur or new web material is created. Collections could be slightly expanded to be broad enough for a variety of user interests. However, the number of collections is necessarily limited by the time commitment of the web archivists.

Another possibility is that the archivists act as technical support whilst researchers create their own collections. This approach requires a greater input on the part of the researcher, but allows more collections to be created and maintained. Since they are designed by the users, each collection should be exactly fit for purpose. However, since each researcher is likely to have slightly different interests or questions in mind, the number of collections may be very large and some collections may closely mirror one another.

BUDDAH01

Listening to talks by academics involved in the British Library’s BUDDHA project, a common starting point for research was to create a corpus: a collection of written texts – in this case websites – of interest that could then be used to inform the research question. This approach is just what I have described above. A large number of corpora created by researchers could be stored by housing different groups of collections under common themes; so the theme of climate change could contain a number of collections on different aspects of the issue.

Moving forward

Perhaps the ideal model that the British Library could adopt is something of a combination of the above ideas. The Library may want to preserve the integrity of its existing special collections, which are carefully curated and designed for a wide range of users. These ‘Special Collections’ could remain under archivist control as described above, with contributions from user feedback. Alongside this core set of special collections could exist the more specific and numerous ‘Research Collections’ - those collections created by researchers. In this way the Library could make available a variety of resources that may be of interest to different users, combining the work of researchers and archivists to accommodate the limited time of both.

One thing we need to do in order to ensure the success of this combined approach is to get more and more researchers involved with creating collections. More projects like BUDDHA and further visits to interested academics will help to increase awareness of the web archive as a research resource, to grow it and turn it into an invaluable tool.

Peter Spooner, Science Policy Intern

PeteSpooner

05 August 2015

Viral Content in the UK Domain

Add comment Comments (0)

Virus02
https://commons.wikimedia.org/wiki/File:Virus_ordinateur.jpg

Why?

"The term 'malware' is commonly used as a catch-all phrase for unwanted software designed to infiltrate a computer...without the owner's informed consent. It includes but is not limited to viruses, Trojan horses, malware."

"Whilst highly undesirable for most contemporary web users, malware is a pervasive feature of the Internet. Many archives choose to scan harvests and identify malware but prefer not to exclude or delete them from ingest into their repositories, as exclusion threatens the integrity of a site and their prevalence across the web is a valid research interest for future users." 
DPC Technology Watch Report, March 2013

The above hopefully goes some way to illustrating our concerns regarding 'viral' content in the data we archive. If overlooked or ignored, such content has the potential to prove hazardous in the future but similarly, they do form an integral part of the Web as we know it (Professor Stephen Hawking famously stated that he thought that "computer viruses should count as life" and who are we to argue?).

How?

Faced with such considerations, there were several options available:

  1. We could simply not store any content flagged as containing a virus. The problem here is the effect is unpredictable—what if the content in question is the front-page of a website? It effectively means that site cannot be navigated as intended.
  2. We could store the content but make it inaccessible. 
  3. We could postpone the scan for viruses until after the crawl. However, this would require amending the output files to either remove or alter infected records.
  4. We could 'nullify' the content, making it unreadable but potentially reversible such that the original data can be read if required.

The latter option was chosen. The specific implementation was that of a XOR Cipher , wherein the individual bytes of the viral content and logically XOR'd with a known byte-length key. Applying the same cipher using the same key reverses the operation. Essentially this turns any record flagged as containing viral content into (theoretically safe) pseudo-gibberish.

To quickly illustrate that in Python:

key = "X"

message = "This is a secret message. Shhhhh!"

 encoded = [ord(m)^ord(key) for m in message]

print(encoded)

 """

The value of 'encoded' here is just a list of numbers; attempting to convert

it to a string actually broke my Putty session.

"""

 decoded = "".join([chr(e^ord(key)) for e in encoded])

print(decoded)

Virus01
https://commons.wikimedia.org/wiki/File:Virus_Blaster.jpg

Heritrix & ClamAV

For all our crawling activities we use the Internet Archive's Heritrix crawler. Part of the ethos behind Heritrix's functionality is that content is processed and written to disk as quickly as possible; ideally you should be utilising all available bandwidth. With that it mind the options for virus-scanners were few. While there are many available few offer any kind of API and even fewer have the ability to parse streamed content and must instead scan content on disk. Given that disk-writes are often the slowest part of the process this was not ideal and left us with only one obvious choice: ClamAV .

We created a ViralContentProcessor  module which interacts with ClamAV, streaming every downloaded resource to the running daemon and receiving the result. Anything which is found to contain a virus:

  1. ...is annotated with the output from ClamAV (this then appears in the log file).
  2. ...is bytewise XOR'd as previously mentioned and the amended content written to a different set of WARC files than non-viral content.

It is worth noting that ClamAV does, in addition to scanning for various types of malware, have the option to identify phishing attempts. However, we disabled this early on in our crawls when we discoverd that it was identifying various examples of phishing emails provided by banks and similar websites to better educate their customers.

During the crawl the resources—memory usage, CPU, etc.—necessary for ClamAV are similar to those required by the crawler itself. That said, the virus-scanning is seldom the slowest part of the crawl.

 WARCs

All web content archived by the British Library is stored in WARC format (ISO 28500). A WARC file is essentially a series of concatenated records, each of a specific type. For instance an average HTML page might look like this:

WARC/1.0

WARC-Type: response

WARC-Target-URI: https://www.gov.uk/licence-finder/activities?activities=158_45_196_63§ors=183

WARC-Date: 2015-07-05T08:54:13Z

WARC-Payload-Digest: sha1:ENRWKIHIXHDHI5VLOBACVIBZIOZWSZ5L

WARC-IP-Address: 185.31.19.144

WARC-Record-ID: <urn:uuid:2b437331-684e-44a8-b9cd-9830634b292e>

Content-Type: application/http; msgtype=response

Content-Length: 23174

 HTTP/1.1 200 OK

Server: nginx

Content-Type: text/html; charset=utf-8

Cache-Control: max-age=1800, public

...

 <!DOCTYPE html>

...

The above essentially contains the raw HTTP transaction plus additional metadata. There is also another type of record: a conversion:

A 'conversion' record shall contain an alternative version of another record's content that was created as the result of an archival process.
ISO 28500

It's this type of record we use to store our processed viral content. A record converted as per the above might appear thusly:

WARC/1.0

WARC-Type: conversion

WARC-Target-URI: https://www.gov.uk/licence-finder/activities?activities=158_45_196_63§ors=183

WARC-Date: 2015-04-20T11:03:11Z

WARC-Payload-Digest: sha1:CWZQY7WV4BJZRG3XHDXNKSD3WEFNBDJD

WARC-IP-Address:185.31.19.144

WARC-Record-ID: <urn:uuid:e21f098e-18e4-45b9-b192-388239150e76>

Content-Type: application/http; encoding=bytewise_xor_with_118

Content-Length: 23174

 >""&YGXGVDFFV9={

...

The two records' metadata do not differ drastically—the main differences being the specified WARC-Type and the Content-Type. In this latter field we include the encoding as part of the MIME. The two records' content, however, appear drastically different: the former record contains valid HTML while the latter contains a seemingly random series of bytes.

Access

In order the access content stored in WARC files we typically create an index, identifying the various URLs and recording their particular offset within a given WARC file. As mentioned earlier, content identified as containing a virus is stored in a different series of files to those of 'clean' content. Currently we do not provide access to viral content but by doing the aforementioned separation this means that firstly we can easily index the regular content and omit the viral and secondly, it means we can, should the demand arise, easily identify and index the viral content.

The software used to replay our WARC content—OpenWayback—is capable of replaying WARCs of all types. While there would be additional step wherin we reverse the XOR cipher, access to the content should not prove problematic. 

Results

Frequent Crawls 

In addition to the annual crawl of the UK domain, we also undertake more frequent crawls of a smaller set of sites. These site are crawled on a daily, weekly, etc. basis to capture more frequently-changing content. In the course of roughly 9,000 frequent crawls since April 2013 only 42 have encountered viral content.

2013 Domain Crawl

  • 30TB regular content.
  • 4GB viral content.

2014 Domain Crawl

  • 57TB regular content.
  • 4.7GB viral content.

 Looking at the logs from the 2014 Domain Crawl which, as mentioned earlier, contain the results from the ClamAV scan, there were 494 distinct viruses flagged. In terms of the most common, the top ten appear to be: 

  1. Html.Exploit.CVE_2014_6342
  2. JS.Obfus-210
  3. PHP.C99-7
  4. JS.Crypt-1
  5. Exploit.URLSpoof.gen
  6. HTML.Iframe-6
  7. JS.Trojan.Iframe-6
  8. Heuristics.Broken.Executable
  9. JS.Obfus-186
  10. Html.Exploit.CVE_2014_0274-4

In total there were 40,203 positive results from ClamAV, with the Html.Exploit.CVE_2014_6342 in top spot above accounting for over a quarter.

Roger G. Coram, Web Crawl Engineer, The British Library

24 July 2015

Geo-location in the 2014 UK Domain Crawl

Add comment Comments (0)

In April 2013 The Legal Deposit Libraries (Non-Print Works) Regulations 2013 Act was passed and of particular relevance is the section which specifies which parts of that ephemeral place we call the Web are considered to be part of "the UK":

  • 18 (1) “…a work published on line shall be treated as published in the United Kingdom if:
    • “(b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom.”

In more practical terms, resources are to be considered as being published in the United Kingdom if the server which serves said resources is physically located in the UK. Here we enter the realm of Geolocation.

Gps

Comparison satellite navigation orbits" by Cmglee, Geo Swan - Own work.Licensed under CC BY-SA 3.0 via Wikimedia Commons

Heritrix & Geolocation

Geolocation is the practice of determining the "real world" location of something—in our case the whereabouts of a server, given its IP address.

The web-crawler we use, Heritrix, already has many necessary features to accomplish this. Among its many DecideRules (a series of ACCEPT/REJECT rules which determine whether a URL is to be downloaded) is the ExternalGeoLocationDecideRule. This requires:

  • A list of ISO 3166-1 country-codes to be permitted in the crawl
    • GB, FR, DE, etc.
  • An Implementation of ExternalGeoLookupInterface.

This latter ExternalGeoLookupInterface is where our own work lies. This is essentially a basic framework on which you must hang your own implementation. In our case, our implementation is based on MaxMind’s GeoLite2 database. Freely available under the Creative Commons Attribution-ShareAlike 3.0 Unported License, this is a small database which translates IP addresses (or, more specifically, IP address ranges) into country (or even specific city) locations.

Taken from our Heritrix configuration, the below shows how this is included in the crawl:

<!- GEO-LOOKUP: specifying location of external database. -->
<bean id="externalGeoLookup" class="uk.bl.wap.modules.deciderules.ExternalGeoLookup">
  <property name="database" value="/dev/shm/geoip-city.mmdb"/>
</bean>
<!-- ...  ACCEPT those in the UK... -->
<bean id="externalGeoLookupRule" class="org.archive.crawler.modules.deciderules.ExternalGeoLocationDecideRule">
  <property name="lookup">
    <ref bean="externalGeoLookup"/>
  </property>
  <property name="countryCodes">
    <list>
      <value>GB</value>
    </list>
  </property>
</bean>

The GeoLite2 database itself is, at around only 30MB, very small. Part of beauty of this implementation is that the entire database can be held comfortably in memory. The above shows that we keep the database in Linux's shared memory, avoiding any disk IO when reading from the database.

Testing

To test the above we performed a short, shallow test crawl of 1,000,000 seeds. A relatively recent addition to Heritrix's DecideRules is this property:

<property name="logToFile" value="true" />

During a crawl, this will create a file, scope.log, containing the final decision for every URI along with the specific rule which made that decision. For example:

2014-11-05T10:17:39.790Z 4 ExternalGeoLocationDecideRule ACCEPT http://www.jaymoy.com/
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT https://t.co/Sz15mxnvtQ
2014-11-05T10:17:39.790Z 0 RejectDecideRule REJECT http://twitter.com/2017Hull7

So for the above 2 URLs were rejected outright, while the first was ruled in-scope by theExternalGeoLocationDecideRule.

Parsing the full output from our test crawl, we find:

  • 89,500,755 URLs downloaded in total.
  • 26,072 URLs which were not on .uk domains (and therefore would, ordinarily, not be in scope).
    • 137 distinct hosts.
UK
British Isles Euler diagram 15 by TWCarlson - Own work. Licensed under CC0 via Wikimedia Commons

2014 Domain Crawl

The process for examining the output of our first Domain Crawl is largely unchanged from the above. The only real difference is the size: the scope.log file gets very large when dealing with domain scale data. It logs not only the decision for every URL downloaded but every URL notdownloaded (and the reason why).

Here we can use a simple sed command (admittedly implemented slightly differently via distributed via Hadoop Streaming to cope with the scale) to parse the logs' output:

sed -rn 's@^.+ ExternalGeoLocationDecideRule ACCEPT https?://([^/]+)/.*$@\1@p' scope.log | grep -Ev "\.uk$" sort -u

This will produce a list of all the distinct hosts which have been ruled in-scope by the ExternalGeoLocationDecideRule (excluding, of course, any .uk hosts which are considered in scope by virtue of a different part of the legislation).

The above produced a list of 2,544,426 hosts ruled in-scope by the above Geolocation process.

By Roger G. Coram, Web Crawl Engineer, The British Library 

17 July 2015

Curating the Election - Archiving the most complex General Election yet…

Add comment Comments (0)

GenEl2015_outcome
https://en.wikipedia.org/wiki/United_Kingdom_general_election,_2015

This year’s General Election is not only one of the closest fought in recent times.

With more parties in the limelight than ever before it is almost certainly the most complex.

 As so much of the election is played out in the here-today-gone-tomorrow world of the Net and broadcast media, the archiving challenge is all the greater. (Many political pages disappear soon after the election results)

The Library is capturing these transient messages before they are lost. Across the Library, and across the Legal Deposit Library network, staff, led by Jennie Grimshaw in Research Engagement have been working on a special web collection, to join several General Election collections we have created in the past. 

Meanwhile, we have been adding extra recordings to the Broadcast News service (these are available within hours of having been broadcast). Because of the significant Scottish dimension, the TV channels STV and BBC Scotland have been added to the mix, creating a lasting archive for years to come.

Because we have archived the 2005 and 2010 elections we can also see that there were significant changes in the way the internet was used. And increasingly the web archive is showing how it can support long-term research of this kind.

Compared with the 2010 General Election, it is clear that there has been a mushrooming of campaigning on the web. In excess of 7,400 websites and webpages have been selected in 2015 compared to approximately 770 pages in the 2010 collection, and 139 in 2005.

One reason for this growth is the way prospective candidates now attempt to engage the electorate on multiple channels. In addition to setting up their own campaigning website and having a page on their party’s constituency website they increasingly use social media channels such as Facebook and Twitter to reach out to voters. For example a total 951 Twitter accounts have been selected across all the subject categories, illustrating just how prominent a part social media played.

Led by Jennie Grimshaw in Research Engagement at the British Library, the team involved included curators from the three national libraries, from Northern Ireland and the Bodleian, Library, Oxford.

One element of the project was to endeavour to capture websites from the same constituencies as selected in the 2010 and 2005 crawls, in an effort to offer some comparison on how constituency and web presences evolve from one election to the next.

UK_opinion_polling_2010-2015
https://en.wikipedia.org/wiki/United_Kingdom_general_election,_2015

The 2015 General Election web archive collection has harvested thirty-two opinion polls, 100 blogs, supplementing the comment and analysis along with more traditional news websites. There are also the webpages and publications of 62 think tanks and 412 interest groups -- all of which creates a rich online documentary archive around the Election, including much material which will disappear rapidly from the live web.  

By Jerry Jenkins, Curator of Emerging Media at the British  Library

note by the editor: the links provided in this post link to the Open UK Web Archive, which gives access to archived webpages where permission has been granted for open access. The complete collections for all three General Elections can only be accessed in the British Library reading rooms under the terms of the Non-Print Legal Deposit Legislation.

10 July 2015

UK Web Archives Forum @ BBC Broadcasting House

Add comment Comments (0)

Friday 19th June saw the first UK Web Archives Forum at Broadcasting House. This was set-up by BBC Archives as an opportunity to get the British Library, the National Archives and Channel 4 together with BBC Archives to discuss current archiving policies & practice in the ever shifting world of web & social media archiving. Representatives from the aforementioned institutions were present including the BBC's own Web Archives team.

The session was very well received, and everyone involved came away with lots of new ideas and potential future collaborations. Presentations and overviews of state-of-play web archiving activities were shared, and then in depth discussions on the moving landscape of web archiving methodology and the challenges in archiving social media took place.

UK Web and Social Media Archive Forum June 2015
Of great interest was the work underway by the BBC Archives, British Library and National Archives in the archiving of Twitter communications. Other major areas of interests were around standards and practices. The BBC has, for example, adopted a number of solutions for web archiving including Crawling WARCs, generating PDFs, screencasts and document archiving to ensure all basis are covered in preserving bbc.co.uk. It was interesting to see the scale adopted by the British Library in preserving the .uk web domain. And the National Archives also explained their challenges in archiving .gov websites, and the large array of government funded organisations at national levels.

It was decided that we would meet again in the future to look more collaborations, quality assurances in our archive results and how to best tackle future online distribution platforms, especially social media and mobile applications were younger generations are now consuming content at a faster rate than ever bfore. There are a lot of exciting challenges in the area of web archiving, so the need for a forum to discuss and shape policies and practices is vital. We hope to work with the Digital Preservation Coalition on future workshops in this field of work, to help to provide common standards for all concerned and for those standards to be shared with the wider UK web archiving community.

Some of the tools under discussion:

Download and preserve the content for future use.

By Carl Davies, Archive Manager, Radio & Multiplatform, BBC Engineering

08 July 2015

Big UK Domain Data for the Arts and Humanities: working with the archive of UK web space, 1996–2013

Add comment Comments (0)

Buddah05

Buddah02

Buddah03

Buddah04

 



In January 2014, the Institute of Historical Research, University of London (in partnership with the British Library, the Oxford Internet Institute and Aarhus University) was awarded funding by the Arts and Humanities Research Council for a project to explore ways in which humanities researchers could engage with web archives. The main aims of ‘Big UK Domain Data for the Arts and Humanities’ were to highlight the value of web archives for research; to develop a theoretical and methodological framework for their analysis; to explore the ethical implications of this kind of big data research; to train researchers in the use of big data; and to inform collections development and access arrangements at the British Library.

SAS_HUMAN_111
Helen Hockx-Yu showing the BUDDAH interface to people at the Being Human Festival 2014

For the past 15 months the project team have been working with 10 researchers, drawn from a range of arts and humanities disciplines, to address these issues and particularly to develop a prototype interface which will make the historical archive (1996–2103) accessible. The researchers came armed with a range of fascinating questions, from analysing Euro-scepticism on the web to studying the Ministry of Defence’s recruitment strategy, from examining the history of disability campaigning groups and charities online to looking at Beat literature in the contemporary imagination. The case studies that they have produced demonstrate some of the challenges posed by the archived web, but also its value and significance. They are available from the project website.

 

BUDDAH01
Along the way, the project has produced not only one of the largest full-text indexes of web archive (WARC) files in the world, but also a sophisticated interface which supports complex query building and gives researchers the ability to create and manipulate corpora derived from the larger dataset.

This interface is accessible as a beta version. It opens up a fascinating range of options now that you longer need to know the URL of a vanished website in order to find it in the archive.

Buddah06

For those less familiar with the concept of web archives, we’ve also produced two short animations, ‘What is a Web Archive?’ and ‘What does the UK Web Archive collect?’. They’re both available under a CC-BY-NC-SA licence, so do please share!

Jane Winters
Professor of Digital History
Institute of Historical Research, School of Advanced Study, University of London
@jfwinters