THE BRITISH LIBRARY

UK Web Archive blog

09 December 2011

Archiving websites with the Web Curator Tool

Have you ever wondered how the UK Web Archive works? Wondered about the technical system that we use to collect and provide access to websites? Wondered about how we process copies of websites before we make them available to the public? Well, you'’re in the right place to find out more! We'’re planning to use this blog to publish information like this, including technical developments as well as the collection development side of our work. But before we start talking about WARCs, crawlers, SIPs and the like, we thought it might be useful to give you some background on the system at the very heart of our web archiving activities. That system is the Web Curator Tool.

Wct

The Web Curator Tool is an open source, selective web archiving workflow management application. Initiated by the International Internet Preservation Consortium (IIPC), it was developed as a collaborative effort by the British Library and the National Library of New Zealand. We'’ve been using it at the British Library for several years and it'’s also used by a number of other IIPC members worldwide.

WCT enables multiple non-technical users to manage the workflow associated with selective archiving. This includes permissions management (i.e permission to archive websites), crawl scheduling (how often websites will be crawled, eg weekly, monthly, every six months, etc), quality assurance, crawl validation, metadata management, and so forth. It’s an incredibly useful piece of software.

Written in Java and designed to run on Apache Tomcat, a typical WCT installation has the following components:•

  • The WCT Core
  • A relational database.
  • The WCT Digital Asset Store (where harvested material is held prior to archiving)•
  • One or more Harvest Agents (responsible for gathering web content, we use Heritrix).

WCT is the web archiving back office. Our front end to the web archive is our public web site, which allows you to browse and search the archive. The archived material itself is replayed by the Internet Archive's Wayback machine software. 

WCT is freely available for anyone to install under an Apache Public License and can be downloaded from Sourceforge.

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.