Archiving websites with the Web Curator Tool
Have you ever wondered how the UK Web Archive works? Wondered about the technical system that we use to collect and provide access to websites? Wondered about how we process copies of websites before we make them available to the public? Well, you'¬íre in the right place to find out more! We'¬íre planning to use this blog to publish information like this, including technical developments as well as the collection development side of our work. But before we start talking about WARCs, crawlers, SIPs and the like, we thought it might be useful to give you some background on the system at the very heart of our web archiving activities. That system is the Web Curator Tool.
The Web Curator Tool is an open source, selective web archiving workflow management application. Initiated by the International Internet Preservation Consortium (IIPC), it was developed as a collaborative effort by the British Library and the National Library of New Zealand. We'¬íve been using it at the British Library for several years and it'¬ís also used by a number of other IIPC members worldwide.
WCT enables multiple non-technical users to manage the workflow associated with selective archiving. This includes permissions management (i.e permission to archive websites), crawl scheduling (how often websites will be crawled, eg weekly, monthly, every six months, etc), quality assurance, crawl validation, metadata management, and so forth. It¬ís an incredibly useful piece of software.
Written in Java and designed to run on Apache Tomcat, a typical WCT installation has the following components:¬ē
- The WCT Core
- A relational database.
- The WCT Digital Asset Store (where harvested material is held prior to archiving)¬ē
- One or more Harvest Agents (responsible for gathering web content, we use Heritrix).
WCT is the web archiving back office. Our front end to the web archive is our public web site, which allows you to browse and search the archive. The archived material itself is replayed by the Internet Archive's Wayback machine software.
WCT is freely available for anyone to install under an Apache Public License and can be downloaded from Sourceforge.