Powering the UK Web Archive search with Solr
When you have hundreds of millions of webpages to search, what technologies do we use at the UK Web Archive to ensure the best service?
Solr
At the core of the UK Web Archive is the open source tool Apache Solr. To quote from their own website, ‘Solr is a very popular open source enterprise search platform that provides full-text and faceted searching’.
It is built using scalable and fault tolerant technologies, providing distributed indexing, automated failover and recovery, and centralised configuration management. And lots more besides – put simply, Solr is proactively pushing towards all aspects of big data search indexing and querying.
Open UK Web Archive
The UKWA website provides public access to more than 200 million UK selected webpages (the selection process includes gaining the permission to publish the archived site from the website owner, and you can nominate a website to archive via our Nominate a Site page.)
Once a site is harvested it is stored internally on several systems to ensure the safe keeping of the data. From these stores the data is ingested into the Solr service, which analyses the metadata and content, primarily to enable the fast querying of the service. Much of Solr’s speed comes from its way of indexing this data, which is called reverse-indexing.
Capable servers
To support these archived websites and provide the UK Web Archive search, we run the service on two dual Xeon servers – an HP ProLiant DL580 G5 with 96GB of RAM and an HP ProLiant DL380 G5 with 64GB of RAM. The data is stored on a Storage Area Network (SAN) using fibre channel connections.
The Solr service itself runs under the Apache Tomcat Java management service, and is split between the two physical servers as a master and slave setup – one provides the data ingest mechanism, the other provides the data querying mechanism for the public website.
Scalability
One of the benefits of using Apache Solr is that it is fairly simple to grow a system, in terms of both speed and data capacity. As the amount of web content increases, we can add more hardware to handle the extra load as Solr is designed from the outset as a distributed service.
By Gil Hoggarth, Web Archiving Technical Services Engineer, The British Library