14 May 2020
Searching eTheses for the openVirus project
This is a guest post by Andy Jackson (@anjacks0n), Technical Lead for the UK Web Archive and enthusiastic data-miner.
Introduction
The COVID-19 outbreak is an unprecedented global crisis that has prompted an unprecedented global response. I’ve been particularly interested in how academic scholars and publishers have responded:
- Academic publishers making COVID-19 and coronavirus-related publications openly available
- Academics and publishers assembling datasets of publications for analysis, like the COVID-19 Open Research Dataset and the WHO COVID-19 Database
- Various projects aiming to make these datasets of publications searchable, like Semantic Scholar, WHO, COVID Scholar, COVIDSeer, SketchEngine, Vespa.ai and fatcat COVID-19 from the Internet Archive.
It’s impressive how much has been done in such a short time! But I also saw one comment that really stuck with me:
“Our digital libraries and archives may hold crucial clues and content about how to help with the #covid19 outbreak: particularly this is the case with scientific literature. Now is the time for institutional bravery around access!”
– @melissaterras
Clearly, academic scholars and publishers are already collaborating. What could digital libraries and archives do to help?
Scale, Audience & Scope
Almost all the efforts I’ve seen so far are focused on helping scientists working on the COVID-19 response to find information from publications that are directly related to coronavirus epidemics. The outbreak is much bigger than this. In terms of scope, it’s not just about understanding the coronavirus itself. The outbreak raises many broader questions, like:
- What types of personal protective equipment are appropriate for different medical procedures?
- How effective are the different kinds of masks when it comes to protecting others?
- What coping strategies have proven useful for people in isolation?
(These are just the examples I’ve personally seen requests for. There will be more.)
Similarly, the audience is much wider than the scientists working directly on the COVID-19 response. From medical professions wanting to know more about protective equipment, to journalists looking for context and counter-arguments.
As a technologist working at the British Library, I felt like there must be some way I could help this situation. Some way to help a wider audience dig out any potentially relevant material we might hold?
The openVirus Project
While looking out for inspiration, I found Peter Murray-Rust’s openVirus project. Peter is a vocal supporter of open source and open data, and had launched an ambitious attempt to aggregate information relating to viruses and epidemics from scholarly publications.
In contrast to the other efforts I’d seen, Peter wanted to focus on novel data-mining methods, and on pulling in less well-known sources of information. This dual focus on text analysis and on opening up underutilised resources appealed to me. And I already had a particular resource in mind…
EThOS
Of course, the British Library has a very wide range of holdings, but as an ex-academic scientist I’ve always had a soft spot for EThOS, which provides electronic access to UK theses.
Through the web interface, users can search the metadata and abstracts of over half a million theses. Furthermore, to support data mining and analysis, the EThOS metadata has been published as a dataset. This dataset includes links to institutional repository pages for many of the theses.
Although doctoral theses are not generally considered to be as important as journal articles, they are a rich and underused source of information, capable of carrying much more context and commentary than a brief article[1].
The Idea
Having identified EThOS as source of information, the idea was to see if I could use our existing UK Web Archive tools to collect and index the full-text of these theses, build a simple faceted search interface, and perform some basic data-mining operations. If that worked, it would allow relevant theses to be discovered and passed to the openVirus tools for more sophisticated analysis.
Preparing the data sources
The links in the EThOS dataset point to the HTML landing-page for each theses, rather than to the full text itself. To get to the text, the best approach would be to write a crawler to find the PDFs. However, it would take a while to create something that could cope with the variety of ways the landing pages tend to be formatted. For machines, it’s not always easy to find the link to the actual theses!
However, many of the universities involved have given the EThOS team permission to download a copy of their theses for safe-keeping. The URLs of the full-text files are only used once (to collect each thesis shortly after publication), but have nevertheless been kept in the EThOS system since then. These URLs are considered transient (i.e. likely to ‘rot’ over time) and come with no guarantees of longer-term availability (unlike the landing pages), so are not included in the main EThOS dataset. Nevertheless, the EThOS team were able to give me the list of PDF URLs, making it easier to get started quickly.
This is far from ideal: we will miss theses that have been moved to new URLs, and from universities that do not take part (which, notably, includes Oxford and Cambridge). This skew would be avoided if we were to use the landing-page URLs provided for all UK digital theses to crawl the PDFs. But we need to move quickly.
So, while keeping these caveats in mind, the first task was to crawl the URLs and see if the PDFs were still there…
Collecting the PDFs
A simple Scrapy crawler was created, one that could read the PDF URLs and download them without overloading the host repositories. The crawler itself does nothing with them, but by running behind warcprox the web requests and responses (including the PDFs) can be captured in the standardised Web ARChive (WARC) format.
For 35 hours, the crawler attempted to download the 130,330 PDF URLs. Quite a lot of URLs had already changed, but 111,793 documents were successfully downloaded. Of these, 104,746 were PDFs.
All the requests and responses generated by the crawler were captured in 1,433 WARCs each around 1GB in size, totalling around 1.5TB of data.
Processing the WARCs
We already have tools for handling WARCs, so the task was to re-use them and see what we get. As this collection is mostly PDFs, Apache Tika and PDFBox are doing most of the work, but the webarchive-discovery wrapper helps run them at scale and add in additional metadata.
The WARCs were transferred to our internal Hadoop cluster, and in just over an hour the text and associated metadata were available as about 5GB of compressed JSON Lines.
A Legal Aside
Before proceeding, there’s legal problem that we need to address. Despite being freely-available over the open web, the rights and licenses under which these documents are being made available can be extremely varied and complex.
There’s no problem gathering the content and using it for data mining. The problem is that there are limitations on what we can redistribute without permission: we can’t redistribute the original PDFs, or any close approximation.
However, collections of facts about the PDFs are fine.
But for the other openVirus tools to do their work, we need to be able to find out what each thesis are about. So how can we make this work?
One answer is to generate statistical summaries of the contents of the documents. For example, we can break the text of each document up into individual words, and count how often each word occurs. These word frequencies are a no substitute for the real text, but are redistributable and suitable for answering simple queries.
These simple queries can be used to narrow down the overall dataset, picking out a relevant subset. Once the list of documents of interest is down to a manageable size, an individual researcher can download the original documents themselves, from the original hosts[2]. As the researcher now has local copies, they can run their own tools over them, including the openVirus tools.
Word Frequencies
A second, simpler Hadoop job was created, post-processing the raw text and replacing it with the word frequency data. This produced 6GB of uncompressed JSON Lines data, which could then be loaded into an instance of the Apache Solr search tool [3].
While Solr provides a user interface, it’s not really suitable for general users, nor is it entirely safe to expose to the World Wide Web. To mitigate this, the index was built on a virtual server well away from any production systems, and wrapped with a web server configured in a way that should prevent problems.
The API this provides (see the Solr documentation for details) enables us to find which theses include which terms. Here are some example queries:
This is fine for programmatic access, but with a little extra wrapping we can make it more useful to more people.
APIs & Notebooks
For example, I was able to create live API documentation and a simple user interface using Google’s Colaboratory:
Google Colaboratory is a proprietary platform, but those notebooks can be exported as more standard Jupyter Notebooks. See here for an example.
Faceted Search
Having carefully exposed the API to the open web, I was also able to take an existing browser-based faceted search interface and modify to suite our use case:
Best of all, this is running on the Glitch collaborative coding platform, so you can go look at the source code and remix it yourself, if you like:
Limitations
The main limitation of using word-frequencies instead of full-text is that phrase search is broken. Searching for face AND mask will work as expected, but searching for “face mask” doesn’t.
Another problem is that the EThOS metadata has not been integrated with the raw text search. This would give us a much richer experience, like accurate publication years and more helpful facets[4].
In terms of user interface, the faceted search UI above is very basic, but for the openVirus project the API is likely to be of more use in the short term.
Next Steps
To make the search more usable, the next logical step is to attempt to integrate the full-text search with the EThOS metadata.
Then, if the results look good, we can start to work out how to feed the results into the workflow of the openVirus tool suite.
1. Even things like negative results, which are informative but can be difficult to publish in article form. ↩︎
2. This is similar data sharing pattern used by Twitter researchers. See, for example, the DocNow Catalogue. ↩︎
3. We use Apache Solr a lot so this was the simplest choice for us. ↩︎
4. Note that since writing this post, this limitation has been rectified. ↩︎