THE BRITISH LIBRARY

UK Web Archive blog

12 June 2014

How big is the UK web?

The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.

Red-button

You’ve probably played the time honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets….the glory of being right. Some facts from last year might help.

2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.

Geolocation
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.

De-duplication
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.

Predictions
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?

Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one 'host'.

We want:

  • URLs (in billions)
  • Size (in terabytes)
  • Hosts (in millions) 

#UKWebCrawl2014

We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.