THE BRITISH LIBRARY

UK Web Archive blog

Information from the team at the UK Web Archive, the Library's premier resource of archived UK websites

Introduction

News and views from the British Library’s web archiving team and guests. Posts about the public UK Web Archive, and since April 2013, about web archiving as part as non-print legal deposit. Editor-in-chief: Peter Webster (Engagement and Liaison Manager). Read more

10 May 2013

The new NHS: a reform you could see from space?

Add comment Comments (0)

[A guest post by Jennie Grimshaw, Lead Curator for social policy and official publications at the British Library.]

The controversial Health and Social Care Act 2012 ushered in the most radical reform of the National Health Service since its launch in 1948. On April 1st 2013, the main changes set out in the Act came into force, and most parts of the NHS will be affected in some way.

Clinical commissioning groups (CCGs) replace primary care trusts (PCTs) and are the cornerstone of the new system. There are 211 CCGs in total, commissioning care for an average of 226,000 people each. Each of the 8,000 GP practices in England is now part of a CCG. These groups will commission the majority of health services, and in 2013/14 will be responsible for a budget of £65bn, about 60% of the total NHS budget. CCGs will be accountable to and supported by NHS England, formerly the NHS Commissioning Board, which will also directly commission primary care and specialist services. NHS services will be opened up to competition from providers that meet NHS standards on price, quality and safety, with a new regulator  (Monitor) and an expectation that the vast majority of hospitals will become foundation trusts by 2014.

In addition, local authorities will take on a bigger role, assuming responsibility for budgets for public health. Health and wellbeing boards will have duties to encourage integrated working between commissioners of services across health, social care and children’s services, involving democratically elected representatives of local people. Local authorities are expected to work more closely with other health and care providers, community groups and agencies, using their knowledge of local communities to tackle challenges such as smoking, alcohol, drug misuse and obesity

Finally, the Local Involvement Networks (LINks) were replaced by 152 Local Healthwatch operating under the leadership of a new consumer champion, Healthwatch England. Each local Healthwatch is part of its local community, and will work in partnership with other local organisations to ensure that the voices of consumers and those who use services reach the ears of decision makers.

The Coalition government’s radical reform of the NHS has attracted criticism on various grounds: of cost and disruption; backdoor privatisation; introduction of price competition, which risks decisions being made on the basis of price rather than clinical need; and determined opposition from health professions.

The scope of the collection

In the light of this debate, the British Library has chosen the NHS Reform of 2013 for its first themed collection of archived websites under the new Non-Print Legal Deposit Regulations which allow it to gather and preserve all sites in the UK web domain. For this collection we have hand-selected the sites of:

  • NHS bodies abolished under the reform (primary care trusts. LINks, strategic health authorities and some public health programmes and agencies). Many of these archived sites are already publicly available in the UK Web Archive. (See also this earlier post);
  • The emerging new bodies (clinical commissioning groups, health and wellbeing boards, and local healthwatch);
  • Groups campaigning for or (mainly) against the changes (medical royal colleges, professional associations, medical charities, trade unions, grass roots organisations);
  • Press and media commentators (including blogs, the BBC and national newspapers from the Sun to the Guardian);
  • The Government and the regulators (including legislation);
  • The private sector providers preparing to move into the market. 

Behind the scenes, this intensive crawl of a relatively small number of sites is going on alongside the general crawl of the whole UK domain which we blogged about last week. These sites are being crawled more frequently than would be typical for the domain crawl, but for a three month period.

The collection will be available for onsite access at the six legal deposit libraries for the UK from this summer. We hope that it will present a balanced view of the impact of the reform, and the debate surrounding it. We'll be blogging again nearer the time with a review of the archive.

30 April 2013

Dispatches from the domain crawl #1

Add comment Comments (1)

After the blaze of publicity surrounding the advent of Non-Print Legal Deposit, the web archiving team have been busy putting the regulations into practice. This is the first of a series of dispatches from the domain crawl, documenting our discoveries as we begin crawling the whole of the UK web domain for this first time.

Firstly, some numbers. In the first week, we acquired nearly 3.6TB of compressed data (in its raw, uncompressed form, the data is ~40% larger) from some 191 million URLs. Although we staggered the launch as a series of smaller crawls, by the end of the week we reached a sustained rate of 300Mb/s. The bulk of this was from the general crawl of the whole domain, which we kicked off with a list of 3.8 million hostnames.

At this stage it is difficult to determine what our success rate is - that is, how successful we are at harvesting each resource we target. This is partly because the Heritrix crawler has what might be described as an optimistic approach to determining what in a harvested page is actually a real link to another resource (particularly when parsing Javascript). As a result, some of the occasions on which Heritrix does not return a resource are due to the fact that there was not a real resource to be had.

At this early stage it is also hard to determine reliably the difference between a erroneous response for a real link resource that has disappeared, and an occasion on which access to a real resource was blocked. Over time, we'll learn more about how best to answer some of these questions, which will hopefully start to reveal interesting things about the UK web as a whole.

Roger Coram / Andy Jackson / Peter Webster

 

16 April 2013

Just what is the UK web domain anyway ?

Add comment Comments (0)

This sounds like a simple question. Ten seconds on most sites will tell a human viewer where a site originates from, and a little digging will produce the answer eventually. But under Non-Print Legal Deposit, we need a scaleable way of settling the question without human intervention. Our remit under the new regulations extends to sites that are issued from a .uk or other UK geographic top-level domain, or where part of the publishing process takes place in the UK. (See the regulations here, and a summary here.) UK map

We estimate that there are just short of five million sites that end in .uk - a simple, unambiguous and machine-readable way of knowing that a site originates from within the UK and so is covered by the remit we now have. However, not all UK domains end in .uk. Many .com, .org and other sites are in fact published from within the UK, and there are few reliable figures as to how many of these there are. And so to identify which of these fall within the scope of the regulations, we need other methods.

One such method is to find out where the site is hosted. www.geoiptool.com provides information on where a server is located, although it is difficult to attain 100% accuracy. Another way is to look at where the domain name is registered, using a service such as www.whois.net. However, in many cases domains are registered by one company on behalf of another or of an individual, perhaps because they want their contact details to remain private. There also isn't (yet) a straightforward way of querying any of these services at scale for thousands or indeed millions of sites.

There may be sites for which we have direct knowledge, from the site owner, that their .com domain is operated from within the UK, but that could only ever be for a tiny proportion of sites. And so after all these possibilities are exhausted, the next step is to make judgements based on the presentation of the site itself. But what in a site is "enough" ? A postal address in a Contact Us page is a possibility; so is a UK-domain email address (for those sites whose owners don't use anything as twentieth century as the post).

What if a site doesn't disclose the information we might like, but is self-evidently from the UK (once you look at the content)? One example is Conservative Home, a prominent political site, which nowhere explicitly states that it is published in the UK. This is a particular issue for blogs, which are often hosted on a platform service such as Wordpress (which is based in San Antonio, Texas) but would be thought by most to be "published" from wherever the author is based. There are similar issues in determining which parts of social media sites such as Twitter or Facebook should be treated as published from within the UK.

All of this of course supposes that all website owners tell the truth about where they are based. There may be cases where a site is published in another country but purports to be from the UK, perhaps to protect the author from a repressive regime. Conversely an owner might, for reasons which are hard to predict, wish that their site published within the UK did not appear to be.

It's early days for Non-Print Legal Deposit, and some of these issues will become clearer as we gain more experience with just these sorts of difficult questions. 

[Map reproduced courtesy of Showeet.com, under a Creative Commons Attribution-NoDerivs 3.0 licence.]

Peter Webster, Web Archiving Engagement and Liaison Manager