The Newsroom blog: Metadata

31 January 2019

The anatomy of news

“I hear new news every day”, wrote the scholar Robert Burton in 1628, “and those ordinary rumours of war, plagues, fires, inundations, thefts, murders, massacres, meteors, comets, spectrums, prodigies, apparitions, of towns taken, cities besieged in France, Germany Turkey, Poland, daily musters and preparations, and such like.” For Burton, this firehose of news amounted to a “vast confusion”, though his attitude seems to have been one of wonder rather than fear.

Burton was an Oxford man, but made regular trips to London. There he would have paid a visit to the Exchange, gathering up news and gossip from the merchants crowding the surrounding streets, before moving on to St. Paul’s Churchyard, perhaps stopping to buy a pamphlet from a hawker on the way. On front of the Cathedral he might have picked up some more pamphlets from the many booksellers lining the border of its square, or a copy of Nathaniel Butter and Nicholas Bourne’s new news publication, an innovative weekly format copied from the continent, although, somewhat disappointingly, it wouldn’t have contained any domestic news.

This short walk helps us understand how Burton perceived a world of overwhelming information. But what would he have made of the 21st century? Indeed, what would he have made of the 19th? Had he been writing, say, 250 years later, in 1872, Burton would surely have been overwhelmed by the number of titles available to him on a daily basis.

A late-seventeenth-century London coffee house (Usage terms: Creative Commons Attribution Non Commercial Share Alike licence. Held by © Trustees of the British Museum)

The 19th century is a new world for me, coming from a background of 17th century newspapers. And it is a different world. There’s the name, for one thing: the Oxford English Dictionary records the first use of the word ‘newspaper’, to mean a publication of regular, periodical news, in 1688. My own work is on the first half of the 17th century, when the word ‘news-book’ was most common, as was a host of words and phrases like ‘coranto’, ‘weekly news-sheet’, ‘weekly pamphlet’ and ‘Mercuries’, with overlapping, shifting and slightly different meanings.

This naming change can be useful – it helps us to grasp the real intellectual and material differences between the news world of the 17th century and that of the 19th. Although the change was gradual and not always linear – changes and innovations often moved backwards as well as forwards – the march of progress was did eventual pick up pace. 17th century news looked very different, much like a few sheets of A4 paper folded in half, with news in a single column. It was called a news-book because it looked like a small book. The way information was organised was different, too: early 17th century news-books contained a series of paragraphs each from a particular place, recording all the news collected from that place. The invention of the ‘article’, a unit of news based on one particular subject or event, was not to happen for some time.

The evolution from one to eight columns

This categorical divide also continues with the data. I estimate there are 1,000,000 words in Early English Books Online’s entire periodicals collection. The British Library’s collection of 19th century news runs to hundreds of millions of pages (we wrote recently that the collection consists of 60 million issues, 450 million pages... perhaps four trillion words... twenty-six trillion characters…). The other seismic change is that a computer can be taught to read (with varying accuracy) 19th century news. For the 17th, it’s still very difficult.

This Optical Character Recognition is what allows me to load up the British Newspaper Archive and check if my great-great-granddad committed any crimes in 1839 (still can’t find anything), for example, or check Limerick hurling scores from 1887. This difference isn’t just trivial: it represents a complete step-change in the way we approach newspaper history. For one thing, the datasets increase in size, by orders of magnitude. I have created a dataset of about 15,000 rows, manually collected, by reading 17th century news and noting down bits of information in a spreadsheet. 15,000 rows, from about 400 newspaper issues, which took many months to create. Yesterday, a few hours, I created a dataset of N-Grams (basically combinations of words) from a single issue of one 19th century title. It contained 150,000 rows.

150,000 rows of generated data, from one issue. Multiply that by about 250 for a weekday title, then by hundreds of titles, then by 200 years and the potential for ‘big data’ is rather astonishing. Of course, this data is not as rich with information as my humble spreadsheet, nor does it record any kind of fine-grained detail, but it does change the types of processing, computing power and storage needed, and most importantly, the types of intellectual questions that are and are not answerable. My 17th century dataset is like interviewing everyone in a small town, in some detail; the 19th century datasets we’ll be working with on our Heritage Made Digital newspapers project records the cosmos – albeit from far away. We don’t know much, but we know it about an enormous number of things. But the differences extend past volume: there is also a step-change in readership and scope.

The 19th century newspaper was everywhere. Some of the most popular 17th century newsbooks were probably printed in weekly runs of about 2,000; by 1863, the Daily Telegraph had a circulation of 120,000 per day. In 1628 Burton was overwhelmed by information in London and Oxford but elsewhere the firehose could be a drip, or a drought. By the 19th century news surged through the country’s arteries, veins and capillaries: at first everywhere within the reach of the train; eventually the telegraph, information finally travelling at the speed of light, in dots and dashes. It was the most pervasive cultural object of the century.

Newspaper titles held by the British Library, year by year, 1621-1900

Even accounting for the reuse and sharing of copies this is a fundamentally very different type of cultural artefact. If I analyse every page of news in the early 17th century, I have a vast record of events, and the thoughts and feelings of a select group of people. In the 19th century, the newspaper is a reasonable proxy for the way society thinks. To me it seems as though news in the 19th century captures a good proportion of a collective consciousness. It is a reasonable (though problematic) way to infer societal change. Through the newspaper’s great reach we can understand historical forces. The articles and personalities in the 19th century newspaper can tell us about structures of power. Its advertisements identify trends, economic forces and the changing roles within the family. The words themselves and their frequencies can help us understand the use of language, or uncover drifts in sentiments towards political movements, ideologies and so forth. In the 17th century the readership is so small, such a small part of the diet of information ingested by both important and ordinary people, that the questions we ask of its remains are different. Not less important, certainly not less interesting, but surely of a different kind.

Yes, the 19th century news world feels like a different one to the 17th. A mostly new world, with some evidence of the ruins of its earlier civilisation: the old towers are fallen, though echoes of their presence remain. The vast confusion had been replaced with one infinitely greater. Our job is to find, research and understand the new techniques that are necessary to make sense of this information overload.

Yann Ryan

Curator, Newspaper Data

Posted by Luke McKernan at 10:30 AM in Digitisation , Metadata , Newspapers | Permalink

07 January 2019

Heritage Made Digital - the newspapers

The British Library is currently engaged on a major programme entitled Heritage Made Digital. The aim of the programme is to transform digital access to the British Library's heritage collections by streamlining digitisation workflows, undertaking strategically led digitisation and making existing digitised content available as openly as copyright and licensing agreements allow. Heritage Made Digital is embracing a wide range of materials, from manuscripts through to sounds, and one of its major elements is newspapers.

Unfit newspaper volumes awaiting conservation inspection

The first thing to ask is why the British Library needs to be digitising newspapers, when we already have a very productive relationship with family history company Findmypast, which selects and digitises newspapers for the British Newspaper Archive, providing us with digital preservation copies in the process. It has digitised over 20 million pages from our collection, and adds hundreds of thousands of extra pages each month.

The simple answer is that there is more that we would like to see digitised that isn't likely to get digitised soon otherwise. The greater part of newspapers processed by Findmypast come from our microfilmed copies, because it is so much easier and quicker to do so (about eighteen times quicker than digitising from print). But only a third of our collection of some 60 million newspaper issues has been microfilmed. Of the newspapers for which we have only print, some get digitised, but many do not. In part this is because of the condition of many of newspapers, often produced using low-quality newsprint and for many years not stored in optimum conditions. We define preservation status of our newspapers under three categories: good, poor and unfit. Unfit no one gets to see, even onsite, unless we have a microfilm or digital access version. And around 4.5% of our collection (or 20 million pages) is in an unfit state and with no microfilmed or digitised copy available. That's a lot of newspapers not to be making available at all.

So, for Heritage Made Digital, we have chosen to concentrate on newspapers in a poor or unfit condition. This is not as straightforward as it might sound, since few runs of a newspaper title (i.e. from its first date to its last date) exist under one condition status. One volume may be good, another poor, another unfit (e.g. with a broken spine, crumbling pages etc). Therefore, although we want to concentrate on poor or unfit newspapers, we also want to digitise full runs of newspaper titles, because this will make best sense for researchers. In practice, we find that 40% of the volumes we are digitising for Heritage Made Digital are in a poor or unfit state.

We have set other restrictions for ourselves, with the aim of offering the best result for the widest range of research users. We are only digitising newspapers that are out of copyright, so that we can make the results freely available online - both the digitised pages and the data created by digitisation. Calculating when a newspaper goes out of copyright is complicated, but we are sticking to a 140-year rule - so the run of the newspaper has to have ended by 1878.

Next, we are primarily digitising newspapers that we published in London but which were distributed outside London as well. So, not newspapers for the areas of London only (i.e. London regionals), but metropolitan newspapers with a wider circulation. Curiously enough, this is a neglected area for newspaper digitisation. The British Newspaper Archive focusses heavily on British regional newspapers, while the main UK national newspapers available digitally are almost entirely those where the title still exists (e.g. The Guardian, The Times). In other words, we have identified a gap, one which we think will make a significant difference to what is available online so far.

We are not in competition with Findmypast, however - in fact, we are working closely with them. Every newspaper that we digitise will be made freely available via the British Library's catalogue, but they will also be made available via the British Newspaper Archive (a subscription site). That means that almost all of our digitised newspapers will be searchable - by title, date and word - in the one place. As things stand, the newspapers will be appearing on the BNA first, and secondly (at a date still to be determined) through the British Library catalogue, using the Universal Viewer display tool (a development project still in progress).

Waiting to be digitised

So, what are we digitising?

It will be around 1.3 million pages, 1 million from print and another 300,000 from microfilm. We're still choosing the titles to digitise, even as we start digitising, as we find out more through a process of preservation need and research, but it will be somewhere around 180 newspaper titles, many of them short runs of a year or less. We can't provide a definitive list as yet, but these are some of the titles (with title changes) that have gone to our imaging studios already:

Baldwin's London Weekly Journal (1803-1836)
The Bee-Hive / The Penny Bee-Hive (1862-1876)
The British Liberator (1833)
Colored News (1855)
Illustrated Sporting News and Theatrical and Music Review / Illustrated Sporting and Theatrical News (1862-1870)
The Lady's Newspaper and Pictorial Times (1847-1863)
Mirror of the Times (1800-1823)
Morning Herald (1801-1869)
The News / The News and Sunday Herald / The News and Sunday Globe (1805-1839)
People's Weekly Police Gazette (1835-1836)
Pictorial Times (1843-1848)
The Saint James's Chronicle (1801-1866)
The Sun / The Sun & Central Press (1801-1876)

There is a lot more that we have planned. We're exploring academic partnerships (we're already working closely with the recently-announced British Library/Alan Turing Institute data science project Living with Machines). We're aiming to do creative things with the data. We will be publishing blog posts, both about the content and about the decisions we're making on what gets digitised. We will be producing online guides and research tools, aimed at both the specialist and the general user.

We think that we have come up with a model for the digitisation of newspapers, in particular the way in which we are working in partnership with Findmypast, which will be particularly productive. We certainly hope to build on it beyond the life of the project. We can't show you any newspapers digitised through Heritage Made Digital, or offer any free datasets, as yet. But we will do soon.

It's worth remembering that the British Library has 60 million newspapers, from 1619 to the present day. After a decade or more of intensive work, we have digitised just 5%. There is a long, long way to go.

Posted by Luke McKernan at 10:08 AM in Archives , Digitisation , Metadata , Newspapers | Permalink

24 September 2015

Mining the FT

We're pleased to announce a partnership with the Financial Times to open up its archives to new kinds of research. The business news daily newspaper has been running since 1888, and has a wealth of information on national and international economic news, and in recent years reporting on general news, the arts and society. Its digital archive is available in the standard search-and-browse manner to institutional subscribers via Cengage Gale, but the newspaper is interested to explore different ways to makes its archives available, with an emphasis on what can be done with its data.

The full digital archive runs 1888-2010 and comprises 903,029 pages from 37,464 print editions. However, the collaboration is starting off with a relatively small amount of content, which may expand later. The FT has agreed a licence which permits use of the data for academic research purposes, either onsite at the British Library or via controlled remote access.

Four complete sample years of FT pages images (as JPEGs) and data (XML) are being made available to research teams: 1888, 1939, 1966 and 1991. The licence runs to the end of 2015, when we will review what has been learned and will see how access and use may be extended thereafter. So the sample years would be ideal for researchers developing data-driven projects who need some test content to scope future plans, or to test tools or applications that they may be developing.

Anyone who is interested should get in touch with Luke McKernan, Lead Curator News & Moving Image at the British Library, who can provide further details. Research teams may also be interested be to take part in the Library's first news hackathon, scheduled for November 16th, which will include FT data alongside data derived from the Library's own news collection. More news on this will be published soon.

The collaboration with the Financial Times is one part of emerging plans for British Library news data. The structure of news content offers numerous opportunities for analysing, interrogating, visualising and rethinking what news archives today, as well as creating new kinds of newspaper and and other news media history. We held a news data workshop on September 7th, where we brought together researchers, developers and content owners to look at ways we might develop plans for news data that would best benefit researchers. There's a report on the workshop on our Digital Scholarship blog.

We will hope to be issuing news on further news archive datasets that we can make available for research in the near future.

Posted by Luke McKernan at 10:59 AM in Digital scholarship , Metadata , Newspapers | Permalink | Comments( 0)

29 August 2014

St Pancras Intelligencer no. 33

Your humble blogger is taking a rest from Newsroom duties for a couple of weeks while he heads off on vacation, so there will be no St Pancras Intelligencer next Friday, nor the next. So make the most of this week's select gathering of news about news, and look out for plenty more from the Newsroom blog on our return.

GDELT comparison of 'conflict events' in Germany 7/8/2009 – 9/6/2009 (green left of black line) and 9/6/2009 – 11/5/2009 (green right of black line) compared with Egypt (red) - see http://blog.gdeltproject.org/towards-psychohistory-uncovering-the-patterns-of-world-history-with-google-bigquery/

Can computers replace historians?: Rory Cellan-Jones at BBC News notes the work of the GDELT project ('a global database of society'), which has collected has collected media reports of events from sources in more than 100 languages covering a period of 35 years. It is using the data to draw out the pattern of world events with the sort of analysis that would have taken historians years to compile in the traditional manner. News looks like it is the first draft of history after all.

'Daily Mail' solves Internet paradox: Michael Wolff at USA Today looks admiringly on how the Daily Mail created the separate beast of Mail Online and created the world's 'most-trafficked' English-language newspaper website.

Open journalism also means opening up your data, so others can use and improve it: Gigaom's Mathew Ingram (never a week goes by but we don't find ourselves recommending his writings) calls for journalists to free up their data - because it's good for journalism.

How the news upstarts covered ISIS: DigiDay examines how news' new kids on the block, including Vice, BuzzFeed, Mashable, International Business Times and Vocativ have been beating newspapers at their traditional game when it comes to coverage of the rise of ISIS.

https://bellingcat.com/resources/case-studies/2014/08/22/gun-safety-self-defense-and-road-marches-finding-an-isis-training-camp/

Gun Safety, Self Defense, and Road Marches – Finding an ISIS Training Camp: Talking of which, news coup of the week was undoubtedly Elliott Higgins' kickstarter-funded citizen journalism site, Bellingcat, which showed how to identify the location of an ISIS training camp using Google Earth and Bing Maps.

Can the UK’s broadcast news providers keep doing more for less?: Former ITN chief turned journalism academic Stewart Purvis looks at the struggles broadcasters have, caught between the demans of innovation and tradition:

At the opposite ends of the scale are the traditional TV news audience, predominantly over 55 years of age, and the 16-34 audience which is converting to or adopting online news use at a startling rate, especially since the arrival of smart phones and tablets ... whereas daily average TV viewing is currently three times higher among adults aged 55-plus than among adults age 16-34, the ratio is more like five or six to one when it comes to news. In the middle is the 35-54 audience which currently has a foot in both camps but whose future allegiance to TV news cannot be taken for granted.

Vice News sparks debate on engaging younger viewers: On the same theme, The Guardian looks at how traditional broadcasters such as the BBC and Channel 4 News are aiming to attract a generation at home on YouTube and social media.

Is local TV vanity over sanity?:Media Week looks at how the plans are going for the launch of local television stations across the UK, and doesn't think that things are going too well.

New Orleans newspaper page, from www.noladna.com

Old newspapers, new value: Printmaker J.S. Makkos writes a beautifully-illustrated piece for The Atlantic about making new products out of old New Orleans newspapers, and reminds us of old controversies about the disposal of surplus newspaper archives and the dangers of keeping only the grey images of microfilm. For more, see the New Orleans Digital Newspaper Archive.

The Times' newsroom set to ring with the sounds of typewriters once more: What fun - a speaker has been introduced into The Times newsroom at London Bridge, which relays the sounds of typewriters, recalling the newsroom of old. The intention is apparently to boost energy levels and encourage journalists to meet deadlines as the sounds of the typewriters rises to a crescendo. Ian Burrell at The Independent looks on, with not a little bemusement.

Posted by Luke McKernan at 8:12 AM in Journalism , Metadata , Newspapers , Social media , St Pancras Intelligencer , Web | Permalink | Comments( 0)

11 April 2014

St Pancras Intelligencer no. 13

Welcome to the latest edition of the St Pancras Intelligencer, our weekly round-up of news about news - stories about news production, publications, apps, digitised resources, events and what is happening with the newspaper collection (and other news collections) at the British Library.

The Newsroom

Opening day: So of course the British Library tops the week's news about news with the opening on April 7th of the Newsroom, its new reading room for news. Newspapers, television news, radio news and web news can now all be found in the one physical space - though for newspapers that means microfilm and digital for now, until the print papers become available again in the autumn. It all looks very beautiful - and has a lot more people in it than in this photo taken just before it opened.

Shift 2014: It's all been happening here this week, with Newsworks, the marketing body for UK national newspapers, holding its Shift 2014 conference at the British Library. The live blog of the event includes reactions to star turns such as the editors of The Guardian (Alan Rusbridger), The Independent (Amol Rajan) and The Telegraph (Jason Seiken) and Sir Martin Sorrell, chief executive of WPP. Jason Seiken's speech is here.

Here & Then: And there's more. The British Newspaper Archive, which provides digitised copies of British Library newspapers online, has issued a free iPhone app, Here & Then, with articles, images and adverts from the collection. Oh, and 135,000 pages were added to the BNA site in March.

What will yesterday’s news look like tomorrow?: Article of the week, by a mile. Adrienne LaFrance at Medium looks at the future of news archives, which focus on how they are catalogued and their data mapped for rediscovery in the future. "News organizations need to design archives that better mirror the experience of consuming news in real time, and reflect the idea that the fundamental nature of a story is ongoing".

The Press Freedom Issue: Contributoria, the community funded, collaborative journalism site, published a special issue on press freedom this month. Among the great articles available are Crowdfunding critical thought: How alternative finance builds alternative journalism, Court and council reporting - still a bedrock of local news?, Pirate journalism and The printing press created journalism. The Internet will destroy it. Read and learn.

News is still a man's world: A City University study reveals that male experts still outnumber female experts by a ratio of four to one on flagship radio and TV news programmes.

Has Thompson at the NYT given newspapers a new way to pull in extra cash and readers?: Mark Thompson, former BBC DG and now heading the New York Times, may have had a big idea - New York Times Premier, an added subscription to the online version of the newspaper, with additional content, offers (two free ebooks a month), even special crosswords. The Drum speculates.

Upvoting the news: long, engrossing article by Alex Leavitt for Medium on how news spreads across social media channels, with particular emphasis on Reddit.

The state of Egypt's news media: Al Jazeera's excellent news analysis programme The Listening Post looks at the "sorry state of journalism in Egypt".

A sample 'card' from Vox.com

Three good things about Ezra Klein’s new site Vox, plus three challenges that it faces: The much-hyped Vox.com site, with celebrity news blogger Ezra Klein, launched on April 6th. Mathew Ingram at Gigaom says what he likes (especially the user-friendly 'cards' with background information to stories) then wonders how it will thrive.

Bristol Post editor baffled by fact that front page gay kiss costs thousands of sales: Press Gazette reports on what happened when Bristol Post editor Mike Norton decided to put same-sex marriage on his paper's front page.

'Video-checking' the Clegg and Farage debate: Fact-checking videos - where videos of speeches are analysed to see whether or not the statements made stand up - have been popularised by The Washington Post's Truth Teller. Now the fact-checking organisation Full Fact have done the same for LBC's Nick Clegg v Nigel Farage debate.

Peaches Geldof – was the coverage by newspapers, and TV, over the top?: Roy Greenslade ponders on what would have been proptionate news coverage for the sad death of Peaches Geldof.

More UGC, fewer photographers – and no paywalls: Editors set out visions of future: Hold the Front Page reports on the Society of Editors Regional Conference, where likely changes to the regional newspaper world were set out: user-generated content, smaller offices, cover price rises, no staff photographers, and no paywalls.

One easy, transparent way of making accuracy visible: open sourcing: George Brock argues that the way for news providers to build up trust is through links to source material - footnotes, sort of, though he prefers the term open sourcing.

How some journalists are using anonymous secret-sharing apps: Using apps like Whisper and Secret to turn rumour into news.

We need to talk: Raju Narisetti, senior vice president of strategy at News Corp, poses 26 questions to ask news organisations about the move to digital. Fascinating insight into a business in transition.

Posted by Luke McKernan at 8:57 AM in British Newspaper Archive , Journalism , Metadata , Newspapers , Newsroom , Radio , Social media , St Pancras Intelligencer , Television , User-generated content , Video , Web | Permalink | Comments( 0)

04 April 2014

St Pancras Intelligencer no. 12

From The Poke via @jameshoggarth

45 local news stories that rocked the world: It started with Patrick Smith at Buzzfeed - now headlines from UK regional newspapers are fast becoming an Internet cult. The Poke collect 45 that show just why we love local newspapers so.

Against beautiful journalism: Thought-provoking article from Felix Salmon at the Reuter blog, who argues against the over-designed nature of some (mostly American) news sites. "Today, when you read a story at the New Republic, or Medium, or any of a thousand other sites, it looks great; every story looks great. Even something as simple as a competition announcement comes with a full-page header and whiz-bang scrollkit graphics. The result is a cognitive disconnect..."

How 3 publishers are innovating with online video: Journalism.co.uk looks at how Huffington Post, the Washington Post and BuzzFeed are taking different approaches to using video, as discussed at the FT Digital Media conference.

Harry Chapman Pincher: Perhaps the best-named journalist ever, certainly one of the most famous living British journalists, Chapman Pincher has turned 100 years old and is still writing. Nick Higham at BBC News profiles the man who became legendary for his espionage scoops.

Safeguarding the “first rough draft of history”: How pleasant to have a history of newspapers (with thank yous to the British Library for its newspaper preservation work from Sylvia Morris at the excellent Shakespeare Blog.

In praise of the almost-journalists: A fine piece by Dan Gillmor at Slate on the distinctive contribution to online news made by advocacy organisations such as Human Rights Watch and Cato Institute.

News Corp boss brands Washington Post journalists 'high priests': Not such good times for journalists of the old school. The Guardian reports how News Corp's Chief Executive Robert Thomson feels that the Washington Post's journalists have failed to embrace the transition to digital.

Apple Adds Talk Radio And News To iTunes Radio Starting With NPR: iTunes Radio gets its first non-music offering with this team up with NPR (National Public Radio), Techcrunch reports.

Journalists increasingly under fire from hackers, Google researchers show: ArsTechnica reports that news organisations are increasingly being targeted by state-sponsored hackers.

The Evolution of Automated Breaking News Stories: Is this the future of news? Technology Review reports on how a Google engineer has developed an algorithm, Wikipedia Live Monitor, that spots breaking news stories on the Web and illustrates them with pictures. Now it is tweeting them.

Debugging the backlash to data journalism: Data journalism has been all the rage, so inevitably there has been a backlash. Alexander Howard at Tow Center provides a good overview of the phenomenon, its strengths and its limitations.

Taming the news beast: The Newsroom blog goes to an International Society for Knowledge Orgaization event on news archives and news metadata, and comes back thoughtful.

London Live – capital's first dedicated TV channel – takes to the air: The Evening Standard-backed TV channel went live on March 31st. Meanwhile, Jim Waterson at BuzzFeed provides an entertaining history of the last time someone tried to launch a TV station called London Live.

The Guardian crowned newspaper of the year at Press Awards for government surveillance reports: Press Gazette names all the winners at the Press Awards. Meanwhile, former Guardian columnist Glenn Greenwald has won the University of Georgia's McGill Medal for Journalistic Courage.

German officials ban journalist from naming his son #Wikileaks. No comment.

Posted by Luke McKernan at 7:52 AM in Archives , Journalism , Metadata , Newspapers , People , Radio , St Pancras Intelligencer , Television , Video | Permalink | Comments( 0)

02 April 2014

Taming the news beast

Taming the News Beast was the striking title of a seminar held on April 1st by ISKO UK, the British branch of the International Society for Knowledge Organization. Subtitled "finding context and value is text and data" its aim was to explore the ways in which we can control the explosion of news information data and derive value from it. Much has been written about this explosion from the points of view of its producers and consumers, but less well known is the huge challenges it presents for those whose job it is to manage such data by working effectively with those who generate it. Few environments depend more on effective information management - while creating any number of problems for those trying to apply the rules - than the news industry today. Hence the seminar, which aimed "to share knowledge from the intersections of technology, semantics and product development".

Looking at the large lecture theatre at University College London filled to the brim with an enthusiastic audience of data developers, information scientists, journalism students and archivists, your blogger was moved to think that things were very different to when he spent his time at library college, many years ago now. Library and information studies, as they called it then, excited no one. Now, in the era of big data, it is where the big ideas are happening. Librarians (let's continue to give them their traditional name) are masters of the digital universe, or might aspire to be. Metadata is cool; ontologies are where it's at; semantics really means something.

The epitome of this excitement about information management - particularly news information - is the work coming out of BBC development projects such as BBC News Labs, which was introduced in a presentation by its Innovation Manager, Matt Shearer. News Labs has a a small team of people looking at better ways in which to manage news information, both within and outside the BBC. Its work includes the Juicer API (for semantic prototyping), the #newsHACK days for testing of product development ideas, entity extraction (extracting key terms from a mass of unstructured text), linked data (the important principle of working with data based on terms produced for DBpedia which other institutions can share in to create linked-up knowledge) and the Storyline ontology. There is particular excitement in trying to extract searachable terms for audiovisual media, through such technologies as speech, image and music recognition. If there is a pattern, the machines can be trained to recognise it.

Shearer's enthusiastic and sometimes mind-spinning presentation was matched by his colleague Jeremy Tarling, data architect with News Labs, who introduced Storyline - an open data model for news. Storyline is a way of structuring news stories around themes, based on a linked data model. The linked data bit is the way of ensuring consistency and shareability (they are working with other news organisations on the project). The theme element is about a new way of presenting news online which joins up stories in a less linear, more intuitive fashion. If you type in 'Edward Snowden' into a search engine you will get hundreds of stories - how to sort these out or to tell what the overarching narrative is that connects them all? If you can bundle the Snowden stories that your news organisation has produced around stories that go to make up the Edward Snowden theme - for example, Snowden at Moscow airport, Snowden finds job in Russia - you start to impose more of a pattern, and to draw out more of a story - the storyline, that is.

The nuts and bolts of this are interesting, because it requires journalists to tag their stories correctly, and listening between the lines one could see that some journalists were more willing and able to do so than others. But this sort of data innovation is happening, and it will have a dramatic impact on how news sources such as the BBC News website look in the future.

The energy, resources and ingenuity put into such work by the BBC can leave the rest of us overwhelmed, not to say humbled, but the remaining speakers had equally interesting things to say. Rob Corrao, Chief Operating Officer of LAC Group, gave a dry, droll account of how his consultancy company had been brought in to enable ABC News in New York to get on top of the "endless torrent" of news information coming in every day. This was a different approach to the problem, more of an exercise in logistics than simple data management policies. They managed the people and the work-processes first, then everything else fell into place. A content strategy was essential to understanding how best to manage the news process, including such simple ideas as prioritising the digitisation of footage of people likely to feature before long in obituary pieces. The more you know what the news will be in advance, the easier it is to manage it.

Ian Roberts of the University of Sheffield introduced AnnoMarket, a European-funded project which will process your text documents for you, or conduct analyses of news and social media sources. As automated metadata extraction tools start to make more of an impact (that is, tools which extract useful information from digital sources), so businesses are popping up which will do the hard work for you. Send them a large bunch of documents in digital form, and they will analyse them for you. Essentially it's like handing them a book and they give you back an index.

Finally Pete Sowerbutts of the Press Association talked about how the news agency is applying semantic data management tools to its news archives, so that with a bit of basic information about a subject (e.g. name, age, occupation), place or organisation and some properly applied tagging, a linked-up catalogue starts to emerge. People, places and organisations are the subjects that all of the projects like to tackle, because they are easily defined. Themes - i.e. what news stories are actually about - are harder to pin down, semantically speaking.

Beneath all the jargon, much of this was about tackling age-old problems of how best to catalogue the world around us. Librarians in the room of a particular vintage looked like they had seen all of this before, and indeed they had. Librarians' role in life is to try impose order on an impossibly chaotic world. Previously they came up with classification schemes and controlled vocabularies and tried to make real-life objects match these. Now we have automated systems which try to apply similar rules with reduced human intervention because of the sheer vastness of the data we are trying to manage, and because it is digital and digital lets you do this sort of thing. Yet real life continues to elude all of our attempts to describe it precisely. Sometimes they only way you are going to find out what a news publication is actually about is to pick it up and read it. But you still have to find it in the first place.

An unanswered question for me was whether what applies to news applies to news archives. News changes once it has been produced. It turns into a body of information about the past, where the stories that mattered when they were news may no longer matter, because researchers will approach the body of information with their own ideas in mind, looking across stories as much as they may look directly for them. Our finding tools for news archives must be practical, but they must not be too prescriptive. ABC News may hope to guess what the news will be in the future, but the news archivist can never be so presumptuous. It is you, the users, who will provide the storylines.

Posted by Luke McKernan at 11:18 PM in Catalogues and databases , Metadata , Technology , Television | Permalink | Comments( 2)

14 March 2014

St Pancras Intelligencer no. 9

The Newsroom: Well of course we have to start with our own big news, which is that the Newsroom - the British Library's news reading room for news - opens at St Pancras on Monday 7 April. Is this first library space ever to be named after a blog...?

Named Entity Recognition for newspapers: Not the most exciting title for a blog post, but something worth reading closely by anyone interested in the future of digitised newspaper research. Europeana Newspapers explains how key terms can be extracted from newspaper text to enhance search and improve linkage of data.

News Archive Connected Studio: Build Studio: Keep an eye on what Peter Rippon and his team at the BBC are doing in planning how to open up their news archives. Much audience testing is coming first.

Why Twitter will never be a news organization: An interesting interview in Time with Twitter's Head of News, Vivian Schiller. "The Twitter news team is never going to pick and choose news stories, pick and choose winners. That’s not our job at all. But what we need to do is ... to make it easier for news organizations but also for our consumers to find what they’re looking for."

Why Twitter can't keep crashing: Mat Honan at Wired says that Twitter has become too important to how the world gains its news to have the crashes that it not infrequently does have. "It is the definition of breaking news. Twitter is increasingly the key place where information is born – stuff that maybe starts with one person but is important to the whole world."

Strictly algorithm: Really interesting article by Stuart Dredge at The Guardian on how the news we wants find us - through algorithms - and what this means for news, journalism and democracy.

Thomas Jewell Bennett: an early supporter of Indian Home Rule: Pat Farrington writes for the British Library's Untold Lives blog on her great-uncle, editor of the Times of India, some of whose letters are held here.

Russia’s information warriors are on the march – we must respond: Anne Applebaum at the Telegraph sets out to sort out the truth from lies in the Russian media's reporting of the crisis in Ukraine.

Ah, sweet irony: For afficianados of errors in TV subtitles, much joy was brought about by this misinterpretation of Matt Frei talking about Russian Foreign Minister Sergey Lavrov on Channel 4 News.

BBC values: The BBC Academy interviews James Harding, director of BBC News, about values and maintaining audience trust.

Endangered species: At British Journalism Review Kim Fletcher argues that traditional newspaper editors are on their way out; content officers are on their way in.

Fleet Street editors of the past were little different from those of today: Talking of which, Roy Greenslade reviews Dennis Griffiths' Blum & Taff: A tale of two editors, on R.D. Blumenfeld and H.A. Gwynne, Fleet Street greats from another age.

Why venture capitalists are suddenly investing in news: Adrienne LaFrance at Quartz looks at why the investment money is pouring into the new kids on the news block: Buzzfeed, Upworthy, Vice etc. As one interviewee puts it: "“They are all technology companies first ... They understand how people utilize technology and how to present and create content."

Journalism startups aren't a revolution if they're filled with all these white men: Emily Bell looks at the somewhat familiar make-up of some supposedly cutting edge news start-ups.

Robot reporters and the age of drone journalism: And finally, look out for Emily Bell's lecture on how new technogies are driving the future of journalism, at the British Library on 25 April.

Posted by Luke McKernan at 8:07 AM in Digitisation , Metadata , Newspapers , Newsroom , Social media , St Pancras Intelligencer , Television , Web | Permalink | Comments( 0)

The Newsroom blog

8 posts categorized "Metadata"

The anatomy of news

Heritage Made Digital - the newspapers

Mining the FT

St Pancras Intelligencer no. 33

St Pancras Intelligencer no. 13

St Pancras Intelligencer no. 12

Taming the news beast

St Pancras Intelligencer no. 9

The Newsroom blog recent posts

Archives

Tags

The Newsroom links

Other British Library blogs