Taming the news beast

Taming the News Beast was the striking title of a seminar held on April 1st by ISKO UK, the British branch of the International Society for Knowledge Organization. Subtitled "finding context and value is text and data" its aim was to explore the ways in which we can control the explosion of news information data and derive value from it. Much has been written about this explosion from the points of view of its producers and consumers, but less well known is the huge challenges it presents for those whose job it is to manage such data by working effectively with those who generate it. Few environments depend more on effective information management - while creating any number of problems for those trying to apply the rules - than the news industry today. Hence the seminar, which aimed "to share knowledge from the intersections of technology, semantics and product development".

Looking at the large lecture theatre at University College London filled to the brim with an enthusiastic audience of data developers, information scientists, journalism students and archivists, your blogger was moved to think that things were very different to when he spent his time at library college, many years ago now. Library and information studies, as they called it then, excited no one. Now, in the era of big data, it is where the big ideas are happening. Librarians (let's continue to give them their traditional name) are masters of the digital universe, or might aspire to be. Metadata is cool; ontologies are where it's at; semantics really means something.

The epitome of this excitement about information management - particularly news information - is the work coming out of BBC development projects such as BBC News Labs, which was introduced in a presentation by its Innovation Manager, Matt Shearer. News Labs has a a small team of people looking at better ways in which to manage news information, both within and outside the BBC. Its work includes the Juicer API (for semantic prototyping), the #newsHACK days for testing of product development ideas, entity extraction (extracting key terms from a mass of unstructured text), linked data (the important principle of working with data based on terms produced for DBpedia which other institutions can share in to create linked-up knowledge) and the Storyline ontology. There is particular excitement in trying to extract searachable terms for audiovisual media, through such technologies as speech, image and music recognition. If there is a pattern, the machines can be trained to recognise it.

Shearer's enthusiastic and sometimes mind-spinning presentation was matched by his colleague Jeremy Tarling, data architect with News Labs, who introduced Storyline - an open data model for news. Storyline is a way of structuring news stories around themes, based on a linked data model. The linked data bit is the way of ensuring consistency and shareability (they are working with other news organisations on the project). The theme element is about a new way of presenting news online which joins up stories in a less linear, more intuitive fashion. If you type in 'Edward Snowden' into a search engine you will get hundreds of stories - how to sort these out or to tell what the overarching narrative is that connects them all? If you can bundle the Snowden stories that your news organisation has produced around stories that go to make up the Edward Snowden theme - for example, Snowden at Moscow airport, Snowden finds job in Russia - you start to impose more of a pattern, and to draw out more of a story - the storyline, that is.

The nuts and bolts of this are interesting, because it requires journalists to tag their stories correctly, and listening between the lines one could see that some journalists were more willing and able to do so than others. But this sort of data innovation is happening, and it will have a dramatic impact on how news sources such as the BBC News website look in the future.

The energy, resources and ingenuity put into such work by the BBC can leave the rest of us overwhelmed, not to say humbled, but the remaining speakers had equally interesting things to say. Rob Corrao, Chief Operating Officer of LAC Group, gave a dry, droll account of how his consultancy company had been brought in to enable ABC News in New York to get on top of the "endless torrent" of news information coming in every day. This was a different approach to the problem, more of an exercise in logistics than simple data management policies. They managed the people and the work-processes first, then everything else fell into place. A content strategy was essential to understanding how best to manage the news process, including such simple ideas as prioritising the digitisation of footage of people likely to feature before long in obituary pieces. The more you know what the news will be in advance, the easier it is to manage it.

Ian Roberts of the University of Sheffield introduced AnnoMarket, a European-funded project which will process your text documents for you, or conduct analyses of news and social media sources. As automated metadata extraction tools start to make more of an impact (that is, tools which extract useful information from digital sources), so businesses are popping up which will do the hard work for you. Send them a large bunch of documents in digital form, and they will analyse them for you. Essentially it's like handing them a book and they give you back an index.

Finally Pete Sowerbutts of the Press Association talked about how the news agency is applying semantic data management tools to its news archives, so that with a bit of basic information about a subject (e.g. name, age, occupation), place or organisation and some properly applied tagging, a linked-up catalogue starts to emerge. People, places and organisations are the subjects that all of the projects like to tackle, because they are easily defined. Themes - i.e. what news stories are actually about - are harder to pin down, semantically speaking.

Beneath all the jargon, much of this was about tackling age-old problems of how best to catalogue the world around us. Librarians in the room of a particular vintage looked like they had seen all of this before, and indeed they had. Librarians' role in life is to try impose order on an impossibly chaotic world. Previously they came up with classification schemes and controlled vocabularies and tried to make real-life objects match these. Now we have automated systems which try to apply similar rules with reduced human intervention because of the sheer vastness of the data we are trying to manage, and because it is digital and digital lets you do this sort of thing. Yet real life continues to elude all of our attempts to describe it precisely. Sometimes they only way you are going to find out what a news publication is actually about is to pick it up and read it. But you still have to find it in the first place.

An unanswered question for me was whether what applies to news applies to news archives. News changes once it has been produced. It turns into a body of information about the past, where the stories that mattered when they were news may no longer matter, because researchers will approach the body of information with their own ideas in mind, looking across stories as much as they may look directly for them. Our finding tools for news archives must be practical, but they must not be too prescriptive. ABC News may hope to guess what the news will be in the future, but the news archivist can never be so presumptuous. It is you, the users, who will provide the storylines.

The Newsroom blog

Taming the news beast

Comments