16 December 2024
Closing the language gap: automated language identification in British Library catalogue records
What do you do when you have millions of books and no record of the language they were written in? Collection Metadata Analyst Victoria Morris looks back to describe how she worked on this in 2020...
Context
In an age of online library catalogues, recording the language in which a book (or any other textual resource) is written is vital to library curators and users alike, as it allows them to search for resources in a particular language, and to filter search results by language.
As the graph below illustrates, although language information is routinely added to British Library catalogue records created as part of ongoing annual production, fewer than 30% of legacy records (from the British Library’s foundation catalogues) contain language information. As of October 2018, nearly 4.7 million of records were lacking any explicit language information. Of these, 78% were also lacking information about the country of publication, so it would not be possible to infer language from the place of publication.
The question is: what can be done about this? In most cases, the language of the resource described can be immediately identified by a person viewing the book (or indeed the catalogue record for the book). With such a large number of books to deal with, though, it would be infeasible to start working through them one at a time ... an automated language identification process is required.
Language identification
Language identification (or language detection) refers to the process of determining the natural language in which a given piece of text is written. The texts analysed are commonly referred to as documents.
There are two possible avenues of approach: using either linguistic models or statistical models. Whilst linguistic models have the potential to be more realistic, they are also more complex, relying on detailed linguistic knowledge. For example, some linguistic models involve analysis of the grammatical structure of a document, and therefore require knowledge of the morphological properties of nouns, verbs, adjectives, etc. within all the languages of interest.
Statistical models are based on the analysis of certain features present within a training corpus of documents. These features might be words, character n-grams (sequences of n adjacent characters) or word n-grams (sequences of n adjacent words). These features are examined in a purely statistical, ‘linguistic-agnostic’ manner; words are understood as sequences of letter-like characters bounded by non-letter-like characters, not as words in any linguistic sense. When a document in an unknown language is encountered, its features can be compared to those of the training corpus, and a predication can thereby be made about the language of the document.
Our project was limited to an investigation of statistical models, since these could be more readily implemented using generic processing rules.
What can be analysed?
Since the vast majority of the books lacking language information have not been digitised, the language identification had to be based solely on the catalogue record. The title, edition statement and series title were extracted from catalogue records, and formed the test documents for analysis.
Although there are examples of catalogue records where these metadata elements are in a language different to that of the resource being described (as in, for example, The Four Gospels in Fanti, below), it was felt that this assumption was reasonable for the majority of catalogue records.
Measures of success
The effectiveness of a language identification model can be quantified by the measures precision and recall; precision measures the ability of the model not to make incorrect language predictions, whilst recall measures the ability of the model to find all instances of documents in a particular language. In this context, high precision is of greater value than high recall, since it is preferable to provide no information about the language of content of a resource than to provide incorrect information.
Various statistical models were investigated, with only a Bayesian statistical model based on analysis of words providing anything approaching satisfactory precision. This model was therefore selected for further development.
The Bayesian idea
Bayesian methods are based on a calculation of the probabilities that a book is written in each language under consideration. An assumption is made that the words present within the book title are statistically independent; this is obviously a false assumption (since, for example, adjacent words are likely to belong to the same language), but it allows application of the following proportionality:
The right-hand side of this proportionality can be calculated based on an analysis of the training corpus. The language of the test document is then predicted to be the language which maximises the above probability.
Because of the assumption of word-independence, this method is often referred to as naïve Bayesian classification.
What that means in practice is this: we notice that whenever the word ‘szerelem’ appears in a book title for which we have language information, the language is Hungarian. Therefore, if we find a book title which contains the word ‘szerelem’, but we don’t have language information for that book, we can predict that the book is probably in Hungarian.
If we repeat this for every word appearing in every title of each of the 12 million resources where we do have language information, then we can build up a model, which we can use to make predictions about the language(s) of the 4.7 million records that we’re interested in. Simple!
Training corpus
The training corpus was built from British Library catalogue records which contain language information, Records recorded as being in ‘Miscellaneous languages’, ‘Multiple languages’, ‘Sign languages’, ‘Undetermined’ and ‘No linguistic content’ were excluded. This yielded 12,254,341 records, of which 9,578,175 were for English-language resources. Words were extracted from the title, edition statement, and series title, and stored in a ‘language bucket’.
Language buckets were analysed in order to create a matrix of probabilities, whereby a number was assigned to each word-language pair (for all words encountered within the catalogue, and all languages listed in a controlled list) to represent the probability that that word belongs to that language. Selected examples are listed in the table below; the final row in the table illustrates the fact that shorter words tend to be common to many languages, and are therefore of less use than longer words in language identification.
౨ |
{Telugu: 0.750, Somali: 0.250} |
aaaarrgghh |
{English: 1.000} |
aaavfleeße |
{German: 1.000} |
aafjezatsd |
{German: 0.333, Low German: 0.333, Limburgish: 0.333} |
aanbidding |
{Germanic (Other): 0.048, Afrikaans: 0.810, Low German: 0.048, Dutch: 0.095} |
نبوغ |
{Persian: 0.067, Arabic: 0.200, Pushto: 0.333, Iranian (Other): 0.333, Azerbaijani: 0.067} |
metodicheskiĭ |
{Russian: 0.981, Kazakh: 0.019} |
nuannersuujuaannannginneranik |
{Kalâtdlisut): 1.000} |
karga |
{Faroese: 0.020, Papiamento: 0.461, Guarani: 0.010, Zaza: 0.010, Esperanto: 0.010, Estonian: 0.010, Iloko: 0.176, Maltese: 0.010, Pampanga: 0.010, Tagalog: 0.078, Ladino: 0.137, Basque: 0.029, English: 0.010, Turkish: 0.029} |
Results
Precision and recall varied enormously between languages. Zulu, for instance, had 100% precision but only 20% recall; this indicates that all records detected as being in Zulu had been correctly classified, but that the majority of Zulu records had either been mis-classified, or no language prediction had been made. In practical terms, this meant that a prediction “this book is in Zulu” was a prediction that we could trust, but we couldn’t assume that we had found all of the Zulu books. Looking at our results across all languages, we could generate a picture (formally termed a ‘confusion matrix’) to indicate how different languages were performing (see below). The shaded cells on the diagonal represent resources where the language has been correctly identified, whilst the other shaded cells show us where things have gone wrong.
The best-performing languages were Hawaiian, Malay, Zulu, Icelandic, English, Samoan, Finnish, Welsh, Latin and French, whilst the worst-performing languages were Shona, Turkish, Pushto, Slovenian, Azerbaijani, Javanese, Vietnamese, Bosnian, Thai and Somali.
Where possible, predictions were checked by language experts from the British Library’s curatorial teams. Such validation facilitated the identification of off-diagonal shaded areas (i.e. languages for which predictions which should be treated with caution), and enabled acceptance thresholds to be set. For example, the model tends to over-predict English, in part due to the predominance of English-language material in the training corpus, thus the acceptance threshold for English was set at 100%: predictions of English would only be accepted if the model claimed that it was 100% certain that the language was English. For other languages, the acceptance threshold was generally between 95% and 99%.
Outcomes
Two batches of records have been completed to date. In the first batch, language codes were assigned to 1.15 million records with 99.7% confidence; in the second batch, a further 1 million language codes were assigned with 99.4% confidence. Work on a third batch is currently underway, and it is hoped to achieve at least a further million language code assignments. The graph below shows the impact that this project is having on the British Library catalogue.
The project has already been well-received by Library colleagues, who have been able to use the additional language coding to assist them in identifying curatorial responsibilities and better understanding the collection.
Further reading
For a more in-depth, mathematical write-up of this project, please see a paper written for Cataloging & Classification Quarterly, which is available at: https://doi.org/10.1080/01639374.2019.1700201, and is also in the BL research repository at https://bl.iro.bl.uk/work/6c99ffcb-0003-477d-8a58-64cf8c45ecf5.