06 July 2020
Archivists, Stop Wasting Your Ref-ing Time!
“I didn’t get where I am today by manually creating individual catalogue references for thousands of archival records!”
One of the most laborious yet necessary tasks of an archivist is the generation of catalogue references. This was once the bane of my life. But I now have a technological solution, which anyone can download and use for free.
Meet ReG: the newest team member of the Endangered Archives Programme (EAP). He’s not as entertaining as Reginald D Hunter. She’s not as lyrical as Regina Spektor. But like 1970s sitcom character Reggie Perrin, ReG provides a logical solution to the daily grind of office life - though less extreme and hopefully more successful.
Reggie Perrin’s boss CJ was famed for his “I didn’t get where I am today” catchphrase, and as EAP’s resident GJ, I decided to employ my own ReG, without whom I wouldn’t be where I am today. Rather than writing this blog, my eyes would be drowning in metadata, my mind gathering dust, and my ears fleeing from the sound of colleagues and collaborators banging on my door, demanding to know why I’m so far behind in my work.
The problem
EAP metadata is created in spreadsheets by digitisation teams all over the world. It is then processed by the EAP team in London and ingested into the British Library’s cataloguing system.
When I joined EAP in 2018 one of the first projects to process was the Barbados Mercury and Bridgetown Gazette. It took days to create all of the catalogue references for this large newspaper collection, which spans more than 60 years.
Microsoft Excel’s fill down feature helped automate part of this task, but repeating this for thousands of rows is time-consuming and error-prone.
I needed to find a solution to this.
During 2019 I established new workflows to semi-automate several aspects of the cataloguing process using OpenRefine - but OpenRefine is primarily a data cleaning tool, and its difficulty in understanding hierarchical relationships meant that it was not suitable for this task.
Learning to code
For some time I toyed with the idea of learning to write computer code using the Python programming language. I dabbled with free online tutorials. But it was tough to make practical sense of these generic tutorials, hard to find time, and my motivation dwindled.
When the British Library teamed up with The National Archives and Birkbeck University of London to launch a PG Cert in Computing for Information Professionals, I jumped at the chance to take part in the trial run.
It was a leap certainly worth taking because I now have the skills to write code for the purpose of transforming and analysing large volumes of data. And the first product of this new skillset is a computer program that accurately generates catalogue references for thousands of rows of data in mere seconds.
The solution - ReG in action
By coincidence, one of the first projects I needed to catalogue after creating this program was another Caribbean newspaper digitised by the same team at the Barbados Archives Department: The Barbadian.
This collection was a similar size and structure to the Barbados Mercury, but the generation of all the catalogue references took just a few seconds. All I needed to do was:
- Open ReG
- Enter the project ID for the collection (reference prefix)
- Enter the filename of the spreadsheet containing the metadata
And Bingo! All my references were generated in a new file..
How it works in a nutshell
The basic principle of the program is that it reads a single column in the dataset, which contains the hierarchical information. In the example above, it read the “Level” column.
It then uses this information to calculate the structured numbering of the catalogue references, which it populates in the “Reference” column.
Reference format
The generated references conform to the following format:
- Each reference begins with a prefix that is common to the whole dataset. This is the prefix that the user enters at the start of the program. In the example above, that is “EAP1251”.
- Forward slashes ( / ) are used to indicate a new hierarchical level.
- Each record is assigned its own number relative to its sibling records, and that number is shared with all of the children of that record.
In the example above, the reference for the first collection is formatted:
The reference for the first series of the first collection is formatted:
The reference for the second series of the first collection is:
No matter how complex the hierarchical structure of the dataset, the program will quickly and accurately generate references for every record in accordance with this format.
While ReG was designed primarily for use by EAP, it should work for anyone that generates reference numbers using the same format.
For users of the Calm cataloguing software, ReG could be used to complete the “RefNo” column, which determines the tree structure of a collection when a spreadsheet is ingested into Calm.
With wider re-use in mind, some settings can be configured to suit individual requirements.
For example, you can configure the names of the columns that ReG reads and generates references in. For EAP, the reference generation column is named “Reference”, but for Calm users, it could be configured as “RefNo”.
Users can also configure their own hierarchy. You have complete freedom to set the hierarchical terms applicable to your institution and complete freedom to set the hierarchical order of those terms.
It is possible that some minor EAP idiosyncrasies might preclude reuse of this program for some users. If this is the case, by all means get in touch; perhaps I can tweak the code to make it more applicable to users beyond EAP - though some tweaks may be more feasible than others.
Additional validation features
While generating references is the core function, to that end it includes several validation features to help you spot and correct problems with your data.
Unexpected item in the hierarchy area
For catalogue references to be calculated, all the data in the level column must match a term within the configured hierarchy. The program therefore checks this and if a discrepancy is found, users will be notified and they have two options to proceed.
Option 1: Rename unexpected terms
First, users have the option to rename any unexpected terms. This is useful for correcting typographical errors, such as this example - where “Files” should be “File”.
Option 2: Build a one-off hierarchy
Alternatively, users can create a one-off hierarchy that matches the terms in the dataset. In the following example, the unexpected hierarchical term “Specimen” is a bona fide term. It is just not part of the configured hierarchy.
Rather than force the user to quit the program and amend the configuration file, they can simply establish a new, one-off hierarchy within the program.
This hierarchy will not be saved for future instances. It is just used for this one-off occasion. If the user wants “Specimen” to be recognised in the future, the configuration file will also need to be updated.
Single child records
To avoid redundant information, it is sometimes advisable for an archivist to eliminate single child records from a collection. ReG will identify any such records, notify the user, and give them three options to proceed:
- Delete single child records
- Delete the parents of single child records
- Keep the single child records and/or their parents
Depending on how the user chooses to proceed, ReG will produce one of three results, which affects the rows that remain and the structure of the generated references.
In this example, the third series in the original dataset contains a single child - a single file.
The most notable result is option B, where the parent was deleted. Looking at the “Level” column, the single child now appears to be a sibling of the files from the second series. But the reference number indicates that this file is part of a different branch within the tree structure.
This is more clearly illustrated by the following tree diagrams.
This functionality means that ReG will help you spot any single child records that you may otherwise have been unaware of.
But it also gives you a means of creating an appropriate hierarchical structure when cataloguing in a spreadsheet. If you intentionally insert dummy parents for single child records, ReG can generate references that map the appropriate tree structure and then remove the dummy parent records in one seamless process.
And finally ...
If you’ve got this far, you probably recognise the problem and have at least a passing interest in finding a solution. If so, please feel free to download the software, give it a go, and get in touch.
If you spot any problems, or have any suggested enhancements, I would welcome your input. You certainly won’t be wasting my time - and you might just save some of yours.
Download links
For making this possible, I am particularly thankful to Jody Butterworth, Sam van Schaik, Nora McGregor, Stelios Sotiriadis, and Peter Wood.
This blog post is by Dr Graham Jevon, Endangered Archives Programme cataloguer. He is on twitter as @GJHistory.