08 December 2015
Using Open Refine to create XML Records for Wikimedia Batch Upload Tool
We do quite a bit of uploading of our British Library digitised collections to Wikimedia Commons and using their GLAMwiki Toolset allows us to fashion our metadata up front so it is consistent, and upload files in bulk by collection. This bulk uploader utility requires metadata to be in a flat-XML file.
There is actually quite a comprehensive guide over at Wikimedia Commons on how to get started using this tool but I thought I might also share an example of how we do it over here! In particular, how we use Open Refine to quickly turn a spreadsheet of records into an XML file appropriate for use in the GLAMwiki Toolset from a spreadsheet of records.
Example: Uploading BL Wildlife Sounds to Wikimedia Commons
Cheryl, our fabulous curator of Wildlife & Environmental Sounds at British Library wanted to upload a collection of Wildlife Sounds ahead of a Europeana Sounds Edit-a-thon.
- She sent me this spreadsheet describing each sound file, downloaded straight out of our BL Sounds Catalogue. The sound files themselves were saved on a webserver and each had their own URL.
- I opened her spreadsheet in OpenRefine to have a look over the data and do any necessary data cleaning. Note: I highly recommend Owen Stephens’ excellent Introduction to OpenRefine which he teaches regularly as part of our internal staff training programme.
- In another browser window I checked to see if there was an existing metadata template for sounds so that I could see which standard Wikimedia fields we could map our own data to. The closest template I found was Musical Work which would roughly do the trick.
- Using the Musical Work template as my guide I did a few quick updates to Cheryl’s original data in OpenRefine (again, see Owen’s great guide for tips on how to do that using regular expressions in Open Refine). For example I:
1) Changed her “CC License” column to “Permissions” and replaced her original text “cc-by” with the appropriate wiki subtemplate "{{cc-by-sa-4.0}}"
2) Created a new column called "GWToolsetTitle" and populated it with the file name I wanted displayed for each image. We like to have the unique British Library shelfmark included in the file names of all of our items on Wikimedia Commons as standard practice so I created that by concatenating the existing Shelfmark Field with the Title field into a new GWToolsetTitle column.
- In Notepad++ I fashioned my XMLTemplate. If a field doesn’t already exist in the Wikimedia template you can create your own using the "other_fields" option. We wanted to have our own unique fields for "British Library Shelfmark", "Copyright Holder", "Recordist" and "Recording Date" displayed on Wikimedia Commons so I added those as other_fields in my template. Note in my XML template example that {{cell[“XXXX”].value}} refers to the column headings in my OpenRefine project.
- Once I had my template in Notepad++ ready to go, I went back to my project in OpenRefine and clicked on "Export/Templating" on the right hand-side.
- There’s a lot of JSON in there as default.
- I deleted all that JSON on the left side and copied and pasted my own XML template in there. You can see on the right side all of my columns are instantly transformed into a flat XML record ready for export!
- Clicking on "Export" downloaded a .txt file which I saved as .xml. I ran it through a validator tool (I like Code Beautify) to make sure it all checked out. Then presto! This is the the flat-XML file I then used in the batch upload tool!
By Nora McGregor, Digital Curator