Why use machine learning in cultural heritage and humanities research?

Most stories start at the beginning, however this story is going to start at the end.

This first post was going to be a reflection on machine learning and the use of TensorFlow – a software library used for machine learning applications like neural networks. This was to follow from attending a pre-conference workshop delivered by colleagues from the AI Lab at the National Library of Norway (NLN) as part of the 2nd international Fantastic Futures 2019 conference at Stanford University.

However, at the end of a session (Intro to TensorFlow) delivered by André Walsøe and Freddy Wetjen from the NLN they posed question:

How do we (the GLAM community) move from experimentation phase into a production phase, using digitised heritage collections and machine learning techniques?

Bear in mind that behind their question is a lot of pre-existing experimentation and collaboration at NLN (roughly four years) and a continuing partnership with Stanford University Library.

An initial thought was to start break that question down and to start with the state of heritage and research collections.

Digitised or digitally born collections – many heritage and research institutions have these. ✔
Collections are not well described – manually – this is not uncommon at all in heritage and research institutions. ✔
Collections not well accessed – it depends on how access is measured but safe to say many items in collections are not used or used often. ✔

If there are digital collections and there are barriers to their discovery and use, it’s a no brainer to add techniques to making them more accessible and reusable. Just a quick pointer to the FAIR (Findable, Accessible, Interoperable, Reusable) and CARE (Collective benefit, Authority to control, Responsibility, Ethics) aide memoire around facilitating data enabled / driven / intensive research and data curation.

Then to look at the types of professionals looking after collections in heritage and research institutions.

Some collection professionals are technically skilled – few though are trained as computer or data scientists.
Safe to say that information science expertise is common in collection professionals – and many are used to using software.
Few collections professionals have ventured into the “carpentries” – many are yet to extend their technical skills through this type of training.

Finally to look at the experimental work with machine learning that André Walsøe and Freddy Wetjen and others with computer science knowledge and technical skills have undertaken.

Tools hosted in the cloud are readily available (like Jupyter notebooks and TensorFlow) – and they are being used in experiments.
The experiments are revealing some useful outcomes (in the case of NLN to support discovery) – and this is great to see.
There are technologists and curators that can be drawn into teams to work and reflect on the experiments and to think further about other questions i.e. why and what-for (the strategy and business case).

A useful moment to reflect back on an investigation on the uptake of linked open data methods funded by VALA travel scholarship back in 2015.

It became clear, after reviewing practice change associated with linked open data in various parts of the world, that there is a common cycle, and that cycle may be relevant here. It seems likely that what may be missing is a collection strategy that leverages machine learning that can be aligned with institutional goals.

It seems reasonable also to suggest that the collection service goals are consistent, but perhaps could be extended. International uptake of machine learning methods in documentation and collection practices by GLAMs could be focused on:

Enhancing metadata quality through data enrichment.
Improving resource discovery across data silos.
Enabling data integration and data interoperability.
Changing documentation and metadata management practices.

Machine learning may extend collection services to include “data services” by:

Offering derived data as collection items.
Supplying training and test datasets as reference data (fully described, versioned, and maintained online with persistent identifiers).
Providing advice on appropriate models and approaches.

This is though a time to acknowledge that data science is already “in the house” of heritage and research institutions maintaining digital collections.

And, a significant proportion of the professional community working with digital collections may need to be brought along into the world of data science, cloud services and data infrastructure — as well as the wider community that need to be drawn into this exploration and change.