Google Books 2020 Update

A look back — and forward — at the ongoing partnership between Google and Harvard Library.

What would you do if Google came to you and said: You have 1 million items that we would like to scan for you and make available to the world?

Over the past two years, a team from Access Services, Stacks Management, Library Technology Services, Information and Technical Services, Harvard Depository, and ReCAP have been attempting to do just that as part of a Harvard Library Digital Strategies and Innovation (DSI) initiative. This project began nearly a decade after our first partnership with Google Books, and it has been an opportunity to approach this work differently — to identify the challenges that we face at each step of the workflow and to look for creative, iterative ways to meet them. 

How do you fulfill a request for 1 million items? 

We knew that Google had already applied several parameters in their review of our collection. In setting their candidate list, they had identified items that were in the public domain, and they were focusing on items that were unique to Harvard. We took their candidate list and thought about our collections both narrowly and broadly, balancing the needs of Google workflow with the needs of our own staff who would be identifying, pulling, barcoding, and shipping selected material. 

Google’s standard workflow involves sending 24 carts at a time, staging them in a library space, and loading them onto a truck for shipment at predetermined intervals. For most of our repositories, that was a non-starter. Loading docks and staging space are at a premium, so we looked for alternative arrangements that would meet our local needs and still get carts to Google. Lee Fenn was the wizard behind the data — helping to identify sets of material within the Google candidate list that would allow Harvard Library to take advantage of this opportunity while at the same time understanding our logistical limitations and meeting our own strategic goals. 

The first set of material we identified was from the Widener stacks. Working with our partners at Google and ReCAP — with Leila Smith at the helm — we were able to send 26,543 volumes from the stacks of Widener to Google via a stop at ReCAP. Once they were scanned, Google returned those materials to ReCAP for accessioning, freeing up 2,333 linear feet of space in the Widener stacks. That work took place in the winter and spring of 2018.

Once we had demonstrated the viability of that workflow, sending materials to ReCAP and allowing Google to pick up and return from there, we knew we could also send any materials from the Google candidate list that had already been accessioned to ReCAP using the same workflow. That meant that we were able to send an additional 16,853 items — or 3,168 linear feet of material — for scanning in the fall of 2018. 

One major difference between the materials at Widener and those that had already been accessioned at ReCAP, however, is that many items at Widener required individual barcoding. Google needed a barcode on each item both for tracking purposes and to retrieve the item’s metadata. The ReCAP items had already been barcoded, and that led us to consider another set of materials that had already been barcoded — items at the Harvard Depository.

HD staff were able to pull and send 147,322 items to Google for scanning but continue to use the existing workflow for returning items to ReCAP for accessioning, freeing up valuable space at HD. These materials were primarily from Widener and Harvard Law School collections, and represented a significant portion of Google’s candidate list.

What about the data?

This work began just as Harvard Library was transitioning from Aleph to Alma, a challenge that ITS and LTS were able to take on despite the complexity. Access Services staff raced against the clock to get the first set of Widener stacks pulls completed before the switch to Alma, and then partnered with colleagues in ITS and LTS to develop a new workflow once the Alma transition was complete to create shipping manifests that accompany each set of carts and allow Google to retrieve the metadata about the items we send.

At the same time, our transition to Alma also meant revisiting the pipeline for adding links in the catalog to items that Google has scanned. Laura Morse’s team within LTS has developed a new workflow for creating and updating links to Google Books, but we are also working with our Google partners to update the pipeline for exposing our Google scans in HathiTrust and linking to the HathiTrust interface from our catalog. This work is ongoing, and it isn’t a static fix. We are continually facing new challenges with how our Alma data is used and updated, and every time we move items around — whether for scanning, relocating, or in this case both — we uncover new issues that lead to new opportunities.

What’s next?

At the same time that we were experimenting with these workflows to send items from Widener and HD and return them to ReCAP, we were also experimenting with a second Google workflow — their sheet-fed scanning process. We knew that there were items within our collection that, once converted to digital form, would no longer be retained in physical form. The most obvious candidates for this workflow came from our Government Documents collection at Lamont. In 2018 we piloted this approach locally using our own high-speed scanner. Hugh Truslow was able to work with his colleagues at Boston Public Library, who serve as our regional Government Depository representatives, to clear a list of titles for this project. Using our local experience as a testing ground, we were able to move forward with identifying broader sets of government documents that we could send through Google’s workflow. 

Laura Sheriff and the team at Lamont have been able to send 1,429 tems using this workflow, and we are prepared to continue that work through 2020. The goal is to clear as many stacks on the D-Level of Lamont as possible — an opportunity for reusing that space in myriad ways that further our strategic objectives as a library. Again, this involves barcoding, updates to Alma records, and work by LTS to replace our print holdings with links to Google Books and ultimately HathiTrust.

But why?

Our partnership with Google is not just about scanning services. It is about an ongoing relationship that seeks to enhance and refine the use of digital surrogates. Between 2004 and 2009, Google scanned 891,164 volumes from Harvard. Google has begun reprocessing those materials, enhancing and correcting the raw images and running them through updated OCR to create better, more searchable, machine-readable text.  

As part of this relationship, we are involved in the Google Library Partners group, an active community of our colleagues from peer institutions who also share their materials with Google. As a group we have been able to advocate for and contribute to reviews for handling of materials, quality assurance in scanning, and expanded treatments for items with foldouts or materials of non-traditional size. We have also led a review of how our peers provide access to materials and are actively partnering with HathiTrust to conduct more research into how users find and utilize these materials.

I am grateful to have been a part of this project and for the support from Library leadership to make this moonshot possible. If you are interested in working to identify items from your own collections that may be suitable for scanning by Google, please email claire_demarco@harvard.edu.
 

By Claire DeMarco, Associate Director, Digital Strategies and Innovation