Open Books: Making Historical Texts Accessible

Prajit Dhar, PhD candidate, Faculty of Arts

Open Research objectives

Making the outputs of research freely accessible.

Using open collaborative methods and tools to increase efficiency and widen participation in research.

Practices

Creating new tools or technologies to facilitate Open Research practices

Introduction

This project aims to provide open access to the linguistic information from two text corpora; the Google Books Ngram Corpus [1] and the Corpus of Historical American English [2].

Motivation

While working on this project, I realized that this was the first time that I encountered the situation where I needed to concern myself with purchasing costs as well as distribution licenses. Additionally, others are not as fortunate as I am, having the computational resources as I currently do.

For this reason, I felt it was necessary to provide users the opportunity to access data that would otherwise be restricted.

Lessons learned

I clearly underestimated the effort needed to work with massive amounts of data. Several factors such as the choice of server, download bandwidth, etc. came into play. The current configuration took 2-3 months to finalize before we could even begin to work on this project.

Then came the issue of where to store the processed information from the corpora. Thankfully, there exists data repositories such as Zenodo that would allow us to store the huge datasets.

Finally, I would like to highlight a recent hackathon [6], where I participated in a team of 5. During the course of 9 days, our team came up with a tool called InnovAItor [7] that allows users to search and visualize the datasets mentioned above. It is still work in progress, but I was proud to be a part of the team and for the opportunity to work on this project.