Corpus DB

Welcome to the Corpus-DB Project, a textual corpus database for the digital humanities.

GitHub

Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This makes it easy to download subcorpora like:

Corpus-DB has several components:

Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo. Some usage examples may be found in the examples directory on GitHub.

Contributing

I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.