Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This makes it easy to download subcorpora like:

Bildungsromans
Dickens novels
Poetry published in the 1880s
Novels set in London

Corpus-DB has several components:

Scripts for aggregating metadata, written in Python
The database, currently a few SQLite databases
A REST API for querying the database, currently in progress
Analytic experiments, mostly in Python

Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo. Some usage examples may be found in the examples directory on GitHub.

Contributing

I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.

Corpus DB

Contributing