Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This makes it easy to download subcorpora like:
- Bildungsromans
- Dickens novels
- Poetry published in the 1880s
- Novels set in London
Corpus-DB has several components:
- Scripts for aggregating metadata, written in Python
- The database, currently a few SQLite databases
- A REST API for querying the database, currently in progress
- Analytic experiments, mostly in Python
Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo. Some usage examples may be found in the examples directory on GitHub.
Contributing
I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.