The Common Pile

We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models. So far, we have released:

The Common Pile v0.1, an 8 TB dataset of text from over 30 diverse sources
Our paper: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Comma v0.1-1T and Comma v0.1-2T, 7B parameter LLMs trained on text from the Common Pile v0.1
The training dataset used to train the Comma v0.1 models
Our code for collecting data from each source

If you're interested in contributing, please open an issue on GitHub!