The Common Pile

We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models. So far, we have released:

If you're interested in contributing, please open an issue on GitHub!