Releasing Common Corpus: the largest public domain dataset for training LLMs

We announce today the release of Common Corpus on HuggingFace:

Common Corpus is the largest public domain dataset released for training LLMs.

Common Corpus includes 500 billion words from a wide diversity of cultural heritage initiatives.

Common Corpus is multilingual and the largest corpus to date in English, French, Dutch, Spanish, German and Italian.

Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.