Releasing Common Corpus: the largest public domain dataset for training LLMs

We announce today the release of Common Corpus on HuggingFace:

  • Common Corpus is the largest public domain dataset released for training LLMs.
  • Common Corpus includes 500 billion words from a wide diversity of cultural heritage initiatives.
  • Common Corpus is multilingual and the largest corpus to date in English, French, Dutch, Spanish, German and Italian.
  • Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.