Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
ashish-soni08 's Collections
Function Calling
Reasoning Datasets
Reasoning_Models
How_LLMS_Think _and_Reason_Papers
Microsoft Models
Embedding_Models
Leaderboards
MoE_Models
Meta AI
Pre-Training-Data-for-LLMs
Privacy_Masking_for_LLMs

Pre-Training-Data-for-LLMs

updated Aug 16, 2024

Open-Source Datasets that have been employed for pre-training Large Language Models

Upvote
-

  • tiiuae/falcon-refinedweb

    Viewer • Updated Jun 20, 2023 • 968M • 18k • 855

  • togethercomputer/RedPajama-Data-1T

    Viewer • Updated Jun 17, 2024 • 1.73M • 2.93k • 1.09k

  • mikex86/stackoverflow-posts

    Viewer • Updated Aug 1, 2023 • 58.3M • 1.46k • 54
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs