aya-se commited on
Commit
68b43b8
·
verified ·
1 Parent(s): a2b2429

Add the description of Swallow Corpus v2 to README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -119,6 +119,12 @@ The following datasets were used for continual pre-training.
119
  - [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733)
120
  - [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
121
 
 
 
 
 
 
 
122
  ## Risks and Limitations
123
 
124
  The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
 
119
  - [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733)
120
  - [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
121
 
122
+ ### Swallow Corpus Version 2
123
+
124
+ We built the Swallow Corpus by extracting high-quality Japanese texts from Common Crawl. In Version 2, we expanded the scope of the Common Crawl collection and modified the pipeline sequence to enable more flexible quality filtering. For Llama 3.1 Swallow v0.2, we further enhanced quality filtering and sampling during training compared to v0.1, resulting in the use of even higher-quality Japanese texts.
125
+
126
+ Further details of the methodology and analysis will be provided in a forthcoming paper.
127
+
128
  ## Risks and Limitations
129
 
130
  The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.