Add the description of Swallow Corpus v2 to README.md
Browse files
README.md
CHANGED
@@ -119,6 +119,12 @@ The following datasets were used for continual pre-training.
|
|
119 |
- [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733)
|
120 |
- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
|
121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
## Risks and Limitations
|
123 |
|
124 |
The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
|
|
|
119 |
- [Swallow Corpus Version 2](https://arxiv.org/abs/2404.17733)
|
120 |
- [The-stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids)
|
121 |
|
122 |
+
### Swallow Corpus Version 2
|
123 |
+
|
124 |
+
We built the Swallow Corpus by extracting high-quality Japanese texts from Common Crawl. In Version 2, we expanded the scope of the Common Crawl collection and modified the pipeline sequence to enable more flexible quality filtering. For Llama 3.1 Swallow v0.2, we further enhanced quality filtering and sampling during training compared to v0.1, resulting in the use of even higher-quality Japanese texts.
|
125 |
+
|
126 |
+
Further details of the methodology and analysis will be provided in a forthcoming paper.
|
127 |
+
|
128 |
## Risks and Limitations
|
129 |
|
130 |
The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
|