Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ This repository provides Japanese ModernBERT trained by [SB Intuitions](https://
|
|
14 |
[ModernBERT](https://arxiv.org/abs/2412.13663) is a new variant of the BERT model that combines local and global attention, allowing it to handle long sequences while maintaining high computational efficiency.
|
15 |
It also incorporates modern architectural improvements, such as [RoPE](https://arxiv.org/abs/2104.09864).
|
16 |
|
17 |
-
Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and English text comprising **4.
|
18 |
|
19 |
|
20 |
## How to Use
|
@@ -81,7 +81,7 @@ Next, we conducted two phases of context length extension.
|
|
81 |
|
82 |
1. **Pre-training**
|
83 |
- Training with **3.51T tokens**, including Japanese and English data extracted from web corpora.
|
84 |
-
- The sequence length is 1,024 with
|
85 |
- Masking rate is **30%** (with 80-10-10 rule).
|
86 |
2. **Context Extension (CE): Phase 1**
|
87 |
- Training with **430B tokens**, comprising high-quality Japanese and English data.
|
|
|
14 |
[ModernBERT](https://arxiv.org/abs/2412.13663) is a new variant of the BERT model that combines local and global attention, allowing it to handle long sequences while maintaining high computational efficiency.
|
15 |
It also incorporates modern architectural improvements, such as [RoPE](https://arxiv.org/abs/2104.09864).
|
16 |
|
17 |
+
Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and English text comprising **4.09T tokens**, featuring a vocabulary size of 102,400 and a sequence length of **8,192** tokens.
|
18 |
|
19 |
|
20 |
## How to Use
|
|
|
81 |
|
82 |
1. **Pre-training**
|
83 |
- Training with **3.51T tokens**, including Japanese and English data extracted from web corpora.
|
84 |
+
- The sequence length is 1,024 with [best-fit packing](https://arxiv.org/abs/2404.10830).
|
85 |
- Masking rate is **30%** (with 80-10-10 rule).
|
86 |
2. **Context Extension (CE): Phase 1**
|
87 |
- Training with **430B tokens**, comprising high-quality Japanese and English data.
|