Fill-Mask
Transformers
Safetensors
Japanese
English
modernbert
hpprc commited on
Commit
72d1b14
·
verified ·
1 Parent(s): 8d9e4cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -14,7 +14,7 @@ This repository provides Japanese ModernBERT trained by [SB Intuitions](https://
14
  [ModernBERT](https://arxiv.org/abs/2412.13663) is a new variant of the BERT model that combines local and global attention, allowing it to handle long sequences while maintaining high computational efficiency.
15
  It also incorporates modern architectural improvements, such as [RoPE](https://arxiv.org/abs/2104.09864).
16
 
17
- Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and English text comprising **4.1T tokens**, featuring a vocabulary size of 102,400 and a sequence length of **8,192** tokens.
18
 
19
 
20
  ## How to Use
@@ -81,7 +81,7 @@ Next, we conducted two phases of context length extension.
81
 
82
  1. **Pre-training**
83
  - Training with **3.51T tokens**, including Japanese and English data extracted from web corpora.
84
- - The sequence length is 1,024 with naive sequence packing.
85
  - Masking rate is **30%** (with 80-10-10 rule).
86
  2. **Context Extension (CE): Phase 1**
87
  - Training with **430B tokens**, comprising high-quality Japanese and English data.
 
14
  [ModernBERT](https://arxiv.org/abs/2412.13663) is a new variant of the BERT model that combines local and global attention, allowing it to handle long sequences while maintaining high computational efficiency.
15
  It also incorporates modern architectural improvements, such as [RoPE](https://arxiv.org/abs/2104.09864).
16
 
17
+ Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and English text comprising **4.09T tokens**, featuring a vocabulary size of 102,400 and a sequence length of **8,192** tokens.
18
 
19
 
20
  ## How to Use
 
81
 
82
  1. **Pre-training**
83
  - Training with **3.51T tokens**, including Japanese and English data extracted from web corpora.
84
+ - The sequence length is 1,024 with [best-fit packing](https://arxiv.org/abs/2404.10830).
85
  - Masking rate is **30%** (with 80-10-10 rule).
86
  2. **Context Extension (CE): Phase 1**
87
  - Training with **430B tokens**, comprising high-quality Japanese and English data.