Update README.md
Browse files
README.md
CHANGED
@@ -23,13 +23,13 @@ Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and Engli
|
|
23 |
You can use our models directly with the transformers library v4.48.0 or higher:
|
24 |
|
25 |
```bash
|
26 |
-
pip install -U transformers>=4.48.0
|
27 |
```
|
28 |
|
29 |
Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
|
30 |
|
31 |
```
|
32 |
-
pip install flash-attn
|
33 |
```
|
34 |
|
35 |
### Example Usage
|
@@ -55,6 +55,8 @@ for result in results:
|
|
55 |
|
56 |
## Model Series
|
57 |
|
|
|
|
|
58 |
|ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
|
59 |
|-|-|-|-|-|-|
|
60 |
|[sbintuitions/modernbert-ja-30m](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
|
@@ -62,6 +64,13 @@ for result in results:
|
|
62 |
|[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
|
63 |
|**[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)**|315M|236M|768|3072|25|
|
64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
65 |
|
66 |
## Model Description
|
67 |
|
@@ -168,7 +177,6 @@ For datasets with predefined `train`, `validation`, and `test` sets, we simply t
|
|
168 |
The evaluation results are shown in the table.
|
169 |
`#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only.
|
170 |
|
171 |
-
|
172 |
Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-310M also exhibited strong performance in short-sequence evaluations.
|
173 |
|
174 |
## Ethical Considerations
|
|
|
23 |
You can use our models directly with the transformers library v4.48.0 or higher:
|
24 |
|
25 |
```bash
|
26 |
+
pip install -U "transformers>=4.48.0"
|
27 |
```
|
28 |
|
29 |
Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
|
30 |
|
31 |
```
|
32 |
+
pip install flash-attn --no-build-isolation
|
33 |
```
|
34 |
|
35 |
### Example Usage
|
|
|
55 |
|
56 |
## Model Series
|
57 |
|
58 |
+
We provide ModernBERT-Ja in several model sizes. Below is a summary of each model.
|
59 |
+
|
60 |
|ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
|
61 |
|-|-|-|-|-|-|
|
62 |
|[sbintuitions/modernbert-ja-30m](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
|
|
|
64 |
|[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
|
65 |
|**[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)**|315M|236M|768|3072|25|
|
66 |
|
67 |
+
For all models,
|
68 |
+
the vocabulary size is 102,400,
|
69 |
+
the head dimension is 64,
|
70 |
+
and the activation function is GELU.
|
71 |
+
The configuration for global attention and sliding window attention consists of 1 layer + 2 layers (global–local–local).
|
72 |
+
The sliding window attention window context size is 128, with global_rope_theta set to 160,000 and local_rope_theta set to 10,000.
|
73 |
+
|
74 |
|
75 |
## Model Description
|
76 |
|
|
|
177 |
The evaluation results are shown in the table.
|
178 |
`#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only.
|
179 |
|
|
|
180 |
Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-310M also exhibited strong performance in short-sequence evaluations.
|
181 |
|
182 |
## Ethical Considerations
|