Fill-Mask
Transformers
Safetensors
Japanese
English
modernbert
hpprc commited on
Commit
5875c2b
·
verified ·
1 Parent(s): 30e5b54

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -3
README.md CHANGED
@@ -23,13 +23,13 @@ Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and Engli
23
  You can use our models directly with the transformers library v4.48.0 or higher:
24
 
25
  ```bash
26
- pip install -U transformers>=4.48.0
27
  ```
28
 
29
  Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
30
 
31
  ```
32
- pip install flash-attn
33
  ```
34
 
35
  ### Example Usage
@@ -55,6 +55,8 @@ for result in results:
55
 
56
  ## Model Series
57
 
 
 
58
  |ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
59
  |-|-|-|-|-|-|
60
  |[sbintuitions/modernbert-ja-30m](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
@@ -62,6 +64,13 @@ for result in results:
62
  |[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
63
  |**[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)**|315M|236M|768|3072|25|
64
 
 
 
 
 
 
 
 
65
 
66
  ## Model Description
67
 
@@ -168,7 +177,6 @@ For datasets with predefined `train`, `validation`, and `test` sets, we simply t
168
  The evaluation results are shown in the table.
169
  `#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only.
170
 
171
-
172
  Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-310M also exhibited strong performance in short-sequence evaluations.
173
 
174
  ## Ethical Considerations
 
23
  You can use our models directly with the transformers library v4.48.0 or higher:
24
 
25
  ```bash
26
+ pip install -U "transformers>=4.48.0"
27
  ```
28
 
29
  Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
30
 
31
  ```
32
+ pip install flash-attn --no-build-isolation
33
  ```
34
 
35
  ### Example Usage
 
55
 
56
  ## Model Series
57
 
58
+ We provide ModernBERT-Ja in several model sizes. Below is a summary of each model.
59
+
60
  |ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
61
  |-|-|-|-|-|-|
62
  |[sbintuitions/modernbert-ja-30m](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
 
64
  |[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
65
  |**[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)**|315M|236M|768|3072|25|
66
 
67
+ For all models,
68
+ the vocabulary size is 102,400,
69
+ the head dimension is 64,
70
+ and the activation function is GELU.
71
+ The configuration for global attention and sliding window attention consists of 1 layer + 2 layers (global–local–local).
72
+ The sliding window attention window context size is 128, with global_rope_theta set to 160,000 and local_rope_theta set to 10,000.
73
+
74
 
75
  ## Model Description
76
 
 
177
  The evaluation results are shown in the table.
178
  `#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only.
179
 
 
180
  Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-310M also exhibited strong performance in short-sequence evaluations.
181
 
182
  ## Ethical Considerations