sbintuitions
/

modernbert-ja-310m

Model card Files Files and versions Community

hpprc commited on Feb 20

Commit

5875c2b

·

verified ·

1 Parent(s): 30e5b54

Update README.md

Files changed (1) hide show

README.md +11 -3

README.md CHANGED Viewed

@@ -23,13 +23,13 @@ Our ModernBERT-Ja-310M is trained on a high-quality corpus of Japanese and Engli
 You can use our models directly with the transformers library v4.48.0 or higher:
 ```bash
-pip install -U transformers>=4.48.0
 ```
 Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
 ```
-pip install flash-attn
 ```
 ### Example Usage
@@ -55,6 +55,8 @@ for result in results:
 ## Model Series
 |ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
 |-|-|-|-|-|-|
 |[sbintuitions/modernbert-ja-30m](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
@@ -62,6 +64,13 @@ for result in results:
 |[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
 |**[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)**|315M|236M|768|3072|25|
 ## Model Description
@@ -168,7 +177,6 @@ For datasets with predefined `train`, `validation`, and `test` sets, we simply t
 The evaluation results are shown in the table.
 `#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only.
 Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-310M also exhibited strong performance in short-sequence evaluations.
 ## Ethical Considerations

 You can use our models directly with the transformers library v4.48.0 or higher:
 ```bash
+pip install -U "transformers>=4.48.0"
 ```
 Additionally, if your GPUs support Flash Attention 2, we recommend using our models with Flash Attention 2.
 ```
+pip install flash-attn --no-build-isolation
 ```
 ### Example Usage
 ## Model Series
+We provide ModernBERT-Ja in several model sizes. Below is a summary of each model.
 |ID| #Param. | #Param.<br>w/o Emb.|Dim.|Inter. Dim.|#Layers|
 |-|-|-|-|-|-|
 |[sbintuitions/modernbert-ja-30m](https://huggingface.co/sbintuitions/modernbert-ja-30m)|37M|10M|256|1024|10|
 |[sbintuitions/modernbert-ja-130m](https://huggingface.co/sbintuitions/modernbert-ja-130m)|132M|80M|512|2048|19|
 |**[sbintuitions/modernbert-ja-310m](https://huggingface.co/sbintuitions/modernbert-ja-310m)**|315M|236M|768|3072|25|
+For all models,
+the vocabulary size is 102,400,
+the head dimension is 64,
+and the activation function is GELU.
+The configuration for global attention and sliding window attention consists of 1 layer + 2 layers (global–local–local).
+The sliding window attention window context size is 128, with global_rope_theta set to 160,000 and local_rope_theta set to 10,000.
 ## Model Description
 The evaluation results are shown in the table.
 `#Param.` represents the number of parameters in both the input embedding layer and the Transformer layers, while `#Param. w/o Emb.` indicates the number of parameters in the Transformer layers only.
 Despite being a long-context model capable of processing sequences of up to 8,192 tokens, our ModernBERT-Ja-310M also exhibited strong performance in short-sequence evaluations.
 ## Ethical Considerations