Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +135 -0
config.json +44 -0
generation_config.json +7 -0
model-00001-of-00004.safetensors +3 -0
model-00002-of-00004.safetensors +3 -0
model-00003-of-00004.safetensors +3 -0
model-00004-of-00004.safetensors +3 -0
model.safetensors.index.json +441 -0
modeling_plamo.py +1707 -0
special_tokens_map.json +24 -0
tokenization_plamo.py +392 -0
tokenizer.jsonl +3 -0
tokenizer_config.json +56 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.jsonl filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+---
+license: other
+license_name: plamo-community-license
+license_link: https://huggingface.co/pfnet/plamo-2-8b/blob/main/LICENSE/ja
+language:
+- en
+- ja
+pipeline_tag: text-generation
+library_name: transformers
+extra_gated_heading: PLaMo community license to download PLaMo 2 8B
+extra_gated_description: To download PLaMo 2 8B, you have to agree to our license.
+  PLaMo 2 8B is released PLaMo community license. For non-commerical use, please contact
+  us via this [form](https://forms.gle/mTL8tBLrMYXKNZD56).
+extra_gated_button_content: agree to PLaMo community license
+extra_gated_prompt: "(English version is under construction. We apologize for the\
+  \ inconvenience.)\n### PLaMoコミュニティライセンス契約\nPLaMoコミュニティライセンス契約には、株式会社Preferred Networksが提供する別途定める大規模言語基盤モデルPLaMo及びその派生物を利用するためのライセンスの内容及びユーザーが遵守する事項等が定められている。ユーザーのPLaMo及びその派生物の利用には本契約が適用され、本契約に同意又は本モデル等を利用することにより、ユーザーは本契約に拘束される。\n\
+  #### 第1条（定義）\n(1) 「本契約」とは、PLaMoコミュニティライセンス契約を意味する。\n(2) 「PFN」とは、株式会社Preferred Networksを意味する。\n\
+  (3) 「本モデル」とは、別途定める「PLaMo」という名称のモデルの重み、モデルコード、トークナイザー、学習スクリプト及びこれらに付随してPFNが提供するものを意味する。\n\
+  (4) 「ユーザー」とは、本モデルを利用する個人又は法人を意味する。\n(5) 「派生モデル」とは、本モデルを改変又は利用し作成されるモデルの重み、モデルコード及びその他作成されたモデルの付随物を意味する。\n\
+  (6) 「生成物」とは、本モデル又は派生モデルの出力結果を意味する。\n(7) 「本モデル等」とは、本モデル、派生モデル及び生成物の総称を意味する。\n(8)\
+  \ 「本ライセンス」とは、PFNがユーザーに対して本契約に基づき本モデル等を利用することを許諾することを意味する。\n(9) 「商業目的」とは、 私的使用又は学術用途の範囲を超える、事業での利用又は営利を目的とする利用を意味する。なお、商業目的にはユーザーの製品、サービス又は事業の開発、変更又は提供（ホスティングサービスやAPI経由での提供を含む。）を目的とした使用及びユーザーの組織内部における利用も含まれる。\n\
+  #### 第2条（ユーザー）\nユーザーは、18歳以上又はその居住国で単独で契約を締結できる年齢に達していなければならない。但し、ユーザーの親権者又は法定代理人が本契約をユーザーが締結することに同意している場合はこの限りではない。\n\
+  #### 第3条（本ライセンス）\n(1) PFNは、ユーザーが本契約に同意しかつ本契約を遵守することを条件に、ユーザーに対して、本モデル等を本契約に定める条件及び範囲内で利用することを許諾する。\n\
+  (2) 本ライセンスは非独占、世界的、譲渡不可及びロイヤリティ無料とする。\n(3) ユーザーは、以下の条件をいずれも満たす場合に限り、商業目的を含む形で本モデル等を利用することができる。なお、ユーザーがこれらの条件のいずれかを満たさなくなった場合は、ユーザーはその時点で本モデル等を商業目的で利用することはできず、商業目的で本モデル等を利用したい場合は、新たにPFNから商業用のライセンスを取得しなければならない。\n\
+  \n  (i) PFNの公式登録ページ https://forms.gle/mTL8tBLrMYXKNZD56 に事前に登録すること。\n\n  (ii) ユーザー又はその関係会社の直近事業年度の収入又は売上が10億円（ユーザーの現地通貨換算額）を超えないこと。\n\
+  \n#### 第4条（再配布及び表示義務）\n(1) ユーザーが本モデル等（派生モデルやその生成物を含む）を第三者に提供する場合、以下の条件を満たさなければならない。\n\
+  \n  (i) 本契約のコピーを提供し、本契約の条件を遵守させること。\n\n  (ii) 「Built with PLaMo」と明示し、関連ウェブサイト、ユーザーインターフェース、ブログ記事、製品情報ページ又は製品ドキュメントに記載すること。\n\
+  \n  (iii) 本モデル等を利用して作成した AI モデルの名称に「PLaMo」を含めること。\n\n#### 第5条（生成物の利用）\n(1) ユーザーは、生成物を本モデル又は派生モデルの生成物であることを明示することを条件に、公表することができる。\n\
+  (2) 生成物を利用してモデルを学習した場合、そのモデルは派生モデルとして本契約の条件が適用され、本契約のライセンス条件の下でのみ利用、配布及び商業化することができる。\n\
+  #### 第6条（その他利用条件）\nユーザーは、本モデル等の利用に関して、以下に定める行為をしてはならない。\n(1) 法令又は公序良俗に違反する行為\n(2)\
+  \ 犯罪行為又はこれを予告、関与、助長その他これらに関連する行為\n(3) PFN又は第三者の権利又は利益を侵害する行為\n(4) PFN又は第三者の名誉若しくは信用を毀損する行為\n\
+  (5) 生成物がPFNの公式見解等であるものという錯誤を生む情報を流布する行為\n(6) 虚偽の情報を発信する行為\n(7) 上記の他、PFNが不適切と合理的に判断する行為\n\
+  #### 第7条（保証の否認）\n(1) 本モデル及び生成物は、「現状有姿」で提供され、PFNは、これらに対して、正確性、真実性、商品性、品質、性能、特定目的への適合性、権利の非侵害など一切の保証をしない。\n\
+  (2) ユーザーは、法律、医療、金融又は人物評価その他重要な事項の決定に関して、生成物を唯一の証拠、評価又は意見として使用してはならない。\n(3) ユーザーは、本モデル等の使用及びその結果に関して全ての責任を負う。\n\
+  #### 第8条（責任の制限）\n(1) 契約責任、不法行為又は製造物責任その他の法的責任のいずれかであるかを問わず、PFNが本契約及び本モデル等に関してユーザーに対して負う損害賠償の責任は、通常かつ直接の損害に限り（逸失利益、特別損害、間接損害その他の損害については、その予見可能性の有無に関わらず、責任を負わない。）、損害賠償額の上限は、500円とする。但し、PFNに故意又は重過失が認められる場合はこの限りではない。\n\
+  (2) 前項に関わらず、ユーザーが本モデル等を事業のために利用する場合は、PFNは本契約及び本モデル等に関してユーザーに対して一切の損害賠償責任及びその他の責任を負わない。\n\
+  #### 第9条（ユーザーの責任）\n(1) ユーザーは、本モデル等の取得及び利用に関して、適用される法令（輸出入及び貿易に関連する法令を含む。）及び本契約を遵守する。\n\
+  (2) ユーザーは、本契約違反又は本モデル等の使用によって、PFNに損害を与えた場合は、その損害を賠償する。\n(3) ユーザーの本モデル等の使用に起因して、PFNが第三者から損害賠償請求その他請求を受けた場合、ユーザーは、当該請求からPFNを免責し、PFNに損害を与えないようにする。\n\
+  #### 第10条（権利の帰属）\n(1) 本モデルの一切の権利は、PFN又はPFNに本モデルのライセンスをしている第三者に帰属する。\n(2) 派生モデルのうち、ユーザーが本モデルを改変した部分の権利はユーザーに帰属し、その他の部分の権利はPFNに帰属する。\n\
+  (3) 生成物の一切の権利はユーザーに帰属する。\n#### 第11条（契約期間及び終了）\n(1) 本契約は、ユーザーが本契約に同意したとき又は本モデルにアクセスしたときから、本契約が解約されたときまでとする。\n\
+  (2) ユーザーが本契約のいずれかの条項に違反した場合、PFNは直ちに本契約を解除することができ、ユーザーは本モデル等のすべてのコピーを削除し、利用を即時に停止しなければならない。\n\
+  #### 第12条（契約の変更）\nPFNは、本契約（本モデル等に関するルールや諸規定等を含む。以下本条において同じ。）を変更できるものとする。PFNは、本契約を変更する場合には、変更の内容及び変更の効力発生時期を、当該効力発生時期までにPFN所定の方法で告知するものとする。\n\
+  #### 第13条（準拠法及び管轄裁判所）\n(1) 本契約の準拠法は日本法とする。\n(2) 本モデル等及び本契約に起因する紛争については、東京地方裁判所が専属的合意管轄裁判所とする。"
+base_model: pfnet/plamo-2-8b
+tags:
+- plamo
+- translation
+---
+# PLaMo Translation Model
+PLaMo翻訳モデルはPreferred Networksによって開発された翻訳向け特化型大規模言語モデルです。
+詳しくは[ブログ記事](https://tech.preferred.jp/ja/blog/plamo-translate/)および[プレスリリース](https://www.preferred.jp/ja/news/pr20250527/)を参照してください。
+PLaMo Translation Model is a specialized large-scale language model developed by Preferred Networks for translation tasks.
+For details, please refer to the [blog post](https://tech.preferred.jp/ja/blog/plamo-translate/) and [press release](https://www.preferred.jp/ja/news/pr20250527/).
+List of models:
+- [plamo-2-translate](http://huggingface.co/pfnet/plamo-2-translate) ... Post-trained model for translation
+- [plamo-2-translate-base](http://huggingface.co/pfnet/plamo-2-translate-base) ... Base model for translation
+- [plamo-2-translate-eval](http://huggingface.co/pfnet/plamo-2-translate-eval) ... Pair-wise evaluation model
+PLaMo Translation Model is released under PLaMo community license. Please check the following license and agree to this before downloading.
+- (EN) under construction: we apologize for the inconvenience
+- (JA) https://www.preferred.jp/ja/plamo-community-license/
+**NOTE**: This model has **NOT** been instruction-tuned for chat dialog or other downstream tasks.
+### For *commercial* users
+Please check the PLaMo community license and contact us via the following form to use commercial purpose.
+- (EN/JA) https://forms.gle/mTL8tBLrMYXKNZD56
+## Usage
+### main/base model
+```py
+import vllm
+# max_model_len/max_num_batched_tokens can be increased when running on a GPU with substantial memory.
+# NOTE: Switch to "pfnet/plamo-2-translate-base" to try the base model.
+llm = vllm.LLM(model="pfnet/plamo-2-translate", trust_remote_code=True, max_model_len=2000, max_num_batched_tokens=2000)
+prompt = r'''<|plamo:op|>dataset
+translation
+<|plamo:op|>input lang=English
+Write the text to be translated here.
+<|plamo:op|>output lang=Japanese
+'''
+responses = llm.generate([prompt] * 1, sampling_params=vllm.SamplingParams(temperature=0, max_tokens=1024, stop=["<|plamo:op|>"]))
+# NOTE: This outputs "ここに翻訳するテキストを入力してください。".
+print(responses[0].outputs[0].text)
+```
+### evaluation model
+```py
+import vllm
+# max_model_len/max_num_batched_tokens can be increased when running on a GPU with substantial memory.
+llm = vllm.LLM(model="pfnet/plamo-2-translate-eval", trust_remote_code=True, max_model_len=2000, max_num_batched_tokens=2000)
+prompt = r'''<|plamo:op|>dataset
+translation evaluation
+<|plamo:op|>input lang=English
+This is an apple.
+<|plamo:op|>output id=A lang=Japanese
+これはりんごです。
+<|plamo:op|>output id=B lang=Japanese
+これはリンゴです。
+<|plamo:op|>best
+id='''
+responses = llm.generate([prompt] * 1, sampling_params=vllm.SamplingParams(temperature=0, max_tokens=1, stop=["<|plamo:op|>"]))
+# NOTE: This outputs "A".
+print(responses[0].outputs[0].text)
+```
+## Bias, Risks, and Limitations
+PLaMo Translation Model is a new technology that carries risks with use. Testing conducted to date has been in English and Japanese, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, PLaMo Translation Model’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of PLaMo Translation Model, developers should perform safety testing and tuning tailored to their specific applications of the model.
+## Acknowledgement
+This model is trained under the project, “Research and Development Project of the Enhanced Infrastructures for Post 5G Information and Communication System” (JPNP 20017), subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
+## AI policies for Preferred Networks, Inc. group
+- (EN) https://www.preferred.jp/en/company/aipolicy/
+- (JA) https://www.preferred.jp/ja/company/aipolicy/

config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "architectures": [
+    "Plamo2ForCausalLM"
+  ],
+  "attention_window_size": 32768,
+  "auto_map": {
+    "AutoConfig": "modeling_plamo.Plamo2Config",
+    "AutoModelForCausalLM": "modeling_plamo.Plamo2ForCausalLM"
+  },
+  "bos_token_id": 1,
+  "eos_token_id": 1,
+  "eval_attention_n_bit": null,
+  "eval_mlp_n_bit": null,
+  "fp8_accum_dtype": "bfloat16",
+  "full_attention_idx": [],
+  "hidden_size": 4096,
+  "hidden_size_per_head": 128,
+  "image_feature_size": null,
+  "image_proj_type": "linear",
+  "image_token_id": null,
+  "intermediate_size": 16384,
+  "linear_type": "normal",
+  "mamba_chunk_size": 256,
+  "mamba_d_conv": 4,
+  "mamba_d_state": 64,
+  "mamba_enabled": true,
+  "mamba_num_heads": 64,
+  "mamba_step": 2,
+  "max_position_embeddings": 10485760,
+  "model_type": "plamo2",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 4,
+  "pad_token_id": 3,
+  "rms_norm_eps": 1e-06,
+  "rope_local_theta": 1000000.0,
+  "rope_theta": 1000000.0,
+  "sliding_window": 32768,
+  "tokenizer_class": "Plamo2Tokenizer",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.46.2",
+  "use_cache": false,
+  "vocab_size": 100032
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 1,
+  "pad_token_id": 3,
+  "transformers_version": "4.46.2"
+}

model-00001-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:707a7c1f9bb4982b098864d7215a3985fa3f026f2ebd0a82799fea72e807349f
+size 4771172576

model-00002-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab65beeb0731f66b6542cbfb3b6967f72e12386f55bd2234d690cdcda991ac4f
+size 4964802416

model-00003-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88a4bb84ec3e982f16497c375013e7efdc07890a8cd7126be0bcdefe6193a9f7
+size 4832590896

model-00004-of-00004.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a90408e0d462b199de23f1e29730d1608c3c425d384c6a3c0dc34fbeea9b335f
+size 3668492768

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,441 @@

+{
+  "metadata": {
+    "total_size": 18237007872
+  },
+  "weight_map": {
+    "model.embed_tokens.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.A_log": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.B_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.C_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.D": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.bcdt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.conv1d.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.dt_bias": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.dt_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.dt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.in_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mixer.out_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.0.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.k_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.q_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.1.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.10.mixer.A_log": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.B_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.C_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.D": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.bcdt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.conv1d.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.dt_bias": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.dt_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.dt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.in_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mixer.out_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.10.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.k_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.q_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.11.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.A_log": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.B_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.C_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.D": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.bcdt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.conv1d.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.dt_bias": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.dt_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.dt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.in_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mixer.out_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.12.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.k_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.q_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.13.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.A_log": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.B_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.C_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.D": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.bcdt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.conv1d.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.dt_bias": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.dt_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.dt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.in_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mixer.out_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.14.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.k_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.q_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.15.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.A_log": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.B_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.C_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.D": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.bcdt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.conv1d.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.dt_bias": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.dt_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.dt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.in_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mixer.out_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.16.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.16.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.16.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.16.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.16.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.16.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.mixer.k_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.mixer.q_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.17.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.A_log": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.B_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.C_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.D": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.bcdt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.conv1d.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.dt_bias": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.dt_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.dt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.in_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mixer.out_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.18.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.k_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.q_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.19.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.2.mixer.A_log": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.B_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.C_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.D": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.bcdt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.conv1d.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.dt_bias": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.dt_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.dt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.in_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mixer.out_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.2.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.20.mixer.A_log": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.B_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.C_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.D": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.bcdt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.conv1d.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.dt_bias": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.dt_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.dt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.in_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mixer.out_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.20.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.k_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.q_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.21.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.A_log": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.B_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.C_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.D": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.bcdt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.conv1d.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.dt_bias": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.dt_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.dt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.in_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mixer.out_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.22.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.k_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.q_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.23.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.A_log": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.B_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.C_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.D": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.bcdt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.conv1d.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.dt_bias": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.dt_norm_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.dt_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.in_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mixer.out_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.mlp.gate_up_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.post_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.post_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.pre_mixer_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.24.pre_mlp_norm.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.k_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.o_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.q_weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mixer.qkv_proj.weight": "model-00003-of-00004.safetensors",
+    "model.layers.layers.25.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.25.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.25.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.25.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.25.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.25.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.A_log": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.B_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.C_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.D": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.bcdt_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.conv1d.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.dt_bias": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.dt_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.dt_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.in_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mixer.out_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.26.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.mixer.k_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.mixer.q_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.27.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.A_log": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.B_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.C_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.D": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.bcdt_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.conv1d.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.dt_bias": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.dt_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.dt_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.in_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mixer.out_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.28.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.k_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.q_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.29.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.3.mixer.k_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mixer.q_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.3.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.30.mixer.A_log": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.B_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.C_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.D": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.bcdt_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.conv1d.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.dt_bias": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.dt_norm_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.dt_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.in_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mixer.out_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.30.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.k_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.o_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.q_weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mixer.qkv_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.mlp.gate_up_proj.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.post_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.post_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.pre_mixer_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.31.pre_mlp_norm.weight": "model-00004-of-00004.safetensors",
+    "model.layers.layers.4.mixer.A_log": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.B_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.C_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.D": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.bcdt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.conv1d.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.dt_bias": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.dt_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.dt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.in_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mixer.out_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.4.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.k_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.q_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.5.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.A_log": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.B_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.C_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.D": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.bcdt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.conv1d.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.dt_bias": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.dt_norm_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.dt_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.in_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mixer.out_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.mlp.gate_up_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.post_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.post_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.pre_mixer_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.6.pre_mlp_norm.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.k_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.o_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.q_weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mixer.qkv_proj.weight": "model-00001-of-00004.safetensors",
+    "model.layers.layers.7.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.7.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.7.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.7.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.7.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.7.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.A_log": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.B_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.C_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.D": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.bcdt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.conv1d.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.dt_bias": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.dt_norm_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.dt_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.in_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mixer.out_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.8.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.k_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.o_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.q_weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mixer.qkv_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.mlp.gate_up_proj.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.post_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.post_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.pre_mixer_norm.weight": "model-00002-of-00004.safetensors",
+    "model.layers.layers.9.pre_mlp_norm.weight": "model-00002-of-00004.safetensors",
+    "model.norm.weight": "model-00004-of-00004.safetensors"
+  }
+}

modeling_plamo.py ADDED Viewed

	@@ -0,0 +1,1707 @@

+import enum
+import math
+import warnings
+from typing import Any, Dict, List, Literal, NamedTuple, Optional, Tuple, Union
+try:
+    # It is difficult to install mamba_ssm in login node because
+    # it requires GPU for installation
+    import mamba_ssm
+except ModuleNotFoundError:
+    warnings.warn("mamba_ssm could not be imported", stacklevel=2)
+try:
+    # It is difficult to install causal_conv1d in login node because
+    # it requires GPU for installation
+    import causal_conv1d.causal_conv1d_interface as causal_conv1d
+except ModuleNotFoundError:
+    warnings.warn("causal_conv1d could not be imported", stacklevel=2)
+import torch
+from torch import nn
+from torch.nn import functional as F
+from transformers import PretrainedConfig, PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
+def _is_first_token(mask: torch.Tensor) -> torch.Tensor:
+    assert mask.dtype == torch.bool
+    B, Nh, q_len, kv_len = mask.shape
+    mask = mask[:, :, :, -q_len:]
+    cont = q_len != kv_len
+    v = False if cont else True
+    out = torch.logical_not(torch.diagonal(mask, offset=-1, dim1=-2, dim2=-1).bool())
+    out = torch.cat(
+        [
+            torch.full(size=(B, Nh, 1), dtype=torch.bool, device=out.device, fill_value=v),
+            out,
+        ],
+        dim=-1,
+    )
+    return out
+def _swiglu(h: torch.Tensor) -> torch.Tensor:
+    h0, h1 = h.chunk(2, dim=-1)
+    return torch.nn.functional.silu(h0) * h1
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(
+        self, dim: int, max_position_embeddings: int = 2048, base: int = 10000, device: Optional[torch.device] = None
+    ) -> None:
+        super().__init__()
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Build here to make `torch.jit.trace` work.
+        self._set_cos_sin_cache(
+            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
+        )
+    def _set_cos_sin_cache(self, seq_len: int, device: Any, dtype: Any) -> None:
+        self.max_seq_len_cached = seq_len
+        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)  # type: ignore
+        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
+        # Different from paper, but it uses a different permutation in order to obtain the same calculation
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
+        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
+    def forward(self, x: torch.Tensor, seq_len: int) -> Tuple[torch.Tensor, torch.Tensor]:
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if seq_len > self.max_seq_len_cached:
+            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
+        return (
+            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),  # type: ignore
+            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),  # type: ignore
+        )
+def _rotate_half(x: torch.Tensor) -> torch.Tensor:
+    """Rotates half the hidden dims of the input."""
+    x1 = x[..., : x.shape[-1] // 2]
+    x2 = x[..., x.shape[-1] // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+def _rotary_pos_emb(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor, position_ids: torch.Tensor) -> torch.Tensor:
+    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
+    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
+    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
+    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
+    x_embed = (x * cos) + (_rotate_half(x) * sin)
+    return x_embed
+class LinearType(str, enum.Enum):
+    Normal = "normal"
+    Fp8 = "fp8"
+    Fp8Retain = "fp8-retain"
+class Plamo2Config(PretrainedConfig):  # type: ignore
+    model_type: str = "plamo2"
+    def __init__(
+        self,
+        hidden_size: int = 4096,
+        num_hidden_layers: int = 32,
+        rms_norm_eps: float = 1e-6,
+        tie_word_embeddings: bool = True,
+        # Attention
+        num_attention_heads: int = 32,
+        num_key_value_heads: int = 4,
+        hidden_size_per_head: int = 128,
+        max_position_embeddings: int = 2048,
+        attention_window_size: int = 2048,
+        full_attention_idx: list[int] | None = None,
+        rope_theta: int = 10000,
+        rope_local_theta: int = 10000,
+        # Mamba
+        mamba_d_state: int = 64,
+        mamba_d_conv: int = 4,
+        mamba_num_heads: int = 64,
+        mamba_step: int = 2,
+        mamba_chunk_size: int = 256,
+        mamba_enabled: bool = True,
+        # MLP
+        intermediate_size: int = 13312,
+        # Tokenizer
+        vocab_size: int = 32000,
+        tokenizer_class: str = "Plamo2Tokenizer",
+        pad_token_id: Optional[int] = None,
+        bos_token_id: int = 1,
+        eos_token_id: int = 2,
+        # Multimodal
+        image_token_id: Optional[int] = None,
+        image_feature_size: Optional[int] = None,
+        image_proj_type: Literal["linear", "mlp"] = "linear",
+        # FP8
+        linear_type: LinearType = LinearType.Normal,
+        fp8_accum_dtype: Optional[str] = None,
+        # Evaluation
+        eval_attention_n_bit: Optional[int] = None,
+        eval_mlp_n_bit: Optional[int] = None,
+        use_cache: bool = True,
+        **kwargs: Any,
+    ) -> None:
+        # max_position_embeddings is often used to determine the max length during inference,
+        # but samba should have extrapolation abilities
+        self.max_position_embeddings = max(10 * 1024 * 1024, max_position_embeddings)
+        self.hidden_size = hidden_size
+        self.rms_norm_eps = rms_norm_eps
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_size_per_head = hidden_size_per_head
+        self.num_key_value_heads = num_key_value_heads
+        self.attention_window_size = attention_window_size
+        self.full_attention_idx = full_attention_idx if full_attention_idx is not None else []
+        self.rope_theta = rope_theta
+        self.rope_local_theta = rope_local_theta
+        self.mamba_d_state = mamba_d_state
+        self.mamba_d_conv = mamba_d_conv
+        self.mamba_num_heads = mamba_num_heads
+        self.mamba_step = mamba_step
+        self.mamba_chunk_size = mamba_chunk_size
+        self.mamba_enabled = mamba_enabled
+        self.intermediate_size = intermediate_size
+        self.vocab_size = vocab_size
+        self.image_token_id = image_token_id
+        self.image_feature_size = image_feature_size
+        self.image_proj_type = image_proj_type
+        self.linear_type = linear_type
+        self.fp8_accum_dtype = fp8_accum_dtype
+        self.eval_attention_n_bit = eval_attention_n_bit
+        self.eval_mlp_n_bit = eval_mlp_n_bit
+        self.use_cache = use_cache
+        # fields for vLLM
+        self.sliding_window = attention_window_size
+        super().__init__(
+            tokenizer_class=tokenizer_class,
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    @property
+    def layers_block_type(self) -> list[str]:
+        return ["mamba" if is_mamba(self, i) else "attention" for i in range(self.num_hidden_layers)]
+    @property
+    def rope_local_base_freq(self) -> int:
+        return self.rope_local_theta
+class Plamo2AttentionCache(torch.nn.Module):
+    def __init__(self, key: torch.Tensor, value: torch.Tensor) -> None:
+        super().__init__()
+        B, nh, L, c = key.shape
+        assert len(value.shape) == 4
+        assert value.shape[0] == B
+        assert value.shape[2] == L
+        self.register_parameter("key", torch.nn.Parameter(key, requires_grad=False))
+        self.register_parameter("value", torch.nn.Parameter(value, requires_grad=False))
+class Plamo2MambaCache(torch.nn.Module):
+    def __init__(self, conv_state: torch.Tensor, ssm_state: torch.Tensor) -> None:
+        super().__init__()
+        # conv_state: [B, C, d_conv]
+        # ssm_state: [B, nhead, nchanel_per_head, d_state]
+        assert len(conv_state.shape) == 3
+        assert len(ssm_state.shape) == 4
+        assert conv_state.shape[0] == ssm_state.shape[0]
+        self.register_parameter("conv_state", torch.nn.Parameter(conv_state, requires_grad=False))
+        self.register_parameter("ssm_state", torch.nn.Parameter(ssm_state, requires_grad=False))
+Plamo2LayerCache = Plamo2AttentionCache | Plamo2MambaCache
+class Plamo2Cache(torch.nn.Module):
+    """
+    stores states of the model for fast decoding.
+    `transformers` uses `transformers.Cache` for this purpose, but the interface and variable names are
+    deeply dependent on Transformers architecture (e.g., `key_states`) and it is difficult to use
+    other architectures (e.g., Mamba).
+    This class provides a similar interface to `transformers.Cache`, but is designed to also handle
+    the state of Mamba properly.
+    """
+    def __init__(self, config: Plamo2Config) -> None:
+        super().__init__()
+        self.config = config
+        self.cache = torch.nn.ModuleList([None for _ in range(config.num_hidden_layers)])  # type: ignore
+    def append_kv(self, key: torch.Tensor, value: torch.Tensor, layer_idx: int) -> tuple[torch.Tensor, torch.Tensor]:
+        c = self.cache[layer_idx]
+        if c is None:
+            return key, value
+        assert isinstance(c, Plamo2AttentionCache)
+        def _validate(cache: torch.Tensor, new_tensor: torch.Tensor) -> None:
+            assert len(cache.shape) == 4
+            assert len(new_tensor.shape) == 4
+            assert cache.shape[0] == new_tensor.shape[0]
+            assert cache.shape[1] == new_tensor.shape[1]
+            assert cache.shape[3] == new_tensor.shape[3]
+        _validate(c.key, key)
+        _validate(c.value, value)
+        assert key.shape[2] == value.shape[2]
+        return torch.cat([c.key, key], dim=2), torch.cat([c.value, value], dim=2)
+    def update_attention(
+        self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx: int
+    ) -> Plamo2AttentionCache:
+        full_attn = layer_idx in self.config.full_attention_idx
+        window_size = self.config.attention_window_size
+        if self.cache[layer_idx] is None:
+            if full_attn:
+                self.cache[layer_idx] = Plamo2AttentionCache(key_states, value_states)
+            else:
+                self.cache[layer_idx] = Plamo2AttentionCache(
+                    key_states[:, :, -window_size:, :], value_states[:, :, -window_size:, :]
+                )
+        else:
+            c = self.cache[layer_idx]
+            assert isinstance(c, Plamo2AttentionCache)
+            k, v = self.append_kv(key_states, value_states, layer_idx)
+            if full_attn:
+                c.key.data = k
+                c.value.data = v
+            else:
+                c.key.data = k[:, :, -window_size:, :]
+                c.value.data = v[:, :, -window_size:, :]
+        return self.cache[layer_idx]  # type: ignore
+    def update_mamba(self, conv_state: torch.Tensor, ssm_state: torch.Tensor, layer_idx: int) -> Plamo2MambaCache:
+        if self.cache[layer_idx] is None:
+            self.cache[layer_idx] = Plamo2MambaCache(conv_state, ssm_state)
+        else:
+            c = self.cache[layer_idx]
+            assert isinstance(c, Plamo2MambaCache)
+            assert c.conv_state.shape == conv_state.shape
+            assert c.ssm_state.shape == ssm_state.shape
+            c.conv_state.data = conv_state
+            c.ssm_state.data = ssm_state
+        return self.cache[layer_idx]  # type: ignore
+    def __getitem__(self, layer_idx: int) -> Plamo2LayerCache | None:
+        assert layer_idx < len(self.cache)
+        layer_cache = self.cache[layer_idx]
+        return layer_cache  # type: ignore
+    def __len__(self) -> int:
+        return len(self.cache)
+    def get_seq_length(self, layer_idx: Optional[int] = None) -> int:
+        if layer_idx is not None:
+            c = self.cache[layer_idx]
+            assert isinstance(c, Plamo2AttentionCache)
+            return c.key.shape[2]  # type: ignore
+        sequence_length: int | None = None
+        for layer_cache in self.cache:
+            if isinstance(layer_cache, Plamo2AttentionCache):
+                sequence_length = (
+                    max(layer_cache.key.shape[2], sequence_length)
+                    if sequence_length is not None
+                    else layer_cache.key.shape[2]
+                )
+        assert sequence_length is not None
+        return sequence_length
+    def get_max_length(self) -> int | None:
+        return None
+    def get_usable_length(self, new_seq_length: int, layer_idx: Optional[int] = 0) -> int:
+        """Given the sequence length of the new inputs, returns the usable length of the cache."""
+        # Cache without size limit -> all cache is usable
+        # Cache with size limit -> if the length cache plus the length of the new inputs is larger the maximum cache
+        #   length, we will need to evict part of the cache (and thus not all cache is usable)
+        max_length = self.get_max_length()
+        previous_seq_length = self.get_seq_length(layer_idx)
+        if max_length is not None and previous_seq_length + new_seq_length > max_length:
+            return max_length - new_seq_length
+        return previous_seq_length
+    def reorder_cache(self, beam_idx: torch.Tensor) -> None:
+        def _mamba(cache: Plamo2MambaCache) -> Plamo2MambaCache:
+            return Plamo2MambaCache(
+                conv_state=cache.conv_state.index_select(0, beam_idx),
+                ssm_state=cache.ssm_state.index_select(0, beam_idx),
+            )
+        def _attention(cache: Plamo2AttentionCache) -> Plamo2AttentionCache:
+            return Plamo2AttentionCache(
+                key=cache.key.index_select(0, beam_idx),
+                value=cache.value.index_select(0, beam_idx),
+            )
+        for i in range(len(self.cache)):
+            if self.cache[i] is None:
+                continue
+            layer_cache = self.cache[i]
+            if isinstance(layer_cache, Plamo2MambaCache):
+                self.cache[i] = _mamba(layer_cache)
+            else:
+                assert isinstance(layer_cache, Plamo2AttentionCache)
+                self.cache[i] = _attention(layer_cache)
+    @property
+    def seen_tokens(self) -> int | None:
+        return None
+class DecoderInput(NamedTuple):
+    hidden_states: torch.Tensor
+    attention_mask: Optional[torch.Tensor] = None
+    past_states: Optional[Plamo2Cache] = None
+    output_hidden_states: Optional[bool] = False
+    output_attentions: Optional[bool] = False
+    gradient_checkpointing: bool = False
+    input_ids: Optional[torch.Tensor] = None
+class DecoderOutput(NamedTuple):
+    hidden_states: torch.Tensor
+    all_hidden_states: Optional[Tuple[torch.Tensor, ...]]
+    all_self_attns: Optional[Tuple[torch.Tensor, ...]]
+# Copied from transformers.models.bart.modeling_bart._make_causal_mask
+def _make_causal_mask(
+    input_ids_shape: Tuple[int, int], dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
+) -> torch.Tensor:
+    """
+    Make causal mask used for bi-directional self-attention.
+    """
+    bsz, tgt_len = input_ids_shape
+    mask = torch.full((tgt_len, tgt_len), float("-inf"), device=device)
+    mask_cond = torch.arange(mask.size(-1), device=device)
+    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
+    mask = mask.to(dtype)
+    if past_key_values_length > 0:
+        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
+    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
+# Copied from transformers.models.bart.modeling_bart._expand_mask
+def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None) -> torch.Tensor:
+    """
+    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
+    """
+    bsz, src_len = mask.size()
+    tgt_len = tgt_len if tgt_len is not None else src_len
+    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
+    inverted_mask = 1.0 - expanded_mask
+    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), float("-inf"))  # type: ignore
+def _rms_norm(
+    hidden_states: torch.Tensor, weight: Optional[torch.Tensor], eps: float, offset: float = 1.0
+) -> torch.Tensor:
+    input_dtype = hidden_states.dtype
+    hidden_states = hidden_states.to(torch.float32)
+    variance = hidden_states.pow(2).mean(-1, keepdim=True)
+    hidden_states = hidden_states * torch.rsqrt(variance + eps)
+    hidden_states = hidden_states.to(input_dtype)
+    if weight is not None:
+        hidden_states = (offset + weight) * hidden_states
+    return hidden_states
+class RMSNorm(nn.Module):
+    def __init__(
+        self,
+        hidden_size: int,
+        eps: float = 1e-6,
+        offset: float = 1.0,
+        device: Optional[Union[torch.device, str]] = None,
+    ) -> None:
+        super().__init__()
+        self.weight = nn.Parameter(torch.zeros(hidden_size, device=device))
+        self.variance_epsilon = eps
+        self.offset = offset
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        return _rms_norm(hidden_states, self.weight, self.variance_epsilon, offset=self.offset)
+def get_initial_dt_bias(num_heads: int) -> torch.Tensor:
+    dt_min = 0.001
+    dt_max = 0.1
+    dt = torch.exp(torch.rand(num_heads) * (math.log(dt_max) - math.log(dt_min)) + math.log(dt_min))
+    dt = torch.clamp(dt, 1e-4)
+    inv_dt = dt + torch.log(-torch.expm1(-dt))
+    return inv_dt
+def get_initial_A(num_heads: int) -> torch.Tensor:
+    A = torch.arange(1, num_heads + 1, dtype=torch.float32)
+    return torch.log(A)
+def _bf16_supported_in_triton() -> bool:
+    # newer torch (2.2.0 and later?) supports bfloat16 even when using Voltas
+    # but triton cannot compile bf16 kernels for Volta
+    major, _ = torch.cuda.get_device_capability()
+    return major >= 8
+def _get_trition_dtype(dtype: torch.dtype) -> torch.dtype:
+    if dtype != torch.bfloat16:
+        return dtype
+    if _bf16_supported_in_triton():
+        return dtype
+    return torch.float32
+def ssd_update_state(
+    ssm_state: torch.Tensor,
+    x: torch.Tensor,
+    dt: torch.Tensor,
+    A: torch.Tensor,
+    B: torch.Tensor,
+    C: torch.Tensor,
+    D: torch.Tensor,
+    z: torch.Tensor,
+    dt_bias: torch.Tensor,
+    dt_softplus: bool,
+) -> torch.Tensor:
+    assert ssm_state.dtype == torch.float32
+    if dt.is_cuda:
+        dtype = _get_trition_dtype(x.dtype)
+    else:
+        dtype = x.dtype
+    if dt.is_cuda:
+        f = mamba_ssm.ops.triton.selective_state_update.selective_state_update
+    else:
+        f = mamba_ssm.ops.triton.selective_state_update.selective_state_update_ref
+    hidden_size_per_head = x.shape[-1]
+    d_state = B.shape[-1]
+    A = A[:, None, None].expand(-1, hidden_size_per_head, d_state).float()
+    dt = dt[..., None].expand(-1, -1, hidden_size_per_head)
+    dt_bias = dt_bias[:, None].expand(-1, hidden_size_per_head)
+    D = D[:, None].expand(-1, hidden_size_per_head)
+    assert ssm_state.dtype == torch.float32
+    out = f(
+        ssm_state,
+        x.to(dtype),
+        dt.to(dtype),
+        A.float(),
+        B.to(dtype),
+        C.to(dtype),
+        D.float(),
+        z.to(dtype),
+        dt_bias.float(),
+        dt_softplus=dt_softplus,
+    )
+    return out[:, None]  # type: ignore
+def _ssd_chunk_scan_combined_naive(
+    x: torch.Tensor,
+    dt: torch.Tensor,
+    A: torch.Tensor,
+    B: torch.Tensor,
+    C: torch.Tensor,
+    D: torch.Tensor,
+    z: torch.Tensor,
+    dt_bias: torch.Tensor,
+    dt_softplus: bool,
+    seq_idx: torch.Tensor | None,
+    ssm_state: torch.Tensor,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    assert ssm_state.dtype == torch.float32
+    length = x.shape[1]
+    ys = []
+    for i in range(length):
+        if i != 0 and seq_idx is not None:
+            ssm_state = torch.where(
+                (seq_idx[:, i - 1] != seq_idx[:, i])[:, None, None, None],
+                torch.zeros_like(ssm_state),
+                ssm_state,
+            )
+        y = ssd_update_state(
+            ssm_state,
+            x[:, i],
+            dt[:, i],
+            A,
+            B[:, i],
+            C[:, i],
+            D,
+            z=z[:, i],
+            dt_bias=dt_bias,
+            dt_softplus=dt_softplus,
+        )
+        ys.append(y)
+    return torch.cat(ys, dim=1), ssm_state
+def _ssd_chunk_scan_combined_cpu(
+    x: torch.Tensor,
+    dt: torch.Tensor,
+    A: torch.Tensor,
+    B: torch.Tensor,
+    C: torch.Tensor,
+    chunk_size: int,
+    D: torch.Tensor,
+    z: torch.Tensor,
+    dt_bias: torch.Tensor,
+    dt_softplus: bool,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    # (bsize, nhead, nchunk, chunk_size)
+    dt = dt.float()  # We want high precision for this before cumsum
+    dt = dt.permute(0, 2, 1).unflatten(2, (-1, chunk_size))  # type: ignore
+    if dt_bias is not None:
+        dt = dt + dt_bias[None, :, None, None]
+    if dt_softplus:
+        dt = F.softplus(dt)
+    dA = dt * A[None, :, None, None]
+    dA_cumsum = torch.cumsum(dA, dim=-1)
+    _, _, nheads, _ = x.shape
+    dstate = B.shape[-1]
+    _ = dt.shape[2]
+    with torch.profiler.record_function("ssd_chunk_scan_combined_cpu_chunk_state"):
+        # Following is equivalent to `mamba_ssm.ops.triton.ssd_combined.chunk_state_ref(B, x, dt, dA_cumsum)`
+        # But `einsum` in the above function is too slow in CPU.
+        x_ = torch.unflatten(x, 1, (-1, chunk_size))
+        assert B.shape[2] == nheads  # B should be already expanded
+        B_ = torch.unflatten(B, 1, (-1, chunk_size)).to(x.dtype)  # (bsize, nchunk, chunk_size, nheads, dstate)
+        decay_states = torch.exp((dA_cumsum[:, :, :, -1:] - dA_cumsum)).to(x.dtype)
+        dt_ = dt.to(x.dtype)
+        # einsum("bclhn,bhcl,bhcl,bclhp->bchpn", B_, decay_states, dt_, x_)
+        B_ = B_.permute(0, 1, 3, 4, 2)  # bchnl
+        tmp = dt_ * decay_states  # bhcl
+        tmp = tmp.permute(0, 2, 1, 3)[:, :, :, None]  # bch1l
+        tmp = B_ * tmp  # bchnl
+        x_ = x_.permute(0, 1, 3, 2, 4)  # bchlp
+        tmp = tmp @ x_  # bchnp
+        states = tmp.permute(0, 1, 2, 4, 3)  # bchpn
+    states_dtype = states.dtype
+    if states.dtype not in [torch.float32, torch.float64]:
+        states = states.to(torch.float32)
+    with torch.profiler.record_function("ssd_chunk_scan_combined_cpu_state_passing"):
+        out, last_state = mamba_ssm.ops.triton.ssd_combined.state_passing_ref(
+            states.flatten(start_dim=-2, end_dim=-1),
+            dA_cumsum[:, :, :, -1],
+        )
+    states = torch.unflatten(out, -1, (-1, dstate))
+    last_state = torch.unflatten(last_state, -1, (-1, dstate))
+    states = states.to(states_dtype)
+    with torch.profiler.record_function("ssd_chunk_scan_combined_cpu_chunk_scan"):
+        out = mamba_ssm.ops.triton.ssd_combined.chunk_scan_ref(B, C, x, dt, dA_cumsum, states, D=D, z=z)
+    return out, last_state
+@torch.profiler.record_function("ssd_chunk_scan_combined")
+def ssd_chunk_scan_combined(
+    x: torch.Tensor,
+    dt: torch.Tensor,
+    A: torch.Tensor,
+    B: torch.Tensor,
+    C: torch.Tensor,
+    chunk_size: int,
+    D: torch.Tensor,
+    z: torch.Tensor,
+    dt_bias: torch.Tensor,
+    dt_softplus: bool,
+    return_final_states: bool,
+    seq_idx: torch.Tensor | None,
+    ssm_state: torch.Tensor | None,
+) -> tuple[torch.Tensor, torch.Tensor] | torch.Tensor:
+    if seq_idx is not None:
+        assert seq_idx.dtype == torch.int32
+        assert ssm_state is None
+        assert not return_final_states
+    if ssm_state is not None:
+        assert ssm_state.dtype == torch.float32
+        assert seq_idx is None
+    length = x.shape[1]
+    """
+    state will be updates by following:
+    ```
+    dt = softplus(dt)
+    dA = exp(dt * A)
+    state_next = state * dA + dB * x
+    ```
+    To avoid updating state, we set dt to -inf and x to 0
+    because `softplus(-inf) = 0` and `exp(0) = 1`
+    """
+    pad = (chunk_size - length % chunk_size) % chunk_size
+    x = torch.nn.functional.pad(x, pad=[0, 0, 0, 0, pad, 0], value=0.0)
+    dt = torch.nn.functional.pad(dt, pad=[0, 0, pad, 0], value=float("-inf"))
+    B = torch.nn.functional.pad(B, pad=[0, 0, 0, 0, pad, 0], value=0.0)
+    C = torch.nn.functional.pad(C, pad=[0, 0, 0, 0, pad, 0], value=0.0)
+    z = torch.nn.functional.pad(z, pad=[0, 0, 0, 0, pad, 0], value=0.0)
+    if seq_idx is not None:
+        seq_idx = torch.nn.functional.pad(seq_idx, pad=[pad, 0], value=0)
+    length = x.shape[1]
+    assert length % chunk_size == 0, (length, chunk_size)
+    if dt.is_cuda:
+        dtype = _get_trition_dtype(x.dtype)
+        out = mamba_ssm.ops.triton.ssd_combined.mamba_chunk_scan_combined(  # type: ignore
+            x.to(dtype),
+            dt.to(dtype),
+            A.float(),
+            B.to(dtype),
+            C.to(dtype),
+            chunk_size,
+            D=D.float(),
+            z=z.to(dtype),
+            initial_states=ssm_state,
+            dt_bias=dt_bias.float(),
+            dt_softplus=dt_softplus,
+            seq_idx=seq_idx,
+            return_final_states=return_final_states,
+        )
+        if return_final_states:
+            return out[0][:, pad:], out[1]
+        else:
+            assert isinstance(out, torch.Tensor)
+            return out[:, pad:]
+    else:
+        if ssm_state is None and seq_idx is None:
+            tmp = _ssd_chunk_scan_combined_cpu(
+                x,
+                dt,
+                A,
+                B,
+                C,
+                chunk_size,
+                D=D,
+                z=z,
+                dt_bias=dt_bias.float(),
+                dt_softplus=dt_softplus,
+            )
+        else:
+            if ssm_state is None:
+                bsize, _, num_heads, channel = x.shape
+                state = B.shape[-1]
+                ssm_state = torch.zeros(bsize, num_heads, channel, state, dtype=torch.float32, device=x.device)
+            tmp = _ssd_chunk_scan_combined_naive(
+                x, dt, A, B, C, D, z=z, dt_bias=dt_bias, dt_softplus=dt_softplus, seq_idx=seq_idx, ssm_state=ssm_state
+            )
+        tmp = (tmp[0][:, pad:], tmp[1])
+        if return_final_states:
+            return tmp
+        else:
+            return tmp[0]
+def _causal_conv1d_update(
+    conv_state: torch.Tensor, weight: torch.Tensor, xBC: torch.Tensor
+) -> tuple[torch.Tensor, torch.Tensor]:
+    dtype = conv_state.dtype
+    xBC = xBC.to(dtype)
+    weight = weight.to(dtype)
+    if conv_state.is_cuda:
+        x = causal_conv1d.causal_conv1d_update(
+            x=xBC,
+            conv_state=conv_state,
+            weight=weight[:, 0, :],
+            activation="silu",
+        )
+        return x, conv_state
+    else:
+        x = causal_conv1d.causal_conv1d_update_ref(
+            x=xBC,
+            conv_state=conv_state,
+            weight=weight[:, 0, :],
+            activation="silu",
+        )
+        return x, conv_state
+def _causal_conv1d_naive(
+    conv_state: torch.Tensor, weight: torch.Tensor, x: torch.Tensor, seq_idx: torch.Tensor | None
+) -> tuple[torch.Tensor, torch.Tensor]:
+    length = x.shape[-1]
+    out = torch.zeros_like(x)
+    for i in range(length):
+        if i != 0 and seq_idx is not None:
+            conv_state = torch.where(
+                (seq_idx[:, i - 1] != seq_idx[:, i])[:, None, None],
+                torch.zeros_like(conv_state),
+                conv_state,
+            )
+        out[:, :, i : i + 1], conv_state = _causal_conv1d_update(conv_state, weight, x[:, :, i : i + 1])
+    return out, conv_state
+@torch.profiler.record_function("causal_conv1d")
+def _causal_conv1d(
+    conv_state: torch.Tensor | None, weight: torch.Tensor, x: torch.Tensor, seq_idx: torch.Tensor | None
+) -> tuple[torch.Tensor, torch.Tensor | None]:
+    dtype = x.dtype
+    if conv_state is not None:
+        dtype = conv_state.dtype
+        assert seq_idx is None
+    if seq_idx is not None:
+        assert seq_idx.dtype == torch.int32
+        assert conv_state is None
+    weight = weight.to(dtype)
+    x = x.to(dtype)
+    return_final_states = conv_state is not None
+    if weight.is_cuda:
+        if x.stride(1) != 1:
+            # to channel-last format
+            x = x.transpose(-1, -2).contiguous().transpose(-1, -2)
+        if conv_state is not None:
+            if conv_state.stride(1) != 1:
+                # to channel-last format
+                conv_state = conv_state.transpose(-1, -2).contiguous().transpose(-1, -2)
+        tmp = causal_conv1d.causal_conv1d_fn(
+            x=x,
+            weight=weight[:, 0, :],
+            initial_states=conv_state,
+            return_final_states=conv_state is not None,
+            activation="silu",
+            seq_idx=seq_idx,
+        )
+        if conv_state is not None:
+            x, conv_state = tmp
+        else:
+            x = tmp
+    else:
+        if seq_idx is None:
+            x, conv_state = causal_conv1d.causal_conv1d_ref(
+                x=x,
+                initial_states=conv_state,
+                return_final_states=True,
+                weight=weight[:, 0, :],
+                activation="silu",
+            )
+        else:
+            if conv_state is None:
+                bsize = x.shape[0]
+                dim = weight.shape[0]
+                d_conv = weight.shape[-1]
+                conv_state = torch.zeros(bsize, dim, d_conv - 1, dtype=x.dtype, device=x.device)
+            x, conv_state = _causal_conv1d_naive(conv_state, weight, x, seq_idx)
+    if return_final_states:
+        return x, conv_state
+    else:
+        return x, None
+class Mamba(torch.nn.Module):
+    def __init__(self, config: Plamo2Config, layer_idx: int) -> None:
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.hidden_size = config.hidden_size
+        self.d_state = config.mamba_d_state
+        self.d_conv = config.mamba_d_conv
+        self.chunk_size = config.mamba_chunk_size
+        self.num_heads = config.mamba_num_heads
+        # TODO add mamba_hidden_size_per_head config (?)
+        self.hidden_size_per_head = config.hidden_size_per_head
+        self.intermediate_size = self.num_heads * self.hidden_size_per_head
+        self.in_proj = torch.nn.Linear(self.hidden_size, 2 * self.intermediate_size, bias=False)
+        self.conv1d = torch.nn.Conv1d(
+            in_channels=self.intermediate_size,
+            out_channels=self.intermediate_size,
+            bias=False,  # TODO the original implementation uses bias
+            kernel_size=self.d_conv,
+            groups=self.intermediate_size,
+            padding=0,
+        )
+        self.dt_dim = max(64, self.hidden_size // 16)
+        # Notes:
+        # Mamba2 removes this linear projection for simplicity (Figure 6 in the paper),
+        # but it may degrade the ability of content-length extrapolation.
+        self.bcdt_proj = torch.nn.Linear(
+            self.intermediate_size,
+            self.dt_dim + 2 * self.d_state,
+            bias=False,
+        )
+        self.dt_proj = torch.nn.Linear(self.dt_dim, self.num_heads, bias=False)
+        self.dt_bias = torch.nn.Parameter(get_initial_dt_bias(self.num_heads))
+        self.A_log = torch.nn.Parameter(get_initial_A(self.num_heads))
+        self.D = torch.nn.Parameter(torch.ones(self.num_heads))
+        # TODO norm weight before gating like Mamba2
+        self.dt_norm_weight = torch.nn.Parameter(torch.ones(self.dt_dim))
+        self.B_norm_weight = torch.nn.Parameter(torch.ones(self.d_state))
+        self.C_norm_weight = torch.nn.Parameter(torch.ones(self.d_state))
+        self.out_proj = torch.nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+    def _no_weight_decay_param_names(self) -> set[str]:
+        return set(["D", "dt_bias", "A_log"])
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_states: Optional[Plamo2Cache] = None,
+    ) -> Tuple[torch.Tensor, Optional[Plamo2Cache]]:
+        bsize, length, _ = hidden_states.shape
+        is_update = length == 1 and past_states is not None
+        bool_mask: torch.Tensor | None = None
+        seq_idx: torch.Tensor | None = None
+        if attention_mask is not None:
+            if len(attention_mask.shape) == 2:
+                attention_mask = attention_mask[None, None].expand(bsize, 1, -1, -1)
+            assert len(attention_mask.shape) == 4
+            if past_states is None:
+                # TODO: support seq_idx with cache
+                bool_mask_4d = attention_mask == 0
+                is_first_token = _is_first_token(bool_mask_4d)[:, 0, :]
+                seq_idx = torch.cumsum(is_first_token, dim=-1) - 1
+                seq_idx = seq_idx.to(torch.int32)
+            # `generate` function creates attention mask that contains past tokens,
+            # but mamba does not use them
+            attention_mask = attention_mask[:, 0, -length:, -length:]
+            bool_mask = torch.diagonal(attention_mask, dim1=-2, dim2=-1) == 0
+        conv_state: torch.Tensor | None
+        ssm_state: torch.Tensor | None
+        if past_states is None:
+            conv_state = None
+            ssm_state = None
+        elif past_states[self.layer_idx] is None:
+            conv_state = torch.zeros(
+                bsize, self.intermediate_size, self.d_conv - 1, dtype=hidden_states.dtype, device=hidden_states.device
+            )
+            ssm_state = torch.zeros(
+                bsize,
+                self.num_heads,
+                self.hidden_size_per_head,
+                self.d_state,
+                dtype=torch.float32,
+                device=hidden_states.device,
+            )
+        else:
+            c = past_states[self.layer_idx]
+            assert isinstance(c, Plamo2MambaCache)
+            conv_state = c.conv_state
+            ssm_state = c.ssm_state
+        zx = self.in_proj(hidden_states)
+        zx = zx.reshape(bsize, length, self.num_heads, -1)
+        # z: (bsize, length, num_heads, hidden_size_per_head)
+        # x: (bsize, length, num_heads, hidden_size_per_head)
+        z, x = torch.split(zx, [self.hidden_size_per_head, self.hidden_size_per_head], dim=-1)
+        # conv
+        x = x.reshape(bsize, length, -1).transpose(1, 2)  # (bsize, intermediate_size, length)
+        if bool_mask is not None:
+            x = torch.where(bool_mask[:, None, :], x, 0.0)
+        if is_update:
+            assert conv_state is not None
+            x, conv_state = _causal_conv1d_update(conv_state, self.conv1d.weight, x)
+        else:
+            x, conv_state = _causal_conv1d(conv_state, self.conv1d.weight, x, seq_idx=seq_idx)
+        x = x.to(dtype=hidden_states.dtype)
+        x = x.transpose(1, 2)  # (bsize, length, intermediate_size)
+        x = x.reshape(bsize, length, -1)
+        # x: (bsize, length, num_heads, hidden_size_per_head)
+        # B: (bsize, length, 1, d_state)
+        # C: (bsize, length, 1, d_state)
+        # dt: (bsize, length, dt_dim)
+        BCdt = self.bcdt_proj(x)
+        x = x.reshape(bsize, length, self.num_heads, -1)
+        B, C, dt = torch.split(BCdt, [self.d_state, self.d_state, self.dt_dim], dim=-1)
+        B = B[:, :, None, :]
+        C = C[:, :, None, :]
+        A = -torch.exp(self.A_log.float())  # (num_heads,)
+        dt = _rms_norm(dt, None, self.config.rms_norm_eps) * self.dt_norm_weight[None, None, :]
+        B = _rms_norm(B, None, self.config.rms_norm_eps) * self.B_norm_weight[None, None, None, :]
+        C = _rms_norm(C, None, self.config.rms_norm_eps) * self.C_norm_weight[None, None, None, :]
+        # (bsize, length, num_heads, 1)
+        dt = self.dt_proj(dt)[..., None]
+        # TODO it may not be required
+        B = B.expand(-1, -1, self.num_heads, -1)
+        C = C.expand(-1, -1, self.num_heads, -1)
+        if bool_mask is not None:
+            """
+            state will be updates by following:
+            ```
+            dt = softplus(dt)
+            dA = exp(dt * A)
+            state_next = state * dA + dB * x
+            ```
+            To avoid updating state, we set dt to -inf and x to 0
+            because `softplus(-inf) = 0` and `exp(0) = 1`
+            """
+            dt = torch.where(bool_mask[:, :, None, None], dt, float("-inf"))
+            x = torch.where(bool_mask[:, :, None, None], x, 0.0)
+        # ssm
+        if is_update:
+            assert ssm_state is not None
+            out = ssd_update_state(
+                ssm_state,
+                x[:, 0],
+                dt[:, 0].reshape(bsize, -1),
+                A,
+                B[:, 0],
+                C[:, 0],
+                D=self.D,
+                z=z[:, 0],
+                dt_bias=self.dt_bias,
+                dt_softplus=True,
+            )
+        else:
+            tmp = ssd_chunk_scan_combined(
+                x,
+                dt.reshape(bsize, length, -1),
+                A,
+                B,
+                C,
+                self.chunk_size,
+                D=self.D,
+                z=z,
+                dt_bias=self.dt_bias,
+                dt_softplus=True,
+                return_final_states=past_states is not None,
+                seq_idx=seq_idx,
+                ssm_state=ssm_state,
+            )
+            if past_states is not None:
+                out, ssm_state = tmp
+            else:
+                assert isinstance(tmp, torch.Tensor)
+                out = tmp
+        y = self.out_proj(out.reshape(bsize, length, -1))
+        if past_states is not None:
+            assert ssm_state is not None
+            assert conv_state is not None
+            past_states.update_mamba(conv_state, ssm_state, self.layer_idx)
+        return y, past_states
+def swa_mask(q_len: int, kv_len: int, device: torch.device, window_size: int) -> torch.Tensor:
+    max_len = max(q_len, kv_len)
+    mask = (
+        torch.ones(max_len, max_len, dtype=torch.bool, device=device)
+        .triu(diagonal=-window_size)
+        .tril(diagonal=window_size)
+    )
+    return mask[-q_len:, -kv_len:]
+class Attention(torch.nn.Module):
+    def __init__(self, config: Plamo2Config, layer_idx: int) -> None:
+        super().__init__()
+        self.config = config
+        self.layer_idx = layer_idx
+        self.hidden_size = config.hidden_size
+        head_dim = config.hidden_size_per_head
+        self.max_position_embeddings = config.max_position_embeddings
+        self.q_num_heads = config.num_attention_heads
+        self.qk_dim = self.v_dim = head_dim
+        self.k_num_heads = self.v_num_heads = config.num_key_value_heads
+        assert self.q_num_heads % self.k_num_heads == 0
+        self.n_group = self.q_num_heads // self.k_num_heads
+        self.q_proj_dim = self.q_num_heads * self.qk_dim
+        self.k_proj_dim = self.k_num_heads * self.qk_dim
+        self.v_proj_dim = self.k_num_heads * self.v_dim
+        self.qkv_proj = nn.Linear(self.hidden_size, self.q_proj_dim + self.k_proj_dim + self.v_proj_dim, bias=False)
+        self.o_proj = nn.Linear(self.q_num_heads * self.v_dim, self.hidden_size, bias=False)
+        self.q_weight = torch.nn.Parameter(torch.ones((self.q_num_heads, self.qk_dim)))
+        self.k_weight = torch.nn.Parameter(torch.ones((self.k_num_heads, self.qk_dim)))
+        self.full_attn = self.layer_idx in self.config.full_attention_idx
+        base = self.config.rope_theta if self.full_attn else self.config.rope_local_theta
+        self.rotary_emb = RotaryEmbedding(
+            self.qk_dim, max_position_embeddings=self.config.attention_window_size, base=base
+        )
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_states: Optional[Plamo2Cache] = None,
+        output_attentions: bool = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Plamo2Cache]]:
+        bsz, q_len, _ = hidden_states.size()
+        qkv = self.qkv_proj(hidden_states)
+        query_states, key_states, value_states = torch.split(
+            qkv, [self.q_proj_dim, self.k_proj_dim, self.v_proj_dim], dim=-1
+        )
+        query_states = query_states.view(bsz, q_len, self.q_num_heads, self.qk_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.k_num_heads, self.qk_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.v_num_heads, self.v_dim).transpose(1, 2)
+        attn_dtype = query_states.dtype
+        query_states = _rms_norm(query_states, None, 1e-6) * self.q_weight[None, :, None]
+        key_states = _rms_norm(key_states, None, 1e-6) * self.k_weight[None, :, None]
+        if past_states is not None:
+            # reuse k, v, self_attention
+            key_states_new = key_states
+            value_states_new = value_states
+            key_states, value_states = past_states.append_kv(key_states, value_states, self.layer_idx)  # type: ignore
+            past_states.update_attention(key_states_new, value_states_new, self.layer_idx)
+        kv_seq_len = key_states.shape[-2]
+        device = hidden_states.device
+        position_ids = torch.arange(kv_seq_len, dtype=torch.long, device=device)[None]
+        q_position_ids = position_ids[:, -query_states.shape[2] :]
+        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        query_states = _rotary_pos_emb(query_states, cos, sin, q_position_ids)
+        key_states = _rotary_pos_emb(key_states, cos, sin, position_ids)
+        # [bsz, nh, t, hd]
+        def _expand_kv(t: torch.Tensor, repeat: int, target: int) -> torch.Tensor:
+            t = torch.repeat_interleave(t, repeat, dim=1)
+            return t[:, :target]
+        # expand shared kv
+        assert self.k_num_heads == self.v_num_heads
+        key_states = _expand_kv(key_states, self.n_group, self.q_num_heads)
+        value_states = _expand_kv(value_states, self.n_group, self.q_num_heads)
+        query_states = query_states.to(attn_dtype)
+        key_states = key_states.to(attn_dtype)
+        value_states = value_states.to(attn_dtype)
+        if attention_mask is not None and attention_mask.dtype != torch.bool:
+            attention_mask = attention_mask.to(attn_dtype)
+        if attention_mask is None:
+            if not self.full_attn:
+                assert key_states.shape[2] <= self.config.attention_window_size + 1
+            attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, is_causal=True)
+        else:
+            if attention_mask.dtype == torch.bool:
+                attention_mask = torch.where(attention_mask, torch.tensor(0.0, dtype=torch.float), float("-inf"))
+            if len(attention_mask.shape) == 2:
+                attention_mask = attention_mask[None, None]
+            assert len(attention_mask.shape) == 4
+            if not self.full_attn:
+                m_swa = swa_mask(
+                    query_states.shape[2], key_states.shape[2], query_states.device, self.config.attention_window_size
+                )
+                # `generate` function creates attention mask that does not consider sliding window
+                m_swa = m_swa[None, None]
+                attention_mask = attention_mask[:, :, -query_states.shape[2] :, -key_states.shape[2] :]
+                attention_mask = torch.where(m_swa, attention_mask, float("-inf"))
+            # like AttentionMaskConverter._unmask_unattended in huggingface.transfoermers,
+            # we need to attend to all tokens in masked rows for `scaled_dot_product_attention`
+            bool_mask = torch.logical_not(torch.isneginf(attention_mask))
+            valid_tokens = torch.sum(bool_mask, dim=-1).bool()  # (..., q_len)
+            attention_mask = torch.where(valid_tokens[..., None], attention_mask, float(0.0))
+            attn_output = F.scaled_dot_product_attention(
+                query_states, key_states, value_states, attn_mask=attention_mask
+            )
+        attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, q_len, self.q_num_heads * self.v_dim)
+        attn_output = self.o_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights, past_states
+class MLP(nn.Module):
+    def __init__(self, config: Plamo2Config) -> None:
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.gate_up_proj = torch.nn.Linear(self.hidden_size, self.intermediate_size * 2, bias=False)
+        self.down_proj = torch.nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        h = self.gate_up_proj(x)
+        h = _swiglu(h)
+        return self.down_proj(h)  # type: ignore
+class Plamo2DecoderLayer(torch.nn.Module):
+    def __init__(self, config: Plamo2Config, layer_idx: int) -> None:
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.is_mamba = config.layers_block_type[layer_idx] == "mamba"
+        self.mixer: torch.nn.Module
+        if self.is_mamba:
+            self.mixer = Mamba(config, layer_idx)
+        else:
+            self.mixer = Attention(config, layer_idx)
+        self.mlp = MLP(config)
+        """
+        Notes: The model performance was degraded when setting all offsets to 1.
+        """
+        self.pre_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0)
+        self.post_mixer_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0 / 5)
+        self.pre_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0)
+        self.post_mlp_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps, offset=1.0 / (5**1.5))
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        past_state: Optional[Plamo2Cache] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[Any, ...]:
+        # from LlamaDecoder
+        residual = hidden_states
+        hidden_states = self.pre_mixer_norm(hidden_states)
+        # Self Attention
+        if self.is_mamba:
+            hidden_states_sa, present_key_value = self.mixer(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                past_states=past_state,
+            )
+            self_attn_weights = None
+        else:
+            hidden_states_sa, self_attn_weights, present_key_value = self.mixer(
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                past_states=past_state,
+                output_attentions=output_attentions,
+            )
+        hidden_states_sa = self.post_mixer_norm(hidden_states_sa)
+        hidden_states = residual + hidden_states_sa
+        residual = hidden_states
+        hidden_states = self.pre_mlp_norm(hidden_states)
+        # Fully Connected
+        hidden_states_mlp = self.mlp(hidden_states)
+        # Residual
+        hidden_states_mlp = self.post_mlp_norm(hidden_states_mlp)
+        hidden_states = residual + hidden_states_mlp
+        outputs: Any = (hidden_states,)
+        if output_attentions:
+            outputs += (self_attn_weights,)
+        return outputs  # type: ignore
+def is_mamba(config: Plamo2Config, i: int) -> bool:
+    if not config.mamba_enabled:
+        return False
+    assert config.mamba_step > 1
+    assert i < config.num_hidden_layers
+    if config.num_hidden_layers <= (config.mamba_step // 2):
+        # use attention in last layer
+        return i != config.num_hidden_layers - 1
+    return (i % config.mamba_step) != (config.mamba_step // 2)
+class Plamo2Decoder(torch.nn.Module):
+    def __init__(self, config: Plamo2Config) -> None:
+        super().__init__()
+        self.layers = torch.nn.ModuleList(
+            [Plamo2DecoderLayer(config, layer_idx=i) for i in range(config.num_hidden_layers)]
+        )
+        self.gradient_checkpointing = False
+    def forward(self, x: DecoderInput) -> DecoderOutput:
+        all_hidden_states: Optional[Tuple[torch.Tensor, ...]] = () if x.output_hidden_states else None
+        all_self_attns: Optional[Tuple[torch.Tensor, ...]] = () if x.output_attentions else None
+        hidden_states = x.hidden_states
+        for decoder_layer in self.layers:
+            if x.output_hidden_states:
+                assert all_hidden_states is not None
+                all_hidden_states += (hidden_states,)
+            if self.training and x.gradient_checkpointing:
+                layer_outputs = self._gradient_checkpointing_func(
+                    decoder_layer.__call__,
+                    hidden_states,
+                    x.attention_mask,
+                    x.past_states,
+                    x.output_attentions,
+                )
+            else:
+                layer_outputs = decoder_layer(
+                    hidden_states,
+                    attention_mask=x.attention_mask,
+                    past_state=x.past_states,
+                    output_attentions=x.output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if x.output_attentions:
+                assert layer_outputs[1] is not None
+                assert all_self_attns is not None
+                all_self_attns += (layer_outputs[1],)
+        return DecoderOutput(hidden_states, all_hidden_states, all_self_attns)
+class Plamo2PreTrainedModel(PreTrainedModel):  # type: ignore
+    config_class = Plamo2Config
+    _no_split_modules: List[str]
+    base_model_prefix = "model"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["PlamoDecoderLayer"]
+    _skip_keys_device_placement = "past_key_values"
+    _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
+    def _init_weights(self, module: torch.nn.Module) -> None:
+        std = 0.02
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=std)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+class Plamo2Model(Plamo2PreTrainedModel):
+    def __init__(self, config: Plamo2Config):
+        super().__init__(config)
+        assert config.eval_attention_n_bit is None
+        assert config.eval_mlp_n_bit is None
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
+        if config.image_feature_size is not None:
+            if config.image_proj_type == "mlp":
+                self.image_proj = MLPImageProjector(config)  # type: ignore
+            elif config.image_proj_type == "linear":
+                self.image_proj = nn.Linear(config.image_feature_size, config.hidden_size, bias=False)  # type: ignore
+            else:
+                raise ValueError(f"Unknown image_proj_type: {config.image_proj_type}")
+        self.layers = Plamo2Decoder(config)  # type: ignore
+        self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.gradient_checkpointing = False
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> torch.nn.Embedding:
+        return self.embed_tokens
+    def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
+        self.embed_tokens = value
+    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
+    def _prepare_decoder_attention_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_shape: Tuple[int, int],
+        inputs_embeds: Optional[torch.Tensor],
+        past_key_values_length: int,
+    ) -> Optional[torch.Tensor]:
+        # create causal mask
+        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+        combined_attention_mask: Optional[torch.Tensor] = None
+        if input_shape[-1] > 1:
+            assert inputs_embeds is not None
+            combined_attention_mask = _make_causal_mask(
+                input_shape,
+                inputs_embeds.dtype,
+                device=inputs_embeds.device,
+                past_key_values_length=past_key_values_length,
+            )
+            input_shape = (input_shape[0], combined_attention_mask.shape[2])
+        if attention_mask is not None:
+            if attention_mask.dim() == 4:
+                # Custom 4D attention mask
+                expanded_attn_mask = attention_mask
+            else:
+                # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
+                assert inputs_embeds is not None
+                expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
+                    inputs_embeds.device
+                )
+            combined_attention_mask = (
+                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
+            )
+        return combined_attention_mask
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Plamo2Cache] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        image_features: Optional[torch.Tensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        **kwargs: Any,
+    ) -> Union[Tuple, BaseModelOutputWithPast]:
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # retrieve input_ids and inputs_embeds
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if self.gradient_checkpointing and self.training and use_cache:
+            use_cache = False
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids)
+        batch_size, seq_length, _ = inputs_embeds.shape
+        seq_length_with_past = seq_length
+        past_key_values_length = 0
+        if past_key_values is not None:
+            past_key_values_length = past_key_values.get_seq_length()
+            seq_length_with_past = seq_length_with_past + past_key_values_length
+        assert cache_position is None, "cache_position is not supported yet"
+        if image_features is not None:
+            assert self.config.image_token_id is not None
+            image_embeds = self.image_proj(image_features)
+            assert image_embeds.shape == inputs_embeds.shape, (image_embeds.shape, inputs_embeds.shape)
+            mask = input_ids == self.config.image_token_id
+            inputs_embeds[mask] = image_embeds[mask]
+        # embed positions
+        require_attn_mask = False
+        if not self.training or past_key_values is not None:
+            require_attn_mask = True
+        if seq_length_with_past >= self.config.attention_window_size:
+            require_attn_mask = True
+        if require_attn_mask and attention_mask is None:
+            attention_mask = torch.ones(
+                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
+            )
+        if attention_mask is not None:
+            attention_mask = self._prepare_decoder_attention_mask(
+                attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
+            )
+        hidden_states = inputs_embeds
+        if use_cache and past_key_values is None:
+            past_key_values = Plamo2Cache(self.config)
+        # decoder layers
+        out = self.layers(
+            DecoderInput(
+                hidden_states,
+                attention_mask,
+                past_key_values,
+                output_hidden_states,
+                output_attentions,
+                self.gradient_checkpointing,
+            )
+        )
+        assert isinstance(out, DecoderOutput)
+        hidden_states = out.hidden_states
+        all_hidden_states = out.all_hidden_states
+        all_self_attns = out.all_self_attns
+        hidden_states = self.norm(hidden_states)
+        # add hidden states from the last decoder layer
+        if output_hidden_states:
+            assert all_hidden_states is not None
+            all_hidden_states += (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v for v in [hidden_states, past_key_values, all_hidden_states, all_self_attns] if v is not None
+            )
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=past_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attns,
+        )
+class Plamo2ForCausalLM(Plamo2PreTrainedModel):
+    _tied_weights_keys = ["lm_head.weight"]
+    # Without this, the model cannot be loaded into a meta device.
+    # Relevant code:
+    # https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L4376-L4381
+    # https://github.com/huggingface/transformers/blob/v4.44.2/src/transformers/modeling_utils.py#L356
+    # https://github.com/pytorch/pytorch/blob/v2.4.1/torch/nn/modules/module.py#L2068
+    _supports_param_buffer_assignment = False
+    def __init__(self, config: Plamo2Config) -> None:
+        super().__init__(config)
+        self.model = Plamo2Model(config)
+        self.vocab_size = config.vocab_size
+        vocab_size = ((self.vocab_size + 15) // 16) * 16
+        self.lm_head: torch.nn.Module = nn.Linear(config.hidden_size, vocab_size, bias=False)
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> torch.nn.Embedding:
+        return self.model.embed_tokens
+    def set_input_embeddings(self, value: torch.nn.Embedding) -> None:
+        self.model.embed_tokens = value
+    def get_output_embeddings(self) -> torch.nn.Module:
+        return self.lm_head
+    def set_output_embeddings(self, new_embeddings: torch.nn.Module) -> None:
+        self.lm_head = new_embeddings
+    def set_decoder(self, decoder: Plamo2Model) -> None:
+        self.model = decoder
+    def get_decoder(self) -> Plamo2Model:
+        return self.model
+    def forward(  # type: ignore
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.Tensor] = None,
+        past_key_values: Optional[Plamo2Cache] = None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        image_features: Optional[torch.Tensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        use_cache: Optional[bool] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        logits_to_keep: int | torch.Tensor = 0,
+        **kwargs: Any,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:
+            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
+                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
+                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
+        Returns:
+        Example:
+        ```python
+        >>> from transformers import AutoTokenizer, LlamaForCausalLM
+        >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
+        >>> prompt = "Hey, are you consciours? Can you talk to me?"
+        >>> inputs = tokenizer(prompt, return_tensors="pt")
+        >>> # Generate
+        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
+        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+        "Hey, are you consciours? Can you talk to me?\nI'm not consciours, but I can talk to you."
+        ```"""
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
+        outputs = self.model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            image_features=image_features,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            cache_position=cache_position,
+            **kwargs,
+        )
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+        slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
+        logits = logits[:, slice_indices, : self.vocab_size]
+        loss = None
+        if labels is not None:
+            if len(kwargs) > 0 and set(kwargs.keys()) != set(["ignore_index"]):
+                warnings.warn(
+                    f"The following kwargs may not be supported: {', '.join(kwargs.keys())}. ",
+                    stacklevel=2,
+                )
+            loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
+        if not return_dict:
+            output = (logits,) + outputs[1:]
+            return (loss,) + output if loss is not None else output
+        return CausalLMOutputWithPast(
+            loss=loss,
+            logits=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.Tensor,
+        past_key_values: Optional[Plamo2Cache] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        inputs_embeds: Optional[torch.Tensor] = None,
+        image_features: Optional[torch.Tensor] = None,
+        **kwargs: Any,
+    ) -> Dict[str, Any]:
+        if past_key_values:
+            input_ids = input_ids[:, -1:]
+            if image_features is not None:
+                image_features = image_features[:, -1:, :]
+        position_ids = kwargs.get("position_ids", None)
+        if attention_mask is not None and position_ids is None:
+            # create position_ids on the fly for batch generation
+            position_ids = attention_mask.long().cumsum(-1) - 1
+            position_ids.masked_fill_(attention_mask == 0, 1)
+            if past_key_values:
+                position_ids = position_ids[:, -1].unsqueeze(-1)
+        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
+        if inputs_embeds is not None and past_key_values is None:
+            model_inputs: Dict[str, Any] = {"inputs_embeds": inputs_embeds}
+        else:
+            model_inputs = {"input_ids": input_ids}
+        model_inputs.update(
+            {
+                "position_ids": position_ids,
+                "past_key_values": past_key_values,
+                "use_cache": kwargs.get("use_cache"),
+                "attention_mask": attention_mask,
+                "image_features": image_features,
+            }
+        )
+        return model_inputs
+    @staticmethod
+    def _reorder_cache(past_key_values: Plamo2Cache, beam_idx: torch.Tensor) -> Plamo2Cache:
+        past_key_values.reorder_cache(beam_idx)
+        return past_key_values
+class MLPImageProjector(nn.Module):
+    def __init__(self, config: Plamo2Config) -> None:
+        super().__init__()
+        self.config = config
+        assert config.image_feature_size is not None  # for typing
+        # nn.LayerNorm is not supported by PFVM, so use RMSNorm + Bias instead to approximate this.
+        self.norm0 = RMSNorm(config.image_feature_size, eps=config.rms_norm_eps)
+        self.bias0 = Bias(config.image_feature_size)
+        # PFVM doesn't support Linear with bias, so add bias manually afterwards.
+        self.linear1 = nn.Linear(config.image_feature_size, config.hidden_size, bias=False)
+        self.bias1 = Bias(config.hidden_size)
+        self.act1 = nn.GELU()
+        self.linear2 = nn.Linear(config.hidden_size, config.hidden_size, bias=False)
+        self.bias2 = Bias(config.hidden_size)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+    ) -> torch.Tensor:
+        hidden_states = self.norm0(hidden_states)
+        hidden_states = self.bias0(hidden_states)
+        hidden_states = self.linear1(hidden_states)
+        hidden_states = self.bias1(hidden_states)
+        hidden_states = self.act1(hidden_states)
+        hidden_states = self.linear2(hidden_states)
+        hidden_states = self.bias2(hidden_states)
+        return hidden_states
+class Bias(nn.Module):
+    def __init__(self, num_features: int) -> None:
+        super().__init__()
+        self._bias = nn.Parameter(torch.zeros((num_features,)))
+    def forward(
+        self,
+        x: torch.Tensor,
+    ) -> torch.Tensor:
+        return x + self._bias

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|plamo:bos|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": "<|plamo:bos|>",
+  "pad_token": {
+    "content": "<|plamo:pad|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|plamo:unk|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenization_plamo.py ADDED Viewed

	@@ -0,0 +1,392 @@

+import json
+import math
+import os
+from shutil import copyfile
+from typing import Any, Optional, Tuple
+import numpy as np
+# NOTE: numba does not support type hints for njit: https://github.com/python/mypy/issues/16149
+from numba import njit  # type: ignore[attr-defined]
+from numba.core import types
+from numba.typed import Dict, List
+from transformers.tokenization_utils import PreTrainedTokenizer
+from transformers.utils import logging
+VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.jsonl"}
+logger = logging.get_logger(__name__)
+INVALID_SCORE = -20000000
+UNKNOWN_SCORE = -10000000
+TABLE_PIECE_LENGTH = 0
+TABLE_TOKEN_ID = 1
+TABLE_SCORE = 2
+TABLE_PIECE_ID = 3
+PATH_TOKEN_LENGTH = 0
+PATH_TOKEN_ID = 1
+PATH_NUM_TOKENS = 2
+class AhoCorasick:
+    def __init__(self) -> None:
+        # List of tokens in the vocabulary.
+        self._tokens: list[str]
+        # A mapping from a byte code point to a token ID, used for byte fallback.
+        self._bytes: np.ndarray
+        # A mapping from a suffix's piece code to a suffix ID.
+        #
+        # Typically, the Aho-Corasick algorithm builds a Trie and adds suffix links between nodes
+        # of the Trie. In this implementation, a suffix ID corresponds to a node in the trie, and
+        # a piece code to an edge (in other words, a pair of a node and the next character).
+        #
+        # A piece code is a 64-bit integer:
+        # - The upper 32 bits store the Unicode code point of the first character.
+        # - The lower 32 bits store the suffix ID of the remaining suffix.
+        #
+        # A suffix ID is an integer indicating the starting position in the _table.
+        self._to_suffix_id: Dict[types.int64, types.int32]
+        # Flattened table representing the Trie structure for the Aho-Corasick algorithm.
+        # It stores information including scores for each piece (prefix) within each suffix.
+        # It is flattened for memory efficiency and performance. Suffixes are stored in
+        # lexicographical order of their reversed strings, which improves memory access locality
+        # when exploring new characters starting from the string's end. Pieces within a suffix are
+        # stored in the decreasing order of their lengths.
+        #
+        # Each piece (a prefix fo the suffix) contains four pieces of information:
+        # - TABLE_PIECE_LENGTH: Length of the piece.
+        # - TABLE_TOKEN_ID: Token ID (or -1 if the piece is not a valid token).
+        # - TABLE_SCORE: Score (or INVALID_SCORE if the piece is not a valid token).
+        # - TABLE_PIECE_ID: Piece ID of the suffix.
+        #
+        # Each suffix also includes a sentinel row with a length of 1, a score of UNKNOWN_SCORE,
+        # and a token ID of -1. Sentinel rows are identified by the score being UNKNOWN_SCORE.
+        self._table: np.ndarray
+    def build(self, vocab: list[Any]) -> None:
+        self._bytes = np.zeros(256, dtype=np.int32)
+        self._to_suffix_id = Dict.empty(key_type=types.int64, value_type=types.int32)
+        # Build suffix_to_score and token_to_token_id.
+        # The suffix_to_score dictionary maps a suffix to its score. It also includes all suffixes
+        # of the token for the Trie structure for the Aho-Corasick algorithm. If a suffix is not a
+        # valid token, its score is set to math.nan.
+        # The token_to_token_id dictionary maps a token to its token ID.
+        suffix_to_score: dict[str, float] = {}
+        token_to_token_id: dict[str, int] = {}
+        self._tokens = []
+        for token_id, row in enumerate(vocab):
+            assert isinstance(row[0], str), row
+            assert isinstance(row[1], (int, float)), row
+            token = str(row[0])
+            self._tokens.append(token)
+            token_to_token_id[token] = token_id
+            # Special handling for byte tokens.
+            if len(row) > 2 and row[2] == "BYTE":
+                assert len(token) == 6 and token.startswith("<0x") and token.endswith(">"), row[0]
+                self._bytes[int(row[0][3:5], 16)] = token_id
+                continue
+            suffix_to_score[token] = float(row[1])
+            # Ensure that all suffixes are included in suffix_to_score.
+            for i in range(1, len(token)):
+                suffix_to_score[token[i:]] = suffix_to_score.get(token[i:], math.nan)
+        # Ensure all byte tokens are set.
+        for i in range(256):
+            assert self._bytes[i] != 0, f"Byte token for <0x{i:02X}> is not set."
+        # List suffixes in lexicographical order of their reversed strings.
+        suffixes = list(suffix_to_score.keys())
+        suffixes.append("")
+        suffixes.sort(key=lambda x: x[::-1])
+        # Build suffix_to_id, which is a mapping from a suffix to a suffix ID, and _to_suffix_id,
+        # which is a mapping from a piece code to a suffix ID.
+        suffix_to_id: dict[str, int] = {}
+        num_pieces = 0
+        for s in suffixes:
+            suffix_to_id[s] = num_pieces
+            if s != "":
+                self._to_suffix_id[ord(s[0]) << 32 | suffix_to_id[s[1:]]] = np.int32(num_pieces)
+            num_pieces += 1 + sum(s[:i] in suffix_to_score for i in range(1, len(s) + 1))
+        assert suffix_to_id[""] == 0, suffix_to_id[""]
+        # Build _table, which is a flattened table representing the Trie structure for the Aho-Corasick.
+        self._table = np.zeros((num_pieces, 4), dtype=np.int32)
+        i = 0
+        for suffix in suffixes:
+            # Add all prefixes of the suffix to the table.
+            for piece_length in range(len(suffix), 0, -1):
+                piece = suffix[:piece_length]
+                score = suffix_to_score.get(piece, None)
+                if score is None:
+                    continue
+                self._table[i, TABLE_PIECE_LENGTH] = piece_length
+                self._table[i, TABLE_TOKEN_ID] = token_to_token_id.get(piece, -1)
+                self._table[i, TABLE_SCORE] = round(score * 1e4) if math.isfinite(score) else INVALID_SCORE
+                self._table[i, TABLE_PIECE_ID] = suffix_to_id[piece]
+                i += 1
+            # Add a sentinel row.
+            self._table[i, TABLE_PIECE_LENGTH] = 1
+            self._table[i, TABLE_TOKEN_ID] = -1
+            self._table[i, TABLE_SCORE] = UNKNOWN_SCORE
+            i += 1
+        assert i == num_pieces, (i, num_pieces)
+    @staticmethod
+    @njit
+    def _encode(
+        to_suffix_id: Dict[types.int64, types.int32],
+        table: np.ndarray,
+        bytes: np.ndarray,
+        data: np.ndarray,
+    ) -> np.ndarray:
+        # Initialize scores array with a high value and set the score at the end to 0.
+        # This array keeps track of the minimum cost (best score) to encode from each position to the end.
+        scores = np.full((len(data) + 1,), 2**60, dtype=np.int64)
+        scores[-1] = 0
+        # Path array to store the best path information.
+        # The path array keeps track of token length, token ID, and number of tokens needed to encode.
+        path = np.zeros((len(data) + 1, 3), dtype=np.int32)
+        # Initialize suffix_id to 0, which represents the root of the Trie.
+        suffix_id = 0
+        # Process the input data from the end to the beginning.
+        for i in range(len(data) - 1, -1, -1):
+            c = data[i]
+            # Find the next suffix ID by iterating the suffix IDs of prefixes of the current suffix.
+            # NOTE: If no suffix ID is found, suffix_id will be set to 0.
+            for p in range(suffix_id, len(table)):
+                suffix_id = to_suffix_id.get(c << 32 | table[p, TABLE_PIECE_ID], np.int32(0))
+                # If a next suffix ID is found or a sentinel row is reached, break the loop.
+                if suffix_id > 0 or table[p, TABLE_SCORE] == UNKNOWN_SCORE:
+                    break
+            # Update the best path to the current position. If multiple paths have the same score,
+            # this chooses the longest prefix as the best path (table is sorted in the decreasing
+            # order of piece length).
+            for p in range(suffix_id, len(table)):
+                score = table[p, TABLE_SCORE]
+                if score > INVALID_SCORE:
+                    piece_length = table[p, TABLE_PIECE_LENGTH]
+                    s = scores[i + piece_length] - score
+                    if s < scores[i]:
+                        scores[i] = s
+                        path[i, PATH_TOKEN_LENGTH] = piece_length
+                        path[i, PATH_TOKEN_ID] = table[p, TABLE_TOKEN_ID]
+                        path[i, PATH_NUM_TOKENS] = path[i + piece_length, PATH_NUM_TOKENS] + 1
+                        if score == UNKNOWN_SCORE:
+                            # Add number of bytes to represent `c` in UTF-8 (minus 1; 1 is already
+                            # added above).
+                            path[i, PATH_NUM_TOKENS] += (c >= 0x80) + (c >= 0x800) + (c >= 0x10000)
+                # If it reaches a sentinel row, break the loop.
+                if score == UNKNOWN_SCORE:
+                    break
+        # Decode the best path from the beginning to get the token IDs.
+        pos = 0
+        token_ids = np.zeros(path[0, PATH_NUM_TOKENS], dtype=np.int32)
+        token_pos = 0
+        while pos < len(data):
+            if path[pos, PATH_TOKEN_ID] >= 0:
+                token_ids[token_pos] = path[pos, PATH_TOKEN_ID]
+                token_pos += 1
+            else:
+                # Fall back to byte tokens.
+                c = data[pos]
+                s = 1 + (c >= 0x80) + (c >= 0x800) + (c >= 0x10000)
+                # Add byte tokens representing UTF-8 bytes.
+                for i in range(s):
+                    b = c if s == 1 else (0xF00 >> s) & 0xFF if i == 0 else 0x80
+                    token_ids[token_pos] = bytes[b | ((c >> (s - i - 1) * 6) & 0x3F)]
+                    token_pos += 1
+            # Ensure that pos should increase by at least 1.
+            assert path[pos, PATH_TOKEN_LENGTH] > 0, (pos, path[pos])
+            pos += path[pos, PATH_TOKEN_LENGTH]
+        return token_ids
+    def encode(self, data: str) -> np.ndarray:
+        """Encodes a string into a sequence of token IDs."""
+        return np.asarray(
+            self._encode(
+                self._to_suffix_id,
+                self._table,
+                self._bytes,
+                # Convert a string into a numpy array of Unicode code points.
+                # NOTE: This skips UTF-32 BOM.
+                np.frombuffer(data.encode("utf-32"), dtype=np.int32)[1:],
+            )
+        )
+    def encode_as_tokens(self, data: str) -> list[str]:
+        """Encodes a string into a sequence of tokens."""
+        return [self._tokens[token_id] for token_id in self.encode(data)]
+class Plamo2Tokenizer(PreTrainedTokenizer):  # type: ignore
+    vocab_files_names = VOCAB_FILES_NAMES
+    model_input_names = ["input_ids", "attention_mask"]
+    _save_files = [
+        "special_tokens_map.json",
+        "tokenization_plamo.py",
+        "tokenizer.jsonl",
+        "tokenizer_config.json",
+    ]
+    def __init__(
+        self,
+        vocab_file: str,
+        unk_token: str = "<|plamo:unk|>",
+        bos_token: str = "<|plamo:bos|>",
+        eos_token: str = "<|plamo:eos|>",
+        pad_token: str = "<|plamo:pad|>",
+        cls_token: Optional[str] = None,
+        sep_token: Optional[str] = None,
+        mask_token: Optional[str] = None,
+        clean_up_tokenization_spaces: bool = False,
+        **kwargs: Any,
+    ) -> None:
+        """Tokenizer for PLaMo.
+        Args:
+            vocab_file (str): Vocabrary file path.
+            unk_token (str): Unknown token.
+            bos_token (str): Beginning of sentence token.
+            eos_token (str): End of sentence token.
+            pad_token (str): Padding token.
+            cls_token (str):
+                Classification token, to extract a summary of an input sequence leveraging self-attention along the
+                full depth of the model.
+            sep_token (str): Separation token, to separate context and query in an input sequence.
+            mask_token (str): Mask token, to use when training a model with masked-language modeling.
+            clean_up_tokenization_spaces (bool): Whether or not to clean up the tokenization spaces.
+            num_threads (int):
+                Number of threads. This value will be ignored if one of `PLAMO_TOKENIZER_NUM_THREADS` or
+                `RAYON_NUM_THREADS` is set as an environment variable.
+        """
+        if "add_bos_token" not in kwargs:
+            kwargs["add_bos_token"] = False
+        if "add_eos_token" not in kwargs:
+            kwargs["add_eos_token"] = False
+        self.data: list[Any] = [json.loads(line) for line in open(vocab_file, "r", encoding="utf-8")]
+        self.vocab: dict[str, int] = {v[0]: i for i, v in enumerate(self.data)}
+        self.aho_corasick = AhoCorasick()
+        self.aho_corasick.build(self.data)
+        self.vocab_file = vocab_file
+        self.add_bos_token = kwargs["add_bos_token"]
+        self.add_eos_token = kwargs["add_eos_token"]
+        super().__init__(
+            vocab_file=vocab_file,
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            cls_token=cls_token,
+            sep_token=sep_token,
+            mask_token=mask_token,
+            clean_up_tokenization_spaces=clean_up_tokenization_spaces,
+            **kwargs,
+        )
+    # the functions below are copied from hf transformers LlamaTokenizer's implementation to fix the behaviour of the tokenizer
+    # https://github.com/huggingface/transformers/blob/v4.30.2/src/transformers/models/llama/tokenization_llama.py
+    def __getstate__(self) -> dict[str, Any]:
+        state = self.__dict__.copy()
+        state["aho_corasick"] = None
+        return state
+    def __setstate__(self, d: dict[str, Any]) -> None:
+        self.__dict__ = d
+        self.aho_corasick = AhoCorasick()
+        self.aho_corasick.build(self.data)
+    @property
+    def vocab_size(self) -> Any:
+        """Returns vocab size"""
+        return len(self.data)
+    def token_to_score(self, token: str) -> Optional[float]:
+        """Returns score of the token"""
+        token_id = self.vocab.get(token, None)
+        return None if token_id is None else self.data[token_id][1]
+    def get_vocab(self) -> dict[str, int]:
+        """Returns vocab as a dict"""
+        vocab = self.vocab.copy()
+        vocab.update(self.added_tokens_encoder)
+        return vocab
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        """Converts a sequence of tokens (string) in a single string."""
+        return b"".join(
+            [bytes([int(t[3:5], 16)]) if t.startswith("<0x") else t.encode("utf-8") for t in tokens]
+        ).decode("utf-8", errors="replace")
+    def _tokenize(self, text: str) -> Any:
+        """Returns a tokenized string."""
+        return self.aho_corasick.encode_as_tokens(text)
+    def _convert_token_to_id(self, token: str) -> Any:
+        """Converts a token (str) in an id using the vocab."""
+        return self.vocab.get(token, 0)
+    def _convert_id_to_token(self, index: int) -> Any:
+        """Converts an index (integer) in a token (str) using the vocab."""
+        return self.data[index][0]
+    def build_inputs_with_special_tokens(
+        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
+    ) -> List[int]:
+        bos_token_id = [self.bos_token_id] if self.add_bos_token else []
+        eos_token_id = [self.eos_token_id] if self.add_eos_token else []
+        output = bos_token_id + token_ids_0 + eos_token_id
+        if token_ids_1 is not None:
+            output = output + bos_token_id + token_ids_1 + eos_token_id
+        return output
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        """
+        Save the vocabulary and special tokens file to a directory.
+        Args:
+            save_directory (`str`):
+                The directory in which to save the vocabulary.
+        Returns:
+            `Tuple(str)`: Paths to the files saved.
+        """
+        if not os.path.isdir(save_directory):
+            logger.error(f"Vocabulary path ({save_directory}) should be a directory")
+            return ("",)
+        out_vocab_file = os.path.join(
+            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+        )
+        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
+            copyfile(self.vocab_file, out_vocab_file)
+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "w") as f:
+                for token in self.data:
+                    print(json.dumps(token, ensure_ascii=False), file=f)
+        return (out_vocab_file,)

tokenizer.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3af94303d6c676e875e13165abf4e4a9d42c663ef247a8985cdeedb223cc0681
+size 10573745

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<|plamo:unk|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<|plamo:bos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<|plamo:eos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<|plamo:pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "auto_map": {
+    "AutoTokenizer": [
+      "tokenization_plamo.Plamo2Tokenizer",
+      null
+    ]
+  },
+  "bos_token": "<|plamo:bos|>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": null,
+  "eos_token": "<|plamo:bos|>",
+  "extra_special_tokens": {},
+  "local_file_only": true,
+  "mask_token": null,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|plamo:pad|>",
+  "sep_token": null,
+  "tokenizer_class": "Plamo2Tokenizer",
+  "unk_token": "<|plamo:unk|>"
+}