zhangyikai commited on
Commit
267ea5c
·
verified ·
1 Parent(s): ed0d0ab

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright (C) 2025 AIDC-AI
2
+
3
+ Apache License
4
+ Version 2.0, January 2004
5
+ http://www.apache.org/licenses/
6
+
7
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
8
+
9
+ 1. Definitions.
10
+
11
+ "License" shall mean the terms and conditions for use, reproduction,
12
+ and distribution as defined by Sections 1 through 9 of this document.
13
+
14
+ "Licensor" shall mean the copyright owner or entity authorized by
15
+ the copyright owner that is granting the License.
16
+
17
+ "Legal Entity" shall mean the union of the acting entity and all
18
+ other entities that control, are controlled by, or are under common
19
+ control with that entity. For the purposes of this definition,
20
+ "control" means (i) the power, direct or indirect, to cause the
21
+ direction or management of such entity, whether by contract or
22
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
23
+ outstanding shares, or (iii) beneficial ownership of such entity.
24
+
25
+ "You" (or "Your") shall mean an individual or Legal Entity
26
+ exercising permissions granted by this License.
27
+
28
+ "Source" form shall mean the preferred form for making modifications,
29
+ including but not limited to software source code, documentation
30
+ source, and configuration files.
31
+
32
+ "Object" form shall mean any form resulting from mechanical
33
+ transformation or translation of a Source form, including but
34
+ not limited to compiled object code, generated documentation,
35
+ and conversions to other media types.
36
+
37
+ "Work" shall mean the work of authorship, whether in Source or
38
+ Object form, made available under the License, as indicated by a
39
+ copyright notice that is included in or attached to the work
40
+ (an example is provided in the Appendix below).
41
+
42
+ "Derivative Works" shall mean any work, whether in Source or Object
43
+ form, that is based on (or derived from) the Work and for which the
44
+ editorial revisions, annotations, elaborations, or other modifications
45
+ represent, as a whole, an original work of authorship. For the purposes
46
+ of this License, Derivative Works shall not include works that remain
47
+ separable from, or merely link (or bind by name) to the interfaces of,
48
+ the Work and Derivative Works thereof.
49
+
50
+ "Contribution" shall mean any work of authorship, including
51
+ the original version of the Work and any modifications or additions
52
+ to that Work or Derivative Works thereof, that is intentionally
53
+ submitted to Licensor for inclusion in the Work by the copyright owner
54
+ or by an individual or Legal Entity authorized to submit on behalf of
55
+ the copyright owner. For the purposes of this definition, "submitted"
56
+ means any form of electronic, verbal, or written communication sent
57
+ to the Licensor or its representatives, including but not limited to
58
+ communication on electronic mailing lists, source code control systems,
59
+ and issue tracking systems that are managed by, or on behalf of, the
60
+ Licensor for the purpose of discussing and improving the Work, but
61
+ excluding communication that is conspicuously marked or otherwise
62
+ designated in writing by the copyright owner as "Not a Contribution."
63
+
64
+ "Contributor" shall mean Licensor and any individual or Legal Entity
65
+ on behalf of whom a Contribution has been received by Licensor and
66
+ subsequently incorporated within the Work.
67
+
68
+ 2. Grant of Copyright License. Subject to the terms and conditions of
69
+ this License, each Contributor hereby grants to You a perpetual,
70
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
71
+ copyright license to reproduce, prepare Derivative Works of,
72
+ publicly display, publicly perform, sublicense, and distribute the
73
+ Work and such Derivative Works in Source or Object form.
74
+
75
+ 3. Grant of Patent License. Subject to the terms and conditions of
76
+ this License, each Contributor hereby grants to You a perpetual,
77
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
78
+ (except as stated in this section) patent license to make, have made,
79
+ use, offer to sell, sell, import, and otherwise transfer the Work,
80
+ where such license applies only to those patent claims licensable
81
+ by such Contributor that are necessarily infringed by their
82
+ Contribution(s) alone or by combination of their Contribution(s)
83
+ with the Work to which such Contribution(s) was submitted. If You
84
+ institute patent litigation against any entity (including a
85
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
86
+ or a Contribution incorporated within the Work constitutes direct
87
+ or contributory patent infringement, then any patent licenses
88
+ granted to You under this License for that Work shall terminate
89
+ as of the date such litigation is filed.
90
+
91
+ 4. Redistribution. You may reproduce and distribute copies of the
92
+ Work or Derivative Works thereof in any medium, with or without
93
+ modifications, and in Source or Object form, provided that You
94
+ meet the following conditions:
95
+
96
+ (a) You must give any other recipients of the Work or
97
+ Derivative Works a copy of this License; and
98
+
99
+ (b) You must cause any modified files to carry prominent notices
100
+ stating that You changed the files; and
101
+
102
+ (c) You must retain, in the Source form of any Derivative Works
103
+ that You distribute, all copyright, patent, trademark, and
104
+ attribution notices from the Source form of the Work,
105
+ excluding those notices that do not pertain to any part of
106
+ the Derivative Works; and
107
+
108
+ (d) If the Work includes a "NOTICE" text file as part of its
109
+ distribution, then any Derivative Works that You distribute must
110
+ include a readable copy of the attribution notices contained
111
+ within such NOTICE file, excluding those notices that do not
112
+ pertain to any part of the Derivative Works, in at least one
113
+ of the following places: within a NOTICE text file distributed
114
+ as part of the Derivative Works; within the Source form or
115
+ documentation, if provided along with the Derivative Works; or,
116
+ within a display generated by the Derivative Works, if and
117
+ wherever such third-party notices normally appear. The contents
118
+ of the NOTICE file are for informational purposes only and
119
+ do not modify the License. You may add Your own attribution
120
+ notices within Derivative Works that You distribute, alongside
121
+ or as an addendum to the NOTICE text from the Work, provided
122
+ that such additional attribution notices cannot be construed
123
+ as modifying the License.
124
+
125
+ You may add Your own copyright statement to Your modifications and
126
+ may provide additional or different license terms and conditions
127
+ for use, reproduction, or distribution of Your modifications, or
128
+ for any such Derivative Works as a whole, provided Your use,
129
+ reproduction, and distribution of the Work otherwise complies with
130
+ the conditions stated in this License.
131
+
132
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
133
+ any Contribution intentionally submitted for inclusion in the Work
134
+ by You to the Licensor shall be under the terms and conditions of
135
+ this License, without any additional terms or conditions.
136
+ Notwithstanding the above, nothing herein shall supersede or modify
137
+ the terms of any separate license agreement you may have executed
138
+ with Licensor regarding such Contributions.
139
+
140
+ 6. Trademarks. This License does not grant permission to use the trade
141
+ names, trademarks, service marks, or product names of the Licensor,
142
+ except as required for reasonable and customary use in describing the
143
+ origin of the Work and reproducing the content of the NOTICE file.
144
+
145
+ 7. Disclaimer of Warranty. Unless required by applicable law or
146
+ agreed to in writing, Licensor provides the Work (and each
147
+ Contributor provides its Contributions) on an "AS IS" BASIS,
148
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
149
+ implied, including, without limitation, any warranties or conditions
150
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
151
+ PARTICULAR PURPOSE. You are solely responsible for determining the
152
+ appropriateness of using or redistributing the Work and assume any
153
+ risks associated with Your exercise of permissions under this License.
154
+
155
+ 8. Limitation of Liability. In no event and under no legal theory,
156
+ whether in tort (including negligence), contract, or otherwise,
157
+ unless required by applicable law (such as deliberate and grossly
158
+ negligent acts) or agreed to in writing, shall any Contributor be
159
+ liable to You for damages, including any direct, indirect, special,
160
+ incidental, or consequential damages of any character arising as a
161
+ result of this License or out of the use or inability to use the
162
+ Work (including but not limited to damages for loss of goodwill,
163
+ work stoppage, computer failure or malfunction, or any and all
164
+ other commercial damages or losses), even if such Contributor
165
+ has been advised of the possibility of such damages.
166
+
167
+ 9. Accepting Warranty or Additional Liability. While redistributing
168
+ the Work or Derivative Works thereof, You may choose to offer,
169
+ and charge a fee for, acceptance of support, warranty, indemnity,
170
+ or other liability obligations and/or rights consistent with this
171
+ License. However, in accepting such obligations, You may act only
172
+ on Your own behalf and on Your sole responsibility, not on behalf
173
+ of any other Contributor, and only if You agree to indemnify,
174
+ defend, and hold each Contributor harmless for any liability
175
+ incurred by, or claims asserted against, such Contributor by reason
176
+ of your accepting any such warranty or additional liability.
177
+
178
+ END OF TERMS AND CONDITIONS
179
+
180
+ APPENDIX: How to apply the Apache License to your work.
181
+
182
+ To apply the Apache License to your work, attach the following
183
+ boilerplate notice, with the fields enclosed by brackets "[]"
184
+ replaced with your own identifying information. (Don't include
185
+ the brackets!) The text should be enclosed in the appropriate
186
+ comment syntax for the file format. We also recommend that a
187
+ file or class name and description of purpose be included on the
188
+ same "printed page" as the copyright notice for easier
189
+ identification within third-party archives.
190
+
191
+ Copyright [yyyy] [name of copyright owner]
192
+
193
+ Licensed under the Apache License, Version 2.0 (the "License");
194
+ you may not use this file except in compliance with the License.
195
+ You may obtain a copy of the License at
196
+
197
+ http://www.apache.org/licenses/LICENSE-2.0
198
+
199
+ Unless required by applicable law or agreed to in writing, software
200
+ distributed under the License is distributed on an "AS IS" BASIS,
201
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
202
+ See the License for the specific language governing permissions and
203
+ limitations under the License.
NOTICE ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Copyright (C) 2025 AIDC-AI
2
+ Licensed under the Apache License, Version 2.0 (the "License");
3
+ you may not use this file except in compliance with the License.
4
+ You may obtain a copy of the License at
5
+ http://www.apache.org/licenses/LICENSE-2.0
6
+ Unless required by applicable law or agreed to in writing, software
7
+ distributed under the License is distributed on an "AS IS" BASIS,
8
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
9
+ See the License for the specific language governing permissions and
10
+ limitations under the License.
11
+
12
+ This model was trained based on the following models:
13
+ 1. Qwen2.5 (https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), license:(https://huggingface.co/Qwen/Qwen2.5-32B-Instruct/blob/main/LICENSE , SPDX-License-identifier: Apache-2.0).
14
+ 2. AimV2 (https://huggingface.co/apple/aimv2-1B-patch14-448), license: AMLR (https://huggingface.co/apple/deeplabv3-mobilevit-small/blob/main/LICENSE). Apple Machine Learning Research Model is licensed under the Apple Machine Learning Research Model License Agreement.
README.md CHANGED
@@ -1,3 +1,229 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - AIDC-AI/Ovis-dataset
5
+ library_name: transformers
6
+ tags:
7
+ - MLLM
8
+ pipeline_tag: image-text-to-text
9
+ language:
10
+ - en
11
+ - zh
12
+ base_model:
13
+ - AIDC-AI/Ovis2-34B
14
+ ---
15
+
16
+ # Ovis2-34B-GPTQ-Int8
17
+ <div align="center">
18
+ <img src=https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/3IK823BZ8w-mz_QfeYkDn.png width="30%"/>
19
+ </div>
20
+
21
+ ## Introduction
22
+ [GitHub](https://github.com/AIDC-AI/Ovis) | [Paper](https://arxiv.org/abs/2405.20797)
23
+
24
+ We are pleased to announce the release of **Ovis2**, our latest advancement in multi-modal large language models (MLLMs). Ovis2 inherits the innovative architectural design of the Ovis series, aimed at structurally aligning visual and textual embeddings. As the successor to Ovis1.6, Ovis2 incorporates significant improvements in both dataset curation and training methodologies.
25
+
26
+ **Key Features**:
27
+
28
+ - **Small Model Performance**: Optimized training strategies enable small-scale models to achieve higher capability density, demonstrating cross-tier leading advantages.
29
+
30
+ - **Enhanced Reasoning Capabilities**: Significantly strengthens Chain-of-Thought (CoT) reasoning abilities through the combination of instruction tuning and preference learning.
31
+
32
+ - **Video and Multi-Image Processing**: Video and multi-image data are incorporated into training to enhance the ability to handle complex visual information across frames and images.
33
+
34
+ - **Multilingual Support and OCR**: Enhances multilingual OCR beyond English and Chinese and improves structured data extraction from complex visual elements like tables and charts.
35
+
36
+ <div align="center">
37
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/XB-vgzDL6FshrSNGyZvzc.png" width="100%" />
38
+ </div>
39
+
40
+ ## Model Zoo
41
+
42
+ | Ovis MLLMs | ViT | LLM | Model Weights | Demo |
43
+ |:-----------|:-----------------------:|:---------------------:|:-------------------------------------------------------:|:--------------------------------------------------------:|
44
+ | Ovis2-1B | aimv2-large-patch14-448 | Qwen2.5-0.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-1B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-1B) |
45
+ | Ovis2-2B | aimv2-large-patch14-448 | Qwen2.5-1.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-2B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-2B) |
46
+ | Ovis2-4B | aimv2-huge-patch14-448 | Qwen2.5-3B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-4B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-4B) |
47
+ | Ovis2-8B | aimv2-huge-patch14-448 | Qwen2.5-7B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-8B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-8B) |
48
+ | Ovis2-16B | aimv2-huge-patch14-448 | Qwen2.5-14B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-16B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-16B) |
49
+ | Ovis2-34B | aimv2-1B-patch14-448 | Qwen2.5-32B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-34B) | - |
50
+ | Ovis2-2B-GPTQ-Int4 | aimv2-large-patch14-448 | Qwen2.5-1.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-2B-GPTQ-Int4) | - |
51
+ | Ovis2-4B-GPTQ-Int4 | aimv2-huge-patch14-448 | Qwen2.5-3B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-4B-GPTQ-Int4) | - |
52
+ | Ovis2-8B-GPTQ-Int4 | aimv2-huge-patch14-448 | Qwen2.5-7B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-8B-GPTQ-Int4) | - |
53
+ | Ovis2-16B-GPTQ-Int4 | aimv2-huge-patch14-448 | Qwen2.5-14B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-16B-GPTQ-Int4) | - |
54
+ | Ovis2-34B-GPTQ-Int4 | aimv2-1B-patch14-448 | Qwen2.5-32B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-34B-GPTQ-Int4) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-34B-GPTQ-Int4) |
55
+ | Ovis2-34B-GPTQ-Int8 | aimv2-1B-patch14-448 | Qwen2.5-32B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-34B-GPTQ-Int8) | - |
56
+
57
+
58
+ ## Quantized Models
59
+ We quantized Ovis2 with [GPTQModel](https://github.com/ModelCloud/GPTQModel). Follow these steps to launch it.
60
+
61
+ ### Installation
62
+ Run the following commands to get a basic environment. Be sure to run with CUDA 12.1.
63
+ ```bash
64
+ conda create -n <your_env_name> python=3.10
65
+ conda activate <your_env_name>
66
+ pip install torch==2.4.0 transformers==4.49.0 pillow==10.3.0
67
+ pip install flash-attn==2.7.0.post2 --no-build-isolation
68
+ pip install gptqmodel
69
+ pip install numpy==1.25.0
70
+ ```
71
+
72
+ ### Usage
73
+ Below is a code snippet to run quantized Ovis2 series with multimodal inputs. For additional usage instructions, including inference wrapper and Gradio UI, please refer to [Ovis GitHub](https://github.com/AIDC-AI/Ovis?tab=readme-ov-file#inference).
74
+ ```python
75
+ import torch
76
+ from PIL import Image
77
+ from transformers import GenerationConfig
78
+ from gptqmodel import GPTQModel
79
+
80
+ # load model
81
+ # customize load device
82
+ load_device = "cuda:0"
83
+ torch.cuda.set_device(load_device)
84
+ # We take AIDC-AI/Ovis2-34B-GPTQ-Int4 as an example. Note that the code snippet is
85
+ # applicable to any GPTQ-quantized Ovis2 model.
86
+ model = GPTQModel.load("AIDC-AI/Ovis2-34B-GPTQ-Int4", device=load_device, trust_remote_code=True)
87
+ model.model.generation_config = GenerationConfig.from_pretrained("AIDC-AI/Ovis2-34B-GPTQ-Int4")
88
+ text_tokenizer = model.get_text_tokenizer()
89
+ visual_tokenizer = model.get_visual_tokenizer()
90
+
91
+ # For inference, quantization affects only the model loading part. The rest is the same
92
+ # as unquantized Ovis2 models. Here we show how to inference with single image input
93
+ # and without batching. For other input types and batch inference, please refer to
94
+ # https://huggingface.co/AIDC-AI/Ovis2-34B.
95
+ image_path = input("Enter image path: ")
96
+ images = [Image.open(image_path)]
97
+ max_partition = 9
98
+ text = input("Enter prompt: ")
99
+ query = f'<image>\n{text}'
100
+
101
+ # format conversation
102
+ prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
103
+ attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
104
+ input_ids = input_ids.unsqueeze(0).to(device=model.device)
105
+ attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
106
+ if pixel_values is not None:
107
+ pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
108
+ pixel_values = [pixel_values]
109
+
110
+ # generate output
111
+ with torch.inference_mode():
112
+ gen_kwargs = dict(
113
+ max_new_tokens=1024,
114
+ do_sample=False,
115
+ top_p=None,
116
+ top_k=None,
117
+ temperature=None,
118
+ repetition_penalty=None,
119
+ eos_token_id=model.generation_config.eos_token_id,
120
+ pad_token_id=text_tokenizer.pad_token_id,
121
+ use_cache=True
122
+ )
123
+ output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
124
+ output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
125
+ print(f'Output:\n{output}')
126
+ ```
127
+
128
+ ## Quantize Your Own Ovis2 Model with GPTQModel
129
+ We provide a demonstration code snippet for you to quantize your own fine-tuned **Ovis2** model. Before running the code, you need to **follow the ABOVE installation steps** to obtain an environment for quantization.
130
+ ```python
131
+ import torch
132
+ from gptqmodel import QuantizeConfig, GPTQModel
133
+
134
+ model_path = "path/to/finetuned/model"
135
+ quantize_save_path = "path/to/save/quantized/model"
136
+
137
+ quantize_config = QuantizeConfig(
138
+ bits=4, # 4 or 8
139
+ group_size=128 # it is recommended to set the value to 128
140
+ )
141
+
142
+ model = GPTQModel.load(
143
+ model_path,
144
+ quantize_config,
145
+ torch_dtype=torch.bfloat16,
146
+ trust_remote_code=True,
147
+ use_cache=False
148
+ )
149
+ model.model.llm.model.config.use_cache = False
150
+ model.model.config.llm_config.use_cache = False
151
+
152
+
153
+ # Data list for calibration, should be in the following format:
154
+ # Single Image/Multi-Image/Video: For video input, frames should be pre-selected and form
155
+ # a list, thus equaling multi-image input. See more on video input in
156
+ # https://huggingface.co/AIDC-AI/Ovis2-34B.
157
+ # data_list = [
158
+ # {
159
+ # "image": ["path/to/image(s)/of/this/sample", ...],
160
+ # "conversations": [
161
+ # {
162
+ # "from": "human",
163
+ # "value": "<image>\n[Your sample prompt]"
164
+ # # For multi-image input, the number of '<image>' marks should be equal
165
+ # # to the number of input images. For video input, the mark should be a
166
+ # # single '<video>'.
167
+ # },
168
+ # {
169
+ # "from": "gpt",
170
+ # "value": "[Your sample answer]"
171
+ # }
172
+ # ]
173
+ # },
174
+ # ...
175
+ # ]
176
+ #
177
+ # Pure Text
178
+ # data_list = [
179
+ # {
180
+ # "conversations": [
181
+ # {
182
+ # "from": "human",
183
+ # "value": "[Your sample prompt]"
184
+ # },
185
+ # {
186
+ # "from": "gpt",
187
+ # "value": "[Your sample answer]"
188
+ # }
189
+ # ]
190
+ # },
191
+ # ...
192
+ # ]
193
+ data_list = [...]
194
+
195
+
196
+ model.quantize(data_list, batch_size=1, calibration_enable_gpu_cache=True)
197
+ print(f"Quantized! Now Saving...")
198
+
199
+ model.save(quantize_save_path, max_shard_size='5GB')
200
+ print(f"ALL Done!")
201
+ ```
202
+
203
+
204
+
205
+ ## Performance
206
+ We report the performance of the quantized Ovis2 series. OpenCompass results are evaluated with [VLMEvalkit](https://github.com/open-compass/VLMEvalKit).
207
+
208
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/645cb4b4a03f3ebb0bde20e0/9jXouHtOH-1L5lPEtZN5S.png)
209
+
210
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/645cb4b4a03f3ebb0bde20e0/HZqgbFU-RkuqvWhmo1HpM.png)
211
+
212
+
213
+
214
+ ## Citation
215
+ If you find Ovis useful, please consider citing the paper
216
+ ```
217
+ @article{lu2024ovis,
218
+ title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
219
+ author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
220
+ year={2024},
221
+ journal={arXiv:2405.20797}
222
+ }
223
+ ```
224
+
225
+ ## License
226
+ This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) (SPDX-License-Identifier: Apache-2.0).
227
+
228
+ ## Disclaimer
229
+ We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
config.json ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Ovis"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_ovis.OvisConfig",
7
+ "AutoModelForCausalLM": "modeling_ovis.Ovis"
8
+ },
9
+ "conversation_formatter_class": "QwenConversationFormatter",
10
+ "disable_tie_weight": false,
11
+ "hidden_size": 5120,
12
+ "llm_attn_implementation": "flash_attention_2",
13
+ "llm_config": {
14
+ "_attn_implementation_autoset": true,
15
+ "_name_or_path": "Qwen/Qwen2.5-32B-Instruct",
16
+ "add_cross_attention": false,
17
+ "architectures": [
18
+ "Qwen2ForCausalLM"
19
+ ],
20
+ "attention_dropout": 0.0,
21
+ "bad_words_ids": null,
22
+ "begin_suppress_tokens": null,
23
+ "bos_token_id": 151643,
24
+ "chunk_size_feed_forward": 0,
25
+ "cross_attention_hidden_size": null,
26
+ "decoder_start_token_id": null,
27
+ "diversity_penalty": 0.0,
28
+ "do_sample": false,
29
+ "early_stopping": false,
30
+ "encoder_no_repeat_ngram_size": 0,
31
+ "eos_token_id": 151645,
32
+ "exponential_decay_length_penalty": null,
33
+ "finetuning_task": null,
34
+ "forced_bos_token_id": null,
35
+ "forced_eos_token_id": null,
36
+ "hidden_act": "silu",
37
+ "hidden_size": 5120,
38
+ "id2label": {
39
+ "0": "LABEL_0",
40
+ "1": "LABEL_1"
41
+ },
42
+ "initializer_range": 0.02,
43
+ "intermediate_size": 27648,
44
+ "is_decoder": false,
45
+ "is_encoder_decoder": false,
46
+ "label2id": {
47
+ "LABEL_0": 0,
48
+ "LABEL_1": 1
49
+ },
50
+ "length_penalty": 1.0,
51
+ "max_length": 20,
52
+ "max_position_embeddings": 32768,
53
+ "max_window_layers": 70,
54
+ "min_length": 0,
55
+ "model_type": "qwen2",
56
+ "no_repeat_ngram_size": 0,
57
+ "num_attention_heads": 40,
58
+ "num_beam_groups": 1,
59
+ "num_beams": 1,
60
+ "num_hidden_layers": 64,
61
+ "num_key_value_heads": 8,
62
+ "num_return_sequences": 1,
63
+ "output_attentions": false,
64
+ "output_hidden_states": false,
65
+ "output_scores": false,
66
+ "pad_token_id": null,
67
+ "prefix": null,
68
+ "problem_type": null,
69
+ "pruned_heads": {},
70
+ "remove_invalid_values": false,
71
+ "repetition_penalty": 1.0,
72
+ "return_dict": true,
73
+ "return_dict_in_generate": false,
74
+ "rms_norm_eps": 1e-06,
75
+ "rope_scaling": null,
76
+ "rope_theta": 1000000.0,
77
+ "sep_token_id": null,
78
+ "sliding_window": null,
79
+ "suppress_tokens": null,
80
+ "task_specific_params": null,
81
+ "temperature": 1.0,
82
+ "tf_legacy_loss": false,
83
+ "tie_encoder_decoder": false,
84
+ "tie_word_embeddings": false,
85
+ "tokenizer_class": null,
86
+ "top_k": 50,
87
+ "top_p": 1.0,
88
+ "torch_dtype": "bfloat16",
89
+ "torchscript": false,
90
+ "typical_p": 1.0,
91
+ "use_bfloat16": false,
92
+ "use_cache": false,
93
+ "use_sliding_window": false,
94
+ "vocab_size": 152064
95
+ },
96
+ "model_type": "ovis",
97
+ "multimodal_max_length": 32768,
98
+ "quantization_config": {
99
+ "bits": 8,
100
+ "checkpoint_format": "gptq",
101
+ "desc_act": true,
102
+ "group_size": 128,
103
+ "lm_head": false,
104
+ "meta": {
105
+ "damp_auto_increment": 0.0025,
106
+ "damp_percent": 0.01,
107
+ "mse": 0.0,
108
+ "quantizer": [
109
+ "gptqmodel:2.0.0-dev"
110
+ ],
111
+ "static_groups": false,
112
+ "true_sequential": true,
113
+ "uri": "https://github.com/modelcloud/gptqmodel"
114
+ },
115
+ "pack_dtype": "int32",
116
+ "quant_method": "gptq",
117
+ "sym": true
118
+ },
119
+ "torch_dtype": "bfloat16",
120
+ "transformers_version": "4.49.0",
121
+ "use_cache": false,
122
+ "visual_tokenizer_config": {
123
+ "_attn_implementation_autoset": true,
124
+ "_name_or_path": "",
125
+ "add_cross_attention": false,
126
+ "architectures": null,
127
+ "backbone_config": {
128
+ "_attn_implementation_autoset": true,
129
+ "_name_or_path": "apple/aimv2-1B-patch14-448",
130
+ "add_cross_attention": false,
131
+ "architectures": [
132
+ "AIMv2Model"
133
+ ],
134
+ "attention_dropout": 0.0,
135
+ "auto_map": {
136
+ "AutoConfig": "configuration_aimv2.AIMv2Config",
137
+ "AutoModel": "modeling_aimv2.AIMv2Model",
138
+ "FlaxAutoModel": "modeling_flax_aimv2.FlaxAIMv2Model"
139
+ },
140
+ "bad_words_ids": null,
141
+ "begin_suppress_tokens": null,
142
+ "bos_token_id": null,
143
+ "chunk_size_feed_forward": 0,
144
+ "cross_attention_hidden_size": null,
145
+ "decoder_start_token_id": null,
146
+ "diversity_penalty": 0.0,
147
+ "do_sample": false,
148
+ "early_stopping": false,
149
+ "encoder_no_repeat_ngram_size": 0,
150
+ "eos_token_id": null,
151
+ "exponential_decay_length_penalty": null,
152
+ "finetuning_task": null,
153
+ "forced_bos_token_id": null,
154
+ "forced_eos_token_id": null,
155
+ "hidden_size": 2048,
156
+ "id2label": {
157
+ "0": "LABEL_0",
158
+ "1": "LABEL_1"
159
+ },
160
+ "image_size": 448,
161
+ "intermediate_size": 5632,
162
+ "is_decoder": false,
163
+ "is_encoder_decoder": false,
164
+ "label2id": {
165
+ "LABEL_0": 0,
166
+ "LABEL_1": 1
167
+ },
168
+ "length_penalty": 1.0,
169
+ "max_length": 20,
170
+ "min_length": 0,
171
+ "model_type": "aimv2",
172
+ "no_repeat_ngram_size": 0,
173
+ "num_attention_heads": 16,
174
+ "num_beam_groups": 1,
175
+ "num_beams": 1,
176
+ "num_channels": 3,
177
+ "num_hidden_layers": 24,
178
+ "num_return_sequences": 1,
179
+ "output_attentions": false,
180
+ "output_hidden_states": false,
181
+ "output_scores": false,
182
+ "pad_token_id": null,
183
+ "patch_size": 14,
184
+ "prefix": null,
185
+ "problem_type": null,
186
+ "projection_dropout": 0.0,
187
+ "pruned_heads": {},
188
+ "qkv_bias": false,
189
+ "remove_invalid_values": false,
190
+ "repetition_penalty": 1.0,
191
+ "return_dict": true,
192
+ "return_dict_in_generate": false,
193
+ "rms_norm_eps": 1e-05,
194
+ "sep_token_id": null,
195
+ "suppress_tokens": null,
196
+ "task_specific_params": null,
197
+ "temperature": 1.0,
198
+ "tf_legacy_loss": false,
199
+ "tie_encoder_decoder": false,
200
+ "tie_word_embeddings": true,
201
+ "tokenizer_class": null,
202
+ "top_k": 50,
203
+ "top_p": 1.0,
204
+ "torch_dtype": "bfloat16",
205
+ "torchscript": false,
206
+ "typical_p": 1.0,
207
+ "use_bfloat16": false,
208
+ "use_bias": false
209
+ },
210
+ "backbone_kwargs": {},
211
+ "bad_words_ids": null,
212
+ "begin_suppress_tokens": null,
213
+ "bos_token_id": null,
214
+ "chunk_size_feed_forward": 0,
215
+ "cross_attention_hidden_size": null,
216
+ "decoder_start_token_id": null,
217
+ "depths": null,
218
+ "diversity_penalty": 0.0,
219
+ "do_sample": false,
220
+ "drop_cls_token": false,
221
+ "early_stopping": false,
222
+ "encoder_no_repeat_ngram_size": 0,
223
+ "eos_token_id": null,
224
+ "exponential_decay_length_penalty": null,
225
+ "finetuning_task": null,
226
+ "forced_bos_token_id": null,
227
+ "forced_eos_token_id": null,
228
+ "hidden_stride": 2,
229
+ "id2label": {
230
+ "0": "LABEL_0",
231
+ "1": "LABEL_1"
232
+ },
233
+ "is_decoder": false,
234
+ "is_encoder_decoder": false,
235
+ "label2id": {
236
+ "LABEL_0": 0,
237
+ "LABEL_1": 1
238
+ },
239
+ "length_penalty": 1.0,
240
+ "max_length": 20,
241
+ "min_length": 0,
242
+ "model_type": "aimv2_visual_tokenizer",
243
+ "no_repeat_ngram_size": 0,
244
+ "num_beam_groups": 1,
245
+ "num_beams": 1,
246
+ "num_return_sequences": 1,
247
+ "output_attentions": false,
248
+ "output_hidden_states": false,
249
+ "output_scores": false,
250
+ "pad_token_id": null,
251
+ "prefix": null,
252
+ "problem_type": null,
253
+ "pruned_heads": {},
254
+ "remove_invalid_values": false,
255
+ "repetition_penalty": 1.0,
256
+ "return_dict": true,
257
+ "return_dict_in_generate": false,
258
+ "sep_token_id": null,
259
+ "suppress_tokens": null,
260
+ "task_specific_params": null,
261
+ "tau": 1.0,
262
+ "temperature": 1.0,
263
+ "tf_legacy_loss": false,
264
+ "tie_encoder_decoder": false,
265
+ "tie_word_embeddings": true,
266
+ "tokenize_function": "softmax",
267
+ "tokenizer_class": null,
268
+ "top_k": 50,
269
+ "top_p": 1.0,
270
+ "torch_dtype": null,
271
+ "torchscript": false,
272
+ "typical_p": 1.0,
273
+ "use_bfloat16": false,
274
+ "use_indicators": false,
275
+ "vocab_size": 65536
276
+ }
277
+ }
configuration_aimv2.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # copied from https://huggingface.co/apple/aimv2-huge-patch14-448
2
+ from typing import Any
3
+
4
+ from transformers.configuration_utils import PretrainedConfig
5
+
6
+ __all__ = ["AIMv2Config"]
7
+
8
+
9
+ class AIMv2Config(PretrainedConfig):
10
+ """This is the configuration class to store the configuration of an [`AIMv2Model`].
11
+
12
+ Instantiating a configuration with the defaults will yield a similar configuration
13
+ to that of the [apple/aimv2-large-patch14-224](https://huggingface.co/apple/aimv2-large-patch14-224).
14
+
15
+ Args:
16
+ hidden_size: Dimension of the hidden representations.
17
+ intermediate_size: Dimension of the SwiGLU representations.
18
+ num_hidden_layers: Number of hidden layers in the Transformer.
19
+ num_attention_heads: Number of attention heads for each attention layer
20
+ in the Transformer.
21
+ num_channels: Number of input channels.
22
+ image_size: Image size.
23
+ patch_size: Patch size.
24
+ rms_norm_eps: Epsilon value used for the RMS normalization layer.
25
+ attention_dropout: Dropout ratio for attention probabilities.
26
+ projection_dropout: Dropout ratio for the projection layer after the attention.
27
+ qkv_bias: Whether to add a bias to the queries, keys and values.
28
+ use_bias: Whether to add a bias in the feed-forward and projection layers.
29
+ kwargs: Keyword arguments for the [`PretrainedConfig`].
30
+ """
31
+
32
+ model_type: str = "aimv2"
33
+
34
+ def __init__(
35
+ self,
36
+ hidden_size: int = 1024,
37
+ intermediate_size: int = 2816,
38
+ num_hidden_layers: int = 24,
39
+ num_attention_heads: int = 8,
40
+ num_channels: int = 3,
41
+ image_size: int = 224,
42
+ patch_size: int = 14,
43
+ rms_norm_eps: float = 1e-5,
44
+ attention_dropout: float = 0.0,
45
+ projection_dropout: float = 0.0,
46
+ qkv_bias: bool = False,
47
+ use_bias: bool = False,
48
+ **kwargs: Any,
49
+ ):
50
+ super().__init__(**kwargs)
51
+ self.hidden_size = hidden_size
52
+ self.intermediate_size = intermediate_size
53
+ self.num_hidden_layers = num_hidden_layers
54
+ self.num_attention_heads = num_attention_heads
55
+ self.num_channels = num_channels
56
+ self.patch_size = patch_size
57
+ self.image_size = image_size
58
+ self.attention_dropout = attention_dropout
59
+ self.rms_norm_eps = rms_norm_eps
60
+
61
+ self.projection_dropout = projection_dropout
62
+ self.qkv_bias = qkv_bias
63
+ self.use_bias = use_bias
configuration_ovis.py ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from abc import ABC, abstractmethod
2
+ from typing import List, Dict, Union, Optional
3
+
4
+ from transformers import PretrainedConfig, AutoConfig, AutoModel
5
+ from .configuration_aimv2 import AIMv2Config
6
+ from .modeling_aimv2 import AIMv2Model
7
+
8
+ IGNORE_ID = -100
9
+ IMAGE_TOKEN_ID = -200
10
+ IMAGE_TOKEN = "<image>"
11
+ IMAGE_ATOM_ID = -300
12
+ IMAGE_INDICATOR_IDS = [-301, -302, -303, -304, -305]
13
+
14
+ AutoConfig.register("aimv2", AIMv2Config)
15
+ AutoModel.register(AIMv2Config, AIMv2Model)
16
+
17
+ # ----------------------------------------------------------------------
18
+ # Visual Tokenizer Configuration
19
+ # ----------------------------------------------------------------------
20
+ class BaseVisualTokenizerConfig(PretrainedConfig):
21
+ def __init__(
22
+ self,
23
+ vocab_size=16384,
24
+ tokenize_function="softmax",
25
+ tau=1.0,
26
+ depths=None,
27
+ drop_cls_token=False,
28
+ backbone_config: Optional[Union[PretrainedConfig, dict]] = None,
29
+ hidden_stride: int = 1,
30
+ **kwargs
31
+ ):
32
+ super().__init__(**kwargs)
33
+ self.vocab_size = vocab_size
34
+ self.tokenize_function = tokenize_function
35
+ self.tau = tau
36
+ if isinstance(depths, str):
37
+ depths = [int(x) for x in depths.split('|')]
38
+ self.depths = depths
39
+ self.backbone_kwargs = {}
40
+ self.drop_cls_token = drop_cls_token
41
+ if backbone_config is not None:
42
+ assert isinstance(backbone_config, (PretrainedConfig, dict)), \
43
+ f"expect `backbone_config` to be instance of PretrainedConfig or dict, but got {type(backbone_config)} type"
44
+ if not isinstance(backbone_config, PretrainedConfig):
45
+ model_type = backbone_config['model_type']
46
+ backbone_config.pop('model_type')
47
+ backbone_config = AutoConfig.for_model(model_type, **backbone_config)
48
+ self.backbone_config = backbone_config
49
+ self.hidden_stride = hidden_stride
50
+
51
+
52
+ class Aimv2VisualTokenizerConfig(BaseVisualTokenizerConfig):
53
+ model_type = "aimv2_visual_tokenizer"
54
+
55
+ def __init__(self, **kwargs):
56
+ super().__init__(**kwargs)
57
+ if self.drop_cls_token:
58
+ self.drop_cls_token = False
59
+ if self.depths:
60
+ assert len(self.depths) == 1
61
+ self.backbone_kwargs['num_hidden_layers'] = self.depths[0]
62
+
63
+
64
+ AutoConfig.register("aimv2_visual_tokenizer", Aimv2VisualTokenizerConfig)
65
+
66
+
67
+ # ----------------------------------------------------------------------
68
+ # Ovis Configuration
69
+ # ----------------------------------------------------------------------
70
+ class OvisConfig(PretrainedConfig):
71
+ model_type = "ovis"
72
+
73
+ def __init__(
74
+ self,
75
+ llm_config: Optional[Union[PretrainedConfig, dict]] = None,
76
+ visual_tokenizer_config: Optional[Union[PretrainedConfig, dict]] = None,
77
+ multimodal_max_length=8192,
78
+ hidden_size=None,
79
+ conversation_formatter_class=None,
80
+ llm_attn_implementation=None,
81
+ disable_tie_weight=False,
82
+ **kwargs
83
+ ):
84
+ super().__init__(**kwargs)
85
+ if llm_config is not None:
86
+ assert isinstance(llm_config, (PretrainedConfig, dict)), \
87
+ f"expect `llm_config` to be instance of PretrainedConfig or dict, but got {type(llm_config)} type"
88
+ if not isinstance(llm_config, PretrainedConfig):
89
+ model_type = llm_config['model_type']
90
+ llm_config.pop('model_type')
91
+ llm_config = AutoConfig.for_model(model_type, **llm_config)
92
+ self.llm_config = llm_config
93
+ if visual_tokenizer_config is not None:
94
+ assert isinstance(visual_tokenizer_config, (PretrainedConfig, dict)), \
95
+ f"expect `visual_tokenizer_config` to be instance of PretrainedConfig or dict, but got {type(visual_tokenizer_config)} type"
96
+ if not isinstance(visual_tokenizer_config, PretrainedConfig):
97
+ model_type = visual_tokenizer_config['model_type']
98
+ visual_tokenizer_config.pop('model_type')
99
+ visual_tokenizer_config = AutoConfig.for_model(model_type, **visual_tokenizer_config)
100
+ self.visual_tokenizer_config = visual_tokenizer_config
101
+ self.multimodal_max_length = multimodal_max_length
102
+ self.hidden_size = hidden_size
103
+ self.conversation_formatter_class = conversation_formatter_class
104
+ self.llm_attn_implementation = llm_attn_implementation
105
+ self.disable_tie_weight = disable_tie_weight
106
+
107
+
108
+ # ----------------------------------------------------------------------
109
+ # Conversation Formatter
110
+ # ----------------------------------------------------------------------
111
+ class ConversationFormatter(ABC):
112
+ support_tokenizer_types = None
113
+
114
+ def __init__(self, tokenizer):
115
+ tokenizer_type = type(tokenizer).__name__
116
+ assert tokenizer_type in self.support_tokenizer_types, \
117
+ f'Invalid tokenizer type, expected one from `{self.support_tokenizer_types}`, but got `{tokenizer_type}`'
118
+ self.tokenizer = tokenizer
119
+ self.image_token = IMAGE_TOKEN
120
+ self.image_token_id = IMAGE_TOKEN_ID
121
+ self.ignore_id = IGNORE_ID
122
+
123
+ def _tokenize_with_image_symbol(self, text):
124
+ text_chunks = [self.tokenizer(chunk, add_special_tokens=False).input_ids for chunk in
125
+ text.split(self.image_token)]
126
+ token_ids = []
127
+ num_chuck = len(text_chunks)
128
+ for i, chunk in enumerate(text_chunks):
129
+ token_ids.extend(chunk)
130
+ if i < num_chuck - 1:
131
+ token_ids.append(self.image_token_id)
132
+ return token_ids
133
+
134
+ @abstractmethod
135
+ def format(self, conversations: List[Dict], generation_preface=None):
136
+ pass
137
+
138
+ @abstractmethod
139
+ def format_query(self, query, generation_preface=""):
140
+ pass
141
+
142
+
143
+ class QwenConversationFormatter(ConversationFormatter):
144
+ support_tokenizer_types = ['QWenTokenizer', 'Qwen2TokenizerFast']
145
+
146
+ def __init__(self, tokenizer):
147
+ super().__init__(tokenizer)
148
+ self.from2role = {
149
+ "system": "<|im_start|>system\n",
150
+ "human": "<|im_start|>user\n",
151
+ "gpt": "<|im_start|>assistant\n",
152
+ }
153
+ self.gpt_token_num = None
154
+ self.im_end = "<|im_end|>\n"
155
+ self.default_system_prompt = "You are a helpful assistant."
156
+
157
+ def format(self, conversations: List[Dict], generation_preface=None):
158
+ if self.gpt_token_num is None:
159
+ self.gpt_token_num = len(self.tokenizer(self.from2role["gpt"], add_special_tokens=False).input_ids)
160
+
161
+ if conversations[0]["from"] != "system":
162
+ conversations.insert(0, {
163
+ "from": "system",
164
+ "value": self.default_system_prompt
165
+ })
166
+
167
+ if generation_preface is not None:
168
+ conversations.append({
169
+ "from": "gpt",
170
+ "value": generation_preface
171
+ })
172
+
173
+ prompt = ""
174
+ input_ids = []
175
+ labels = []
176
+ num_conversation = len(conversations)
177
+ for i, conversation in enumerate(conversations):
178
+ frm = conversation["from"]
179
+ role = self.from2role[frm]
180
+ message = conversation["value"]
181
+ text = role + message
182
+ if i < num_conversation - 1 or generation_preface is None:
183
+ text += self.im_end
184
+ prompt += text
185
+ token_ids = self._tokenize_with_image_symbol(text)
186
+ input_ids.extend(token_ids)
187
+ label_ids = [self.ignore_id] * len(token_ids)
188
+ if frm == "gpt" and generation_preface is None:
189
+ # learning `\n` following `im_end` is meaningless, so the last `\n` token is ignored in label
190
+ label_ids[self.gpt_token_num:-1] = token_ids[self.gpt_token_num:-1]
191
+ labels.extend(label_ids)
192
+
193
+ assert self._tokenize_with_image_symbol(prompt) == input_ids
194
+ assert len(input_ids) == len(labels)
195
+
196
+ return prompt, input_ids, labels
197
+
198
+ def format_query(self, query, generation_preface=""):
199
+ prompt, input_ids, _ = self.format([{
200
+ "from": "human",
201
+ "value": query
202
+ }], generation_preface=generation_preface)
203
+
204
+ return prompt, input_ids
generation_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "multimodal_max_length": 32768,
9
+ "pad_token_id": 151643,
10
+ "repetition_penalty": 1.05,
11
+ "temperature": 0.7,
12
+ "top_k": 20,
13
+ "top_p": 0.8,
14
+ "transformers_version": "4.49.0"
15
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:31ae9b24a7cd23db4fcaed4b22ac81712dba712ec1a6451f893bd2745a884620
3
+ size 4907142744
model-00002-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f81967bba3188be0e8d56f55e0fb8cd592562d23a5801e215c64da66b27c0ab
3
+ size 4992878208
model-00003-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd0e16824a9be081891012bdea97fc932c539ae72814881f0d3186be25b76ff1
3
+ size 4992878336
model-00004-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ecac693371e90df59071aa4522c408b32066a2fe6434bddfdf0752a3ed2c7df7
3
+ size 4992878336
model-00005-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a857c3741713a799f8f28e99b798d0f44310f24c72de0f5585cfadf252d2928
3
+ size 4992878336
model-00006-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ebaf7192f0e72b9a1e7bbf70664e92f6e341a9927e67fbbdbdc38d3136bc2c1
3
+ size 4992878336
model-00007-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e61f20d2013eb7553f85e052195190f99b64a039e06cfeeca457c9ed823c928
3
+ size 3640032456
model-00008-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77d41a5176fdcf0f2bf42b11d20feda0151510c3b2ddb10de8502f8ca0d54990
3
+ size 4030220952
model-00009-of-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d0ce68d9d72cb3bbe72a58ce646f3def257d0f1513fb8e971278a654599a49b
3
+ size 1745011108
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
modeling_aimv2.py ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # adapted from https://huggingface.co/apple/aimv2-huge-patch14-448 (modification: add gradient checkpoint support)
2
+ from typing import Optional, Tuple, Union
3
+
4
+ import torch
5
+ from .configuration_aimv2 import AIMv2Config
6
+ from torch import nn
7
+ from torch.nn import functional as F
8
+ from transformers.modeling_outputs import BaseModelOutputWithNoAttention
9
+ from transformers.modeling_utils import PreTrainedModel
10
+
11
+ __all__ = ["AIMv2Model"]
12
+
13
+
14
+ class RMSNorm(nn.Module):
15
+ def __init__(self, dim: int, eps: float = 1e-6):
16
+ super().__init__()
17
+ self.weight = nn.Parameter(torch.ones(dim))
18
+ self.eps = eps
19
+
20
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
21
+ output = self._norm(x.float()).type_as(x)
22
+ return output * self.weight
23
+
24
+ def extra_repr(self) -> str:
25
+ return f"{tuple(self.weight.shape)}, eps={self.eps}"
26
+
27
+ def _norm(self, x: torch.Tensor) -> torch.Tensor:
28
+ return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
29
+
30
+
31
+ class AIMv2SwiGLUFFN(nn.Module):
32
+ def __init__(self, config: AIMv2Config):
33
+ super().__init__()
34
+ hidden_features = config.intermediate_size
35
+ in_features = config.hidden_size
36
+ bias = config.use_bias
37
+
38
+ self.fc1 = nn.Linear(in_features, hidden_features, bias=bias)
39
+ self.fc2 = nn.Linear(hidden_features, in_features, bias=bias)
40
+ self.fc3 = nn.Linear(in_features, hidden_features, bias=bias)
41
+
42
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
43
+ x = F.silu(self.fc1(x)) * self.fc3(x)
44
+ x = self.fc2(x)
45
+ return x
46
+
47
+
48
+ class AIMv2PatchEmbed(nn.Module):
49
+ def __init__(self, config: AIMv2Config):
50
+ super().__init__()
51
+ self.proj = nn.Conv2d(
52
+ config.num_channels,
53
+ config.hidden_size,
54
+ kernel_size=(config.patch_size, config.patch_size),
55
+ stride=(config.patch_size, config.patch_size),
56
+ )
57
+ self.norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
58
+
59
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
60
+ x = self.proj(x).flatten(2).transpose(1, 2)
61
+ x = self.norm(x)
62
+ return x
63
+
64
+
65
+ class AIMv2ViTPreprocessor(nn.Module):
66
+ def __init__(self, config: AIMv2Config):
67
+ super().__init__()
68
+ num_patches = (config.image_size // config.patch_size) ** 2
69
+
70
+ self.patchifier = AIMv2PatchEmbed(config)
71
+ self.pos_embed = nn.Parameter(torch.zeros((1, num_patches, config.hidden_size)))
72
+
73
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
74
+ tokens = self.patchifier(x)
75
+ _, N, _ = tokens.shape
76
+ pos_embed = self.pos_embed.to(tokens.device)
77
+ tokens = tokens + pos_embed[:, :N]
78
+ return tokens
79
+
80
+
81
+ class AIMv2Attention(nn.Module):
82
+ def __init__(self, config: AIMv2Config):
83
+ super().__init__()
84
+ dim = config.hidden_size
85
+
86
+ self.num_heads = config.num_attention_heads
87
+ self.qkv = nn.Linear(dim, dim * 3, bias=config.qkv_bias)
88
+ self.attn_drop = nn.Dropout(config.attention_dropout)
89
+ self.proj = nn.Linear(dim, dim, bias=config.use_bias)
90
+ self.proj_drop = nn.Dropout(config.projection_dropout)
91
+
92
+ def forward(
93
+ self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
94
+ ) -> torch.Tensor:
95
+ B, N, C = x.shape
96
+ qkv = (
97
+ self.qkv(x)
98
+ .reshape(B, N, 3, self.num_heads, C // self.num_heads)
99
+ .permute(2, 0, 3, 1, 4)
100
+ )
101
+ q, k, v = qkv.unbind(0)
102
+
103
+ x = F.scaled_dot_product_attention(q, k, v, attn_mask=mask)
104
+ x = x.transpose(1, 2).contiguous().reshape(B, N, C)
105
+ x = self.proj(x)
106
+ x = self.proj_drop(x)
107
+ return x
108
+
109
+
110
+ class AIMv2Block(nn.Module):
111
+ def __init__(self, config: AIMv2Config):
112
+ super().__init__()
113
+ self.attn = AIMv2Attention(config)
114
+ self.norm_1 = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
115
+ self.mlp = AIMv2SwiGLUFFN(config)
116
+ self.norm_2 = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
117
+
118
+ def forward(
119
+ self, x: torch.Tensor, mask: Optional[torch.Tensor] = None
120
+ ) -> torch.Tensor:
121
+ x = x + self.attn(self.norm_1(x), mask)
122
+ x = x + self.mlp(self.norm_2(x))
123
+ return x
124
+
125
+
126
+ class AIMv2Transformer(nn.Module):
127
+ def __init__(self, config: AIMv2Config):
128
+ super().__init__()
129
+ self.blocks = nn.ModuleList(
130
+ [AIMv2Block(config) for _ in range(config.num_hidden_layers)]
131
+ )
132
+ self.post_trunk_norm = RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
133
+ self.gradient_checkpointing = False
134
+
135
+ def forward(
136
+ self,
137
+ tokens: torch.Tensor,
138
+ mask: Optional[torch.Tensor] = None,
139
+ output_hidden_states: bool = False,
140
+ ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, ...]]]:
141
+ hidden_states = () if output_hidden_states else None
142
+ for block in self.blocks:
143
+ if self.gradient_checkpointing and self.training:
144
+ tokens = self._gradient_checkpointing_func(block.__call__, tokens, mask)
145
+ else:
146
+ tokens = block(tokens, mask)
147
+ if output_hidden_states:
148
+ hidden_states += (tokens,)
149
+ tokens = self.post_trunk_norm(tokens)
150
+ return tokens, hidden_states
151
+
152
+
153
+ class AIMv2PretrainedModel(PreTrainedModel):
154
+ config_class = AIMv2Config
155
+ base_model_prefix = "aimv2"
156
+ supports_gradient_checkpointing = True
157
+ main_input_name = "pixel_values"
158
+ _no_split_modules = ["AIMv2ViTPreprocessor", "AIMv2Block"]
159
+ _supports_sdpa = True
160
+
161
+
162
+ class AIMv2Model(AIMv2PretrainedModel):
163
+ def __init__(self, config: AIMv2Config):
164
+ super().__init__(config)
165
+ self.preprocessor = AIMv2ViTPreprocessor(config)
166
+ self.trunk = AIMv2Transformer(config)
167
+
168
+ def forward(
169
+ self,
170
+ pixel_values: torch.Tensor,
171
+ mask: Optional[torch.Tensor] = None,
172
+ output_hidden_states: Optional[bool] = None,
173
+ return_dict: Optional[bool] = None,
174
+ ) -> Union[
175
+ Tuple[torch.Tensor],
176
+ Tuple[torch.Tensor, Tuple[torch.Tensor, ...]],
177
+ BaseModelOutputWithNoAttention,
178
+ ]:
179
+ if output_hidden_states is None:
180
+ output_hidden_states = self.config.output_hidden_states
181
+ if return_dict is None:
182
+ return_dict = self.config.use_return_dict
183
+
184
+ x = self.preprocessor(pixel_values)
185
+ x, hidden_states = self.trunk(
186
+ x, mask, output_hidden_states=output_hidden_states
187
+ )
188
+
189
+ if not return_dict:
190
+ res = (x,)
191
+ res += (hidden_states,) if output_hidden_states else ()
192
+ return res
193
+
194
+ return BaseModelOutputWithNoAttention(
195
+ last_hidden_state=x,
196
+ hidden_states=hidden_states,
197
+ )
198
+
modeling_ovis.py ADDED
@@ -0,0 +1,590 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (C) 2025 AIDC-AI
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ # http://www.apache.org/licenses/LICENSE-2.0
7
+ #
8
+ # Unless required by applicable law or agreed to in writing, software
9
+ # distributed under the License is distributed on an "AS IS" BASIS,
10
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11
+ #
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ import logging
16
+ import os
17
+ import importlib.metadata
18
+
19
+ from packaging import version
20
+ from importlib import import_module
21
+ from typing import List, Callable, Union, Optional, Dict
22
+
23
+ import PIL.Image
24
+ import torch
25
+ from torch import Tensor
26
+ from torch.nn import init
27
+ from torch.nn.functional import softmax, gumbel_softmax, pad
28
+ from transformers.utils import is_flash_attn_2_available
29
+ from transformers import PreTrainedModel, AutoModel, AutoTokenizer, AutoModelForCausalLM, AutoImageProcessor
30
+ from transformers.generation.utils import GenerateOutput
31
+
32
+ from .configuration_ovis import BaseVisualTokenizerConfig, Aimv2VisualTokenizerConfig
33
+ from .configuration_ovis import OvisConfig, ConversationFormatter
34
+ from .configuration_ovis import IGNORE_ID, IMAGE_ATOM_ID, IMAGE_INDICATOR_IDS, IMAGE_TOKEN_ID
35
+
36
+ # ----------------------------------------------------------------------
37
+ # Visual Tokenizer
38
+ # ----------------------------------------------------------------------
39
+ class BaseVisualTokenizer(PreTrainedModel):
40
+ base_model_prefix = "backbone"
41
+ main_input_name = None
42
+ _image_processor_class = None
43
+ _image_processor_kwargs = {}
44
+ _backbone_class = None
45
+ _backbone_name_or_path = None
46
+
47
+ def __init__(self, config: BaseVisualTokenizerConfig, *inputs, **kwargs):
48
+ super().__init__(config, *inputs, **kwargs)
49
+ self.image_processor = AutoImageProcessor.from_pretrained(kwargs['image_processor_name_or_path'])
50
+ self.backbone = AutoModel.from_config(self.config.backbone_config)
51
+ head_dim = self.config.vocab_size - len(IMAGE_INDICATOR_IDS) # reserved tokens for IMAGE_INDICATORS
52
+ self.head = torch.nn.Sequential(
53
+ torch.nn.Linear(
54
+ self.backbone.config.hidden_size * self.config.hidden_stride * self.config.hidden_stride, head_dim,
55
+ bias=False
56
+ ),
57
+ torch.nn.LayerNorm(head_dim)
58
+ )
59
+
60
+ assert all((self.image_processor.do_resize,
61
+ not getattr(self.image_processor, 'do_center_crop', False),
62
+ self.image_processor.do_rescale,
63
+ self.image_processor.do_normalize
64
+ )), f"image_processor `{self.image_processor}` is not supported currently"
65
+
66
+ def get_backbone(self):
67
+ return self.backbone
68
+
69
+ def get_image_processor(self):
70
+ return self.image_processor
71
+
72
+ def mock_input(self):
73
+ height, width = self.get_image_size()
74
+ return torch.zeros(1, 3, height, width), self.construct_image_placeholders((1, 1))
75
+
76
+ def get_head(self):
77
+ return self.head
78
+
79
+ def get_image_size(self):
80
+ raise NotImplementedError
81
+
82
+ @staticmethod
83
+ def construct_image_placeholders(grid):
84
+ image_placeholders = [IMAGE_INDICATOR_IDS[0], IMAGE_ATOM_ID, IMAGE_INDICATOR_IDS[1]]
85
+ if grid[0] * grid[1] > 1:
86
+ for r in range(grid[0]):
87
+ for c in range(grid[1]):
88
+ image_placeholders.append(IMAGE_ATOM_ID)
89
+ if c < grid[1] - 1:
90
+ image_placeholders.append(IMAGE_INDICATOR_IDS[2])
91
+ if r < grid[0] - 1:
92
+ image_placeholders.append(IMAGE_INDICATOR_IDS[3])
93
+ image_placeholders.append(IMAGE_INDICATOR_IDS[4])
94
+ return image_placeholders
95
+
96
+ def preprocess_image(self, image: PIL.Image.Image, max_partition=9, covering_threshold=0.9, convert_to_rgb=True):
97
+ def _preprocess(img: PIL.Image.Image, side):
98
+ # first resize and preprocess
99
+ w, h = img.size
100
+ if w == h:
101
+ new_width = new_height = side
102
+ elif w > h:
103
+ new_width = side
104
+ new_height = int(h / w * new_width)
105
+ else:
106
+ new_height = side
107
+ new_width = int(w / h * new_height)
108
+ new_size = dict(height=new_height, width=new_width)
109
+ pixel_values = self.image_processor.preprocess(img, size=new_size, return_tensors='pt')['pixel_values']
110
+
111
+ # then pad to square
112
+ square_values = torch.zeros([1, 3, side, side], dtype=pixel_values.dtype, device=pixel_values.device)
113
+ new_height, new_width = pixel_values.shape[2:]
114
+ if new_height == new_width:
115
+ square_values[:, :, :, :] = pixel_values
116
+ elif new_height > new_width:
117
+ from_index = (side - new_width) // 2
118
+ square_values[:, :, :, from_index:from_index + new_width] = pixel_values
119
+ else:
120
+ from_index = (side - new_height) // 2
121
+ square_values[:, :, from_index:from_index + new_height, :] = pixel_values
122
+
123
+ return square_values
124
+
125
+ def _partition(img, grid):
126
+ w, h = img.size
127
+ row_height = h // grid[0]
128
+ col_width = w // grid[1]
129
+
130
+ partition = []
131
+ for row in range(grid[0]):
132
+ for col in range(grid[1]):
133
+ left = col * col_width
134
+ upper = row * row_height
135
+ right = w if col == grid[1] - 1 else (col + 1) * col_width
136
+ lower = h if row == grid[0] - 1 else (row + 1) * row_height
137
+ partition.append((left, upper, right, lower))
138
+
139
+ return partition
140
+
141
+ def _covering_area(left, upper, right, lower, side):
142
+ w = right - left
143
+ h = lower - upper
144
+ w, h = max(w, h), min(w, h)
145
+ if w > side:
146
+ h = h / w * side
147
+ w = side
148
+ return w * h
149
+
150
+ def _get_best_grid(img, side):
151
+ img_area = img.size[0] * img.size[1]
152
+
153
+ candidate_grids = []
154
+ for i in range(1, max_partition + 1):
155
+ for j in range(1, max_partition + 1):
156
+ if i * j <= max_partition:
157
+ candidate_grids.append((i, j))
158
+
159
+ all_grids = []
160
+ good_grids = []
161
+ for grid in candidate_grids:
162
+ partition = _partition(img, grid)
163
+ covering_ratio = sum([_covering_area(*p, side) for p in partition]) / img_area
164
+ assert covering_ratio <= 1.0
165
+ all_grids.append((grid, covering_ratio))
166
+ if covering_ratio > covering_threshold:
167
+ good_grids.append((grid, covering_ratio))
168
+
169
+ if len(good_grids) > 0:
170
+ # pick the good partition with minimum #sub_images and break the tie using covering_ratio
171
+ return sorted(good_grids, key=lambda x: (x[0][0] * x[0][1], -x[1]))[0][0]
172
+ else:
173
+ # pick the partition with maximum covering_ratio and break the tie using #sub_images
174
+ return sorted(all_grids, key=lambda x: (-x[1], x[0][0] * x[0][1]))[0][0]
175
+
176
+ if convert_to_rgb and image.mode != 'RGB':
177
+ image = image.convert('RGB')
178
+
179
+ sides = self.get_image_size()
180
+ if sides[0] != sides[1]:
181
+ raise ValueError('get_image_size() returns non-square size')
182
+ side = sides[0]
183
+ grid = _get_best_grid(image, side)
184
+ partition = _partition(image, grid)
185
+ crops = [image.crop(p) for p in partition]
186
+ if len(crops) > 1:
187
+ crops.insert(0, image)
188
+ pixel_values = torch.cat([_preprocess(crop, side) for crop in crops], dim=0)
189
+ image_placeholders = self.construct_image_placeholders(grid)
190
+ return pixel_values, image_placeholders
191
+
192
+ def tokenize(self, logits):
193
+ def st_argmax(y_soft, dim): # straight-through softmax
194
+ index = y_soft.max(dim, keepdim=True)[1]
195
+ y_hard = torch.zeros_like(y_soft, memory_format=torch.legacy_contiguous_format).scatter_(dim, index, 1.0)
196
+ ret = y_hard - y_soft.detach() + y_soft
197
+ return ret
198
+
199
+ if self.config.tokenize_function == 'softmax':
200
+ tokens = softmax(logits, dim=-1)
201
+ elif self.config.tokenize_function == 'gumbel_argmax':
202
+ tokens = gumbel_softmax(logits, tau=self.config.tau, hard=True)
203
+ elif self.config.tokenize_function == 'st_argmax':
204
+ tokens = st_argmax(logits, dim=-1)
205
+ else:
206
+ raise ValueError(
207
+ f'Invalid `max_type`, expected softmax or gumbel_argmax or st_argmax, but got {self.config.tokenize_function}')
208
+ return tokens
209
+
210
+ def encode(self, pixel_values):
211
+ output = self.backbone(pixel_values, output_hidden_states=True, return_dict=True)
212
+ features = output.hidden_states[-1]
213
+ if self.config.drop_cls_token:
214
+ features = features[:, 1:, :]
215
+
216
+ # merge number of `hidden_stride * hidden_stride` hidden states together to reduce token sequence length
217
+ # e.g., for hidden_stride=2, this leads to a token length reduction: 1024 -> 256 for aimv2
218
+ if self.config.hidden_stride > 1:
219
+ n, l, d = features.shape # this `d` maybe different from the above `d
220
+ sqrt_l = int(l ** 0.5)
221
+ assert sqrt_l ** 2 == l, "The token sequence length should be a perfect square."
222
+ features = features.reshape(n, sqrt_l, sqrt_l, d)
223
+ pl = (self.config.hidden_stride - (sqrt_l % self.config.hidden_stride)) % self.config.hidden_stride
224
+ features = pad(features, (0, 0, 0, pl, 0, pl), "constant", 0)
225
+ sqrt_l += pl
226
+ features = features.reshape(n, sqrt_l // self.config.hidden_stride, self.config.hidden_stride,
227
+ sqrt_l // self.config.hidden_stride, self.config.hidden_stride, d)
228
+ features = features.permute(0, 1, 3, 2, 4, 5) # [n, sqrt_l/hs, sqrt_l/hs, hs, hs, d]
229
+ features = features.flatten(3) # [n, sqrt_l/hs, sqrt_l/hs, hs*hs*d]
230
+ features = features.reshape(
231
+ n, -1, self.config.hidden_stride * self.config.hidden_stride * d)
232
+
233
+ return features
234
+
235
+ def forward(self, pixel_values) -> torch.Tensor: # [BatchSize, ImageShape] -> [BatchSize, #Token, VocabSize]
236
+ features = self.encode(pixel_values)
237
+ logits = self.head(features)
238
+ tokens = self.tokenize(logits)
239
+ # tokens' shape is [BatchSize, #Token, VocabSize-5], so padding with [BatchSize, #Token, 5], after
240
+ # which, tokens' shape should become [BatchSize, #Token, VocabSize]
241
+ batch_size, token_len, _ = tokens.shape
242
+ padding_tensor = torch.zeros(size=(batch_size, token_len, len(IMAGE_INDICATOR_IDS)),
243
+ dtype=tokens.dtype,
244
+ device=tokens.device,
245
+ layout=tokens.layout,
246
+ requires_grad=False)
247
+ tokens = torch.cat((tokens, padding_tensor), dim=2)
248
+ return tokens
249
+
250
+
251
+ class Aimv2VisualTokenizer(BaseVisualTokenizer):
252
+ config_class = Aimv2VisualTokenizerConfig
253
+ supports_gradient_checkpointing = True
254
+ _no_split_modules = ["AIMv2ViTPreprocessor", "AIMv2Block"]
255
+ _image_processor_kwargs = dict(do_center_crop=False)
256
+
257
+ def get_image_size(self):
258
+ height = self.image_processor.crop_size["height"]
259
+ width = self.image_processor.crop_size["width"]
260
+ return height, width
261
+
262
+
263
+ AutoModel.register(Aimv2VisualTokenizerConfig, Aimv2VisualTokenizer)
264
+
265
+
266
+ # ----------------------------------------------------------------------
267
+ # Ovis
268
+ # ----------------------------------------------------------------------
269
+ class VisualEmbedding(torch.nn.Embedding):
270
+ def forward(self, visual_tokens: Tensor) -> Tensor:
271
+ if visual_tokens.dtype in [torch.int8, torch.int16, torch.int32, torch.int64, torch.long]:
272
+ return super().forward(visual_tokens)
273
+ return torch.matmul(visual_tokens, self.weight)
274
+
275
+ def reset_parameters(self, mean=0., std=1.) -> None:
276
+ init.normal_(self.weight, mean=mean, std=std)
277
+ self._fill_padding_idx_with_zero()
278
+
279
+
280
+ class OvisPreTrainedModel(PreTrainedModel):
281
+ config_class = OvisConfig
282
+ base_model_prefix = "ovis"
283
+
284
+
285
+ class Ovis(OvisPreTrainedModel):
286
+
287
+ def __init__(self, config: OvisConfig, *inputs, **kwargs):
288
+ super().__init__(config, *inputs, **kwargs)
289
+ attn_kwargs = dict()
290
+ if self.config.llm_attn_implementation:
291
+ if self.config.llm_attn_implementation == "flash_attention_2":
292
+ assert (is_flash_attn_2_available() and
293
+ version.parse(importlib.metadata.version("flash_attn")) >= version.parse("2.6.3")), \
294
+ "Using `flash_attention_2` requires having `flash_attn>=2.6.3` installed."
295
+ attn_kwargs["attn_implementation"] = self.config.llm_attn_implementation
296
+ self.llm = AutoModelForCausalLM.from_config(self.config.llm_config, **attn_kwargs)
297
+ assert self.config.hidden_size == self.llm.config.hidden_size, "hidden size mismatch"
298
+ self.text_tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path)
299
+ self.visual_tokenizer = AutoModel.from_config(self.config.visual_tokenizer_config,
300
+ image_processor_name_or_path=self.config.name_or_path)
301
+ self.vte = VisualEmbedding(
302
+ self.config.visual_tokenizer_config.vocab_size,
303
+ self.config.hidden_size,
304
+ device=self.visual_tokenizer.device,
305
+ dtype=self.visual_tokenizer.dtype
306
+ )
307
+
308
+ def _merge_modules(modules_list: tuple):
309
+ merged_modules = []
310
+ for modules in modules_list:
311
+ merged_modules.extend(modules if modules else [])
312
+ return merged_modules
313
+
314
+ self._no_split_modules = _merge_modules((self.llm._no_split_modules, self.visual_tokenizer._no_split_modules))
315
+ self._skip_keys_device_placement = self.llm._skip_keys_device_placement
316
+ self._keep_in_fp32_modules = _merge_modules(
317
+ (self.llm._keep_in_fp32_modules, self.visual_tokenizer._keep_in_fp32_modules))
318
+ self.is_parallelizable = all((self.llm.is_parallelizable, self.visual_tokenizer.is_parallelizable))
319
+ self.supports_gradient_checkpointing = True
320
+ self._supports_flash_attn_2 = True
321
+
322
+ def get_text_tokenizer(self):
323
+ return self.text_tokenizer
324
+
325
+ def get_visual_tokenizer(self):
326
+ return self.visual_tokenizer
327
+
328
+ def tie_weights(self):
329
+ if not self.config.disable_tie_weight:
330
+ self.get_llm().tie_weights()
331
+
332
+ def get_llm(self):
333
+ return self.llm
334
+
335
+ def get_vte(self):
336
+ return self.vte
337
+
338
+ def get_wte(self):
339
+ return self.llm.get_input_embeddings()
340
+
341
+ def get_conversation_formatter(self) -> ConversationFormatter:
342
+ if getattr(self, 'conversation_formatter', None) is None:
343
+ self.conversation_formatter = getattr(import_module(".configuration_ovis", __package__),
344
+ self.config.conversation_formatter_class)(self.text_tokenizer)
345
+ return self.conversation_formatter
346
+
347
+ def forward(
348
+ self,
349
+ input_ids: torch.Tensor,
350
+ attention_mask: torch.Tensor,
351
+ labels: Optional[torch.Tensor],
352
+ pixel_values: List[Optional[torch.Tensor]],
353
+ **kwargs
354
+ ):
355
+ # assert self.training, "`forward` can only be used in training. For inference, use `generate`."
356
+ _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
357
+ text_input_ids=input_ids,
358
+ text_attention_masks=attention_mask,
359
+ text_labels=labels,
360
+ pixel_values=pixel_values
361
+ )
362
+ return self.llm(inputs_embeds=inputs_embeds, labels=labels, attention_mask=attention_mask, **kwargs)
363
+
364
+ def merge_multimodal(
365
+ self,
366
+ text_input_ids: torch.Tensor,
367
+ text_attention_masks: torch.Tensor,
368
+ text_labels: Optional[torch.Tensor],
369
+ pixel_values: List[Optional[torch.Tensor]],
370
+ left_padding: bool = False
371
+ ):
372
+ input_device = text_input_ids.device
373
+ visual_vocab_szie = self.get_visual_tokenizer().config.vocab_size
374
+ visual_indicator_embeds = self.get_vte()(
375
+ torch.tensor(
376
+ list(range(visual_vocab_szie - 5, visual_vocab_szie)),
377
+ dtype=torch.long,
378
+ device=self.get_visual_tokenizer().device
379
+ )
380
+ ).to(device=input_device)
381
+
382
+ if self.training:
383
+ # When training, to be compatible with deepspeed zero, each sample has to include pixel_value tensor.
384
+ # For text-only sample, one can simply use a full zero tensor as pixel_value, which will be ignored
385
+ # (see below in this function); so, the gradient will not be affected.
386
+ num_images = [x.shape[0] for x in pixel_values]
387
+ visual_tokens = self.visual_tokenizer(torch.cat([x for x in pixel_values], dim=0))
388
+ visual_embeds = torch.split(self.get_vte()(visual_tokens).to(dtype=self.dtype, device=input_device),
389
+ split_size_or_sections=num_images, dim=0)
390
+ visual_input_ids = torch.split(torch.argmax(visual_tokens, dim=-1).to(device=input_device),
391
+ split_size_or_sections=num_images, dim=0)
392
+ visual_labels = [torch.full(x.shape, IGNORE_ID, dtype=torch.long, device=input_device) for x in
393
+ visual_input_ids]
394
+ else:
395
+ # When inference, sample can include only text with `None` pixel_value
396
+ num_images = [x.shape[0] if x is not None else 0 for x in pixel_values]
397
+ if sum(num_images) > 0:
398
+ visual_tokens = self.visual_tokenizer(torch.cat([x for x in pixel_values if x is not None], dim=0))
399
+ visual_embeds = torch.split(self.get_vte()(visual_tokens).to(dtype=self.dtype, device=input_device),
400
+ split_size_or_sections=num_images, dim=0)
401
+ visual_input_ids = torch.split(torch.argmax(visual_tokens, dim=-1).to(device=input_device),
402
+ split_size_or_sections=num_images, dim=0)
403
+ visual_labels = [torch.full(x.shape, IGNORE_ID, dtype=torch.long, device=input_device) for x in
404
+ visual_input_ids]
405
+ else:
406
+ # just placeholders
407
+ visual_embeds = [None] * len(num_images)
408
+ visual_input_ids = [None] * len(num_images)
409
+ visual_labels = [None] * len(num_images)
410
+ # just placeholders
411
+ if text_labels is None:
412
+ text_labels = torch.full(text_input_ids.shape, IGNORE_ID, dtype=torch.long, device=input_device)
413
+
414
+ input_embeds = []
415
+ attention_masks = []
416
+ labels = []
417
+ for text_input_id, text_label, text_attention_mask, visual_embed, visual_input_id, visual_label in zip(
418
+ text_input_ids, text_labels, text_attention_masks, visual_embeds, visual_input_ids, visual_labels
419
+ ):
420
+ placeholder_token_mask = torch.lt(text_input_id, 0)
421
+ text_embed = self.get_wte()(torch.masked_fill(text_input_id, placeholder_token_mask, 0))
422
+ for i, indicator_id in enumerate(IMAGE_INDICATOR_IDS):
423
+ text_embed[text_input_id == indicator_id] = visual_indicator_embeds[i]
424
+ image_atom_positions = torch.where(torch.eq(text_input_id, IMAGE_ATOM_ID))[0].tolist()
425
+ if len(image_atom_positions) > 0:
426
+ input_embed_parts = []
427
+ attention_mask_parts = []
428
+ label_parts = []
429
+ prev_image_atom_position = -1
430
+ for index, image_atom_position in enumerate(image_atom_positions):
431
+ input_embed_parts.append(
432
+ text_embed[prev_image_atom_position + 1:image_atom_position, :])
433
+ label_parts.append(
434
+ text_label[prev_image_atom_position + 1:image_atom_position])
435
+ attention_mask_parts.append(
436
+ text_attention_mask[prev_image_atom_position + 1:image_atom_position])
437
+ input_embed_parts.append(visual_embed[index])
438
+ attention_mask_parts.append(
439
+ torch.ones_like(visual_label[index], dtype=torch.bool))
440
+ label_parts.append(visual_label[index])
441
+ prev_image_atom_position = image_atom_position
442
+ if prev_image_atom_position + 1 < text_input_id.shape[0]:
443
+ input_embed_parts.append(
444
+ text_embed[prev_image_atom_position + 1:, :])
445
+ attention_mask_parts.append(
446
+ text_attention_mask[prev_image_atom_position + 1:])
447
+ label_parts.append(
448
+ text_label[prev_image_atom_position + 1:])
449
+ input_embed = torch.cat(input_embed_parts, dim=0)
450
+ attention_mask = torch.cat(attention_mask_parts, dim=0)
451
+ label = torch.cat(label_parts, dim=0)
452
+ else:
453
+ input_embed = text_embed
454
+ attention_mask = text_attention_mask
455
+ label = text_label
456
+ if self.training:
457
+ # Make visual_embed & visual_indicator_embeds involved in the backward graph,
458
+ # to be compatible with deepspeed zero and ddp.
459
+ input_embed += torch.sum(visual_embed * 0.0) + torch.sum(visual_indicator_embeds * 0.0)
460
+ input_embeds.append(input_embed)
461
+ attention_masks.append(attention_mask)
462
+ labels.append(label)
463
+
464
+ if self.training: # padding to self.config.multimodal_max_length for increased training speed
465
+ padding_size = max(0, self.config.multimodal_max_length - len(input_embeds[0]))
466
+ input_embeds[0] = torch.nn.ConstantPad2d((0, 0, 0, padding_size), 0.0)(input_embeds[0])
467
+ attention_masks[0] = torch.nn.ConstantPad1d((0, padding_size), False)(attention_masks[0])
468
+ labels[0] = torch.nn.ConstantPad1d((0, padding_size), IGNORE_ID)(labels[0])
469
+ batch_input_embeds = self.pad_truncate_sequence(input_embeds, batch_first=True, padding_value=0.0, left_padding=left_padding)
470
+ batch_attention_mask = self.pad_truncate_sequence(attention_masks, batch_first=True, padding_value=False, left_padding=left_padding)
471
+ batch_labels = self.pad_truncate_sequence(labels, batch_first=True, padding_value=IGNORE_ID, left_padding=left_padding)
472
+
473
+ return visual_input_ids, batch_input_embeds, batch_labels, batch_attention_mask
474
+
475
+ def pad_truncate_sequence(self, sequences: List[torch.Tensor], batch_first: bool = True, padding_value: float = 0.0, left_padding: bool = False) -> torch.Tensor:
476
+ if not left_padding:
477
+ pad_sequence = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=batch_first, padding_value=padding_value)
478
+ return pad_sequence[:,:self.config.multimodal_max_length]
479
+ else:
480
+ pad_sequence = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in sequences],batch_first=True, padding_value=padding_value).flip(dims=[1])
481
+ return pad_sequence[:,-self.config.multimodal_max_length:]
482
+
483
+ def preprocess_inputs(
484
+ self,
485
+ text_or_conversations: Union[List[Dict], str],
486
+ images: Optional[List[PIL.Image.Image]],
487
+ max_partition=9,
488
+ generation_preface='',
489
+ return_labels=False,
490
+ propagate_exception=True,
491
+ frame_selector=None,
492
+ frame_selector_kwargs=None
493
+ ):
494
+ # convert text to conversations
495
+ if isinstance(text_or_conversations, str):
496
+ conversations = [{
497
+ "from": "human",
498
+ "value": text_or_conversations
499
+ }]
500
+ elif isinstance(text_or_conversations, list):
501
+ conversations = text_or_conversations
502
+ else:
503
+ raise ValueError(f'Invalid type of `text_or_conversations`, expected `List[Dict]` or `str`,'
504
+ f' but got {type(text_or_conversations)}')
505
+
506
+ if frame_selector is not None:
507
+ frame_selector_kwargs = frame_selector_kwargs or {}
508
+ conversations, images = frame_selector(conversations=conversations, frames=images, **frame_selector_kwargs)
509
+
510
+ # format conversations
511
+ prompt, raw_input_ids, raw_labels = self.get_conversation_formatter().format(
512
+ conversations, generation_preface=generation_preface)
513
+
514
+ # place image placeholders
515
+ input_ids = []
516
+ labels = []
517
+ pixel_values = []
518
+ invalidate_label = False
519
+ image_token_indices = [i for i, v in enumerate(raw_input_ids) if v == IMAGE_TOKEN_ID]
520
+ last_image_token_index = -1
521
+ for i in range(len(image_token_indices)):
522
+ head = 0 if i == 0 else image_token_indices[i - 1] + 1
523
+ tail = image_token_indices[i]
524
+ last_image_token_index = tail
525
+ input_ids.extend(raw_input_ids[head:tail])
526
+ labels.extend(raw_labels[head:tail])
527
+ try:
528
+ image = images[i]
529
+ raw_pixel_values, image_placeholders = self.visual_tokenizer.preprocess_image(
530
+ image, max_partition=max_partition)
531
+ except Exception as e:
532
+ if propagate_exception:
533
+ raise e
534
+ logging.exception(e)
535
+ invalidate_label = True
536
+ raw_pixel_values, image_placeholders = self.visual_tokenizer.mock_input()
537
+ input_ids.extend(image_placeholders)
538
+ labels.extend([IGNORE_ID] * len(image_placeholders))
539
+ pixel_values.append(raw_pixel_values)
540
+ input_ids.extend(raw_input_ids[last_image_token_index + 1:])
541
+ labels.extend(raw_labels[last_image_token_index + 1:])
542
+
543
+ # return tensors
544
+ input_ids = torch.tensor(input_ids, dtype=torch.long)
545
+ labels = torch.tensor([IGNORE_ID] * len(labels) if invalidate_label else labels, dtype=torch.long)
546
+ pixel_values = torch.cat(pixel_values, dim=0) if len(pixel_values) > 0 else None
547
+
548
+ if return_labels:
549
+ return prompt, input_ids, pixel_values, labels
550
+ else:
551
+ return prompt, input_ids, pixel_values
552
+
553
+ def save_pretrained(
554
+ self,
555
+ save_directory: Union[str, os.PathLike],
556
+ is_main_process: bool = True,
557
+ state_dict: Optional[dict] = None,
558
+ save_function: Callable = torch.save,
559
+ push_to_hub: bool = False,
560
+ max_shard_size: Union[int, str] = "5GB",
561
+ safe_serialization: bool = True,
562
+ variant: Optional[str] = None,
563
+ token: Optional[Union[str, bool]] = None,
564
+ save_peft_format: bool = True,
565
+ **kwargs
566
+ ):
567
+ super().save_pretrained(save_directory,
568
+ is_main_process=is_main_process,
569
+ state_dict=state_dict,
570
+ save_function=save_function,
571
+ safe_serialization=safe_serialization)
572
+ self.get_text_tokenizer().save_pretrained(save_directory)
573
+ self.get_visual_tokenizer().get_image_processor().save_pretrained(save_directory)
574
+
575
+ def generate(
576
+ self,
577
+ inputs: Optional[torch.Tensor] = None,
578
+ **kwargs
579
+ ) -> Union[GenerateOutput, torch.LongTensor]:
580
+ _, inputs_embeds, labels, attention_mask = self.merge_multimodal(
581
+ text_input_ids=inputs,
582
+ text_attention_masks=kwargs.pop('attention_mask'),
583
+ text_labels=None,
584
+ pixel_values=kwargs.pop('pixel_values'),
585
+ left_padding=True
586
+ )
587
+ inputs_embeds = inputs_embeds.detach()
588
+ torch.cuda.empty_cache()
589
+
590
+ return self.llm.generate(inputs=None, inputs_embeds=inputs_embeds, attention_mask=attention_mask, **kwargs)
preprocessor_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": {
3
+ "height": 448,
4
+ "width": 448
5
+ },
6
+ "do_center_crop": false,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "image_mean": [
12
+ 0.48145466,
13
+ 0.4578275,
14
+ 0.40821073
15
+ ],
16
+ "image_processor_type": "CLIPImageProcessor",
17
+ "image_std": [
18
+ 0.26862954,
19
+ 0.26130258,
20
+ 0.27577711
21
+ ],
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "shortest_edge": 448
26
+ }
27
+ }
quantize_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bits": 8,
3
+ "group_size": 128,
4
+ "desc_act": true,
5
+ "sym": true,
6
+ "lm_head": false,
7
+ "quant_method": "gptq",
8
+ "checkpoint_format": "gptq",
9
+ "pack_dtype": "int32",
10
+ "meta": {
11
+ "quantizer": [
12
+ "gptqmodel:2.0.0-dev"
13
+ ],
14
+ "uri": "https://github.com/modelcloud/gptqmodel",
15
+ "damp_percent": 0.01,
16
+ "damp_auto_increment": 0.0025,
17
+ "static_groups": false,
18
+ "true_sequential": true,
19
+ "mse": 0.0
20
+ }
21
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": "<unk>"
25
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
3
+ size 11421896
tokenizer_config.json ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|im_end|>",
201
+ "errors": "replace",
202
+ "extra_special_tokens": {},
203
+ "model_max_length": 131072,
204
+ "pad_token": "<unk>",
205
+ "split_special_tokens": false,
206
+ "tokenizer_class": "Qwen2TokenizerFast",
207
+ "unk_token": null,
208
+ "_commit_hash": null
209
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff