Improve model card: update pipeline tag, add library name and license, and improve content

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +40 -109
README.md CHANGED
@@ -1,11 +1,31 @@
1
  ---
2
- pipeline_tag: text-generation
 
 
3
  ---
4
- # 8b version of our ChemVLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ## Citation
6
- arxiv.org/abs/2408.07246
7
 
8
- ```
9
  @inproceedings{li2025chemvlm,
10
  title={Chemvlm: Exploring the power of multimodal large language models in chemistry area},
11
  author={Li, Junxian and Zhang, Di and Wang, Xunzhi and Hao, Zeying and Lei, Jingdi and Tan, Qian and Zhou, Cai and Liu, Wei and Yang, Yaotian and Xiong, Xinrui and others},
@@ -17,9 +37,11 @@ arxiv.org/abs/2408.07246
17
  }
18
  ```
19
 
20
- Codebase and datasets can be found at https://github.com/AI4Chem/ChemVlm.
21
 
22
- ### Performances of our 8b model on several tasks
 
 
23
 
24
  | Datasets | MMChemOCR | CMMU | MMCR-bench | Reaction type |
25
  | :----- | :----- | :----- |:----- |:----- |
@@ -27,123 +49,32 @@ Codebase and datasets can be found at https://github.com/AI4Chem/ChemVlm.
27
  |scores of ChemVLM-8b| 81.75/57.69 | 52.7(SOTA) | 33.6 | 16.79 |
28
 
29
 
30
- Quick start as below(```transformers>=4.37.0 is needed```)
31
- Update: You may also need
32
- ```
33
- pip install sentencepiece
34
- pip install einops
35
- pip install timm
36
- pip install accelerate>=0.26.0
37
- ```
38
 
39
- Code:
40
- ```Python
41
- from transformers import AutoTokenizer, AutoModelforCasualLM
42
  import torch
43
  import torchvision.transforms as T
44
- import transformers
45
  from torchvision.transforms.functional import InterpolationMode
 
46
 
47
-
48
- IMAGENET_MEAN = (0.485, 0.456, 0.406)
49
- IMAGENET_STD = (0.229, 0.224, 0.225)
50
-
51
- IMAGENET_MEAN = (0.485, 0.456, 0.406)
52
- IMAGENET_STD = (0.229, 0.224, 0.225)
53
-
54
-
55
- def build_transform(input_size):
56
- MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
57
- transform = T.Compose([
58
- T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
59
- T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
60
- T.ToTensor(),
61
- T.Normalize(mean=MEAN, std=STD)
62
- ])
63
- return transform
64
-
65
-
66
- def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
67
- best_ratio_diff = float('inf')
68
- best_ratio = (1, 1)
69
- area = width * height
70
- for ratio in target_ratios:
71
- target_aspect_ratio = ratio[0] / ratio[1]
72
- ratio_diff = abs(aspect_ratio - target_aspect_ratio)
73
- if ratio_diff < best_ratio_diff:
74
- best_ratio_diff = ratio_diff
75
- best_ratio = ratio
76
- elif ratio_diff == best_ratio_diff:
77
- if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
78
- best_ratio = ratio
79
- return best_ratio
80
-
81
-
82
- def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
83
- orig_width, orig_height = image.size
84
- aspect_ratio = orig_width / orig_height
85
-
86
- # calculate the existing image aspect ratio
87
- target_ratios = set(
88
- (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
89
- i * j <= max_num and i * j >= min_num)
90
- target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
91
-
92
- # find the closest aspect ratio to the target
93
- target_aspect_ratio = find_closest_aspect_ratio(
94
- aspect_ratio, target_ratios, orig_width, orig_height, image_size)
95
-
96
- # calculate the target width and height
97
- target_width = image_size * target_aspect_ratio[0]
98
- target_height = image_size * target_aspect_ratio[1]
99
- blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
100
-
101
- # resize the image
102
- resized_img = image.resize((target_width, target_height))
103
- processed_images = []
104
- for i in range(blocks):
105
- box = (
106
- (i % (target_width // image_size)) * image_size,
107
- (i // (target_width // image_size)) * image_size,
108
- ((i % (target_width // image_size)) + 1) * image_size,
109
- ((i // (target_width // image_size)) + 1) * image_size
110
- )
111
- # split the image
112
- split_img = resized_img.crop(box)
113
- processed_images.append(split_img)
114
- assert len(processed_images) == blocks
115
- if use_thumbnail and len(processed_images) != 1:
116
- thumbnail_img = image.resize((image_size, image_size))
117
- processed_images.append(thumbnail_img)
118
- return processed_images
119
-
120
-
121
- def load_image(image_file, input_size=448, max_num=6):
122
- image = Image.open(image_file).convert('RGB')
123
- transform = build_transform(input_size=input_size)
124
- images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
125
- pixel_values = [transform(image) for image in images]
126
- pixel_values = torch.stack(pixel_values)
127
- return pixel_values
128
 
129
  tokenizer = AutoTokenizer.from_pretrained('AI4Chem/ChemVLM-8B', trust_remote_code=True)
130
-
131
- query = "Please describe the molecule in the image."
132
- image_path = "your image path"
133
- pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda()
134
-
135
-
136
  model = AutoModelForCausalLM.from_pretrained(
137
  "AI4Chem/ChemVLM-8B",
138
  torch_dtype=torch.bfloat16,
139
  low_cpu_mem_usage=True,
140
  trust_remote_code=True
141
- ).to(device).eval().cuda()
 
 
 
 
142
 
143
  gen_kwargs = {"max_length": 1000, "do_sample": True, "temperature": 0.7, "top_p": 0.9}
144
 
145
  response = model.chat(tokenizer, pixel_values, query, gen_kwargs)
146
-
147
-
148
  ```
149
-
 
1
  ---
2
+ pipeline_tag: image-text-to-text
3
+ library_name: transformers
4
+ license: apache-2.0
5
  ---
6
+
7
+ # ChemVLM-8B: A Multimodal Large Language Model for Chemistry
8
+
9
+ This is the 8b version of ChemVLM, a multimodal large language model designed for chemical applications.
10
+
11
+ ## Paper
12
+
13
+ [ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area](https://huggingface.co/papers/2408.07246)
14
+
15
+ ### Abstract
16
+
17
+ Large Language Models (LLMs) have achieved remarkable success and have been applied across various scientific fields, including chemistry. However, many chemical tasks require the processing of visual information, which cannot be successfully handled by existing chemical LLMs. This brings a growing need for models capable of integrating multimodal information in the chemical domain. In this paper, we introduce **ChemVLM**, an open-source chemical multimodal large language model specifically designed for chemical applications. ChemVLM is trained on a carefully curated bilingual multimodal dataset that enhances its ability to understand both textual and visual chemical information, including molecular structures, reactions, and chemistry examination questions. We develop three datasets for comprehensive evaluation, tailored to Chemical Optical Character Recognition (OCR), Multimodal Chemical Reasoning (MMCR), and Multimodal Molecule Understanding tasks. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks. Experimental results demonstrate that ChemVLM achieves competitive performance across all evaluated tasks. Our model can be found at https://huggingface.co/AI4Chem/ChemVLM-26B.
18
+
19
+ ## Model Description
20
+
21
+ The architecture of ChemVLM is based on InternVLM and incorporates both vision and language processing components. The model is trained on a bilingual multimodal dataset containing chemical information, including molecular structures, reactions, and chemistry exam questions. More details about the architecture can be found in the [Github README](https://github.com/AI4Chem/ChemVlm).
22
+
23
+ ![ChemVLM](https://github.com/AI4Chem/ChemVlm/raw/main/imgs/ChemVLM.jpg)
24
+
25
+
26
  ## Citation
 
27
 
28
+ ```bibtex
29
  @inproceedings{li2025chemvlm,
30
  title={Chemvlm: Exploring the power of multimodal large language models in chemistry area},
31
  author={Li, Junxian and Zhang, Di and Wang, Xunzhi and Hao, Zeying and Lei, Jingdi and Tan, Qian and Zhou, Cai and Liu, Wei and Yang, Yaotian and Xiong, Xinrui and others},
 
37
  }
38
  ```
39
 
40
+ ## Codebase and Datasets
41
 
42
+ Codebase and datasets can be found at https://github.com/AI4Chem/ChemVlm.
43
+
44
+ ## Performances of our 8b model on several tasks
45
 
46
  | Datasets | MMChemOCR | CMMU | MMCR-bench | Reaction type |
47
  | :----- | :----- | :----- |:----- |:----- |
 
49
  |scores of ChemVLM-8b| 81.75/57.69 | 52.7(SOTA) | 33.6 | 16.79 |
50
 
51
 
52
+ ## Quick Start
 
 
 
 
 
 
 
53
 
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
 
56
  import torch
57
  import torchvision.transforms as T
 
58
  from torchvision.transforms.functional import InterpolationMode
59
+ from PIL import Image
60
 
61
+ # ... (helper functions from original model card)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  tokenizer = AutoTokenizer.from_pretrained('AI4Chem/ChemVLM-8B', trust_remote_code=True)
 
 
 
 
 
 
64
  model = AutoModelForCausalLM.from_pretrained(
65
  "AI4Chem/ChemVLM-8B",
66
  torch_dtype=torch.bfloat16,
67
  low_cpu_mem_usage=True,
68
  trust_remote_code=True
69
+ ).cuda().eval()
70
+
71
+ query = "Please describe the molecule in the image."
72
+ image_path = "your image path"
73
+ pixel_values = load_image(image_path, max_num=6).to(torch.bfloat16).cuda()
74
 
75
  gen_kwargs = {"max_length": 1000, "do_sample": True, "temperature": 0.7, "top_p": 0.9}
76
 
77
  response = model.chat(tokenizer, pixel_values, query, gen_kwargs)
78
+ print(response)
 
79
  ```
80
+ Install required libraries with `pip install transformers>=4.37.0 sentencepiece einops timm accelerate>=0.26.0`. Make sure to have `torch` and `torchvision` installed as well.