ColorfulAI commited on
Commit
36f2c74
·
1 Parent(s): e30c063
Files changed (1) hide show
  1. README.md +71 -2
README.md CHANGED
@@ -3,12 +3,81 @@ license: mit
3
  ---
4
  # M4-LongVA-7B-Qwen2
5
 
6
- Enhancing Interactive Capabilities in VideoLLM
 
 
7
 
8
  M4-7B is an extension of [LongVA-7B](https://github.com/EvolvingLMMs-Lab/LongVA), further trained using the [M4-IT](https://huggingface.co/datasets/ColorfulAI/M4-IT) dataset, which comprises 9,963 visual instruction tuning instances. This training was conducted without any special modifications to the existing training pipeline.
9
 
10
  ## Usage
11
 
12
- ![images](./assets/framework.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  For more information about the interaction inference pipeline, please visit the [M4 GitHub repository](https://github.com/patrick-tssn/M4).
 
3
  ---
4
  # M4-LongVA-7B-Qwen2
5
 
6
+ ![images](./assets/framework.png)
7
+
8
+ Enhancing Interactive Capabilities in MLLM
9
 
10
  M4-7B is an extension of [LongVA-7B](https://github.com/EvolvingLMMs-Lab/LongVA), further trained using the [M4-IT](https://huggingface.co/datasets/ColorfulAI/M4-IT) dataset, which comprises 9,963 visual instruction tuning instances. This training was conducted without any special modifications to the existing training pipeline.
11
 
12
  ## Usage
13
 
14
+ *Please refer to [M4](https://github.com/patrick-tssn/M4) to install relvevant packages*
15
+
16
+ ```python
17
+ import os
18
+ from PIL import Image
19
+ import numpy as np
20
+ import torchaudio
21
+ import torch
22
+ from decord import VideoReader, cpu
23
+ import whisper
24
+ # fix seed
25
+ torch.manual_seed(0)
26
+
27
+ from intersuit.model.builder import load_pretrained_model
28
+ from intersuit.mm_utils import tokenizer_image_speech_tokens, process_images
29
+ from intersuit.constants import IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX
30
+
31
+ import ChatTTS
32
+ chat = ChatTTS.Chat()
33
+ chat.load(source='local', compile=True)
34
+
35
+ import warnings
36
+ warnings.filterwarnings("ignore")
37
+
38
+ model_path = "checkpoints/M4-LongVA-7B-Qwen2"
39
+ video_path = "local_demo/assets/water.mp4"
40
+ max_frames_num = 16 # you can change this to several thousands so long you GPU memory can handle it :)
41
+ gen_kwargs = {"do_sample": True, "temperature": 0.5, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 1024}
42
+ tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_qwen", device_map="cuda:0", attn_implementation="eager")
43
+
44
+ # original query
45
+ query = "Give a detailed caption of the video as if I am blind."
46
+ prompt = f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>{query}\n<|im_end|>\n<|im_start|>assistant\n"
47
+ input_ids = tokenizer_image_speech_tokens(prompt, tokenizer, IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
48
+ pad_token_ids = (tokenizer.pad_token_id if tokenizer.pad_token_id is not None else tokenizer.eos_token_id)
49
+ attention_masks = input_ids.ne(pad_token_ids).to(input_ids.device)
50
+
51
+ # new query
52
+ new_query = "How many people in the video?"
53
+ new_query = "Okay, I see."
54
+ new_query = "Sorry to interrupt."
55
+ new_query_pos = 10 # which token encounter the new query
56
+ new_prompt = f"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{new_query}\n<|im_end|>\n<|im_start|>assistant\n"
57
+ new_input_ids = tokenizer_image_speech_tokens(new_prompt, tokenizer, IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
58
+
59
+ #video input
60
+ vr = VideoReader(video_path, ctx=cpu(0))
61
+ total_frame_num = len(vr)
62
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
63
+ frame_idx = uniform_sampled_frames.tolist()
64
+ frames = vr.get_batch(frame_idx).asnumpy()
65
+ video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.bfloat16)
66
+
67
+
68
+ with torch.inference_mode():
69
+ output_ids = model.generate_parallel(input_ids,
70
+ attention_mask=attention_masks,
71
+ images=[video_tensor],
72
+ modalities=["video"],
73
+ new_query=new_input_ids,
74
+ new_query_pos=new_query_pos,
75
+ query_str=query,
76
+ new_query_str=new_query,
77
+ tokenizer=tokenizer,
78
+ **gen_kwargs)
79
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
80
+
81
+ ```
82
 
83
  For more information about the interaction inference pipeline, please visit the [M4 GitHub repository](https://github.com/patrick-tssn/M4).