zhiqings commited on
Commit
4faf534
·
1 Parent(s): 17343f4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ inference: false
4
+ ---
5
+ # LLaVA-RLHF Model Card
6
+
7
+ ## Model details
8
+
9
+ **Model type:**
10
+ LLaVA-RLHF represents a novel aligned end-to-end trained large multimodal model that combines a CLIP vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive visual reasoning and perception capabilities mimicking spirits of the multimodal GPT-4.
11
+ Via Factually Augmented RLHF, LLaVA-RLHF is presented to be more helpful and less hallucinated than LLaVA or other open-sourced LMMs.
12
+
13
+ **Usage:**
14
+ **NOTE: The RLHFed model is trained with LoRA and the bfloat16 data type.**
15
+ Users have to apply the PEFT-LoRA on the LLaVA-SFT+ model.
16
+
17
+ ```python
18
+ dtype = torch.bfloat16
19
+ model_path = "LLaVA-RLHF-7b-v1.5-224/sft_model"
20
+ lora_path = "LLaVA-RLHF-7b-v1.5-224/rlhf_lora_adapter_model"
21
+ model = LlavaLlamaForCausalLM.from_pretrained(
22
+ model_path,
23
+ device_map={"": "cuda:0"},
24
+ torch_dtype=dtype,
25
+ )
26
+ model = PeftModel.from_pretrained(
27
+ model,
28
+ lora_path,
29
+ )
30
+ ```
31
+
32
+ **Model date:**
33
+ LLaVA was trained in Sept 2024.
34
+
35
+ **Paper or resources for more information:**
36
+ https://llava-rlhf.github.io/
37
+
38
+ **License:**
39
+ Apache License 2.0
40
+
41
+ **Where to send questions or comments about the model:**
42
+ https://github.com/Edward-Sun/LLaVA-RLHF/issues
43
+
44
+ ## Intended use
45
+ **Primary intended uses:**
46
+ The primary use of LLaVA-RLHF is research on large multimodal chatbots.
47
+
48
+ **Primary intended users:**
49
+ The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
50
+
51
+ ## Training dataset
52
+ 595K filtered image-text pairs from CC3M.
53
+ 150K GPT-generated multimodal instruction-following chat data.
54
+ 83K VQA v2 instruction-following VQA data.
55
+ 16K A-OKVQA instruction-following CoT-VQA data.
56
+ 23K FLICKR instruction-following spotting captioning data.
57
+ 10K LLaVA-based human preference data