Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,131 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
|
3 |
+
## LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
|
4 |
+
|
5 |
+
This is the official PyTorch implementation of [LLMDet](). Please see our [GitHub](https://github.com/iSEE-Laboratory/LLMDet).
|
6 |
+
|
7 |
+
### 1 Introduction
|
8 |
+
|
9 |
+
<img src="./compare_result.png" style="zoom:30%;" />
|
10 |
+
|
11 |
+
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits.
|
12 |
+
|
13 |
+
### 2 Model Zoo
|
14 |
+
|
15 |
+
| Model | AP<sup>mini</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> | AP<sup>val</sup> | AP<sub>r</sub> | AP<sub>c</sub> | AP<sub>f</sub> |
|
16 |
+
| ----------------------------- | ----------------- | -------------- | -------------- | -------------- | ---------------- | -------------- | -------------- | -------------- |
|
17 |
+
| LLMDet Swin-T only p5 | 44.5 | 38.6 | 39.3 | 50.3 | 34.6 | 25.5 | 29.9 | 43.8 |
|
18 |
+
| LLMDet Swin-T | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 |
|
19 |
+
| LLMDet Swin-B | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 |
|
20 |
+
| LLMDet Swin-L | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 |
|
21 |
+
| LLMDet Swin-L (chunk size 80) | 52.4 | 44.3 | 48.8 | 57.1 | 43.2 | 32.8 | 40.5 | 50.8 |
|
22 |
+
|
23 |
+
**NOTE:**
|
24 |
+
|
25 |
+
1. AP<sup>mini</sup>: evaluated on LVIS `minival`.
|
26 |
+
2. AP<sup>val</sup>: evaluated on LVIS `val 1.0`.
|
27 |
+
3. AP is fixed AP.
|
28 |
+
4. All the checkpoints and logs can be found in [huggingface](https://huggingface.co/fushh7/LLMDet) and [modelscope](https://modelscope.cn/models/fushh7/LLMDet).
|
29 |
+
5. Other benchmarks are tested using `LLMDet Swin-T only p5`.
|
30 |
+
|
31 |
+
### 3 Our Experiment Environment
|
32 |
+
|
33 |
+
Note: other environments may also work.
|
34 |
+
|
35 |
+
- pytorch==2.2.1+cu121
|
36 |
+
- transformers==4.37.2
|
37 |
+
- numpy==1.22.2 (numpy should be lower than 1.24, recommend for numpy==1.23 or 1.22)
|
38 |
+
- mmcv==2.2.0, mmengine==0.10.5
|
39 |
+
- timm, deepspeed, pycocotools, lvis, jsonlines, fairscale, nltk, peft, wandb
|
40 |
+
|
41 |
+
### 4 Data Preparation
|
42 |
+
|
43 |
+
```
|
44 |
+
|--huggingface
|
45 |
+
| |--bert-base-uncased
|
46 |
+
| |--siglip-so400m-patch14-384
|
47 |
+
| |--my_llava-onevision-qwen2-0.5b-ov-2
|
48 |
+
| |--mm_grounding_dino
|
49 |
+
| | |--grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth
|
50 |
+
| | |--grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth
|
51 |
+
| | |--grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth
|
52 |
+
|--grounding_data
|
53 |
+
| |--coco
|
54 |
+
| | |--annotations
|
55 |
+
| | | |--instances_train2017_vg_merged6.jsonl
|
56 |
+
| | | |--instances_val2017.json
|
57 |
+
| | | |--lvis_v1_minival_inserted_image_name.json
|
58 |
+
| | | |--lvis_od_val.json
|
59 |
+
| | |--train2017
|
60 |
+
| | |--val2017
|
61 |
+
| |--flickr30k_entities
|
62 |
+
| | |--flickr_train_vg7.jsonl
|
63 |
+
| | |--flickr30k_images
|
64 |
+
| |--gqa
|
65 |
+
| | |--gqa_train_vg7.jsonl
|
66 |
+
| | |--images
|
67 |
+
| |--llava_cap
|
68 |
+
| | |--LLaVA-ReCap-558K_tag_box_vg7.jsonl
|
69 |
+
| | |--images
|
70 |
+
| |--v3det
|
71 |
+
| | |--annotations
|
72 |
+
| | | |--v3det_2023_v1_train_vg7.jsonl
|
73 |
+
| | |--images
|
74 |
+
|--LLMDet (code)
|
75 |
+
```
|
76 |
+
|
77 |
+
- pretrained models
|
78 |
+
- `bert-base-uncased`, `siglip-so400m-patch14-384` are directly downloaded from huggingface.
|
79 |
+
- To fully reproduce our results, please download `my_llava-onevision-qwen2-0.5b-ov-2` from [huggingface](https://huggingface.co/fushh7/LLMDet) or [modelscope](https://modelscope.cn/models/fushh7/LLMDet), which is slightly fine-tuned by us in early exploration. We find that the original `llava-onevision-qwen2-0.5b-ov` is still OK to reproduce our results but users should pretrain the projector.
|
80 |
+
- Since LLMDet is fine-tuned from`mm_grounding_dino`, please download their checkpoints [swin-t](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth), [swin-b](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det/grounding_dino_swin-b_pretrain_obj365_goldg_v3de-f83eef00.pth), [swin-l](https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-l_pretrain_obj365_goldg/grounding_dino_swin-l_pretrain_obj365_goldg-34dcdc53.pth) for training.
|
81 |
+
- grounding data (GroundingCap-1M)
|
82 |
+
- `coco`: You can download it from the [COCO](https://cocodataset.org/) official website or from [opendatalab](https://opendatalab.com/OpenDataLab/COCO_2017).
|
83 |
+
- `lvis`: LVIS shares the same images with COCO. You can download the minival annotation file from [here](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_v1_minival_inserted_image_name.json), and the val 1.0 annotation file from [here](https://huggingface.co/GLIPModel/GLIP/blob/main/lvis_od_val.json).
|
84 |
+
- `flickr30k_entities`:[Flickr30k images](http://shannon.cs.illinois.edu/DenotationGraph/).
|
85 |
+
- `gqa`: [GQA images](https://nlp.stanford.edu/data/gqa/images.zip).
|
86 |
+
- `llava_cap`:[images](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/blob/main/images.zip) .
|
87 |
+
- `v3det`:The V3Det dataset can be downloaded from [opendatalab](https://opendatalab.com/V3Det/V3Det).
|
88 |
+
- Our generated jsonls can be found [huggingface](https://huggingface.co/fushh7/LLMDet) or [modelscope](https://modelscope.cn/models/fushh7/LLMDet).
|
89 |
+
- For other evalation datasets, please refer to [MM-GDINO](https://github.com/open-mmlab/mmdetection/blob/main/configs/mm_grounding_dino/dataset_prepare.md).
|
90 |
+
|
91 |
+
### 5 Usage
|
92 |
+
|
93 |
+
#### 5.1 Training
|
94 |
+
|
95 |
+
```
|
96 |
+
bash dist_train.sh configs/grounding_dino_swin_t.py 8 --amp
|
97 |
+
```
|
98 |
+
|
99 |
+
#### 5.2 Evaluation
|
100 |
+
|
101 |
+
```
|
102 |
+
bash dist_test.sh configs/grounding_dino_swin_t.py tiny.pth 8
|
103 |
+
```
|
104 |
+
|
105 |
+
### 6 License
|
106 |
+
|
107 |
+
LLMDet is released under the Apache 2.0 license. Please see the LICENSE file for more information.
|
108 |
+
|
109 |
+
### 7 Bibtex
|
110 |
+
|
111 |
+
If you find our work helpful for your research, please consider citing our paper (Coming very soon!).
|
112 |
+
|
113 |
+
```
|
114 |
+
@article{fu2025llmdet,
|
115 |
+
title={Frozen-DETR: Enhancing DETR with Image Understanding from Frozen Foundation Models},
|
116 |
+
author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
|
117 |
+
}
|
118 |
+
```
|
119 |
+
|
120 |
+
### 8 Acknowledgement
|
121 |
+
|
122 |
+
Our LLMDet is heavily inspired by many outstanding prior works, including
|
123 |
+
|
124 |
+
- [MM-Grounding-DINO](https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino)
|
125 |
+
- [LLaVA1.5](https://github.com/haotian-liu/LLaVA)
|
126 |
+
- [LLaVA OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT)
|
127 |
+
- [ShareGPT4V](https://github.com/ShareGPT4Omni/ShareGPT4V)
|
128 |
+
- [ASv2](https://github.com/OpenGVLab/all-seeing)
|
129 |
+
- [RAM](https://github.com/xinyu1205/recognize-anything)
|
130 |
+
|
131 |
+
Thank the authors of above projects for open-sourcing their assets!
|