Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ datasets:
|
|
10 |
pipeline_tag: image-feature-extraction
|
11 |
---
|
12 |
|
13 |
-
InternViT-300M-448px
|
14 |
|
15 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[π InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/)
|
16 |
|
@@ -26,24 +26,6 @@ This update primarily focuses on enhancing the efficiency of the vision foundati
|
|
26 |
- **Pretrain Dataset:** LAION-en, LAION-zh, COYO, GRIT, COCO, TextCaps, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and other OCR-related datasets.
|
27 |
To enhance the OCR capability of the model, we have incorporated additional OCR data alongside the general caption datasets. Specifically, we utilized PaddleOCR to perform Chinese OCR on images from Wukong and English OCR on images from LAION-COCO.
|
28 |
|
29 |
-
## Released Models
|
30 |
-
### Vision Foundation model
|
31 |
-
| Model | Date | Download | Note |
|
32 |
-
| ----------------------- | ---------- | ---------------------------------------------------------------------- | -------------------------------- |
|
33 |
-
| InternViT-6B-448px-V1-5 | 2024.04.20 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | support dynamic resolution, super strong OCR (π₯new) |
|
34 |
-
| InternViT-6B-448px-V1-2 | 2024.02.11 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) | 448 resolution |
|
35 |
-
| InternViT-6B-448px-V1-0 | 2024.01.30 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) | 448 resolution |
|
36 |
-
| InternViT-6B-224px | 2023.12.22 | π€ [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-224px) | vision foundation model |
|
37 |
-
| InternVL-14B-224px | 2023.12.22 | π€ [HF link](https://huggingface.co/OpenGVLab/InternVL-14B-224px) | vision-language foundation model |
|
38 |
-
|
39 |
-
### Multimodal Large Language Model (MLLM)
|
40 |
-
| Model | Date | Download | Note |
|
41 |
-
| ----------------------- | ---------- | --------------------------------------------------------------------------- | ---------------------------------- |
|
42 |
-
| InternVL-Chat-V1-5 | 2024.04.18 | π€ [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (π₯new)|
|
43 |
-
| InternVL-Chat-V1-2-Plus | 2024.02.21 | π€ [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) | more SFT data and stronger |
|
44 |
-
| InternVL-Chat-V1-2 | 2024.02.11 | π€ [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) | scaling up LLM to 34B |
|
45 |
-
| InternVL-Chat-V1-1 | 2024.01.24 | π€ [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1) | support Chinese and stronger OCR |
|
46 |
-
|
47 |
## Model Usage (Image Embeddings)
|
48 |
|
49 |
```python
|
@@ -86,7 +68,3 @@ If you find this project useful in your research, please consider citing:
|
|
86 |
}
|
87 |
|
88 |
```
|
89 |
-
|
90 |
-
## Acknowledgement
|
91 |
-
|
92 |
-
InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
|
|
|
10 |
pipeline_tag: image-feature-extraction
|
11 |
---
|
12 |
|
13 |
+
# InternViT-300M-448px
|
14 |
|
15 |
[\[π Blog\]](https://internvl.github.io/blog/) [\[π InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[π InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821) [\[π¨οΈ Chat Demo\]](https://internvl.opengvlab.com/)
|
16 |
|
|
|
26 |
- **Pretrain Dataset:** LAION-en, LAION-zh, COYO, GRIT, COCO, TextCaps, Objects365, OpenImages, All-Seeing, Wukong-OCR, LaionCOCO-OCR, and other OCR-related datasets.
|
27 |
To enhance the OCR capability of the model, we have incorporated additional OCR data alongside the general caption datasets. Specifically, we utilized PaddleOCR to perform Chinese OCR on images from Wukong and English OCR on images from LAION-COCO.
|
28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
## Model Usage (Image Embeddings)
|
30 |
|
31 |
```python
|
|
|
68 |
}
|
69 |
|
70 |
```
|
|
|
|
|
|
|
|