PE-Core-B16-224 / README.md

jz2023

Update README.md

73f7ad8 verified 6 days ago

4.13 kB

	---
	license: apache-2.0
	---

	# Model Details

	Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
	are not at the output of the network](https://ai.meta.com/research/publications/perception-encoder-the-best-visual-embeddings-are-not-at-the-output-of-the-network/)".

	Model Developer: Meta

	Model Overview: Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.

	<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_image1.png" style="width: 100%; margin: 0 auto; display: block;" />


	\| Scale \| Tower \| Params \| Width \| Depth \| MLP \| Heads \| CLIP Dim \| Resolution \| Patch Size \| Text Context Length \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| B \| Vision \| 0.09B \| 768 \| 12 \| 3072 \| 12 \| 1024 \| 224 \| 16 \| 32 \|
	\| \| Text \| 0.31B \| 1024 \| 24 \| 4096 \| 16 \| 1024 \| 224 \| 16 \| 32 \|
	\| L \| Vision \| 0.32B \| 1024 \| 24 \| 4096 \| 16 \| 1024 \| 336 \| 14 \| 32 \|
	\| \| Text \| 0.31B \| 1024 \| 24 \| 4096 \| 16 \| 1024 \| 336 \| 14 \| 32 \|
	\| G \| Vision \| 1.88B \| 1536 \| 50 \| 8960 \| 16 \| 1280 \| 448 \| 14 \| 72 \|
	\| \| Text \| 0.47B \| 1280 \| 24 \| 5120 \| 20 \| 1280 \| 448 \| 14 \| 72 \|


	# How to use

	## PE codebase
	We provide the pretraining code in https://github.com/facebookresearch/perception_models
	```shell
	git clone https://github.com/facebookresearch/perception_models.git
	cd perception_models
	conda create --name occhi-env python=3.12
	conda activate occhi-env
	# Install PyTorch
	pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
	# We use torchcodec for decoding videos into PyTorch tensors
	conda install ffmpeg -c conda-forge
	pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
	pip install -e .
	```
	## Image and Textg Feature extraction with a Trained Model :robot:
	```python
	import torch
	from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
	from PIL import Image

	model_name = 'PEv1-B16-224 '
	pretrained='PATH_TO_PE_Core_B16_224'

	model, _, preprocess = create_model_and_transforms(
	model_name,
	pretrained=pretrained,
	)
	model = model.cuda()
	tokenizer = get_tokenizer(model_name)
	image = preprocess(Image.open("docs/cat.png")).unsqueeze(0).cuda()
	text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

	with torch.no_grad(), torch.autocast("cuda"):
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)
	image_features /= image_features.norm(dim=-1, keepdim=True)
	text_features /= text_features.norm(dim=-1, keepdim=True)
	text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

	print("Label probs:", text_probs) # prints: [[0.0, 0.0, 1.0]]
	```
	You can find more details in the GitHub repo.
	# Evaluation
	We evaluate the pretrained PE models on Zero-shot Common Sense Reasoning tasks
	Here is the table in Markdown format:
	## Zero-Shot Image Results
	<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
	## Zero-Shot Video Results
	<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
	# Citation
	If you find our code useful for your research, please consider citing:

	@article{PE,
	title={Perception Encoder: The best visual embeddings are not at the output of the network},
	author={},
	journal={arXiv:xxx.xxxxx},
	year={2025}
	}