perception-encoder
PE-Core-B16-224 / README.md
jz2023's picture
Update README.md
73f7ad8 verified
|
raw
history blame
4.13 kB
metadata
license: apache-2.0

Model Details

Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "Perception Encoder: The best visual embeddings are not at the output of the network".

Model Developer: Meta

Model Overview: Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.

Scale Tower Params Width Depth MLP Heads CLIP Dim Resolution Patch Size Text Context Length
B Vision 0.09B 768 12 3072 12 1024 224 16 32
Text 0.31B 1024 24 4096 16 1024 224 16 32
L Vision 0.32B 1024 24 4096 16 1024 336 14 32
Text 0.31B 1024 24 4096 16 1024 336 14 32
G Vision 1.88B 1536 50 8960 16 1280 448 14 72
Text 0.47B 1280 24 5120 20 1280 448 14 72

How to use

PE codebase

We provide the pretraining code in https://github.com/facebookresearch/perception_models

git clone https://github.com/facebookresearch/perception_models.git
cd perception_models
conda create --name occhi-env python=3.12
conda activate occhi-env
# Install PyTorch
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 xformers --index-url https://download.pytorch.org/whl/cu124
# We use torchcodec for decoding videos into PyTorch tensors
conda install ffmpeg -c conda-forge
pip install torchcodec==0.1 --index-url=https://download.pytorch.org/whl/cu124
pip install -e .

Image and Textg Feature extraction with a Trained Model :robot:

import torch
from occhi.vision_encoder.factory import create_model_and_transforms, get_tokenizer
from PIL import Image

model_name = 'PEv1-B16-224 '
pretrained='PATH_TO_PE_Core_B16_224'

model, _, preprocess = create_model_and_transforms(
    model_name,
    pretrained=pretrained,
)
model = model.cuda()
tokenizer = get_tokenizer(model_name)
image = preprocess(Image.open("docs/cat.png")).unsqueeze(0).cuda()
text = tokenizer(["a diagram", "a dog", "a cat"]).cuda()

with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[0.0, 0.0, 1.0]]

You can find more details in the GitHub repo.

Evaluation

We evaluate the pretrained PE models on Zero-shot Common Sense Reasoning tasks Here is the table in Markdown format:

Zero-Shot Image Results

## Zero-Shot Video Results # Citation If you find our code useful for your research, please consider citing:
@article{PE,
    title={Perception Encoder: The best visual embeddings are not at the output of the network},
    author={},
    journal={arXiv:xxx.xxxxx},
    year={2025}
}