--- language: multilingual license: mit tags: - onnx - optimum - quantized - int8 - text-embedding - onnxruntime - opset14 - text-classification - gpu - optimized datasets: - mmarco pipeline_tag: sentence-similarity --- # gte-multilingual-reranker-base-onnx-op14-opt-gpu-int8-quantized This model is a quantized ONNX version of [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) using ONNX opset 14. ## Model Details - **Quantization Type**: INT8 - **ONNX Opset**: 14 - **Task**: text-classification - **Target Device**: GPU - **Optimized**: Yes - **Framework**: ONNX Runtime - **Original Model**: [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base) - **Quantized On**: 2025-03-27 ## Environment and Package Versions | Package | Version | | --- | --- | | transformers | 4.48.3 | | optimum | 1.24.0 | | onnx | 1.17.0 | | onnxruntime | 1.21.0 | | torch | 2.5.1 | | numpy | 1.26.4 | | huggingface_hub | 0.28.1 | | python | 3.12.9 | | system | Darwin 24.3.0 | ### Applied Optimizations | Optimization | Setting | | --- | --- | | Graph Optimization Level | Extended | | Optimize for GPU | Yes | | Use FP16 | No | | Transformers Specific Optimizations Enabled | Yes | | Gelu Fusion Enabled | Yes | | Layer Norm Fusion Enabled | Yes | | Attention Fusion Enabled | Yes | | Skip Layer Norm Fusion Enabled | Yes | | Gelu Approximation Enabled | Yes | ## Usage ```python from optimum.onnxruntime import ORTModelForSequenceClassification from transformers import AutoTokenizer # Load model and tokenizer model = ORTModelForSequenceClassification.from_pretrained("quantized_model") tokenizer = AutoTokenizer.from_pretrained("quantized_model") # Prepare input text = "Your text here" inputs = tokenizer(text, return_tensors="pt") # Run inference outputs = model(**inputs) ``` ## Quantization Process This model was quantized using ONNX Runtime with int8 quantization. The quantization was performed using the Optimum library from Hugging Face with opset 14. Graph optimization was applied during export, targeting GPU devices. ## Performance Comparison Quantized models generally offer better inference speed with a slight trade-off in accuracy. This INT8 quantized model should provide significantly faster inference than the original model.