Cohere on Hugging Face Inference Providers 🔥

Published April 16, 2025

Update on GitHub

Upvote

reach-vb Vaibhav Srivastav

burtenshaw ben burtenshaw

merve Merve Noyan

celinah Célina Hanouti

alexrs Alejandro Rodriguez

CohereLabs

julien-c Julien Chaumond

sbrandeis Simon Brandeis

We're thrilled to share that Cohere is now a supported Inference Provider on HF Hub! This also marks the first model creator to share and serve their models directly on the Hub.

Cohere is committed to building and serving models purpose-built for enterprise use-cases. Their comprehensive suite of secure AI solutions, from cutting-edge Generative AI to powerful Embeddings and Ranking models, are designed to tackle real-world business challenges. Additionally, Cohere Labs, Cohere’s in house research lab, supports fundamental research and seeks to change the spaces where research happens.

Starting now, you can run serverless inference to the following models via Cohere and Inference Providers:

Light up your projects with Cohere and Cohere Labs today!

Cohere Models

Cohere and Cohere Labs bring a swathe of their models to Inference Providers that excel at specific business applications. Let’s explore some in detail.

CohereLabs/c4ai-command-a-03-2025 🔗

Optimized for demanding enterprises that require fast, secure, and high-quality AI. Its 256k context length (2x most leading models) can handle much longer enterprise documents. Other key features include Cohere’s advanced retrieval-augmented generation (RAG) with verifiable citations, agentic tool use, enterprise-grade security, and strong multilingual performance (support for 23 languages).

CohereLabs/aya-expanse-32b 🔗

Focuses on state-of-the-art multilingual support, applying the latest research on multilingual pre-training. Supports Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese with 128K context length.

CohereLabs/c4ai-command-r7b-12-2024 🔗

Ideal for low-cost or low-latency use cases, bringing state-of-the-art performance in its class of open-weight models across real-world tasks. This model offers a context length of 128k. It delivers a powerful combination of multilingual support, citation-verified retrieval-augmented generation (RAG), reasoning, tool use, and agentic behavior. Also supports 23 languages.

CohereLabs/aya-vision-32b 🔗

32-billion parameter model with advanced capabilities optimized for a variety of vision-language use cases, including OCR, captioning, visual reasoning, summarization, question answering, code, and more. It expands multimodal capabilities to 23 languages spoken by over half the world's population.

How it works

You can use Cohere models directly on the Hub either on the website UI or via the client SDKs.

You can find all the examples mentioned in this section on the Cohere documentation page.

In the website UI

You can search for Cohere models by filtering by the inference provider in the model hub.

From the Model Card, you can select the inference provider and run inference directly in the UI.

From the client SDKs

Let’s walk through using Cohere models from client SDKs. We’ve also made a colab notebook with these snippets, in case you want to try them out right away.

from Python, using huggingface_hub

The following example shows how to use Command A using Cohere as your inference provider. You can use a Hugging Face token for automatic routing through Hugging Face, or your own cohere API key if you have one.

Install huggingface_hub v0.30.0 or later:

pip install -U "huggingface_hub>=0.30.0"

Use the huggingface_hub python library to call Cohere endpoints by defining the provider parameter.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": "How to make extremely spicy Mayonnaise?"
        }
]

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-r7b-12-2024",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)

print(completion.choices[0].message)

Aya Vision, Cohere Labs’ multilingual, multimodal model is also supported. You can include images encoded in base64 as follows:

image_path = "img.jpg"
with open(image_path, "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")
image_url = f"data:image/jpeg;base64,{base64_image}"

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": image_url},
                },
            ]
        }
]

completion = client.chat.completions.create(
    model="CohereLabs/aya-vision-32b",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)

print(completion.choices[0].message)

from JS using @huggingface/inference

import { HfInference } from "@huggingface/inference";

const client = new HfInference("xxxxxxxxxxxxxxxxxxxxxxxx");

const chatCompletion = await client.chatCompletion({
    model: "CohereLabs/c4ai-command-a-03-2025",
    messages: [
        {
            role: "user",
            content: "How to make extremely spicy Mayonnaise?"
        }
    ],
    provider: "cohere",
    max_tokens: 512
});

console.log(chatCompletion.choices[0].message);

From OpenAI client

Here's how you can call Command R7B using Cohere as the inference provider via the OpenAI client library.

from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/cohere/compatibility/v1",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": "How to make extremely spicy Mayonnaise?"
        }
]

completion = client.chat.completions.create(
    model="command-a-03-2025",
    messages=messages,
    temperature=0.7,
)

print(completion.choices[0].message)

Tool Use with Cohere Models

Cohere’s models bring state-of-the-art agentic tool use to Inference Providers so let’s explore that in detail. Both the Hugging Face Hub client and the OpenAI client are compatible with tools via inference providers, so the above examples can be expanded.

First, we will need to define tools for the model to use. Below we define the get_flight_info which calls an API for the latest flight information using two locations. This tool definition will be represented by the model’s chat template. Which we can also explore in the model card (🎉 open source).

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_flight_info",
            "description": "Get flight information between two cities or airports",
            "parameters": {
                "type": "object",
                "properties": {
                    "loc_origin": {
                        "type": "string",
                        "description": "The departure airport, e.g. MIA",
                    },
                    "loc_destination": {
                        "type": "string",
                        "description": "The destination airport, e.g. NYC",
                    },
                },
                "required": ["loc_origin", "loc_destination"],
            },
        },
    }
]

Next, we’ll need to pass messages to the inference client for the model to use the tools when relevant. In the example below we define the assistant’s tool call in tool_calls, for the sake of clarity.


messages = [
    {"role": "developer", "content": "Today is April 30th"},
    {
        "role": "user",
        "content": "When is the next flight from Miami to Seattle?",
    },
    {
        "role": "assistant",
        "tool_calls": [
            {
                "function": {
                    "arguments": '{ "loc_destination": "Seattle", "loc_origin": "Miami" }',
                    "name": "get_flight_info",
                },
                "id": "get_flight_info0",
                "type": "function",
            }
        ],
    },
    {
        "role": "tool",
        "name": "get_flight_info",
        "tool_call_id": "get_flight_info0",
        "content": "Miami to Seattle, May 1st, 10 AM.",
    },
]

Finally, the tools and messages are passed to the create method.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-r7b-12-2024",
    messages=messages,
    tools=tools,
    temperature=0.7,
    max_tokens=512,
)

print(completion.choices[0].message)

Billing

For direct requests, i.e. when you use a Cohere key, you are billed directly on your Cohere account.

For routed requests, i.e. when you authenticate via the Hub, you'll only pay the standard Cohere API rates. There's no additional markup from us, we just pass through the provider costs directly. (In the future, we may establish revenue-sharing agreements with our provider partners.)

Important Note ‼️ PRO users get $2 worth of Inference credits every month. You can use them across providers. 🔥

Subscribe to the Hugging Face PRO plan to get access to Inference credits, ZeroGPU, Spaces Dev Mode, 20x higher limits, and more.

Introducing HUGS - Scale your AI with Open Models

By October 23, 2024 • 37

Build AI on premise with Dell Enterprise Hub

By May 21, 2024 • 24

Community

borgr

5 days ago

•

edited 5 days ago

Great announcement!
Is there a way to opt in to share my data with the world when using the UI? Or get all my conversations with an API request (so others\we can build this opt in, in some hacky way)?

merve

Article author 4 days ago

•

edited 4 days ago

@borgr hello! as of now there's no such option, but we'll consider this, you want this for data labelling right? ☺️ for now you can use the providers programmatically and store them yourself I think

borgr

3 days ago

Labelling, studying what people lack, learning about human reactions to various LM behavior etc.

MohamedGhanySaleh

3 days ago

Uuu

DESSEP

2 days ago

This comment has been hidden (marked as Off-Topic)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote