After diving into the latest benchmark results, it’s clear that Meta’s new Llama 4 lineup (Maverick, Scout, and Behemoth) is no joke.
Here are a few standout highlights🔍:
Llama 4 Maverick hits the sweet spot between cost and performance - Outperforms GPT-4o in image tasks like ChartQA (90.0 vs 85.7) and DocVQA (94.4 vs 92.8) - Beats others in MathVista and MMLU Pro too and at a fraction of the cost ($0.19–$0.49 vs $4.38 🤯)
Llama 4 Scout is lean, cost-efficient, and surprisingly capable - Strong performance across image and language tasks (e.g. ChartQA: 88.8, DocVQA: 94.4) - More affordable than most competitors and still beats out larger models like Gemini 2.0 Flash-Lite
Llama 4 Behemoth is the heavy hitter. - Tops the charts in LiveCodeBench (49.4), MATH-500 (95.0), and MMLU Pro (82.2) - Even edges out Claude 3 Sonnet and Gemini 2 Pro in multiple areas
Meta didn’t just show up, they delivered across multimodal, coding, reasoning, and multilingual benchmarks.
And honestly? Seeing this level of performance, especially at lower inference costs, is a big deal for anyone building on LLMs.
Curious to see how these models do in real-world apps next.
Detect hallucinations in answers based on context and questions using ModernBERT with 8192-token context support!
### Model Details - **Model Name**: [lettucedect-large-modernbert-en-v1](KRLabsOrg/lettucedect-large-modernbert-en-v1) - **Organization**: [KRLabsOrg](KRLabsOrg) - **Github**: [https://github.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect) - **Architecture**: ModernBERT (Large) with extended context support up to 8192 tokens - **Task**: Token Classification / Hallucination Detection - **Training Dataset**: [RagTruth](wandb/RAGTruth-processed) - **Language**: English - **Capabilities**: Detects hallucinated spans in answers, provides confidence scores, and calculates average confidence across detected spans.
LettuceDetect excels at processing long documents to determine if an answer aligns with the provided context, making it a powerful tool for ensuring factual accuracy.
Hugging Face just launched the AI Agents Course – a free journey from beginner to expert in AI agents!
- Learn AI Agent fundamentals, use cases and frameworks - Use top libraries like LangChain & LlamaIndex - Compete in challenges & earn a certificate - Hands-on projects & real-world applications
🙋🏻♂️Hey there folks , Open LLM Europe just released Lucie 7B-Instruct model , a billingual instruct model trained on open data ! You can check out my unofficial demo here while we wait for the official inference api from the group : Tonic/Lucie-7B hope you like it 🚀
(P1, P2): P2 has a shorter survival time and a higher risk score → Concordant ✅ (P1, P3): P3 has a longer survival time and a lower risk score → Concordant ✅ (P2, P3): P3 has a longer survival time and a lower risk score → Concordant ✅ Total pairs = 3 Total concordant pairs = 3
C-index for Group A = Concordant pairs/Total pairs= 3/3 = 1.0
Step 2: Calculate C-index for All Groups Repeat the process for all groups. For now we can assume:
Group A: C-index = 1.0 Group B: C-index = 0.8 Group C: C-index = 0.6 Step 3: Stratified Concordance Index The Stratified Concordance Index combines the C-index scores of all groups and focusing on the following:
Average performance across groups (mean of C-indices). Consistency across groups (low standard deviation of C-indices). Formula: Stratified C-index = Mean(C-index scores) - Standard Deviation(C-index scores)
Calculate the mean: Mean=1.0 + 0.8 + 0.6/3 = 0.8
Calculate the standard deviation: Standard Deviation= sqrt((1.0-0.8)^2 + (0.8-0.8)^2 + (0.6-0.8)^/3) = 0.16
if you can record the problem and share it there , or on the forums in your own post , please dont be shy because i'm not sure but i do think it helps 🤗🤗🤗
boomers still pick zenodo.org instead of huggingface ??? absolutely clownish nonsense , my random datasets have 30x more downloads and views than front page zenodos ... gonna write a comparison blog , but yeah... cringe.
Easy steps for an effective RAG pipeline with LLM models! 1. Document Embedding & Indexing We can start with the use of embedding models to vectorize documents, store them in vector databases (Elasticsearch, Pinecone, Weaviate) for efficient retrieval.
2. Smart Querying Then we can generate query embeddings, retrieve top-K relevant chunks and can apply hybrid search if needed for better precision.
3. Context Management We can concatenate retrieved chunks, optimize chunk order and keep within token limits to preserve response coherence.
4. Prompt Engineering Then we can instruct the LLM to leverage retrieved context, using clear instructions to prioritize the provided information.
5. Post-Processing Finally we can implement response verification, fact-checking and integrate feedback loops to refine the responses.