ilsp
/

Llama-Krikri-8B-Instruct

@@ -18,26 +18,30 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
 ![image/png](llama-krikri-image.jpg)
 # Model Information
 - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
-- 128k context length
 - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
-  * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
-  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
-  * The training corpus also contains 6 billion math and code tokens.
   * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
 | Sub-corpus   | # Tokens         | Percentage |
 |-----------|------------------|------------|
-| Greek     | 55,097,452,359   | 61.4%      |
-| English   | 23,340,749,356   | 26.0%      |
-| Parallel  |  5,262,998,873   | 6.0%       |
-| Math/Code |  5,951,964,497   | 6.6%       |
-| **Total** | **89,653,165,085**   |  **100%**       |
-Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
 # How to use
@@ -88,7 +92,7 @@ client = OpenAI(
     base_url=base_url,
 )
-system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει απευθείας με λίστες Python."
 user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
 messages = [
@@ -103,35 +107,7 @@ print(response.choices[0].message.content)
 # Evaluation
-## Greek Benchmarks
-The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
-Our evaluation suite includes:
-* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
-* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
-* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
-Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table:
-|                          | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
-|--------------------------|---------------------------|----------------------|------------------------|----------------------------|----------------------------|------------------|---------|
-| Meltemi-7B-Instruct-v1.5 | %                       | %                  | %                    | %                        | %                        | %              | %     |
-| Llama-3.1-8B-Instruct    | %                       | %                  | %                    | %                        | %                        | %              | %     |
-| Llama-Krikri-8B-Instruct | **38.5%**                 | **86.5%**            | **61.0%**              | **45.11%**                  | **53.7%**                  | **50.0%**        | **56.7%** |
-## English Benchmarks
-|                          | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
-|--------------------------|---------------------|-------------------|---------------------|-------------------------|-------------------------|---------------|---------|
-| Meltemi-7B-Instruct-v1.5 | --%                 | --%               | --%                 | --%                     | --%                     | --%           | --%     |
-| Llama-3.1-8B-Instruct    | --%                 | --%               | --%                 | --%                     | --%                     | --%           | --%     |
-| Llama-Krikri-8B-Instruct | 71.9%               | **89.5%**         | 78.1%               | 55.9%                   | **51.3%**               | 63.0%         | **69.3%**   |
-Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5-Instruct compared to previous reported results.
 # Acknowledgements

 ![image/png](llama-krikri-image.jpg)
 # Model Information
 - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
+- 128k context length (approximately 80,000 Greek words)
 - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
+  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
+  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
+  * The training corpus also contains 7.8 billion math and code tokens.
   * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
 | Sub-corpus   | # Tokens         | Percentage |
 |-----------|------------------|------------|
+| Greek     | 56.7 B   | 62.3 %      |
+| English   | 21.0 B   | 23.1 %      |
+| Parallel  |  5.5 B   | 6.0 %       |
+| Math/Code |  7.8 B   | 8.6 %       |
+| **Total** | 91 B   |  **100%**       |
+Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
+🚨 **More information of the post-training corpus and methdology coming soon.** 🚨
 # How to use
     base_url=base_url,
 )
+system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python."
 user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
 messages = [
 # Evaluation
+🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨
 # Acknowledgements