LVouk commited on
Commit
cea07d0
·
verified ·
1 Parent(s): 8489343

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -40
README.md CHANGED
@@ -18,26 +18,30 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
18
 
19
  ![image/png](llama-krikri-image.jpg)
20
 
 
21
  # Model Information
22
 
23
  - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
24
- - 128k context length
25
  - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
26
- * This corpus includes 55 billion monolingual Greek tokens, constructed from publicly available resources.
27
- * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (23,3 billion tokens) and Greek-English parallel data (5,26 billion tokens).
28
- * The training corpus also contains 6 billion math and code tokens.
29
  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
30
 
31
 
32
  | Sub-corpus | # Tokens | Percentage |
33
  |-----------|------------------|------------|
34
- | Greek | 55,097,452,359 | 61.4% |
35
- | English | 23,340,749,356 | 26.0% |
36
- | Parallel | 5,262,998,873 | 6.0% |
37
- | Math/Code | 5,951,964,497 | 6.6% |
38
- | **Total** | **89,653,165,085** | **100%** |
 
39
 
40
- Chosen subsets of the 89.65 billion corpus were upsampled resulting in a size of **110 billion tokens**.
 
 
41
 
42
 
43
  # How to use
@@ -88,7 +92,7 @@ client = OpenAI(
88
  base_url=base_url,
89
  )
90
 
91
- system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει απευθείας με λίστες Python."
92
  user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
93
 
94
  messages = [
@@ -103,35 +107,7 @@ print(response.choices[0].message.content)
103
 
104
  # Evaluation
105
 
106
-
107
- ## Greek Benchmarks
108
-
109
- The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
110
-
111
- Our evaluation suite includes:
112
- * Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
113
- * An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
114
- * A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
115
-
116
- Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table:
117
-
118
- | | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
119
- |--------------------------|---------------------------|----------------------|------------------------|----------------------------|----------------------------|------------------|---------|
120
- | Meltemi-7B-Instruct-v1.5 | % | % | % | % | % | % | % |
121
- | Llama-3.1-8B-Instruct | % | % | % | % | % | % | % |
122
- | Llama-Krikri-8B-Instruct | **38.5%** | **86.5%** | **61.0%** | **45.11%** | **53.7%** | **50.0%** | **56.7%** |
123
-
124
-
125
- ## English Benchmarks
126
-
127
- | | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
128
- |--------------------------|---------------------|-------------------|---------------------|-------------------------|-------------------------|---------------|---------|
129
- | Meltemi-7B-Instruct-v1.5 | --% | --% | --% | --% | --% | --% | --% |
130
- | Llama-3.1-8B-Instruct | --% | --% | --% | --% | --% | --% | --% |
131
- | Llama-Krikri-8B-Instruct | 71.9% | **89.5%** | 78.1% | 55.9% | **51.3%** | 63.0% | **69.3%** |
132
-
133
- Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5-Instruct compared to previous reported results.
134
-
135
 
136
  # Acknowledgements
137
 
 
18
 
19
  ![image/png](llama-krikri-image.jpg)
20
 
21
+
22
  # Model Information
23
 
24
  - Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
25
+ - 128k context length (approximately 80,000 Greek words)
26
  - We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
27
+ * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
28
+ * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
29
+ * The training corpus also contains 7.8 billion math and code tokens.
30
  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
31
 
32
 
33
  | Sub-corpus | # Tokens | Percentage |
34
  |-----------|------------------|------------|
35
+ | Greek | 56.7 B | 62.3 % |
36
+ | English | 21.0 B | 23.1 % |
37
+ | Parallel | 5.5 B | 6.0 % |
38
+ | Math/Code | 7.8 B | 8.6 % |
39
+ | **Total** | 91 B | **100%** |
40
+
41
 
42
+ Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
43
+
44
+ 🚨 **More information of the post-training corpus and methdology coming soon.** 🚨
45
 
46
 
47
  # How to use
 
92
  base_url=base_url,
93
  )
94
 
95
+ system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python."
96
  user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
97
 
98
  messages = [
 
107
 
108
  # Evaluation
109
 
110
+ 🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
  # Acknowledgements
113