LVouk commited on
Commit
3e64cfb
·
verified ·
1 Parent(s): 5d4863b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -39
README.md CHANGED
@@ -83,45 +83,6 @@ To build the SFT & DPO data, we utilized various methodologies including:
83
  - Creating data for sentence and document translation using high-quality parallel corpora mainly from [ELRC-SHARE](https://elrc-share.eu/).
84
  - Synthetically extracting question-answer pairs and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
85
 
86
- # Evaluation
87
-
88
- In the table below, we report the scores for our chat evaluation suite which includes:
89
- - [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict)
90
- - [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict)
91
- - [Greek MT-Bench](https://huggingface.co/datasets/ilsp/mt-bench-greek) using gpt-4o-2024-08-06 as the judge model.
92
- - [English MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) using gpt-4o-2024-08-06 as the judge model.
93
-
94
- We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
95
- It also exhibits **the strongest chat capabilities in the Greek MT-Bench benchmark** (+0.28 compared to Aya Expanse 8B), while also being very competitive in the English variant of the MT-Bench benchmark.
96
-
97
- | | IFEval EL (strict) | IFEval EN (strict) | MT-Bench EL | MT-Bench EN |
98
- |---------------- |---------------- |----------------- |------------|------------|
99
- | Qwen 2.5 7B Instruct | 46.2% | 74.8% | 5.83 | **7.87** |
100
- | EuroLLM 9B Instruct | 51.3% | 64.5% | 5.98 | 6.27 |
101
- | Aya Expanse 8B | 50.4% | 62.2% | 7.68 | 6.92 |
102
- | Meltemi 7B v1.5 Instruct | 32.7% | 41.2% | 6.25 | 5.46 |
103
- | Llama-3.1-8B Instruct | 45.8% | 75.1% | 6.46 | 7.25 |
104
- | **Llama-Krikri-8B Instruct** | **67.5%** | **82.4%** | **7.96** | 7.21 |
105
-
106
-
107
- We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
108
- - No Style Control: The original version of the benchmark.
109
- - With Style Control: The benchmark with style control methods for Markdown elements. You can read more about the methodology and technical background in this [blogspot](https://lmsys.org/blog/2024-08-28-style-control/).
110
-
111
- Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
112
-
113
- Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
114
- ![image/png](arena_hard_el.png)
115
-
116
- Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology by using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
117
-
118
- Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with significantly larger previous-generation LLMs** (such as Qwen 2 72B Instruct) and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
119
- ![image/png](arena_hard_en.png)
120
-
121
- ***Please note** that judge models are biased towards student models trained on distilled data from them. You can read more [here](https://arxiv.org/pdf/2502.01534?).
122
-
123
- 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
124
-
125
 
126
  # How to use
127
 
@@ -191,6 +152,46 @@ print(response.choices[0].message.content)
191
  # ['Ηθική καθήκοντος', 'Μεταμοντέρνα ηθική', 'Συνεπειοκρατική ηθική', 'Ωφελιμιστική ηθική', 'Δεοντολογική ηθική', 'Ηθική αρετών', 'Σχετικιστική ηθική']
192
  ```
193
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  # Acknowledgements
195
 
196
  The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.
 
83
  - Creating data for sentence and document translation using high-quality parallel corpora mainly from [ELRC-SHARE](https://elrc-share.eu/).
84
  - Synthetically extracting question-answer pairs and multi-turn dialogues from diverse sources such as Wikipedia, EUR-LEX, Greek School Books, and Kallipos.
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  # How to use
88
 
 
152
  # ['Ηθική καθήκοντος', 'Μεταμοντέρνα ηθική', 'Συνεπειοκρατική ηθική', 'Ωφελιμιστική ηθική', 'Δεοντολογική ηθική', 'Ηθική αρετών', 'Σχετικιστική ηθική']
153
  ```
154
 
155
+ # Evaluation
156
+
157
+ In the table below, we report the scores for our chat evaluation suite which includes:
158
+ - [Greek IFEval](https://huggingface.co/datasets/ilsp/ifeval_greek) (strict average)
159
+ - [English IFEval](https://huggingface.co/datasets/google/IFEval) (strict average)
160
+ - [Greek MT-Bench](https://huggingface.co/datasets/ilsp/mt-bench-greek) using gpt-4o-2024-08-06 as the judge model.
161
+ - [English MT-Bench](https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts) using gpt-4o-2024-08-06 as the judge model.
162
+
163
+ We can observe that *Llama-Krikri-8B-Instruct exhibits the strongest performance* in instruction following for both Greek and English across all the models we tested. In particular, it surpasses Llama-3.1-8B-Instruct by **+21.7%** and **+7.3%** on the Greek and English IFEval respectively.
164
+ It also exhibits **the strongest chat capabilities in the Greek MT-Bench benchmark** (+0.28 compared to Aya Expanse 8B), while also being very competitive in the English variant of the MT-Bench benchmark.
165
+
166
+ | | IFEval EL (strict avg) | IFEval EN (strict avg) | MT-Bench EL | MT-Bench EN |
167
+ |---------------- |---------------- |----------------- |------------|------------|
168
+ | Qwen 2.5 7B Instruct | 46.2% | 74.8% | 5.83 | **7.87** |
169
+ | EuroLLM 9B Instruct | 51.3% | 64.5% | 5.98 | 6.27 |
170
+ | Aya Expanse 8B | 50.4% | 62.2% | 7.68 | 6.92 |
171
+ | Meltemi 7B v1.5 Instruct | 32.7% | 41.2% | 6.25 | 5.46 |
172
+ | Llama-3.1-8B Instruct | 45.8% | 75.1% | 6.46 | 7.25 |
173
+ | **Llama-Krikri-8B Instruct** | **67.5%** | **82.4%** | **7.96** | 7.21 |
174
+
175
+
176
+ We also used the [Arena-Hard-Auto](https://huggingface.co/datasets/lmarena-ai/arena-hard-auto-v0.1) automatic evaluation tool, as well the translated (and post-edited) version for Greek that is publicly available [here](https://huggingface.co/datasets/ilsp/m-ArenaHard_greek). We report 2 scores for Arena-Hard-Auto:
177
+ - No Style Control: The original version of the benchmark.
178
+ - With Style Control: The benchmark with style control methods for Markdown elements. You can read more about the methodology and technical background in this [blogspot](https://lmsys.org/blog/2024-08-28-style-control/).
179
+
180
+ Below, we show the scores for the Greek version of Arena-Hard-Auto for various open and closed chat models that were determined using **gpt-4o-2024-08-06 as the judge model** and **gpt-4o-mini-2024-07-18 as the baseline model** (i.e., by default 50% score).
181
+
182
+ Llama-Krikri-8B Instruct exhibits very strong chat capabilities by scoring **higher than models over 8 times its size** (such as Llama-3.1-70B Instruct) and is also **competitive with closed-source** (e.g., GPT-4o-Mini) and **highly-performant open-source models** (e.g., Gemma 2 27B IT & Aya Expanse 32B).
183
+ ![image/png](arena_hard_el.png)
184
+
185
+ Below, we show the scores for the original Arena-Hard-Auto dataset for various open and closed chat models. We followed the original methodology by using **gpt-4-1106-preview as the judge model** and **gpt-4-0314 as the baseline model**.
186
+
187
+ Llama-Krikri-8B Instruct performs very well in the English variant of Arena-Hard-Auto as well, since we can observe that it is **competitive with significantly larger previous-generation LLMs** (such as Qwen 2 72B Instruct) and that it **improves upon Llama-3.1-8B Instruct by +24.5% / +16%** (No style control / With style control).
188
+ ![image/png](arena_hard_en.png)
189
+
190
+ ***Please note** that judge models are biased towards student models trained on distilled data from them. You can read more [here](https://arxiv.org/pdf/2502.01534?).
191
+
192
+ 🚨 **More information on post-training, methodology, and evaluation coming soon.** 🚨
193
+
194
+
195
  # Acknowledgements
196
 
197
  The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.