Update README.md
Browse files
README.md
CHANGED
@@ -18,26 +18,30 @@ Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama
|
|
18 |
|
19 |

|
20 |
|
|
|
21 |
# Model Information
|
22 |
|
23 |
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
24 |
-
- 128k context length
|
25 |
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
|
26 |
-
* This corpus includes
|
27 |
-
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (
|
28 |
-
* The training corpus also contains
|
29 |
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
|
30 |
|
31 |
|
32 |
| Sub-corpus | # Tokens | Percentage |
|
33 |
|-----------|------------------|------------|
|
34 |
-
| Greek |
|
35 |
-
| English |
|
36 |
-
| Parallel | 5
|
37 |
-
| Math/Code |
|
38 |
-
| **Total** |
|
|
|
39 |
|
40 |
-
Chosen subsets of the
|
|
|
|
|
41 |
|
42 |
|
43 |
# How to use
|
@@ -88,7 +92,7 @@ client = OpenAI(
|
|
88 |
base_url=base_url,
|
89 |
)
|
90 |
|
91 |
-
system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει
|
92 |
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
|
93 |
|
94 |
messages = [
|
@@ -103,35 +107,7 @@ print(response.choices[0].message.content)
|
|
103 |
|
104 |
# Evaluation
|
105 |
|
106 |
-
|
107 |
-
## Greek Benchmarks
|
108 |
-
|
109 |
-
The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).
|
110 |
-
|
111 |
-
Our evaluation suite includes:
|
112 |
-
* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
|
113 |
-
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
|
114 |
-
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).
|
115 |
-
|
116 |
-
Our evaluation for Llama-Krikri-8B is performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). We can see that our training enhances performance across all Greek test sets by a **+10.8%** average improvement. The results for the Greek test sets are shown in the following table:
|
117 |
-
|
118 |
-
| | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|
119 |
-
|--------------------------|---------------------------|----------------------|------------------------|----------------------------|----------------------------|------------------|---------|
|
120 |
-
| Meltemi-7B-Instruct-v1.5 | % | % | % | % | % | % | % |
|
121 |
-
| Llama-3.1-8B-Instruct | % | % | % | % | % | % | % |
|
122 |
-
| Llama-Krikri-8B-Instruct | **38.5%** | **86.5%** | **61.0%** | **45.11%** | **53.7%** | **50.0%** | **56.7%** |
|
123 |
-
|
124 |
-
|
125 |
-
## English Benchmarks
|
126 |
-
|
127 |
-
| | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|
128 |
-
|--------------------------|---------------------|-------------------|---------------------|-------------------------|-------------------------|---------------|---------|
|
129 |
-
| Meltemi-7B-Instruct-v1.5 | --% | --% | --% | --% | --% | --% | --% |
|
130 |
-
| Llama-3.1-8B-Instruct | --% | --% | --% | --% | --% | --% | --% |
|
131 |
-
| Llama-Krikri-8B-Instruct | 71.9% | **89.5%** | 78.1% | 55.9% | **51.3%** | 63.0% | **69.3%** |
|
132 |
-
|
133 |
-
Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5-Instruct compared to previous reported results.
|
134 |
-
|
135 |
|
136 |
# Acknowledgements
|
137 |
|
|
|
18 |
|
19 |

|
20 |
|
21 |
+
|
22 |
# Model Information
|
23 |
|
24 |
- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
|
25 |
+
- 128k context length (approximately 80,000 Greek words)
|
26 |
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
|
27 |
+
* This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
|
28 |
+
* Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
|
29 |
+
* The training corpus also contains 7.8 billion math and code tokens.
|
30 |
* This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:
|
31 |
|
32 |
|
33 |
| Sub-corpus | # Tokens | Percentage |
|
34 |
|-----------|------------------|------------|
|
35 |
+
| Greek | 56.7 B | 62.3 % |
|
36 |
+
| English | 21.0 B | 23.1 % |
|
37 |
+
| Parallel | 5.5 B | 6.0 % |
|
38 |
+
| Math/Code | 7.8 B | 8.6 % |
|
39 |
+
| **Total** | 91 B | **100%** |
|
40 |
+
|
41 |
|
42 |
+
Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.
|
43 |
+
|
44 |
+
🚨 **More information of the post-training corpus and methdology coming soon.** 🚨
|
45 |
|
46 |
|
47 |
# How to use
|
|
|
92 |
base_url=base_url,
|
93 |
)
|
94 |
|
95 |
+
system_prompt = "Είσαι ένα ανεπτυγμένο μεταφραστικό σύστημα που απαντάει με λίστες Python."
|
96 |
user_prompt = "Δώσε μου την παρακάτω λίστα με μεταφρασμένο κάθε string της στα ελληνικά: ['Ethics of duty', 'Postmodern ethics', 'Consequentialist ethics', 'Utilitarian ethics', 'Deontological ethics', 'Virtue ethics', 'Relativist ethics']"
|
97 |
|
98 |
messages = [
|
|
|
107 |
|
108 |
# Evaluation
|
109 |
|
110 |
+
🚨 **Instruction following and chat capability evaluation benchmarks coming soon.** 🚨
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
|
112 |
# Acknowledgements
|
113 |
|