Update README.md
Browse filesAdded link to LMSYS-Chat-1M-Synth dataset.
README.md
CHANGED
@@ -10,6 +10,7 @@ license:
|
|
10 |
model_type: llama
|
11 |
datasets:
|
12 |
- lmsys/lmsys-chat-1m
|
|
|
13 |
- argilla/magpie-ultra-v0.1
|
14 |
---
|
15 |
|
@@ -187,6 +188,7 @@ The following datasets were used for the instruction tuning.
|
|
187 |
- `lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions`
|
188 |
- Single-turn Japanese instruction dataset synthesized and derived from [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) [\[Zhang+, ICLR24\]](https://openreview.net/forum?id=BOfDKxfwt0)). First-turn user instructions were translated into Japanese via DeepL (machine translation), and assistant responses were generated using [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) served as a judge for rejection sampling (n=6).
|
189 |
Conversations containing personally identifiable information (PII) and template-based user instructions were removed. Duplicate instructions were removed.
|
|
|
190 |
- `filtered-magpie-ultra-ja`
|
191 |
- A Japanese variant of the `filtered-magpie-ultra-en` dataset, translated into Japanese by [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it).
|
192 |
- `gemma-magpie`
|
@@ -194,6 +196,7 @@ The following datasets were used for the instruction tuning.
|
|
194 |
- English
|
195 |
- `lmsys-chat-1m-synth-en-wo-pii-and-template-instructions`
|
196 |
- The creation process is similar to `lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions`, but this version uses the original English user instructions. The assistant responses were generated in English as well. Rejection sampling was not applied for this version.
|
|
|
197 |
- `filtered-magpie-ultra-en`
|
198 |
- A subset of the [magpie-ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1) dataset, developed following the MAGPIE recipe [\[Xu+, arXiv24\]](https://arxiv.org/abs/2406.08464) using [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). This subset includes only samples rated as 'average,' 'good,' or 'excellent.'
|
199 |
|
|
|
10 |
model_type: llama
|
11 |
datasets:
|
12 |
- lmsys/lmsys-chat-1m
|
13 |
+
- tokyotech-llm/lmsys-chat-1m-synth
|
14 |
- argilla/magpie-ultra-v0.1
|
15 |
---
|
16 |
|
|
|
188 |
- `lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions`
|
189 |
- Single-turn Japanese instruction dataset synthesized and derived from [lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m) [\[Zhang+, ICLR24\]](https://openreview.net/forum?id=BOfDKxfwt0)). First-turn user instructions were translated into Japanese via DeepL (machine translation), and assistant responses were generated using [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) served as a judge for rejection sampling (n=6).
|
190 |
Conversations containing personally identifiable information (PII) and template-based user instructions were removed. Duplicate instructions were removed.
|
191 |
+
- The dataset is available at [tokyotech-llm/lmsys-chat-1m-synth](https://huggingface.co/datasets/tokyotech-llm/lmsys-chat-1m-synth).
|
192 |
- `filtered-magpie-ultra-ja`
|
193 |
- A Japanese variant of the `filtered-magpie-ultra-en` dataset, translated into Japanese by [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it).
|
194 |
- `gemma-magpie`
|
|
|
196 |
- English
|
197 |
- `lmsys-chat-1m-synth-en-wo-pii-and-template-instructions`
|
198 |
- The creation process is similar to `lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions`, but this version uses the original English user instructions. The assistant responses were generated in English as well. Rejection sampling was not applied for this version.
|
199 |
+
- The dataset is available at [tokyotech-llm/lmsys-chat-1m-synth](https://huggingface.co/datasets/tokyotech-llm/lmsys-chat-1m-synth).
|
200 |
- `filtered-magpie-ultra-en`
|
201 |
- A subset of the [magpie-ultra](https://huggingface.co/datasets/argilla/magpie-ultra-v0.1) dataset, developed following the MAGPIE recipe [\[Xu+, arXiv24\]](https://arxiv.org/abs/2406.08464) using [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). This subset includes only samples rated as 'average,' 'good,' or 'excellent.'
|
202 |
|