Improved the description and fixed some typos.
Browse files
README.md
CHANGED
@@ -130,13 +130,12 @@ We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Ev
|
|
130 |
|
131 |
### MT-Bench JA
|
132 |
|
133 |
-
We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the
|
134 |
-
We utilized the following settings:
|
135 |
|
136 |
-
-
|
137 |
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
138 |
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
139 |
-
- Prompt for Judge: [Nejumi LLM-
|
140 |
- Judge: `gpt-4-1106-preview`
|
141 |
- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
|
142 |
|
|
|
130 |
|
131 |
### MT-Bench JA
|
132 |
|
133 |
+
We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the capabilities of multi-turn dialogue with the following settings:
|
|
|
134 |
|
135 |
+
- Implementation: FastChat [Zheng+, 2023] (commit #e86e70d0)
|
136 |
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
137 |
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
138 |
+
- Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
|
139 |
- Judge: `gpt-4-1106-preview`
|
140 |
- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
|
141 |
|