tokyotech-llm
/

Llama-3.1-Swallow-8B-Instruct-v0.1

@@ -130,13 +130,12 @@ We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Ev
 ### MT-Bench JA
-We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
-We utilized the following settings:
-- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
-- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 - Judge: `gpt-4-1106-preview`
 - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.

 ### MT-Bench JA
+We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the capabilities of multi-turn dialogue with the following settings:
+- Implementation: FastChat [Zheng+, 2023] (commit #e86e70d0)
 - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
 - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
+- Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 - Judge: `gpt-4-1106-preview`
 - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.