Text Generation
Transformers
Safetensors
English
Japanese
llama
conversational
text-generation-inference
nokazaki commited on
Commit
3f35859
·
verified ·
1 Parent(s): 7cffb7b

Improved the description and fixed some typos.

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -130,13 +130,12 @@ We used the Language Model Evaluation Harness(v.0.4.2) and Code Generation LM Ev
130
 
131
  ### MT-Bench JA
132
 
133
- We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
134
- We utilized the following settings:
135
 
136
- - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
137
  - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
138
  - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
139
- - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
140
  - Judge: `gpt-4-1106-preview`
141
  - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
142
 
 
130
 
131
  ### MT-Bench JA
132
 
133
+ We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the capabilities of multi-turn dialogue with the following settings:
 
134
 
135
+ - Implementation: FastChat [Zheng+, 2023] (commit #e86e70d0)
136
  - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
137
  - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
138
+ - Prompt for Judge: [Nejumi LLM-Leaderboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
139
  - Judge: `gpt-4-1106-preview`
140
  - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
141