๐ KURE-v1
Introducing Korea University Retrieval Embedding model, KURE-v1
It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
To our knowledge, It is one of the best publicly opened Korean retrieval models.
For details, visit the KURE repository
Model Versions
Model Name | Dimension | Sequence Length | Introduction |
---|---|---|---|
KURE-v1 | 1024 | 8192 | Fine-tuned BAAI/bge-m3 with Korean data via CachedGISTEmbedLoss |
KoE5 | 1024 | 512 | Fine-tuned intfloat/multilingual-e5-large with ko-triplet-v1.0 via CachedMultipleNegativesRankingLoss |
Model Description
This is the model card of a ๐ค transformers model that has been pushed on the Hub.
- Developed by: NLP&AI Lab
- Language(s) (NLP): Korean, English
- License: MIT
- Finetuned from model: BAAI/bge-m3
Example code
Install Dependencies
First install the Sentence Transformers library:
pip install -U sentence-transformers
Python code
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the ๐ค Hub
model = SentenceTransformer("nlpai-lab/KURE-v1")
# Run inference
sentences = [
'ํ๋ฒ๊ณผ ๋ฒ์์กฐ์ง๋ฒ์ ์ด๋ค ๋ฐฉ์์ ํตํด ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ ๋ฑ์ ๋ค์ํ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ์ด',
'4. ์์ฌ์ ๊ณผ ๊ฐ์ ๋ฐฉํฅ ์์ ์ดํด๋ณธ ๋ฐ์ ๊ฐ์ด ์ฐ๋ฆฌ ํ๋ฒ๊ณผ ๏ฝข๋ฒ์์กฐ์ง ๋ฒ๏ฝฃ์ ๋๋ฒ์ ๊ตฌ์ฑ์ ๋ค์ํํ์ฌ ๊ธฐ๋ณธ๊ถ ๋ณด์ฅ๊ณผ ๋ฏผ์ฃผ์ฃผ์ ํ๋ฆฝ์ ์์ด ๋ค๊ฐ์ ์ธ ๋ฒ์ ๋ชจ์์ ๊ฐ๋ฅํ๊ฒ ํ๋ ๊ฒ์ ๊ทผ๋ณธ ๊ท๋ฒ์ผ๋ก ํ๊ณ ์๋ค. ๋์ฑ์ด ํฉ์์ฒด๋ก์์ ๋๋ฒ์ ์๋ฆฌ๋ฅผ ์ฑํํ๊ณ ์๋ ๊ฒ ์ญ์ ๊ทธ ๊ตฌ์ฑ์ ๋ค์์ฑ์ ์์ฒญํ๋ ๊ฒ์ผ๋ก ํด์๋๋ค. ์ด์ ๊ฐ์ ๊ด์ ์์ ๋ณผ ๋ ํ์ง ๋ฒ์์ฅ๊ธ ๊ณ ์๋ฒ๊ด์ ์ค์ฌ์ผ๋ก ๋๋ฒ์์ ๊ตฌ์ฑํ๋ ๊ดํ์ ๊ฐ์ ํ ํ์๊ฐ ์๋ ๊ฒ์ผ๋ก ๋ณด์ธ๋ค.',
'์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ 2001๋
1์ 24์ผ 5:3์ ๋ค์๊ฒฌํด๋ก ใ๋ฒ์์กฐ์ง๋ฒใ ์ 169์กฐ ์ 2๋ฌธ์ด ํ๋ฒ์ ํฉ์น๋๋ค๋ ํ๊ฒฐ์ ๋ด๋ ธ์ โ 5์ธ์ ๋ค์ ์ฌํ๊ด์ ์์ก๊ด๊ณ์ธ์ ์ธ๊ฒฉ๊ถ ๋ณดํธ, ๊ณต์ ํ ์ ์ฐจ์ ๋ณด์ฅ๊ณผ ๋ฐฉํด๋ฐ์ง ์๋ ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ ๋ฑ์ ๊ทผ๊ฑฐ๋ก ํ์ฌ ํ
๋ ๋น์ ์ดฌ์์ ๋ํ ์ ๋์ ์ธ ๊ธ์ง๋ฅผ ํ๋ฒ์ ํฉ์นํ๋ ๊ฒ์ผ๋ก ๋ณด์์ โ ๊ทธ๋ฌ๋ ๋๋จธ์ง 3์ธ์ ์ฌํ๊ด์ ํ์ ๋ฒ์์ ์์ก์ ์ฐจ๋ ํน๋ณํ ์ธ๊ฒฉ๊ถ ๋ณดํธ์ ์ด์ต๋ ์์ผ๋ฉฐ, ํ
๋ ๋น์ ๊ณต๊ฐ์ฃผ์๋ก ์ธํด ๋ฒ๊ณผ ์ง์ค ๋ฐ๊ฒฌ์ ๊ณผ์ ์ด ์ธ์ ๋ ์ํ๋กญ๊ฒ ๋๋ ๊ฒ์ ์๋๋ผ๋ฉด์ ๋ฐ๋์๊ฒฌ์ ์ ์ํจ โ ์๋ํ๋ฉด ํ์ ๋ฒ์์ ์์ก์ ์ฐจ์์๋ ์์ก๋น์ฌ์๊ฐ ๊ฐ์ธ์ ์ผ๋ก ์ง์ ์ฌ๋ฆฌ์ ์ฐธ์ํ๊ธฐ๋ณด๋ค๋ ๋ณํธ์ฌ๊ฐ ์ฐธ์ํ๋ ๊ฒฝ์ฐ๊ฐ ๋ง์ผ๋ฉฐ, ์ฌ๋ฆฌ๋์๋ ์ฌ์ค๋ฌธ์ ๊ฐ ์๋ ๋ฒ๋ฅ ๋ฌธ์ ๊ฐ ๋๋ถ๋ถ์ด๊ธฐ ๋๋ฌธ์ด๋ผ๋ ๊ฒ์ โก ํํธ, ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ(Bundesverfassungsgerichtsgesetz: BVerfGG) ์ 17a์กฐ์ ๋ฐ๋ผ ์ ํ์ ์ด๋๋ง ์ฌํ์ ๋ํ ๋ฐฉ์ก์ ํ์ฉํ๊ณ ์์ โ ใ์ฐ๋ฐฉํ๋ฒ์ฌํ์๋ฒใ ์ 17์กฐ์์ ใ๋ฒ์์กฐ์ง๋ฒใ ์ 14์ ๋ด์ง ์ 16์ ์ ๊ท์ ์ ์ค์ฉํ๋๋ก ํ๊ณ ์์ง๋ง, ๋
น์์ด๋ ์ดฌ์์ ํตํ ์ฌํ๊ณต๊ฐ์ ๊ด๋ จํ์ฌ์๋ ใ๋ฒ์์กฐ์ง๋ฒใ๊ณผ ๋ค๋ฅธ ๋ด์ฉ์ ๊ท์ ํ๊ณ ์์',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# Results for KURE-v1
# tensor([[1.0000, 0.6967, 0.5306],
# [0.6967, 1.0000, 0.4427],
# [0.5306, 0.4427, 1.0000]])
Training Details
Training Data
KURE-v1
- Korean query-document-hard_negative(5) data
- 2,000,000 examples
Training Procedure
- loss: Used CachedGISTEmbedLoss by sentence-transformers
- batch size: 4096
- learning rate: 2e-05
- epochs: 1
Evaluation
Metrics
- Recall, Precision, NDCG, F1
Benchmark Datasets
- Ko-StrategyQA: ํ๊ตญ์ด ODQA multi-hop ๊ฒ์ ๋ฐ์ดํฐ์ (StrategyQA ๋ฒ์ญ)
- AutoRAGRetrieval: ๊ธ์ต, ๊ณต๊ณต, ์๋ฃ, ๋ฒ๋ฅ , ์ปค๋จธ์ค 5๊ฐ ๋ถ์ผ์ ๋ํด, pdf๋ฅผ ํ์ฑํ์ฌ ๊ตฌ์ฑํ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- MIRACLRetrieval: Wikipedia ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- PublicHealthQA: ์๋ฃ ๋ฐ ๊ณต์ค๋ณด๊ฑด ๋๋ฉ์ธ์ ๋ํ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- BelebeleRetrieval: FLORES-200 ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- MrTidyRetrieval: Wikipedia ๊ธฐ๋ฐ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
- MultiLongDocRetrieval: ๋ค์ํ ๋๋ฉ์ธ์ ํ๊ตญ์ด ์ฅ๋ฌธ ๊ฒ์ ๋ฐ์ดํฐ์
- XPQARetrieval: ๋ค์ํ ๋๋ฉ์ธ์ ํ๊ตญ์ด ๋ฌธ์ ๊ฒ์ ๋ฐ์ดํฐ์
Results
์๋๋ ๋ชจ๋ ๋ชจ๋ธ์, ๋ชจ๋ ๋ฒค์น๋งํฌ ๋ฐ์ดํฐ์ ์ ๋ํ ํ๊ท ๊ฒฐ๊ณผ์ ๋๋ค. ์์ธํ ๊ฒฐ๊ณผ๋ KURE Github์์ ํ์ธํ์ค ์ ์์ต๋๋ค.
Top-k 1
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.52640 | 0.60551 | 0.60551 | 0.55784 |
dragonkue/BGE-m3-ko | 0.52361 | 0.60394 | 0.60394 | 0.55535 |
BAAI/bge-m3 | 0.51778 | 0.59846 | 0.59846 | 0.54998 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.51246 | 0.59384 | 0.59384 | 0.54489 |
nlpai-lab/KoE5 | 0.50157 | 0.57790 | 0.57790 | 0.53178 |
intfloat/multilingual-e5-large | 0.50052 | 0.57727 | 0.57727 | 0.53122 |
jinaai/jina-embeddings-v3 | 0.48287 | 0.56068 | 0.56068 | 0.51361 |
BAAI/bge-multilingual-gemma2 | 0.47904 | 0.55472 | 0.55472 | 0.50916 |
intfloat/multilingual-e5-large-instruct | 0.47842 | 0.55435 | 0.55435 | 0.50826 |
intfloat/multilingual-e5-base | 0.46950 | 0.54490 | 0.54490 | 0.49947 |
intfloat/e5-mistral-7b-instruct | 0.46772 | 0.54394 | 0.54394 | 0.49781 |
Alibaba-NLP/gte-multilingual-base | 0.46469 | 0.53744 | 0.53744 | 0.49353 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.46633 | 0.53625 | 0.53625 | 0.49429 |
openai/text-embedding-3-large | 0.44884 | 0.51688 | 0.51688 | 0.47572 |
Salesforce/SFR-Embedding-2_R | 0.43748 | 0.50815 | 0.50815 | 0.46504 |
upskyy/bge-m3-korean | 0.43125 | 0.50245 | 0.50245 | 0.45945 |
jhgan/ko-sroberta-multitask | 0.33788 | 0.38497 | 0.38497 | 0.35678 |
Top-k 3
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.68678 | 0.28711 | 0.65538 | 0.39835 |
dragonkue/BGE-m3-ko | 0.67834 | 0.28385 | 0.64950 | 0.39378 |
BAAI/bge-m3 | 0.67526 | 0.28374 | 0.64556 | 0.39291 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.67128 | 0.28193 | 0.64042 | 0.39072 |
intfloat/multilingual-e5-large | 0.65807 | 0.27777 | 0.62822 | 0.38423 |
nlpai-lab/KoE5 | 0.65174 | 0.27329 | 0.62369 | 0.37882 |
BAAI/bge-multilingual-gemma2 | 0.64415 | 0.27416 | 0.61105 | 0.37782 |
jinaai/jina-embeddings-v3 | 0.64116 | 0.27165 | 0.60954 | 0.37511 |
intfloat/multilingual-e5-large-instruct | 0.64353 | 0.27040 | 0.60790 | 0.37453 |
Alibaba-NLP/gte-multilingual-base | 0.63744 | 0.26404 | 0.59695 | 0.36764 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.63163 | 0.25937 | 0.59237 | 0.36263 |
intfloat/multilingual-e5-base | 0.62099 | 0.26144 | 0.59179 | 0.36203 |
intfloat/e5-mistral-7b-instruct | 0.62087 | 0.26144 | 0.58917 | 0.36188 |
openai/text-embedding-3-large | 0.61035 | 0.25356 | 0.57329 | 0.35270 |
Salesforce/SFR-Embedding-2_R | 0.60001 | 0.25253 | 0.56346 | 0.34952 |
upskyy/bge-m3-korean | 0.59215 | 0.25076 | 0.55722 | 0.34623 |
jhgan/ko-sroberta-multitask | 0.46930 | 0.18994 | 0.43293 | 0.26696 |
Top-k 5
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.73851 | 0.19130 | 0.67479 | 0.29903 |
dragonkue/BGE-m3-ko | 0.72517 | 0.18799 | 0.66692 | 0.29401 |
BAAI/bge-m3 | 0.72954 | 0.18975 | 0.66615 | 0.29632 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.72962 | 0.18875 | 0.66236 | 0.29542 |
nlpai-lab/KoE5 | 0.70820 | 0.18287 | 0.64499 | 0.28628 |
intfloat/multilingual-e5-large | 0.70124 | 0.18316 | 0.64402 | 0.28588 |
BAAI/bge-multilingual-gemma2 | 0.70258 | 0.18556 | 0.63338 | 0.28851 |
jinaai/jina-embeddings-v3 | 0.69933 | 0.18256 | 0.63133 | 0.28505 |
intfloat/multilingual-e5-large-instruct | 0.69018 | 0.17838 | 0.62486 | 0.27933 |
Alibaba-NLP/gte-multilingual-base | 0.69365 | 0.17789 | 0.61896 | 0.27879 |
intfloat/multilingual-e5-base | 0.67250 | 0.17406 | 0.61119 | 0.27247 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.67447 | 0.17114 | 0.60952 | 0.26943 |
intfloat/e5-mistral-7b-instruct | 0.67449 | 0.17484 | 0.60935 | 0.27349 |
openai/text-embedding-3-large | 0.66365 | 0.17004 | 0.59389 | 0.26677 |
Salesforce/SFR-Embedding-2_R | 0.65622 | 0.17018 | 0.58494 | 0.26612 |
upskyy/bge-m3-korean | 0.65477 | 0.17015 | 0.58073 | 0.26589 |
jhgan/ko-sroberta-multitask | 0.53136 | 0.13264 | 0.45879 | 0.20976 |
Top-k 10
Model | Average Recall_top1 | Average Precision_top1 | Average NDCG_top1 | Average F1_top1 |
---|---|---|---|---|
nlpai-lab/KURE-v1 | 0.79682 | 0.10624 | 0.69473 | 0.18524 |
dragonkue/BGE-m3-ko | 0.78450 | 0.10492 | 0.68748 | 0.18288 |
BAAI/bge-m3 | 0.79195 | 0.10592 | 0.68723 | 0.18456 |
Snowflake/snowflake-arctic-embed-l-v2.0 | 0.78669 | 0.10462 | 0.68189 | 0.18260 |
intfloat/multilingual-e5-large | 0.75902 | 0.10147 | 0.66370 | 0.17693 |
nlpai-lab/KoE5 | 0.75296 | 0.09937 | 0.66012 | 0.17369 |
BAAI/bge-multilingual-gemma2 | 0.76153 | 0.10364 | 0.65330 | 0.18003 |
jinaai/jina-embeddings-v3 | 0.76277 | 0.10240 | 0.65290 | 0.17843 |
intfloat/multilingual-e5-large-instruct | 0.74851 | 0.09888 | 0.64451 | 0.17283 |
Alibaba-NLP/gte-multilingual-base | 0.75631 | 0.09938 | 0.64025 | 0.17363 |
Alibaba-NLP/gte-Qwen2-7B-instruct | 0.74092 | 0.09607 | 0.63258 | 0.16847 |
intfloat/multilingual-e5-base | 0.73512 | 0.09717 | 0.63216 | 0.16977 |
intfloat/e5-mistral-7b-instruct | 0.73795 | 0.09777 | 0.63076 | 0.17078 |
openai/text-embedding-3-large | 0.72946 | 0.09571 | 0.61670 | 0.16739 |
Salesforce/SFR-Embedding-2_R | 0.71662 | 0.09546 | 0.60589 | 0.16651 |
upskyy/bge-m3-korean | 0.71895 | 0.09583 | 0.60258 | 0.16712 |
jhgan/ko-sroberta-multitask | 0.61225 | 0.07826 | 0.48687 | 0.13757 |
Citation
If you find our paper or models helpful, please consider cite as follows:
@misc{KURE,
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
year = {2024},
url = {https://github.com/nlpai-lab/KURE}
},
@misc{KoE5,
author = {NLP & AI Lab and Human-Inspired AI research},
title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
year = {2024},
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nlpai-lab/KoE5}},
}
- Downloads last month
- 87,107