File size: 3,178 Bytes

c6d0683
877da5d
c6d0683

# Mixed-BGE-M3-Email

This is a fine-tuned version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.

## Model Description

- **Model Type:** Embedding model (encoder-only)
- **Base Model:** BAAI/bge-m3
- **Languages:** English, Korean
- **Domain:** Email content, business communication
- **Training Data:** Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)

## Quickstart

```python
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="doubleyyh/mixed-bge-m3-email",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True}
)

# Example emails
emails = [
    {
        "subject": "회의 일정 변경 안내",
        "from": [["김철수", "[email protected]"]],
        "to": [["이영희", "[email protected]"]],
        "cc": [["박지원", "[email protected]"]],
        "date": "2024-03-26T10:00:00",
        "text_body": "안녕하세요, 내일 예정된 프로젝트 미팅을 오후 2시로 변경하고자 합니다."
    },
    {
        "subject": "Project Timeline Update",
        "from": [["John Smith", "[email protected]"]],
        "to": [["Team", "[email protected]"]],
        "cc": [],
        "date": "2024-03-26T11:30:00",
        "text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
    }
]

# Format emails into documents
docs = []
for email in emails:
    # Format email content
    content = "\n".join([f"{k}: {v}" for k, v in email.items()])
    docs.append(Document(page_content=content))

# Create FAISS index
db = FAISS.from_documents(docs, embeddings)

# Query examples (supports both Korean and English)
queries = [
    "회의 시간이 언제로 변경되었나요?",
    "When is the meeting rescheduled?",
    "프로젝트 일정",
    "Q2 milestones"
]

# Perform similarity search
for query in queries:
    print(f"\nQuery: {query}")
    results = db.similarity_search(query, k=1)
    print(f"Most relevant email:\n{results[0].page_content[:200]}...")
```


## Intended Use & Limitations

### Intended Use
- Email content retrieval
- Similar document search in email corpora
- Question answering over email content
- Multi-language email search systems

### Limitations
- Performance may vary for domains outside of email content
- Best suited for business communication context
- While supporting both English and Korean, performance might vary between languages

## Citation

```bibtex
@misc{mixed-bge-m3-email,
  author = {doubleyyh},
  title = {Mixed-BGE-M3-Email: Fine-tuned Embedding Model for Email Content},
  year = {2024},
  publisher = {HuggingFace}
}
```

## License

This model follows the same license as the base model (bge-m3).

## Contact

For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.