Adam Fetzer's picture

Adam Fetzer

Rexschwert

AI & ML interests

AI, Big Data, Data Science, Machine Learning, Computer Vision, Natural Language Processing

Recent Activity

liked a dataset 6 days ago
OmniSVG/MMSVG-Illustration
liked a dataset 9 days ago
HuggingFaceM4/the_cauldron
View all activity

Organizations

Rexschwert Share's profile picture

Rexschwert's activity

reacted to nyuuzyou's post with ๐Ÿ”ฅ๐Ÿ‘ 10 days ago
view post
Post
5503
๐Ÿ‡ท๐Ÿ‡บ Russian Forum Messages Dataset - nyuuzyou/ruforum

Collection of approximately 58 million Russian forum messages featuring:

- Complete message content from Russian online forums spanning 2010-2025
- Comprehensive metadata including unique message IDs and timestamps
- Full text content preserving original user discussions and interactions
- Monolingual dataset focused exclusively on Russian language content

This dataset offers a unique textual archive of Russian online conversations suitable for text generation, sentiment analysis, and language modeling research. Released to the public domain under CC0 1.0 license.
reacted to mlabonne's post with ๐Ÿ”ฅ about 1 month ago
reacted to tomaarsen's post with ๐Ÿง  about 1 month ago
view post
Post
6670
An assembly of 18 European companies, labs, and universities have banded together to launch ๐Ÿ‡ช๐Ÿ‡บ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

๐Ÿ‡ช๐Ÿ‡บ 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3๏ธโƒฃ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
โžก๏ธ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
โš™๏ธ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
๐Ÿ”ฅ A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
๐Ÿ“Š Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
๐Ÿ“ Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
  • 1 reply
ยท