CodeSearchNet Multilingual Tokenizer
A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.
Model Details
Model Description
This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.
- Model type: BPE Tokenizer
- Languages: Python, Java, JavaScript, PHP, Ruby, Go
- Vocabulary size: 64,000 tokens
- Finetuned from: GPT-2 tokenizer
Uses
Direct Use
This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:
- Code generation models
- Code completion systems
- Code analysis and understanding tasks
- Multi-language programming assistants
Performance
Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:
- Python: 25% fewer tokens on average
- Java: 31% fewer tokens on average
- JavaScript: 21% fewer tokens on average
- Go: 14% fewer tokens on average
- PHP: 14% fewer tokens on average
- Ruby: 13% fewer tokens on average
How to Get Started
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")
# Example usage
code = '''public class Calculator {
public int add(int a, int b) {
return a + b;
}
}'''
tokens = tokenizer.tokenize(code)
token_ids = tokenizer.encode(code)
Training Details
Training Data
Trained on the CodeSearchNet dataset which contains:
- ~2M code functions across 6 programming languages
- Real-world code from GitHub repositories
- Functions with associated documentation
Training Procedure
- Base model: GPT-2 tokenizer (50,257 vocab)
- Training method: BPE (Byte-Pair Encoding)
- Final vocabulary: 64,000 tokens
- Training corpus: Combined functions from all 6 languages in CodeSearchNet
Technical Specifications
Model Architecture
- Algorithm: Byte-Pair Encoding (BPE)
- Vocabulary size: 64,000
- Special tokens: Inherited from GPT-2 tokenizer
- Subword handling: Optimized for code syntax and patterns
Citation
@misc{codesearchnet-multilang-tokenizer,
title={CodeSearchNet Multilingual Tokenizer},
author={Hélder Monteiro},
year={2025},
howpublished={Hugging Face Model Hub},
}
Dataset Reference
@article{husain2019codesearchnet,
title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
journal={arXiv preprint arXiv:1909.09436},
year={2019}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for helmo/code-search-net-multilang-tokenizer
Base model
openai-community/gpt2