CodeSearchNet Multilingual Tokenizer

A specialized tokenizer trained on code from 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go) using the CodeSearchNet dataset.

Model Details

Model Description

This tokenizer is based on GPT-2's tokenizer but retrained specifically for source code across multiple programming languages. It provides more efficient tokenization for code compared to general-purpose tokenizers.

  • Model type: BPE Tokenizer
  • Languages: Python, Java, JavaScript, PHP, Ruby, Go
  • Vocabulary size: 64,000 tokens
  • Finetuned from: GPT-2 tokenizer

Uses

Direct Use

This tokenizer is designed for preprocessing source code before training or inference with language models. It's particularly useful for:

  • Code generation models
  • Code completion systems
  • Code analysis and understanding tasks
  • Multi-language programming assistants

Performance

Compared to the original GPT-2 tokenizer, this specialized tokenizer achieves:

  • Python: 25% fewer tokens on average
  • Java: 31% fewer tokens on average
  • JavaScript: 21% fewer tokens on average
  • Go: 14% fewer tokens on average
  • PHP: 14% fewer tokens on average
  • Ruby: 13% fewer tokens on average

How to Get Started

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("helmo/code-search-net-multilang-tokenizer")

# Example usage
code = '''public class Calculator {
    public int add(int a, int b) {
        return a + b;
    }
}'''

tokens = tokenizer.tokenize(code)
token_ids = tokenizer.encode(code)

Training Details

Training Data

Trained on the CodeSearchNet dataset which contains:

  • ~2M code functions across 6 programming languages
  • Real-world code from GitHub repositories
  • Functions with associated documentation

Training Procedure

  • Base model: GPT-2 tokenizer (50,257 vocab)
  • Training method: BPE (Byte-Pair Encoding)
  • Final vocabulary: 64,000 tokens
  • Training corpus: Combined functions from all 6 languages in CodeSearchNet

Technical Specifications

Model Architecture

  • Algorithm: Byte-Pair Encoding (BPE)
  • Vocabulary size: 64,000
  • Special tokens: Inherited from GPT-2 tokenizer
  • Subword handling: Optimized for code syntax and patterns

Citation

@misc{codesearchnet-multilang-tokenizer,
  title={CodeSearchNet Multilingual Tokenizer},
  author={Hélder Monteiro},
  year={2025},
  howpublished={Hugging Face Model Hub},
}

Dataset Reference

@article{husain2019codesearchnet,
  title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for helmo/code-search-net-multilang-tokenizer

Finetuned
(1880)
this model