PDeepPP / README.md
fondress's picture
Update README.md
f0f0cac verified
metadata
license: mit
library_name: transformers
pipeline_tag: feature-extraction

PDeepPP: A Comprehensive Protein Language Model Hub

This repository contains the model as presented in A general language model for peptide identification.

PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.

The models in this repository can all be used on their corresponding datasets on GitHub ([https://github.com/fondress/PDeepPP/tree/main])


Overview

PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from ESM and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.

This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.


Key Features

  • Flexible Architecture: Combines self-attention and convolutional operations for robust feature extraction.
  • Task-Specific Models: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
  • Dataset Support: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
  • Extensibility: Users can fine-tune the models on custom datasets for new tasks.

Available Models

General Models

Task-Specific Models

Post-Translational Modifications (PTMs)

Bioactivity Prediction


Model Architecture

PDeepPP is built on a hybrid architecture that includes:

  • Self-Attention Global Features: Captures long-range dependencies in protein sequences.
  • TransConv1d Module: Combines transformer layers with convolutional layers for local feature extraction.
  • PosCNN Module: Incorporates position-aware convolutional operations to enhance sequence representation.

How to Use

To use any of the models, you need to install the required dependencies, such as torch and transformers:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers

Here’s a quick example of how to load a model (The use of models with specific biological features can be found in Task-Specific Models.):

from transformers import AutoModel, AutoTokenizer

# Load the model
model_name = "fondress/PDeepPP_{task_type}"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

Training and Customization

You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:

  • Custom PTM types: Extend the model to predict additional post-translational modifications.
  • Sequence classification tasks: Adapt the model to classify protein sequences based on custom labels.
  • Feature extraction for downstream analyses: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.

Refer to the PDeepPPConfig class in the source repository for details on available hyperparameters and customization options.


Citation

If you use any of the PDeepPP models in your research, please cite the associated paper:

@article{your_reference,
  title={A general language model for peptide identification},
  author={Author Name},
  journal={Journal Name},
  year={2025}
}

Code: https://github.com/fondress/PDeepPP