license: mit
library_name: transformers
pipeline_tag: feature-extraction
PDeepPP: A Comprehensive Protein Language Model Hub
This repository contains the model as presented in A general language model for peptide identification.
PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.
The models in this repository can all be used on their corresponding datasets on GitHub ([https://github.com/fondress/PDeepPP/tree/main])
Overview
PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from ESM
and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.
This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.
Key Features
- Flexible Architecture: Combines self-attention and convolutional operations for robust feature extraction.
- Task-Specific Models: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
- Dataset Support: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
- Extensibility: Users can fine-tune the models on custom datasets for new tasks.
Available Models
General Models
Task-Specific Models
Post-Translational Modifications (PTMs)
- PDeepPP Phosphorylation (Serine)
- PDeepPP Phosphorylation (Tyrosine)
- PDeepPP Glycosylation (N-linked)
- PDeepPP Glycosylation (O-linked)
- PDeepPP Methylation (Lysine)
- PDeepPP Methylation (Arginine)
- PDeepPP SUMOylation
- PDeepPP Ubiquitin
- PDeepPP S-Palmitoylation
- PDeepPP N6-acetyllysine(K)
- PDeepPP Hydroxyproline (P)
- PDeepPP Hydroxyproline (K)
- PDeepPP Pyrrolidone-carboxylic-acid (Q)
Bioactivity Prediction
- PDeepPP ACE
- PDeepPP BBP
- PDeepPP DPPIV
- PDeepPP Toxicity
- PDeepPP Antimalarial (Main)
- PDeepPP Antimalarial (Alternative)
- PDeepPP Anticancer (Main)
- PDeepPP Anticancer (Alternative)
- PDeepPP Antiviral
- PDeepPP Antioxidant
- PDeepPP Antibacterial
- PDeepPP Antifungal
- PDeepPP Antimicrobial
- PDeepPP Anti-MRSA
- PDeepPP Antiparasitic
- PDeepPP Bitter
- PDeepPP Umami
- PDeepPP Neuro
- PDeepPP Quorum
- PDeepPP TTCA
Model Architecture
PDeepPP is built on a hybrid architecture that includes:
- Self-Attention Global Features: Captures long-range dependencies in protein sequences.
- TransConv1d Module: Combines transformer layers with convolutional layers for local feature extraction.
- PosCNN Module: Incorporates position-aware convolutional operations to enhance sequence representation.
How to Use
To use any of the models, you need to install the required dependencies, such as torch
and transformers
:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
Here’s a quick example of how to load a model (The use of models with specific biological features can be found in Task-Specific Models.):
from transformers import AutoModel, AutoTokenizer
# Load the model
model_name = "fondress/PDeepPP_{task_type}"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
Training and Customization
You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:
- Custom PTM types: Extend the model to predict additional post-translational modifications.
- Sequence classification tasks: Adapt the model to classify protein sequences based on custom labels.
- Feature extraction for downstream analyses: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.
Refer to the PDeepPPConfig
class in the source repository for details on available hyperparameters and customization options.
Citation
If you use any of the PDeepPP models in your research, please cite the associated paper:
@article{your_reference,
title={A general language model for peptide identification},
author={Author Name},
journal={Journal Name},
year={2025}
}