File size: 6,706 Bytes
d172b75
 
 
 
 
 
d82352d
 
f0f0cac
d172b75
d82352d
 
b9842bd
ecaacf2
 
 
d82352d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50750c0
 
ecaacf2
d83fa1b
ecaacf2
 
 
d82352d
 
50750c0
 
 
 
ecaacf2
 
 
 
d82352d
 
 
 
ecaacf2
 
 
d82352d
 
ecaacf2
d82352d
 
ecaacf2
d82352d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d172b75
d82352d
 
 
 
 
fbc990d
d82352d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d172b75
 
d82352d
 
d172b75
f5639cb
d82352d
 
 
 
d172b75
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: mit
library_name: transformers
pipeline_tag: feature-extraction
---

# PDeepPP: A Comprehensive Protein Language Model Hub

This repository contains the model as presented in [A general language model for peptide identification](https://huggingface.co/papers/2502.15610).

PDeepPP is a hybrid protein language model designed to predict post-translational modification (PTM) sites, analyze biologically relevant features, and support a wide range of protein sequence analysis tasks. This repository serves as the central hub for accessing and exploring various specialized PDeepPP models, each fine-tuned for specific tasks, such as PTM site prediction, bioactivity analysis, and more.

The models in this repository can all be used on their corresponding datasets on GitHub ([https://github.com/fondress/PDeepPP/tree/main])

---

## Overview

PDeepPP integrates state-of-the-art transformer-based self-attention mechanisms with convolutional neural networks (CNNs) to capture both global and local features in protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating modular architecture components, PDeepPP offers a robust framework for protein sequence analysis.

This repository contains links to multiple task-specific PDeepPP models. These models are pre-trained or fine-tuned on publicly available datasets and are hosted on Hugging Face for easy access.

---

## Key Features

- **Flexible Architecture**: Combines self-attention and convolutional operations for robust feature extraction.
- **Task-Specific Models**: Includes pre-trained models for PTM prediction, bioactivity classification, and more.
- **Dataset Support**: Models are validated on datasets such as PTM and BPS, ensuring performance on real-world tasks.
- **Extensibility**: Users can fine-tune the models on custom datasets for new tasks.

---

## Available Models

### General Models
- [PDeepPP Main](https://huggingface.co/fondress/PDeepPP)

### Task-Specific Models

#### Post-Translational Modifications (PTMs)
- [PDeepPP Phosphorylation (Serine)](https://huggingface.co/fondress/PDeepPP_Phosphoserine)
- [PDeepPP Phosphorylation (Tyrosine)](https://huggingface.co/fondress/PDeepPP_Phosphorylation-Y)
- [PDeepPP Glycosylation (N-linked)](https://huggingface.co/fondress/PDeepPP_N-linked-glycosylation-N)
- [PDeepPP Glycosylation (O-linked)](https://huggingface.co/fondress/PDeepPP_O-linked-glycosylation)
- [PDeepPP Methylation (Lysine)](https://huggingface.co/fondress/PDeepPP_Methylation-K)
- [PDeepPP Methylation (Arginine)](https://huggingface.co/fondress/PDeepPP_Methylation-R)
- [PDeepPP SUMOylation](https://huggingface.co/fondress/PDeepPP_SUMOylation)
- [PDeepPP Ubiquitin](https://huggingface.co/fondress/PDeepPP_Ubiquitin)
- [PDeepPP S-Palmitoylation](https://huggingface.co/fondress/PDeepPP_S-Palmitoylation)
- [PDeepPP N6-acetyllysine(K)](https://huggingface.co/fondress/PDeepPP_N6-acetyllysine-K)
- [PDeepPP Hydroxyproline (P)](https://huggingface.co/fondress/PDeepPP_Hydroxyproline-P)
- [PDeepPP Hydroxyproline (K)](https://huggingface.co/fondress/PDeepPP_Hydroxyproline-K)
- [PDeepPP Pyrrolidone-carboxylic-acid (Q)](https://huggingface.co/fondress/PDeepPP_Pyrrolidone-carboxylic-acid-Q)

#### Bioactivity Prediction
- [PDeepPP ACE](https://huggingface.co/fondress/PDeepPP_ACE)
- [PDeepPP BBP](https://huggingface.co/fondress/PDeepPP_BBP)
- [PDeepPP DPPIV](https://huggingface.co/fondress/PDeepPP_DPPIV)
- [PDeepPP Toxicity](https://huggingface.co/fondress/PDeepPP_Toxicity)
- [PDeepPP Antimalarial (Main)](https://huggingface.co/fondress/PDeepPP_Antimalarial-main)
- [PDeepPP Antimalarial (Alternative)](https://huggingface.co/fondress/PDeepPP_Antimalarial-alternative)
- [PDeepPP Anticancer (Main)](https://huggingface.co/fondress/PDeepPP_Anticancer-main)
- [PDeepPP Anticancer (Alternative)](https://huggingface.co/fondress/PDeepPP_Anticancer-alternative)
- [PDeepPP Antiviral](https://huggingface.co/fondress/PDeepPP_Antiviral)
- [PDeepPP Antioxidant](https://huggingface.co/fondress/PDeepPP_Antioxidant)
- [PDeepPP Antibacterial](https://huggingface.co/fondress/PDeepPP_Antibacterial)
- [PDeepPP Antifungal](https://huggingface.co/fondress/PDeepPP_Antifungal)
- [PDeepPP Antimicrobial](https://huggingface.co/fondress/PDeepPP_Antimicrobial)
- [PDeepPP Anti-MRSA](https://huggingface.co/fondress/PDeepPP_Anti-MRSA)
- [PDeepPP Antiparasitic](https://huggingface.co/fondress/PDeepPP_Antiparasitic)
- [PDeepPP Bitter](https://huggingface.co/fondress/PDeepPP_bitter)
- [PDeepPP Umami](https://huggingface.co/fondress/PDeepPP_umami)
- [PDeepPP Neuro](https://huggingface.co/fondress/PDeepPP_neuro)
- [PDeepPP Quorum](https://huggingface.co/fondress/PDeepPP_Quorum)
- [PDeepPP TTCA](https://huggingface.co/fondress/PDeepPP_TTCA)

---

## Model Architecture

PDeepPP is built on a hybrid architecture that includes:

- **Self-Attention Global Features**: Captures long-range dependencies in protein sequences.
- **TransConv1d Module**: Combines transformer layers with convolutional layers for local feature extraction.
- **PosCNN Module**: Incorporates position-aware convolutional operations to enhance sequence representation.

---

## How to Use

To use any of the models, you need to install the required dependencies, such as `torch` and `transformers`:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers
```
Here’s a quick example of how to load a model (The use of models with specific biological features can be found in Task-Specific Models.):

```python
from transformers import AutoModel, AutoTokenizer

# Load the model
model_name = "fondress/PDeepPP_{task_type}"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
```

## Training and Customization

You can fine-tune PDeepPP for custom tasks using your own datasets. The model supports:

- **Custom PTM types**: Extend the model to predict additional post-translational modifications.
- **Sequence classification tasks**: Adapt the model to classify protein sequences based on custom labels.
- **Feature extraction for downstream analyses**: Use PDeepPP to generate embeddings for tasks like clustering or similarity calculation.

Refer to the `PDeepPPConfig` class in the source repository for details on available hyperparameters and customization options.

---
 ## Citation

 If you use any of the PDeepPP models in your research, please cite the associated paper:

```
@article{your_reference,
  title={A general language model for peptide identification},
  author={Author Name},
  journal={Journal Name},
  year={2025}
}
```

Code: [https://github.com/fondress/PDeepPP](https://github.com/fondress/PDeepPP)