Introduction

T-cell recognition of somatic mutation-derived cancer neoepitopes can lead to tumor regression. Due to the difficulty to identify effective neoepitopes, constructing a database for sharing experimentally validated cancer neoantigens will be beneficial to precise cancer immunotherapy. Meanwhile, the routine neoepitope prediction in silico is important but laborious for clinical use. Here we present NEPdb, a database that contains more than 17,000 validated human immunogenic neoantigens and ineffective neoepitopes within human leukocyte antigens (HLAs) via curating published literature with our semi-automatic pipeline. Furthermore, NEPdb also provides pan-cancer level predicted HLA-I neoepitopes derived from 16,745 shared cancer somatic mutations, using state-of-the-art predictors. With a well-designed search engine and visualization modes, this database would enhance the efficiency of neoantigen-based cancer studies and treatments.
Data content HLA-Ⅰ data HLA-Ⅱ data Total data
Entry (Total) 12239 5310 17549
Entry (Positive) 155 18 173
Entry (Negative) 12084 5292 17376
Tumor type 22 11 23
HLA allele 60 35 95
Gene 2063 811 2068
Protein sequence 2332 895 2337

Statistics

Validated Neopeptide Dataset

We manually gathered validated neoantigens and non-immunogenic peptides in this database. The statistics of data included in this database are shown in Table 1. Currently, the immunogenic neoantigen dataset encompassed 173 neoepitopes and 17,376 non-immunogenic peptides of human cancers from 41 published papers in recent years. Most of them were tested with T cell assays in vitro and clinical vaccine immunizing or T cell based adoptive transfer in vivo.
Predicted Neopeptide Dataset

For pan-cancer neoepitope prediction, 16,745 dominant non-synonymous mutations were from 683 cancer genes (Cancer Gene Census) and occurred at least 3 times leading to amino acid changes from COSMIC. We applied wo state-of-the-art pMHC binding prediction algorithms (HLAthena and NetMHCpan4.0) predicted the binding probability of each neopeptide from the pool with 95 HLA-I alleles (a total of 516,036 * 95 interactions). The HLA-Ⅰ distributions among the VND and PND are presented in the figure below.
NameCount
Cancer gene683
Non-synonymous mutation16745
Neopeptide516036
HLA class Ⅰ95
Total prediction49023420
Comparison of HLA distribution and peptide number for VND and PND. The number of peptides corresponding to different HLA-I alleles are shown for the immunogenic neoantigen dataset (VND), the ineffective neoantigen dataset (VND), the HLAthena-predicted neoantigen dataset (PND), and the NetMHCpan-predicted neoantigen dataset (PND), respectively.

Application

Figure.

Overall performance of nine HLA class Ⅰ prediction algorithms (immunogenic data from NEPdb)

*The decimal above each bar represents the accuracy rate of the algorithm.

Nine commonly used peptide-MHC binding prediction algorithms were respectively evaluated based on our positive samples from Validated Neopeptide Dataset.

NetMHCcons 1.1, NetMHCpan 4.0 and HLAthena performed better than others under this criterion.