NEPdb

Introduction

T-cell recognition of somatic mutation-derived cancer neoepitopes can lead to tumor regression. Due to the difficulty to identify effective neoepitopes, constructing a database for sharing experimentally validated cancer neoantigens will be beneficial to precise cancer immunotherapy. Meanwhile, the routine neoepitope prediction in silico is important but laborious for clinical use. Here we present NEPdb, a database that contains more than 17,000 validated human immunogenic neoantigens and ineffective neoepitopes within human leukocyte antigens (HLAs) via curating published literature with our semi-automatic pipeline. Furthermore, NEPdb also provides pan-cancer level predicted HLA-I neoepitopes derived from 16,745 shared cancer somatic mutations, using state-of-the-art predictors. With a well-designed search engine and visualization modes, this database would enhance the efficiency of neoantigen-based cancer studies and treatments.

Data content	HLA-Ⅰ data	HLA-Ⅱ data	Total data
Entry (Total)	12239	5310	17549
Entry (Positive)	155	18	173
Entry (Negative)	12084	5292	17376
Tumor type	22	11	23
HLA allele	60	35	95
Gene	2063	811	2068
Protein sequence	2332	895	2337

Statistics

Validated Neopeptide Dataset

We manually gathered validated neoantigens and non-immunogenic peptides in this database. The statistics of data included in this database are shown in Table 1. Currently, the immunogenic neoantigen dataset encompassed 173 neoepitopes and 17,376 non-immunogenic peptides of human cancers from 41 published papers in recent years. Most of them were tested with T cell assays in vitro and clinical vaccine immunizing or T cell based adoptive transfer in vivo.

Predicted Neopeptide Dataset

For pan-cancer neoepitope prediction, 16,745 dominant non-synonymous mutations were from 683 cancer genes (Cancer Gene Census) and occurred at least 3 times leading to amino acid changes from COSMIC. We applied wo state-of-the-art pMHC binding prediction algorithms (HLAthena and NetMHCpan4.0) predicted the binding probability of each neopeptide from the pool with 95 HLA-I alleles (a total of 516,036 * 95 interactions). The HLA-Ⅰ distributions among the VND and PND are presented in the figure below.

Name	Count
Cancer gene	683
Non-synonymous mutation	16745
Neopeptide	516036
HLA class Ⅰ	95
Total prediction	49023420

Comparison of HLA distribution and peptide number for VND and PND. The number of peptides corresponding to different HLA-I alleles are shown for the immunogenic neoantigen dataset (VND), the ineffective neoantigen dataset (VND), the HLAthena-predicted neoantigen dataset (PND), and the NetMHCpan-predicted neoantigen dataset (PND), respectively.

Application

Figure.

Overall performance of nine HLA class Ⅰ prediction algorithms (immunogenic data from NEPdb)

*The decimal above each bar represents the accuracy rate of the algorithm.

Nine commonly used peptide-MHC binding prediction algorithms were respectively evaluated based on our positive samples from Validated Neopeptide Dataset.

NetMHCcons 1.1, NetMHCpan 4.0 and HLAthena performed better than others under this criterion.

NEPdb

A database of T-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy

Introduction

Statistics

Application

Figure.