Abstract
Amyloid-like nanofibers from self-assembling peptides can promote viral gene transfer for therapeutic applications. Traditionally, new sequences are discovered either from screening large libraries or by creating derivatives of known active peptides. However, the discovery of de novo peptides, which are sequence-wise not related to any known active peptides, is limited by the difficulty to rationally predict structureactivity relationships because their activities typically have multi-scale and multi-parameter dependencies. Here, we used a small library of 163 peptides to predict de novo sequences for viral infectivity enhancement using a machine learning (ML) approach based on natural language processing. Specifically, we trained an ML model using continuous vector representations of the peptides, which were previously shown to retain relevant information embedded in the sequences. We used the trained ML model to sample the sequence space of peptides with 6 amino acids to identify promising candidates. These 6-mers were then further screened for charge and aggregation propensity. The resulting 16 new 6-mers were tested and found to be active with a 25% hit rate. Strikingly, these de novo sequences are the shortest active peptides for infectivity enhancement reported so far and show no sequence relation to the training set. Moreover, by screening the chemical space, we discovered the first hydrophobic peptide fibrils with a moderately negative surface charge that can enhance infectivity. Hence, this ML strategy is a time- and cost-efficient way for expanding the chemical space of short functional self-assembling peptides exemplified for therapeutic viral gene delivery.
Supplementary materials
Title
Inverse design of viral infectivity-enhancing peptide fibrils from continuous protein-vector embeddings
Description
The Supporting Information contains 17 pages, Section 1 – 11 with Figures S1–S9, Table S1 and S2. Section 1 provides further information on the trained regression model, Section 2 and 3 describes the selection process based on aggregation prediction and N-gram similarity, Section 4 evaluates the property-activity correlation of the de novo peptides and the training set. Section 5-8 summarizes detailed experimental data, TEM, infection rates, cell-viability, and FT-IR spectra. The supporting Table S1 summarizes the infection data and predicted and experimental aggregation behavior of the training set, Table S2 summarizes the complete physicochemical properties of the de novo peptides
Actions
Title
Table S3
Description
Top 12320 sequences from Monte Carlo ProtVec LASSO model screening with information on predicted infectivity, hydrophobicity, and net charge.
Actions
Title
Table S4
Description
op 3669 peptides with a net positive charge with information on aggregation prediction results from Aggrescan, APPNN, and PATH.
Actions
Title
Table S5
Description
N-gram similarity matrix composed of top 3669 peptides and 163 peptides from the training set
Actions