Data-Driven Models for Predicting Intrinsically Disordered Protein Polymer Physics Directly from Composition or Sequence

Tzu-Hsuan Chao; Shiv  Rekhi; Jeetain Mittal; Daniel P. Tabor

doi:10.26434/chemrxiv-2023-wrnq1

Theoretical and Computational Chemistry

Search within Theoretical and Computational Chemistry

Data-Driven Models for Predicting Intrinsically Disordered Protein Polymer Physics Directly from Composition or Sequence

24 March 2023, Version 1

Working Paper

Show author details

This content is a preprint and has not undergone peer review at the time of posting.

Abstract

The molecular-level understanding of intrinsically disordered proteins is challenging due to experimental characterization difficulties. Computational understanding of IDPs also requires fundamental advances, as the leading tools for predicting protein folding (e.g., Alphafold), typically fail to describe the structural ensembles of IDPs. The focus of this paper is to 1) develop new representations for intrinsically disordered proteins and 2) pair these representations with classical machine learning and deep learning models to predict the radius of gyration and scaling exponent of IDPs. Here, we build a new physically-motivated feature called the bag-of-amino-acid-interactions, which encodes pairwise interactions explicitly into the representation. This feature essentially counts and weights all possible non-bonded interactions in a sequence and thus is, in principle, compatible with arbitrary sequence lengths. To see how well this new feature performs, both categorical and physically-motivated featurization techniques are tested on a computational dataset containing 10,000 sequences simulated at the coarse-grained level. The results indicate that this new feature outperforms the others and possesses solid extrapolation capabilities. For future use, this feature can potentially provide physical insights into amino acid interactions including their temperature dependence, and be applied to other protein spaces.

Keywords

Intrisically Disordered Proteins

Protein Representations

Supplementary materials

Title

Description

Actions

Title

Supporting Information

Description

Supporting information contains: Simulation Snapshots, Illustrating Each Encoding Method on One Example Sequence, Learning Curves

Actions

Supplementary weblinks

Title

Description

Actions

Title

Link to Github Repository for Paper

Description

Code for models and generation of plots.

Actions

View

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting Policy - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here .

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Version History

Mar 24, 2023 Version 1

Metrics

1,066

621

Views

Downloads

License

The content is available under CC BY NC ND 4.0

DOI

10.26434/chemrxiv-2023-wrnq1

Funding

Welch Foundation

A-2049-20200401

Welch Foundation

A-2113-20220331

National Institutes of Health

R01GM136917

Author’s competing interest statement

The author(s) have declared they have no conflict of interest with regard to this content

Ethics

The author(s) have declared ethics committee/IRB approval is not relevant to this content

Data-Driven Models for Predicting Intrinsically Disordered Protein Polymer Physics Directly from Composition or Sequence

Authors

Abstract

Keywords

Supplementary materials

Supplementary weblinks

Comments

Version History

Metrics

License

DOI

Funding

Author’s competing interest statement

Ethics

Share