Abstract
Enzymes enable sustainable and environmentally friendly solutions in industrial biocatalysis, bioremediation, and biosensing. Evolutionary data have proven pivotal for narrowing down the vast sequence space during enzyme optimization. However, capturing all important dependencies among residues is challenging due to the nonlinear influence of coevolution at individual positions. To overcome this challenge, deep learning methods are being actively trained on protein sequence data. While they have demonstrated incredible capacity to grasp protein evolution, the strategies to leverage this information for the design of promising biocatalysts remain largely unexplored. Here, we introduce evolutionary trajectories generated by a generative deep-learning framework of variational autoencoders. We optimized and utilized this framework and the latent space geometry to produce a set of deep-learning-based ancestral sequences of model enzymes haloalkane dehalogenases. The generated novel proteins were expressed and experimentally characterized, showing stability and activity at the level of the wild type for soluble variants. We also identified a major limitation: the sequences distant from the template tend to accumulate many insertions and deletions, known to compromise protein solubility. Taking this limitation into account, we demonstrate that the geometry of the latent space, together with the generative potential of variational autoencoders, can be used for diversification of natural protein sequences.