Abstract
Predicting protein-ligand binding affinity from three-dimensional (3D) structural data is a central task in structure-based drug discovery, yet it remains challenging due to limited data availability, structural complexity, and the sparse nature of 3D molecular representations. In this study, we investigate the application of Vision Transformers (ViTs) to the problem of affinity prediction. We evaluate our approach across two benchmark datasets, demonstrating competitive performance, and in some cases surpassing, state-of-the-art models. We study hyperparameter influence and analyze the model behavior using explainable AI (XAI) techniques. We reveal that spatially proximal patches with similar attention scores cluster around biologically relevant regions, confirming the model’s ability to capture key interaction features. Furthermore, we show that data augmentation strategies can yield performance improvements, underscoring the potential for further enhancement. Despite challenges related to data sparsity and conformational variability, Vision Transformers show strong performance and robustness in structure-based affinity prediction tasks. Our findings highlight their effectiveness in learning spatial patterns and suggest broader applicability to related tasks, such as protein–protein or protein–nucleic acid interaction modeling.