Abstract
Molecular property prediction has become essential in accelerating advancements in drug discovery and materials science. Graph Neural Networks have recently demonstrated remarkable success in molecular representation learning; however, their broader adoption
is impeded by two significant challenges: (1) data scarcity and constrained model generalization due to the expensive and timeconsuming task of acquiring labeled data, and (2) inadequate initial node and edge features that fail to incorporate comprehensive chemical domain knowledge, notably orbital information. To address these limitations, we introduce a Knowledge-Guided Graph (KGG) framework employing self-supervised learning to pre-train models using orbital-level features in order to mitigate reliance on
extensive labeled datasets. In addition, we propose novel representations for atomic hybridization and bond types that explicitly consider orbital engagement. Our pre-training strategy is cost-efficient, utilizing approximately 250,000 molecules from the ZINC15 dataset, in contrast to contemporary approaches that typically require between two and ten million molecules, consequently reducing
the risk of potential data contamination. Extensive evaluations on diverse downstream molecular property datasets demonstrate that our method significantly outperforms state-of-the-art baselines. Complementary analyses, including t-SNE visualizations and comparisons
with traditional molecular fingerprints, further validate the effectiveness and robustness of our proposed KGG approach.
Supplementary materials
Title
KGG: Knowledge-Guided Graph Self-Supervised Learning to Enhance Molecular Property Predictions
Description
Knowledge-Guided Graph (KGG) introduces orbital-aware atomic-hybridization and bond-type encodings, enabling costefficient self-supervised pre-training. Our framework mitigates reliance on extensive labeled datasets and reduces the risk of potential data contamination. Comprehensive benchmarks across diverse molecular-property tasks on MoleculeNet public datasets, alongside t-SNE visualization and comparisons with traditional fingerprints, confirm that KGG consistently
surpasses contemporary self-supervised learning baselines in effectiveness and robustness.
Actions