Abstract
Deep generative models are increasingly crucial in de. novo. drug design, excelling in the rapid exploration of vast chemical spaces for advancing molecular design. By integrating techniques such as adversarial networks, reinforcement learning, and transfer learning, these models can effectively leverage both the public datasets and the collected experimental data. However, accurately assessing the generalization ability of these models for drug discovery applications, while minimizing costs, still remains to be challenging. Developing an accurate yet cost-effective solution would provide substantial benefits to both academia and industry. In this study, we propose three accuracy-based methods for predicting the theoretical coverage of generative models, classifying them into three cost levels: low, medium, and high. For all these methods, the derivative of a unique curve with respect to sample iterations can serve as the most cost-effective yet reliable metric for evaluating generalization ability. To address sampling non-uniformity, we propose a novel double-parameter mathematical model that can accurately fit for both experimental and theoretical coverage across various generative architectures. Furthermore, the developed model provides qualitative insights into how transfer learning and reinforcement learning influence generative models' performance by examining changes resulting from increased non-uniformity and enhanced probabilities of sampling target molecules.