Abstract
Physical molecular models are widely used in educational settings for teaching organic and other branches of chemistry, offering an intuitive understanding of molecular structures. Conversely, while less intuitive, virtual models provide additional functionalities, such as retrieving molecular names and other properties. Currently, to the best of our knowledge, there is a gap between 3D molecular models and their digital counterparts. This paper introduces a computer vision model designed to bridge this gap by converting images of physical molecular models into their digital DeepSMILES representations. This conversion facilitates further information retrieval, enhancing educational utility. We developed synthetic and real datasets to train our model and evaluated its performance across various dataset combinations. Additionally, we attempted to improve the model’s accuracy by multi-image input and beam search. We achieved 62.0% top1 accuracy and 80.3% top-3 accuracy with beam search and multi-image input on our validation set. We also explored the model’s characteristics, such as explainability by saliency maps, and examined its calibration. We also discussed the model’s limitations and directions for future research.