Abstract
The vastness of materials space, particularly that which is concerned with metal-organic frameworks (MOFs), creates the critical problem of performing efficient identification of promising materials for specific applications. Although high-throughput computational approaches, including the use of machine learning, have been useful in rapid screening and rational design of MOFs, they tend to neglect descriptors related to their synthesis. One way to improve the efficiency of MOF discovery is to data mine published MOF papers to extract the materials informatics knowledge contained within the journal articles. Here, by adapting the chemistry-aware natural language processing tool, ChemDataExtractor (CDE), we generated an open-source database of MOFs focused on their synthetic properties: the DigiMOF database. Using the CDE web scraping package alongside the Cambridge Structural Database (CSD) MOF subset, we automatically downloaded 43,281 unique MOF journal articles, extracted 15,501 unique MOF materials and text mined over 52,680 associated properties including synthesis method, solvent, organic linker, metal precursor, and topology. This centralised, structured database reveals the MOF synthetic data embedded within thousands of MOF publications. The DigiMOF database and associated software are publicly available for other researchers to conduct further analysis of alternative MOF production pathways and create additional parsers to search for other desirable properties.