Recognize chemical formulas from research papers using a transformer-based artificial neural network
In recent years, deep learning has played a vital role in various fields of science and technology. This development promotes AI-based tools that can help us in finding information. Modern deep learning technologies, which typically require huge volumes of qualitative data for training neural networks, will also transform chemistry.
The good news is that the chemical data holds up well over time. Even if a molecule was first synthesized more than a century ago, information regarding its structure, characteristics, and methods of synthesis are still useful today. Even in the era of universal digitalization, an organic chemist may turn to a thesis in a library collection for information on a poorly studied molecule published as early as the turn of the 20th century.
There is no universally recognized method for presenting chemical formulas. Chemists employ a variety of shorthand notation strategies to represent common chemical groups. “tBu”, “t-Bu”, and “tert-Bu” are all viable substitutes for the tert-butyl group. To make matters worse, chemists frequently employ a single pattern with many “placeholders” (R1, R2, etc.) to refer to a large number of identical compounds. Yet these placeholder symbols could be described anywhere: in the figure itself, in the body text of the article, or in supplements. Drawing styles fluctuate from journal to journal and evolve, scientists’ personal preferences change, and standards shift. Therefore, even a seasoned scientist can become perplexed when solving a “puzzle” discovered in a magazine article. The problem seems insoluble to a computer algorithm.
Using artificial intelligence, researchers from Syntelly, a startup spun off from Skoltech and Sirius University, have developed a neural network-based solution for automated recognition of chemical formulas on document scans of research.
The researchers had previously faced similar challenges with Transformer, a neural network first developed by Google for machine translation, as they approached it. The researchers used this sophisticated tool to convert the image of a molecule or molecular model into its written representation rather than translating words between languages. With this Transformer-based architecture, named ‘Image2SMILES’, image recognition could be significantly improved for chemical structure recognition.
The neural network could learn virtually anything, much to the amazement of the researchers, as long as the relevant style of representation was represented in the training data. On the other hand, Transformer needs tens of millions of instances to train, and manually collecting so many chemical formulas from research publications is complex. Instead, the scientists used a different method and developed a data generator that generates molecular model instances by merging randomly chosen molecule fragments and representation styles.
The research team believe their method will be an important step towards the development of an artificial intelligence system. This system will be able to ‘read’ and ‘understand’ research papers at the same level as a highly trained chemist.