Machine learning and artificial intelligence (AI) are transforming chemistry, allowing scientists to automate previously rote or tiresome tasks and tackle previously intractable questions.
These tools are having a similar effect on our newsroom. For years, C&EN reporters submitted lists of related stories with each article they wrote. For the last few months, we’ve been testing an AI-powered algorithm that can do that tedious work for them. The algorithm generates the related content we recommend both between paragraphs and at the end of each story.
So what does it take to create an algorithm that meets the needs of C&EN’s readers?
Quite a lot, it turns out. C&EN’s product and editorial teams worked closely with our publisher’s AI team to create and train the algorithm that’s working in the background today. We tinkered and tested until it performed roughly as well as—and sometimes even better than—our veteran journalists at recommending related content.
When looking to improve the algorithm’s performance, the AI team tried tweaking both the core word-embedding algorithm itself and the dataset it was trained on. For all our fellow machine-learning nerds, the first algorithm we tried was based on GloVe (global vectors for word representation) and a statistical model called principal component analysis that maps articles onto a 100-dimension vector space to compare their semantic similarities. We trained the first algorithm on a dataset of millions of documents, largely derived from Wikipedia.
While that first algorithm worked pretty well for most of our stories, it had trouble making relevant recommendations for some of our most technical chemistry coverage. For example, with the first algorithm, this article (on how a collision with a Mars-sized object could explain Earth’s chemical composition) originally got this recommendation (about how rocks supply a significant source of the world’s nitrogen needs). Our editors noted that the algorithm seemed to overlook our previous astrochemistry coverage.
So we tweaked the training set, trying instead a smaller but more targeted dataset drawn entirely from C&EN articles. We also tried a different word-embedding algorithm, swapping GloVe for word2vec, and used Gensim doc2vec to generate article vectors for semantic similarity comparison.
Across the board, the second algorithm provided more relevant recommendations, according to our editors. For example, for the Mars story above, the algorithm recommended this story about how scientists better estimated Jupiter’s age based on meteorites’ chemical composition. Our editors preferred this recommendation, noting that both the original and the recommendation report “isotope ratios measured in space, with an eye toward solar system origins.”
And we’re still not done. Next we’ll examine how readers click through the recommended articles, so that we can begin making relevant usage-based recommendations, as services like Spotify or Netflix do, rather than content-based ones. We’re also investigating whether we can use AI-powered taxonomy tagging on archival C&EN content to help our readers find what they are looking for faster.
These projects are part of a long-standing goal in our newsroom: spend less time on manual production tasks and more time creating high-impact journalism. There are other benefits, too. For example, machine learning makes our recommendations more relevant and timely: the algorithm is dynamic, which means that a story about PFAS written a year ago will have the most current and relevant related stories, so you can be sure that the coverage we recommend is as up to date as possible.
Thanks to ACS’s Jofia Prakash and Marley Zhu for developing the new algorithm, C&EN’s Jessica Morrison, Jyllian Kemsley, Mike McCoy, Linda Wang, and Lauren Wolf for testing it, and IT’s Sujit Boda and Selina Liu and the rest of C&EN’s product team for implementing it.
Editor’s Note: Amanda Yarnell, C&EN’s editorial director, oversees editorial and product development. Yinghao Ma, a senior scientist at the American Chemical Society, leads a team focused on artificial intelligence and enterprise architecture.