ERROR 1
ERROR 1
ERROR 2
ERROR 2
ERROR 2
ERROR 2
ERROR 2
Password and Confirm password must match.
If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)
ERROR 2
ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.
Traffic from bots run by artificial intelligence companies is disrupting scientific journal websites. Some publications report that their websites are now visited more by bots than by genuine users.
AI firms typically use bots to access scholarly content and scrape whatever data they can to train the large language models (LLMs) that power their writing assistance tools and other products.
While some scholarly publishers have signed deals giving access to AI firms, advocates for authors and rights holders have said scientists and academics should be given a chance to opt out of this practice and should receive compensation and credit when their papers are used to train AI chatbots.
“I can’t tell the future, but it’s gotten worse and worse, and there’s no sign of it stopping,” says Eric Hellman, cofounder and president of the Free Ebook Foundation, who published a blog post about the problem last month. “The bots keep coming back. They are never satisfied.”
After Hellman’s post appeared, Ian Mulvany, chief technology officer at the BMJ Group, wrote a follow-up post noting that its journals are grappling with the same problem.
“The issue is a real one,” Mulvany wrote. “I think as we move through the year we will see better mitigations become available. I’m not convinced we will see better behaviours from the LLM bots.”
Mulvany said that BMJ group journals were accessed more than 100 million times from data centers in Hong Kong and Singapore in a recent 3-week period alone. “These aggressive bots are attempting to crawl entire websites within a short period, overloading our web servers and negatively impacting the experience of legitimate users,” Mulvany wrote, quoting one of his BMJ Group colleagues.
Publishers of chemistry journals, however, say they aren’t yet seeing the same deluge. For example, a spokesperson for the American Chemical Society (ACS) says the publisher hasn’t detected significant bot activity.
“ACS’s goal is to protect the integrity of the scholarly record; as such, ACS believes that scraping of content should be done under an agreed license,” the spokesperson adds. (ACS publishes C&EN but doesn’t influence its editorial content.)
“It’s harder and harder to distinguish robots from real users, and so real users are getting caught up in this whole thing,” Hellman says. “I’ve been running websites for almost 30 years now, and I’ve never seen bots that are this aggressive.”
To counter the problem, some website-hosting firms have started using Cloudflare or other security tools to block AI bot traffic, Hellman says. For now, that seems to be successful at limiting the impact” of the bots, he says. “You can block IP address ranges, but then the bots come in through other addresses from other data centers from different countries, [or] they start to use VPNs,” he says, referring to virtual private networks, which can obscure the addresses from which traffic originates.
But turning off all access to AI tools also affects useful bots. For example, the revered Internet Archive uses bots to preserve digital content by feeding it into the Wayback Machine, an archive of billions of web pages.
“It’s hard to figure out which are the good bots and which are the bad bots,” Hellman says. “There’s collateral damage for almost all of the methods that you would use to block bots.”
Join the conversation
Contact the reporter
Submit a Letter to the Editor for publication
Engage with us on X