In an effort to spur biomedical research, the National Institutes of Health is undertaking a number of cross-cutting initiatives under the auspices of its Roadmap for Medical Research. One such initiative, which targets the use of small molecules as probes for biological systems, is crossing swords with the private sector.
The initiative--Molecular Libraries & Imaging (MLI)--includes the generation of compound libraries and the establishment of screening centers to assay these compounds. The information about these MLI compounds and their biological activities is to be collected and made publicly available. To that end, NIH has set up a chemical structure database called PubChem as one component of the initiative.
While the idea of such a database fits nicely into the goals of the NIH Roadmap, the scope of compounds contained in PubChem has officials at the American Chemical Society deeply concerned and asking the agency to focus the database's contents on what the society understands to be the database's stated mission. At issue is whether PubChem, as it currently exists, duplicates and therefore unfairly competes with ACS's Chemical Abstracts Service (CAS) Registry.
To understand ACS's apprehension, one must first understand the two databases. Based in Columbus, Ohio, CAS employs nearly 1,300 people to sort through, analyze, and index all known small-molecule information from publicly disclosed chemical literature and patents. The substance identification database, known as the CAS Registry, is a result of this careful process.
The Registry is available electronically for a fee to industry through SciFinder, to universities through SciFinder Scholar, and around the world on the Internet through STN (Scientific & Technical Network). Through these various forms, the CAS Registry is used globally by chemical, pharmaceutical, and biotechnology companies, as well as by colleges, universities, and patent offices.
The computer-based publication technologies that underpin the Registry were developed in the 1960s. From 1965 through 1975, ACS received three National Science Foundation grants for technology development on what became the CAS Registry. The results of this work were published in a series of articles appearing in the Journal of Chemical Information & Computer Sciences from 1976 to 1981.
Those grants did not cover general database building. ACS has subsequently invested more than $500 million in developing, maintaining, and enhancing the Registry, according to ACS.
THE CAS REGISTRY currently contains more than 25 million organic and inorganic compounds. Its records provide unique CAS Registry index numbers, molecular structures, chemical properties, and synonyms in a cross-searchable platform. The registry also links to other CAS databases for thorough chemical literature references and patent information.
PubChem, on the other hand, is a fledgling database that has been up and running for less than a year. In that time, it has compiled information on nearly a million compounds. According to NIH, its budget for fiscal 2005 is $3.2 million, and it is expected to grow to just under $3.7 million next year.
A staff of 14--all but one of whom hold doctorate degrees in computational chemistry and biology--are involved in the programming and maintenance of PubChem. The database is completely automated with no manual curation and is intended to be an entrée point for biomedical researchers into chemistry, according to NIH.
"A lot of what the MLI initiative is about is trying to bring biomedical researchers into the wonderful world of small molecules," says National Institute of General Medical Sciences Director Jeremy M. Berg, who is a Ph.D. chemist. "Crucially, the initiative requires the creation of an informatics database and tools necessary for integrating information about the biological activities discovered for MLI compounds with related information in other public databases," he explains.
The searchable database provides users with computer-generated molecular structures, physical properties, synonyms, and links to available biomedical literature--general data similar to that found in the CAS Registry. It is also integrated with other resources of the National Center for Biotechnology Information, a division of the National Library of Medicine, such as PubMed and GenBank.
Once the MLI initiative's Molecular Libraries Screening Center Network (MLSCN) is up and running, PubChem will serve as the primary depository for the resulting compound data. In the meantime, to make use of the database, NIH has begun populating it with chemical structures from agency and other publicly available chemical structure databases. Researchers from around the world have also begun to deposit small-molecule records into PubChem.
Because PubChem is completely automated, erroneous data may creep in, NIH acknowledges. But it maintains that the public nature of the database will weed out the incorrect information. To date, NIH has amassed approximately 850,000 chemical structure records, representing about 650,000 unique compounds, from publicly available sources.
ACS, however, contends that by populating PubChem with data not derived from or relevant to NIH Roadmap research, the agency is setting up at taxpayers' expense a general chemical structure database that replicates the CAS Registry. ACS argues that this step takes PubChem beyond its intended scope.
According to ACS, it was told by NIH that the purpose of the database was to house compound data derived from MLSCN. ACS Executive Director and Chief Executive Officer Madeleine Jacobs cites a Jan. 21 letter from Dushanka V. Kleinman, assistant director for NIH Roadmap coordination, that states, "PubChem's purpose is to archive and make publicly available for search and retrieval chemical structures and bioassay data generated by [MLSCN]."
This understanding by ACS that PubChem's purpose is limited to MLSCN is incorrect, says Christopher P. Austin, senior adviser to the director for translational research at the National Human Genome Research Institute. He explains that the purpose of PubChem, as stated on the NIH Roadmap's MLI website for a year or more, says that "a new and comprehensive database of chemical structures and their biological activities ... called PubChem, will house both compound information from the scientific literature as well as screening and probe data from the MLSCN."
Austin says that collecting as much small-molecule information from the scientific community as possible will be key to maximizing PubChem's impact on biomedical research. "Ideally, the ultimate goal of the MLI initiative is to have a small-molecule modulator of every protein in the human and other genomes and to be able to understand what the fundamental principles are by which small molecules interact with proteins," he explains.
But ACS argues that the setting up of a comprehensive database that replicates the form and content of the CAS Registry represents unfair competition. According to ACS, the government is using taxpayer funds to set up a business entity that already exists in the private sector--something that runs counter to Office of Management & Budget policy guidance Circular A-130.
"THE PROBLEM is that people don't naturally think of the information business as an industry," explains CAS President Robert J. Massie. "It's not easy for most people to understand that a collection of data is the product that we have worked to create," he points out.
"We are not saying that PubChem as it exists today will destroy our business," Massie says. "We are saying that it is a platform that can be built upon to eventually replicate the Registry."
NIH, however, maintains that it has no plans to expand PubChem beyond a rudimentary chemical structure database. "We've been very clear that each record will include the molecular structure, names from the depositor, a few calculated properties, and links," Berg says. "There are no resources available and no reason to duplicate what CAS has available," he notes.
"On the other hand," Berg continues, "if the information is already publicly available, not making it accessible is a violation of our mission." He says that this is where a lot of the current tension arises. "If Congress were to ask--given that this information is publicly available--why we didn't link it into PubChem, I don't have a good answer," he says.
ACS, however, believes this inclusion of publicly available data already accessible through the CAS Registry is not within the mission of NIH and diverts scarce resources away from the main mission of NIH, which is basic research, Jacobs says.
Jacobs stresses that at no time has ACS called for PubChem to be shut down, but rather is asking the agency to limit its scope so that the database is mission related. She points out that the current contents go beyond that mission.
"It's hard for us to see how putting in explosives, pesticides, and other molecules that are not drugable targets actually advances the NIH mission," Jacobs says. "In our view, this [database represents] mission creep by NIH."
NIH's Berg admits that some of the data currently in PubChem are not obviously relevant to the MLI initiative, but he attributes this to the fact that various existing databases were deposited into PubChem using an automated process to populate and make use of the database until MLI compound data are available. He notes that this automated dump was done "out of concern for not wasting resources" by hand-sorting the compounds.
Even with a large public data set, NIH maintains that it could never replicate the CAS Registry in the depth of information it provides, specifically synthetic and patent information. In fact, NIH's Austin agrees with Jacobs that increasing the depth of data elements that PubChem includes would fall outside the agency's mission.
CAS has raised the concern about what happens if NIH and its researchers get deeply involved with small molecules and ask NIH to increase the depth of data included in PubChem, Austin notes. In that case, he says, the only reasonable way to handle the request is to send researchers who want more information to various specialty databases such as the CAS Registry, adding that this is already being done.
Austin notes that he and others at NIH have made the case that an expansion of PubChem to a database that completely duplicates the CAS Registry is not something NIH has the resources or the mission to do. "The problem with this is that there is no way to reassure" ACS of what will happen in the future, he points out.
AT PRESENT, Austin argues that PubChem could open a new market for CAS products. He notes that many biomedical researchers don't consider chemistry relevant to their research and haven't heard of ACS or CAS. He echoes Berg's point that the MLI initiative will bring an enormous group of biomedical researchers into chemistry and argues that if CAS were to allow PubChem to link to the Registry, CAS would gain in the long run. "ACS is not only overestimating what PubChem can do, will do, and has the mission and resources to do, but they are also underestimating the market that PubChem could bring to them," Austin says.
CAS, however, believes that it is already well-known in the market. "If a scientist is involved in drug discovery--whether you call that person a chemist, biochemist, chemical biologist, or molecular biologist--he or she is using the CAS databases," says Michael W. Dennis, vice president of planning and development at CAS and a Ph.D. chemist. He adds that many researchers may not know CAS by name, but they do know its products--namely SciFinder and SciFinder Scholar.
Officials from both ACS and NIH, through a number of correspondences and meetings, have been trying to find a way to maximize PubChem's effectiveness without impinging on the CAS Registry. At a March meeting, ACS suggested and NIH agreed to set up a smaller working group to deal with the issue.
The working group, however, has yet to meet because the two sides remain split on the mandate for the group. While both sides agree that ACS and NIH need to effectively work together for the good of the scientific community and research, ACS holds that the working group should collaborate to focus PubChem as a repository for NIH Roadmap data and leave CAS to provide the larger body of publicly available compounds. NIH counters that the charge should not focus on where the data come from, but rather the depth and type of information included in PubChem compound records.
Both sides have also taken their cases to Congress, with ACS seeking help in limiting the compound scope of PubChem and NIH justifying the breadth of it. Among those targeted is Rep. Ralph Regula (R-Ohio), who serves as chair of the House Appropriations subcommittee that includes NIH in its jurisdiction. The appropriation markup had not occurred before C&EN went to press, but according to Regula's office, he has been in discussions with both sides and will hold off making any decisions related to congressional action until he can sort through everything.
The Scholarly Publishing & Academic Resources Coalition has recently weighed in on the debate. In a letter to Regula in support of NIH's position, Executive Director Rick Johnson writes, "We believe that [ACS's] concern is unfounded and that the American public is well served by continued development and maintenance of PubChem."
ACS and NIH returned to the table on June 3 to try and find a solution before Congress makes its decision. At that meeting, NIH Director Elias A. Zerhouni and ACS President William F. Carroll made progress, but at press time, both sides were still trying to work out some details to resolve this issue.