Libraries have always been the storehouses and caretakers of human knowledge. But with so much scholarly material now being produced in digital form, a new storage option has emerged: online archives that can provide free and (one hopes) enduring access to this material.
These digital repositories essentially are high-capacity electronic filing cabinets packed with information pertinent to a particular field or institution. They range from the simple—say, a collection of scholarly articles—to the comprehensive, pulling in dissertations, reports, conference presentations, institutional records, data sets, photos, videos, and educational materials such as lectures and curricula.
The concept of digital repositories seems like a surefire winner, not only because they serve as a convenient central source from which to obtain information, but also because they can raise the profile of the contributors and their sponsoring institutions. Yet such archives are receiving decidedly mixed responses. Some researchers don't want to make the effort to place items in an online archive. Others aren't even aware of the opportunity. Furthermore, publishers' copyright and prior-publication policies can limit what goes into an archive. Many institutions and fields of research haven't yet built a repository. And although repositories can be launched on a shoestring budget, in some cases, building and maintaining an archive can be costly.
Even in the U.K., where several research funding organizations have recently expressed their support for the concept, only one-third of universities have an open-access repository, and "most of these have very little content in them," according to a report released in June by the U.K.'s Joint Information Systems Committee, which funds many repository projects.
But there are some genuine success stories. Two of the oldest and largest digital repositories—one launched at Los Alamos National Laboratory (LANL) and the other at Switzerland's CERN (European Organization for Nuclear Research)—are centered on physics.
Physicist Paul Ginsparg set up arXiv.org (pronounced "archive") at LANL in 1991 as a means to circulate and archive digital versions of physics papers prior to formal peer review. The practice updated physicists' age-old tradition of disseminating print versions of these papers, dubbed preprints, among their colleagues to obtain feedback and establish precedence. The repository's 380,000 documents, which are submitted by authors, are open access, meaning they can be viewed for free by anyone. Ginsparg moved the repository to Cornell University in 2001.
CERN, the world's largest particle physics lab, launched its own repository in 1993. Known as the CERN Document Server, the repository contains 860,000 records for items including preprints, published articles, photos of people and equipment, posters and presentations, descriptions of experiments at the lab, a library catalog, bibliographic records for books, and other items. Much of the content can be viewed by anyone who visits the repository website.
Although authors themselves can submit their work to CERN's repository, most material is collected by the library, often through automated harvesting from the Web. The library is also scanning old documents into the repository. "We have to be a little bit proactive," explains Joanne Yeomans, a CERN librarian who works with the repository. "We don't just sit there waiting for all these busy authors to bring their articles to us."
CERN's repository has several applications. "For us, it's a central database of all the files and records of interest to CERN's staff and users," Yeomans says. "It's a central location that people can go to to find all kinds of documents that might be useful to them in their scientific work." For instance, a researcher could use it to create a reading list for a student, post a CV, or keep up with the scholarly literature. "Many physicists get by on what they find in the CERN repository and the arXiv.org repository," she adds. The repository can also be sifted for information about CERN. For CERN's 50th anniversary in 2004, for example, the repository was searched to reveal which countries had contributed to the development of the institution, Yeomans says.
The digital repository at CERN runs on CDS Invenio software that the institution developed itself. More than a dozen other institutions have adopted this free software for their own repositories.
CERN periodically modifies the software to broaden its capabilities. For example, the latest release provides a service akin to Amazon's that reports information such as "people who viewed this document have also viewed ...," according to Jean-Yves Le Meur, project leader of the CERN document server. This information could allow users to discover new sources of material related to their interests, he says.
Between the CERN and arXiv.org repositories, physics advances are quite well-covered. But other scientific fields that lack the tradition of extensive circulation of preprints, including chemistry, are less well-represented in repositories.
Massachusetts Institute of Technology librarian Steven Gass notes that some journals, such as those of the American Chemical Society, which publishes C&EN, "won't publish anything that has been posted online in some kind of prior form. That discourages people in chemical disciplines from sharing their information prior to it showing up as a publication of ACS."
Brian D. Crawford, senior vice president responsible for the society's journal publishing program, confirms that "ACS journal editors consider self-posting of original research on the Web, such as within a digital repository, to constitute prior publication." He explains that the editors "feel strongly that research communicated in ACS journals should be both peer reviewed and original in nature. Public disclosure and systematic dissemination via the Web, such as via Web posting of preprints, preempts a journal's ability to be seen as the authoritative forum for the publication of work that has undergone such review." Crawford adds that the editors believe that "publication in a journal should be seen as the mechanism whereby the researcher in chemistry establishes priority and precedence in communicating original research results, after appropriate peer review."
Some highly focused chemistry repositories do exist, however. For instance, Michael B. Hursthouse, a crystallographer at the University of Southampton, in England, and his colleagues have set up eCrystals, an archive of complete crystal structure data for small molecules. eCrystals is currently a demonstration project but is expected to achieve permanent status soon. When it does, more structures will be added to the archive. For each structure, the site provides keywords and chemical identifiers that allow scientists searching the Web for information about particular compounds to retrieve the relevant eCrystals entry. eCrystals is part of eBank UK, a project that is exploring the potential for integrating research data sets into digital repositories.
The open-access World Wide Molecular Matrix, which is in the University of Cambridge's institutional repository, contains properties including heats of formation, 3-D structures, and dipole moments for more than 170,000 small molecules. Another resource, the NMRShiftDB database, provides free access to the nuclear magnetic resonance spectra of 19,000 organic structures.
Chemists with a biological bent can use sites such as the Protein Data Bank, a repository that contains open-access data on the 3-D structures of 38,000 large biological molecules.
Readers interested in the broader biomedical and life sciences journal literature can use PubMed Central (PMC), a free digital archive launched by the National Institutes of Health in 2000. Readers can access some articles as soon as they are placed in the archive; other articles become available after a publisher-imposed delay ranging from three months to three years.
Papers can be placed in the repository by authors or publishers. In general, however, the response of both to PMC has been lukewarm. Only one-fourth of authors whose work has been funded by NIH have submitted a paper to PMC (C&EN, March 13, page 13). And some publishers, including ACS, are concerned that the site's free access policy could reduce revenues from subscriptions and other article-access fees.
An archive such as PubMed Central, which focuses on biomedical science, is an example of a subject-based repository. But archives can also be organized around a place rather than a subject. MIT's DSpace archive, for instance, is focused specifically on organizing scholarly materials associated with the university and as such is an institutional repository. Such repositories can serve as a showcase for an institution's scholarly work, promote the institutional "brand," and raise the profile of both the contributors and their institution.
MIT and Hewlett-Packard designed the free DSpace software platform, which has since been adopted by about 150 other institutions. Set up in 2002, MIT's DSpace-based archive costs an estimated $275,000 annually to run and maintain. It now contains more than 20,000 items, of which two-thirds are theses and dissertations. The repository also holds technical reports, data sets, photos, video files, and teaching materials.
"It's almost like, 'What wouldn't we take?'" explains Gass, who serves as the contact person for MIT faculty and researchers interested in depositing their materials in DSpace. "Having said that, we're making a commitment to preserve this stuff over time. So we go into some detail about what formats we feel we can commit to preserving in the long term." Those formats include Adobe pdf, jpeg, gif, rtf, and xml. "People can certainly put other things in, but we don't provide the same level of assurance in terms of the longevity of preservation," Gass says.
Responsibility for filling the archive-most of which is open access-rests in the hands of MIT professors. The university is still trying to determine whether its decentralized collection model is the right way to go, Gass says. "As easy as we've made it to submit things into it, it's often seen as another extra piece of work that faculty need to do."
Recognizing that their repositories might otherwise languish, a few universities have established policies that encourage faculty to deposit content. Portugal's University of Minho, for instance, has been offering university departments and research centers whose researchers place documents in its repository a financial incentive that is declining over time. The policy was designed to jump-start compliance with Minho's deposit mandate.
Australia's Queensland University of Technology motivates authors by collecting statistics about downloads of material from its repository. Paula Callan, project officer for the archive, notifies researchers and their departments when they reach a major download milestone. "I tend to get plenty of phone calls afterward from nonparticipating researchers asking how they can get their work in the repository," she says.
The Queensland repository runs on GNU EPrints, a free and widely used software platform for building institutional repositories. Stevan Harnad, a cognitive scientist in the electronics and computer science (ECS) department at the University of Southampton and at the University of Quebec, Montreal, commissioned the software at Southampton.
Some Southampton departments, including ECS, mandate author deposition of papers into publicly accessible repositories, he notes. Harnad is a champion of such "self-archiving" and open access in general, as are Hursthouse and others in the chemistry department. Southampton's librarians also back the concepts, and the university itself is a leader in the repository and open-access movements, Harnad says. As a result, Southampton has one of the most successful institutional repositories around, particularly in terms of chemistry and ECS content. Established in 2002, the archive contains more than 20,000 records, including more than 2,000 chemistry documents.
To make the contents of their repositories more accessible, institutions are designing them to abide by standard protocols that facilitate content searches. For example, the University of Michigan's OAIster service allows users to search through almost 9 million records describing digital materials at more than 600 institutions and then, in many cases, click on a link to view the materials of interest.
Down the road, searching will become more sophisticated. Yeomans notes that institutions including the U.K.'s National Centre for Text Mining are developing language processing techniques to recognize concepts-and not just words-in a document and then to make connections between that document and others that relate to the same concepts. "A physicist wouldn't think, 'I'll go and search the chemical databases just in case something similar has been done there,' " Yeomans explains. But through automatic searching and processing of the content of multiple repositories, "these kinds of associations might be found and brought to the attention of a physicist or, in reverse, to a chemist."
Such capabilities might encourage more researchers to use repositories. For now, a mere 15% of articles are being self-archived, according to Harnad. To move beyond that figure, authors will need to be convinced that repositories are a good way to maximize the impact of their research.