ADVERTISEMENT
2 /3 FREE ARTICLES LEFT THIS MONTH Remaining
Chemistry matters. Join us to get the news you need.

If you have an ACS member number, please enter it here so we can link this account to your membership. (optional)

ACS values your privacy. By submitting your information, you are gaining access to C&EN and subscribing to our weekly newsletter. We use the information you provide to make your reading experience better, and we will never sell your data to third party members.

ENJOY UNLIMITED ACCES TO C&EN

Undergraduate Education

Now is the time to curate your data: Use the shutdown as an opportunity

by Alexander J. Norquist and Joshua Schrier
May 22, 2020 | APPEARED IN VOLUME 98, ISSUE 20

 

This is a guest editorial by Alexander J. Norquist, professor of chemistry at Haverford College, and Joshua Schrier, Kim B. and Stephen E. Bepler Chair Professor of Chemistry at Fordham University.

As an experimentalist and a computational chemist who have worked together for years, the COVID-19 pandemic is affecting us differently. While computational scientists can remain active amid laboratory shutdowns, those of us who rely upon experiments for our work face a new reality. Instead of preparing for the summer research season, the hoods are empty, and the benches are gathering dust. The pandemic is forcing us to rethink how we can be productive.

You may have already started diving into a new part of the literature, writing proposals, or preparing manuscripts. But how can we make full use of the host of experimental students who are ready and willing to work, including countless undergraduates who have been effectively furloughed by summer research programs?

This current crisis is an opportunity for experimentalists to fully embrace the data-driven age in which we all now live. Now is the time to digitize and curate experimental data that languishes in handwritten laboratory notebooks and in files scattered with arbitrary names and formats on hard drives. Let’s use this crisis as an opportunity to do something important that all too often never makes it to the top of our to-do lists, but that has lasting value for our discipline and our students.

In our own work, we’ve found that digitizing the records from old laboratory notebooks provides a valuable source of data about otherwise unreported experimental failures. By analyzing all our past experiments, we were able to uncover molecular properties that most affected the outcomes of our reactions, namely the polarizability of the organic amines in our hydrothermal crystal growth experiments (Nature 2016, DOI: 10.1038/nature17439). Our first-year undergraduate students digitized each reaction in less than 2 min from written notebooks. While doing so, they got a firsthand look at what makes a good and bad laboratory notebook and learned how to best organize data.

Digitization and curation of data locked in lab notebooks can also reveal our biases as experimenters. Those biases limit our ability to explore chemical space, which in turn limits the utility of the resulting data to train machine-learning models (Nature 2019, DOI: 10.1038/s41586-019-1540-5). Collecting synthetic reaction data is a first step to identifying unexplored areas of chemical space. For example, in a recent collaboration with chemists from Northwestern University, we collected a series of hydrothermal syntheses. Upon plotting the data, it became apparent that certain combinations of reaction conditions had never been tried. Our reactions in that unexplored region produced novel polar racemates (J. Am. Chem. Soc. 2020, DOI: 10.1021/jacs.0c0123). This type of exploratory data analysis can be an excellent introductory research activity for students at all levels.

Finally, digitizing and curating your data can also be the first step toward collaborating with machine-learning experts—or learning how to build predictive machine-learning models yourself. Creating a proper database of experiments is not only the prerequisite for building such models, but it also forces you to think about the structure of the problem and pose questions in a more formal way. Machine-learning models can help you both quantify existing guiding principles of your science and discover new ones that can be tested in the laboratory once the crisis is over.

If you are uncertain about where to start, don’t despair. Based on our experience training undergraduate students to digitize and curate data, we’ve collected a list of 12 practical steps you can take to make the summer of COVID-19 as productive as possible.

The 12 steps

Follow these suggestions for digitizing and curating data to make the summer of COVID-19 as productive as possible.

1. Don’t be afraid to start. A spreadsheet with well-defined columns is a fine place to start. Each experiment gets a row, and each property has its own column. Plan to have a different column for everything that could vary in your experiment. Use consistent terminology for categories and names of things.

2. You don’t need full lab access. You can digitize and curate your data even if time in the lab is limited by social distancing. Use a cell phone or inexpensive USB document camera to take photos of notebook pages. It takes just one person to get the photos.

3. Capture every possible detail of your experiment. All the data and information on the notebook page should be captured. If everything makes it in, you’re less likely to have to re-enter data. Record raw data exactly as it is presented in the laboratory notebooks. Include a failed or incomplete reaction column that you can use to tag incompletely described experiments. Likewise, include a separate column to tag questionable reactions—for example, if the balance was faulty or reaction vessels leaked.

4. Use a systematic description scheme. Describe experiments in terms of the materials that are used (identities and quantities), the actions that are taken on them (types, durations, and settings), the human and machine actors involved (who performed the experiment and what type of instrument was used), observations during the final experiment, and the outcomes you are attempting to predict. Many such organizational schemata have been developed for the purpose of lab automation, but you don’t need robots to benefit from thinking about your results in this way.

5. Missing data are OK. Record all entries, even those with missing data. You will need to decide how to code these missing entries. Whatever you decide, record it in your documentation.

6. Don’t forget about metadata. Provenance—information about who did the experiment, when they did it, and where they did it—can be as important as the primary data, and often these data allow one to find unexpected relations within the dataset or correlations with other datasets. Record lab-notebook page numbers, or attach a digital photo of the notebook page to facilitate recovering old data or tracking down errors. And don’t forget to record who entered the data in the database.

7. Adopt standard naming conventions for molecules. Few people write their lab notebooks using full IUPAC names. Be sure that molecular entities use a standardized, machine-readable representation, such as SMILES or InChI. If your lab uses a set of standard abbreviations, use them, and include a glossary that defines them.

8. Document your data. It may seem obvious to you what the column names mean, but it will not be obvious to your collaborators. Keep a parallel document describing each column name, its definition, expected minima and maxima, and units (if applicable). If categorical data are denoted with a coding scheme (for example, 1 = red and 2 = blue), document those choices. If units remain constant within a column, capture this in the documentation. If units vary within a column, include a separate column for the units.

9. Look for errors. It is easy to enter the wrong data. To catch data entry errors, visualize the distribution of values in each column—a transposed digit or missing decimal point will be obvious. You can also randomly pick reactions for verification against the primary record. Whatever your method, start looking for errors sooner rather than later.

10. Start learning about programming, data curation, and machine learning. It’s easier than you think. There are excellent free, self-paced online resources from places like the Molecular Sciences Software Institute and the Carpentries that can help you get started.

11. Disseminate your work. If you have a clear hypothesis, use your data to pursue that goal, and publish your dataset along with that work in a traditional publication or on a preprint server. It is also possible to publish and cite datasets without claiming a hypothesis via specialized publications such as Scientific Data or data archiving services such as the Materials Data Facility. There are also services that can provide a citable DOI and searchable registry for materials uploaded to your own institutional repositories, such as DataCite.

12. Don’t be afraid to do it again. Accept the fact that you might miss something in the first attempt. You may have to go back and do it again. As Samuel Beckett wrote in his 1983 story “Worstward Ho,” “No matter. Try again. Fail again. Fail better.”

Views expressed on this page are those of the author and not necessarily those of C&EN or ACS.

X

Article:

This article has been sent to the following recipient:

Leave A Comment

*Required to comment