If you’re a biomedical researcher and you’re using bar graphs to plot continuous data, Tracey L. Weissgerber has one word for you: “Stop.”
Weissgerber, a physiologist at Mayo Clinic, is on a crusade to change the way biomedical scientists and others who work with biological samples present their data. She’s especially keen for people to trade bar graphs for other types of plots that reveal, rather than obscure, the underlying data points and their distribution.
She developed her dislike of bar graphs as a result of her research into preeclampsia, a pregnancy complication characterized by high blood pressure.
“One of the things we’ve learned about preeclampsia is it’s a syndrome in which women get the same symptoms, but they get them for different reasons,” she says. “Multiple things can go wrong and lead to preeclampsia. In some women, it may be problems with the mother. In other women, it may be problems with the baby. If we want to make progress as a field, one of the things we need to do is look for subgroups. There are no subgroups in a bar graph. A bar graph completely masks that.”
Working with biostatisticians, Weissgerber surveyed 703 papers in top physiology journals (PLOS Biol. 2015, DOI: 10.1371/journal.pbio.1002128). She and her colleagues found that most papers presented continuous data—variables such as age, weight, blood pressure, biomarker concentration—as bar graphs with a mean and standard error or standard deviation, even when the sample size was too small for this type of statistical analysis to be meaningful.
When Weissgerber showed the two statisticians she works with examples of data sets in which sample size was not sufficient, “they were both a little depressed,” she says. “It’s been an eye-opening process for them as well as for me. I’ve learned a lot about statistics and the problems with the way I was taught statistics as a biomedical science student. And they’ve learned a lot about how things are done in the basic sciences.”
▸ Hometown: Geneva, N.Y.
▸ Education: Ph.D. in kinesiology and health studies, Queen’s University, in Kingston, Ontario
▸ Current position: assistant professor in the division of nephrology and hypertension at Mayo Clinic
▸ Favorite molecule: water, especially on hot days
▸ Favorite statistical method: jackknife resampling, because it sounds dangerous and exciting
▸ Professional highlight: “Our team’s paper on bar graphs was viewed more than 100,000 times in the first month of publication and has contributed to policy changes in several journals.”
The biggest problem with bar graphs is that multiple data distributions can lead to the same bar graph. Without being able to see the underlying data points, it’s impossible to critically evaluate the data or tell what the real data distribution is.
Weissgerber suggests different presentations depending on the size of the data set. Data sets with fewer than 10 observations per group should be presented as dot plots. Dot plots can be used for larger data sets too, but other options exist for these situations. One is what’s known as a box plot. In the case of a vertical plot, it looks like a bar graph that doesn’t extend all the way down to the x-axis. A line that runs horizontally through the box represents the median of the data set, and the top and bottom of the box represent the 75th and 25th percentiles, respectively, of the underlying data points. “Whiskers”—vertical lines extending from the top and bottom of the box—show the minimum and maximum of the data set. Outliers are shown as individual points.
Still, box plots aren’t perfect. If a set of data points are clustered in two or more spots (bimodal or multimodal distributions, respectively), a box plot can hide that fact. For medium to large data sets with these distributions, Weissgerber says a better option is a violin plot, so called because of its resemblance to the instrument. These data are typically plotted symmetrically around a central axis. “If your data are bimodal, a violin plot will show that very nicely,” Weissgerber says. “You’ll have a bulge at the top and a bulge at the bottom and this thin bit in the center.”
Although Weissgerber discourages the use of bar graphs, she also emphasizes that they’re not evil. “If you’re showing counts or proportions, that’s what a bar graph is designed for, and you should definitely use it,” she says. “When we say don’t use bar graphs, it’s for continuous data—something such as age or heart rate that has a range of values and a distribution that’s hard to represent in the bar graph shape.”
Weissgerber hopes that when scientists address problems with data visualization, the experience will encourage them to think about other statistical issues.
“A lot of people are uncomfortable with talking about statistical issues. Some of them never had statistics training at all; others had statistics training that wasn’t designed for them,” she says. When Weissgerber showed her statistician colleagues what she was learning about the ways people were using statistics, “they both said, ‘I would have designed courses completely differently if I’d known these were the types of sample sizes people were using and the type of questions they were asking.’ ” What works for epidemiologists, who work with large data sets, doesn’t work for basic biomedical scientists.
Weissgerber and her colleagues are providing tools to help improve data visualization for studies with small sample sizes. They already have a free web-based tool for generating line graphs (PLOS Biol. 2016, DOI: 10.1371/journal.pbio.1002484), which is available at statistika.mfub.bg.ac.rs/interactive-linegraph. Weissgerber expects to publish something similar for dot plots soon. The team advocates using interactive data sets and graphics.
“There are limits to what you can present in a static graph,” she says. “As soon as your data set gets a little larger, a little more complex, you end up with a lot of crossing lines, and it’s just a huge mess.”
With interactive graphics, for instance, if people are reading a journal article, they can turn to the supplementary information and plot the things they most want to see from the data set. Such data treatments can help tease apart overlapping groups and differences in response patterns.
But no matter how data are represented, the most important thing is transparency, Weissgerber says. “As scientists, we need to be able to critically evaluate the work of others. Bar graphs stop us from doing that because they conceal the underlying data.”