Visualizing Data: When, Why, and How Part 3: The Importance of Integrity: How Colour Choice Influences Interpretation (Article 2)
This article is part of a multi-part series on data visualization. Parts 1 and 2 focus on using data visualization throughout the data science workflow and determining when visualizing your data is an appropriate approach for communicating information. Part 3, The Importance of Integrity, focuses on factors that affect effective and honest communication of a data story. This is the second article of Part 3.
Colour is a critical factor affecting the viewer’s interpretation of a visualization. In this article, we will start with some background on colour theory, and then discuss two different aspects of colour usage that can drive this interpretation:
Colour is integral to the way that we perceive composition and pattern. The visual effects of different colours and colour combinations are the subject of the study of colour theory, the concepts of which can help explain why different visual compositions are more or less effective. These concepts, used to compose paintings, photographs, and other artwork, are just as relevant to the way that our eyes move through the two-dimensional compositions that data visualizations essentially are. Thus, in this section, we will explore how the elements of colour affect our interpretation of plotted data.
There are three elements of colour: hue, value, and intensity. Hue is what we think of as a colour – it is the colour name (“red”, “blue”, “green”), with no indication of lightness, darkness, saturation, etc. Hues can be demonstrated on a colour wheel, such as the one below:
Value is the lightness or darkness of a colour, and a high value or light value colour reflects more light and has greater brightness. For example, a pure yellow is higher value than a pure purple. The figure below shows a gradient of values.
Last, intensity is a colour’s saturation, or how different it is from gray. Pure colours are at their highest intensity, and lower intensity colours begin to look dirtier or dingier. The figure below shows a gradient of intensities for turquoise.
Some hues, like red, will carry weight even when present in a small amount. In a two-dimensional composition, an artist might use this effect to intentionally draw the eye of the viewer, or consider balancing a red element with other heavy elements, such as a complex design, larger regions of lower intense colours, or small regions of other intense colours. For example, consider the painting below, by Edgar Degas.
In the Degas painting above, the composition and use of colour work together to focus the viewer’s eye on the dancer in the red shawl at the front (bottom) left, and to move attention between this dancer and the small patch of orange on the dancer’s skirt in the back (top) right.
We will see shortly that colour has a similar effect on the viewer’s attention in a data visualization, in that red points or lines will draw the attention of the viewer, even if they make up a small part of an overall trend or group of data. Areas of high intensity can also act to steal the show, even when small, capturing the viewer’s eye. The third element of colour, value, can have as strong an effect, with areas of similar values defining the viewer’s path through a visualization.
These elements are inextricable from your colour choices, and may act to enhance or undermine accurate interpretation of data. Be aware of their effects so that you can communicate honestly!
Colour will affect the way your eyes move through a data visualization just as much as they would with a two-dimensional artistic composition. It’s helpful to be aware of these concepts so that you understand how they will affect interpretation, whether your own or the viewers of your visualizations.
The figures below demonstrate how these elements of colour can influence the way that they draw your attention. What do you notice about the trends in each plot?
All four plots show the same data, despite the fact that the importance of different trends may look different in each plot. Groups 1 and 2 are randomly normally distributed over the two variables – there is no relationship between the variables for these groups. However, Group 3 has a parabolic relationship between the two variables, and Group 4 has a linear relationship.
In plot (a), the red points draw they eye because the colour red is a dominant hue and because this particular shade of red is lower in value – it’s darker. These points appear to have a parabolic shape. While it is not clear what is happening with the other points, you might visually extend this relationship to the other groups of points, and start exploring whether this is an overall trend.
In plot (b), the red points are now linearly related, and emphasize this relationship, which was less obvious when group 4 was plotted in orange.
In plot (c), the red and orange points still draw the eye, but in this case, these groups (1 and 2) are random and do not appear to have a relationship. You might notice the green or blue points and explore whether there are relationships between variables for these groups, but it is less likely to drive your intuition around the overall relationship. The same is true for plot (d), which is plotted using an entirely different colour scale. The low value purple points are still noticeable, but the high value yellow points pull attention away from the other colours, especially in contrast with the low value purple.
What is the “correct” way to show the data? Again, this depends on what you are doing with the data and what you are trying to communicate. If you are doing exploratory data analysis and your purpose is to find groups with interesting relationships between the two variables, it might be helpful to have your eye drawn to the parabolic relationship of the points in red in plot (a) or the linear relationship of the points in plot (b). However, if your purpose is to communicate the general relationship between the two variables, these two plots might drive home the conclusion that there is a significant relationship, when in fact these are exceptions rather than the rule.
This is a relatively simple example, with only two variables and only four groups. Something like this might be relevant if you’re exploring variable relationships in pair plots during exploratory data analysis. However, the same concepts scale up to more complex figures. For example, imagine that you’re looking at a complex network graph where some of the nodes or edges are colours that stand out more than others, whether due to hue, intensity, or value – this would still affect the relationships that you will see first, and thus your developed intuition around the data.
The importance of value contrasts are again demonstrated in the plots below. Here, a y variable is plotted against month, a categorical variable. How many of the groups appear to have a consistent time-dependent relationship?
In the plot above, groups 1 and 2 have values that are randomly distributed around 20 for all months in the plot, whereas group three has a parabolic relationship, with a minimum around May. Because group three is plotted in a high value yellow against a low value background, it dominates visually, making the overall trend appear to be parabolic over time.
In the plots below, the same data are plotted with two different colour assignments against two different backgrounds. Plots (a) and (c), and plots (b) and (d), have the same colours per group, whereas plots (a) and (b), and (c) and (d), have the same plot colour backgrounds. When is each colour more dominant? How does this affect what patterns are most noticeable?
The high value yellow is most noticeable against the low value gray background, so the parabolic pattern is more obvious in (a) than in (b). In the second two plots, the red becomes the clearest colour, so the parabolic pattern is more obvious in (d) than in (c).
In none of these cases is it impossible to figure out that one group has a clear pattern and the others are more random. However, if you are looking quickly at a plot for the underlying story, colour combinations such as these can affect your initial read on the patterns, driving the direction you take your analysis or affecting the impression of a take-home message.
Again, the best option for the plot depends on what you are trying to accomplish. Many colour schemes and approaches exist where colours are selected to be balanced visually, or to be the most visible to those who are colour blind (see the viridis palette, discussion on colourblindness, ColorBrewer for maps, and colour selection suggestions here). It is important to consider these factors as well as the effect that a chosen palette has on the visibility of trends.
Colour similarity, or relatedness, can also affect how people interpret relationships between data in a figure. For example, in the D3 category20 scale, the first two colours are light blue and dark blue; when using all colours in this scale, viewers will understandably assume that data in these colours are related, whether or not this is the case.
This extends to colours that are near each other on the colour wheel. For example, when seeing similar colours (e.g., reds and oranges), people will interpret greater relatedness in the underlying data than when seeing complementary colours (e.g., red and green – although, keep colour blindness in mind). In visualizations, this can be used to effectively indicate actual relatedness in the data, but beware of unintentionally communicating incorrect conclusions – to yourself or to others!
This can also be relevant for interpreting patterns across groups that should not necessarily be grouped together. For example, take a look at the line plots below, showing changes in values over time. What trends seem more apparent in each plot?
All of the data in these plots was created as a random walk, selecting only series that remained within specified bounds. Nonetheless, different patterns are emphasized by different colour schemes. In plot (a), the red lines – colours that are similar and also lower value – emphasize the series that trend upward at the end of the time shown, whereas the green lines peak earlier in the time series. This causes the viewer’s mind to group the lines that have similar trends, and might appear as patterns that doesn’t actually exist in the randomly created dataset. In plot (b), the colours are assigned randomly, and the overall impression is more random, although there is still the tendency to group red lines.
Plots (c) and (d) have similar effects, with a different colour scheme (viridis). In plot (c), the low value purples emphasize the trend towards high values at the end of the time series, whereas plot (d) appears more moderate.
The plots below demonstrate a similar, colour grouping effect with scatterplot data.
All four plots include the same data: 9 groups of points normally distributed around points on a grid. In plot (a), it is easy to visually group the red and orange points and spot an increasing linear trend in the data. The same is true of the low value blues and purples in plot (b). Plots (c) and (d) have colours arranged more randomly across the dataset, and it is more obvious that the groups themselves are random, with no relationship between Variables 1 and 2.
If groups one, two, and three are related, it might make sense to use similar colours for them – but if not, beware of visual effects that appear because of grouping similar colours!
When working with data with three variables, it is common to plot the data as a heatmap, with colour indicating the value of the 3rd variable (z-axis). In this context, it is just as important to pay attention to how the range of the axis determines interpretation of patterns and values as it is with two-dimensional plots – except that here the range is a colour scale, rather than a plot axis.
For example, when there is substantial variation at one end of the range but extreme values on the other, notable patterns can be obscured. In this case, your options are similar to those above: you could consider log-transformation of the variable represented by colour, or try using a colour scale with more variation.
In plot (a) below, where is your attention drawn, and what might your conclusion be in terms of the dependence of Z on X and Y?
It looks like there might be a bit of variation throughout, but the only clear pattern is a peak in values centred around X = 15 and Y = 15. In contrast, take a look at the same data plotted with a broader range of colours (b), as well as with the data log-transformed (c and d).
In these plots, it becomes more clear that Z has a relationship with X and Y throughout the range of these latter variables. The broader colour range in plot (b) helps communicate this by providing more contrast at the low end of the range. However, this new step in colour – from purple to white – has a high contrast, and this might indicate to the viewer that this difference is as important as those at the higher end. Whether or not this is appropriate depends on the variables being described and the story being told.
In plots (c) and (d), log-transformation also makes this colour range more clear with both colour palettes, by emphasizing differences between smaller values and down-playing differences between higher values. While log-transformation might not make sense as part of your later data processing, it is a useful tool that you can use to explore variation at smaller values.
Part 1, Data Visualization Throughout the Data Science Workflow, is in two parts and can be found here and here.
Part 2 of the series is When Is Data Visualization a Good Choice.
The third part of this series, The Importance of Integrity, consists of three articles, How Plot Parameters Influence Interpretation, How Color Choice Influences Interpretation, and Maps – Potentials & Pitfalls (forthcoming).