Cause of Death- {gghighlight} with a Line Graph
This is part 2 of final project in Communicating and Transforming Data. In addition to answering the questions regarding causes of death using real data, I also hope to facilitate data analysis using R by documenting the steps leading towards the final results.
The R project aims to answer the following three questions by demonstrating three main data vidualizations:
Q1. How do leading causes of death change over the past 18 years?
Q2. What are the changing patterns of each leading cause of death over the years?
Q3. What are the distinct causes of death in each state in the United States?
This post presents the second part of the project, in which data visualization for Q2 is shown. For steps leading to the plot, including data preparation and tidy data, please refer to part 1 of the project.
Here are the required packages:
install.packages("rio")
install.packages("here")
install.packages("tidyverse")
install.packages("paletteer")
install.packages("janitor")
install.packages("gghighlight")
install.packages("colorblinr")
Here are the global settings:
library(tidyverse)
library(rio)
library(here)
library(paletteer)
library(janitor)
library(gghighlight)
library(colorblindr)
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE)
The second plot of this R project series are for policy-makers and health-related researchers. This is a quick summary plot regarding changes in causes of death over the years.
# create data for labeling
df_plot2 <-tidy_df %>%
group_by(year, cause) %>%
summarise(deaths_by_year_cause = sum(deaths))
tidy_df %>%
group_by(year, cause) %>%
summarise(deaths_by_year_cause = sum(deaths)) %>%
#plot
ggplot(aes(x = year, y = deaths_by_year_cause/10000, color = cause)) +
geom_line(size = 1) +
scale_x_continuous(breaks = seq(2000, 2016, by = 2),
expand = c(0, 0)) +
scale_y_log10(expand = c(0, 0), breaks = c(3, 5, 10, 20, 30, 60, 100),
limits = c(2, 100)) +
scale_color_paletteer_d(rcartocolor, Vivid) +
theme_classic(base_size = 15) +
geom_text(data = filter(df_plot2, year == 2016),aes(label = cause),
nudge_x = 5.5, hjust = 1, size = 4) +
guides(color = "none") +
theme(panel.grid.minor = element_blank(),
legend.key.size = unit(3, 'lines'),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic")) +
labs(title = "Causes of Death Across Time",
subtitle = "Top 2 causes of death are heart disease and cancer.\nAlzheimer's disease increases rapidly.",
x = "Year",
y = "No. of deaths (in 10k)",
caption = "Source: Centers for Disease Control and Prevention ")
From the plot it is easy to tell the changes in number of deaths across the years. Some diseases show increased numbers (suicide, kidney disease, diabetes, Alzheimer’s disease, chronic lower respiratory disease, unintentional injuries), some diseases show decreased numbers (influenza and pneumonia, stroke, heart disease) while some do not change much (cancer).
The cons of this plot is that the texts annotating causes are overlapped. I have tried to maximize the figure height and spaces between each disease to solve this overlapping issue without distorting the figure too much, although the problem remains the same in some texts. For example, the label of “Influenza and pneumonia” is overlapping “Kidney disease”.
So how about looking into a way to get rid of the text?
tidy_df %>%
group_by(year, cause) %>%
summarise(deaths_by_year_cause = sum(deaths)) %>%
mutate(cause = factor(cause,
levels = c("Heart disease", "Cancer", "Unintentional injuries",
"Chronic lower respiratory diseases", "Stroke",
"Alzheimer's disease", "Diabetes",
"Influenza and pneumonia", "Kidney disease" , "Suicide"
))) %>%
ggplot(aes(x = year, y = deaths_by_year_cause/10000, color = cause)) +
geom_line(size = 2) +
scale_x_continuous(breaks = seq(2000, 2016, by = 2), expand = c(0, 0)) +
scale_y_log10(expand = c(0, 0), breaks = c(3, 5, 10, 20, 30, 60, 100),
limits = c(2, 100)) +
scale_color_paletteer_d(rcartocolor, Vivid) +
theme_classic(base_size = 26) +
theme(panel.grid.minor = element_blank(),
legend.key.size = unit(3, 'lines'),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic")) +
labs(title = "Causes of Death Across Time",
subtitle = "Top 2 causes of death are heart disease and cancer.\nAlzheimer's disease increases rapidly.",
x = "Year",
y = "No. of deaths (in 10k)",
caption = "Source: Centers for Disease Control and Prevention ",
color = "")
Although the texts of each cause are legible in the color legend now, a total of 10 colors definitly increases audiences’ cognitive loads. The other concern is that there is no color-blind-friendly qualitative palette out there that has more than 10 colors. Therefore, I decided to use {gghighlight}
to select three causes of interest and plot these three causes with a color-blind-friendly palette colorblindr::scale_color_OkabeIto()
in the final version of plot below. Note that scale_color_OkabeIto
needs additional two packages {cowplot}
and {colorspace}
to make the palette work. Here are the details.
tidy_df %>%
group_by(year, cause) %>%
summarise(deaths_by_year_cause = sum(deaths)) %>%
mutate(cause = factor(cause,
levels = c("Heart disease", "Cancer", "Unintentional injuries",
"Chronic lower respiratory diseases", "Stroke",
"Alzheimer's disease", "Diabetes",
"Influenza and pneumonia", "Kidney disease" , "Suicide"
))) %>%
ggplot(aes(x = year, y = deaths_by_year_cause/10000, color = cause)) +
geom_line(size = 2) +
gghighlight(cause == "Heart disease" |
cause == "Cancer" |
cause == "Alzheimer's disease") +
scale_x_continuous(breaks = seq(2000, 2016, by = 2), expand = c(0, 0), limits = c(1999, 2017)) +
scale_y_log10(expand = c(0, 0), breaks = c(3, 5, 10, 20, 30, 60, 100),
limits = c(2, 100)) +
# color-blind-friendly
scale_color_OkabeIto() +
theme_classic(base_size = 15) +
guides(color = "none") +
theme(panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(face = "italic")) +
labs(title = "Causes of Death Across Time",
subtitle = "Top 2 causes of death are heart disease and cancer.\nAlzheimer's disease increases rapidly.",
x = "Year",
y = "No. of deaths (in 10k)",
caption = "Source: Centers for Disease Control and Prevention ",
color = "")
This post is intended for policy-makers and/or researchers to see the overall changes of causes of death across the years. Heart disease, cancer and Alzheimer’s disease are the three causes that need most attention in disease intervention and prevention. Heart disease and cancer are the top two distinct causes of death over time and Alzheimer’s disease increases rapidly.
Text and figures are licensed under Creative Commons Attribution CC BY-NC 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Chen (2019, March 12). Szu-Hua Teresa Chen, PT, PhD: Changing patterns of leading causes of death over time. Retrieved from https://teresashchen.github.io/blog/posts/2019-03-12-cause-of-death-gghighlight-with-a-line-graph/
BibTeX citation
@misc{chen2019changing, author = {Chen, Teresa}, title = {Szu-Hua Teresa Chen, PT, PhD: Changing patterns of leading causes of death over time}, url = {https://teresashchen.github.io/blog/posts/2019-03-12-cause-of-death-gghighlight-with-a-line-graph/}, year = {2019} }