What is ggplot?

“ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.” https://ggplot2.tidyverse.org/.

ggplot2 lets you create different types of graphics that are often more visually more appealing than base R graphics. There are a lot of pre-set options that will make your plots look nicer than base R graphics automatically, or, you can tweak these individually for very fine control.

The general structure of a ggplot object includes (1) the data in a data frame, (2) indications of what your x and y vars are (3) a specific “geom” to tell ggplot what kind of plot you actually want (sometimes there will be more than one.) What you should put (and what will actually run) depends on the nature of your x and y variables e.g. are both variables continuous? Is y continuous but x discrete? etc. This will also impact how you specify options for your respective axes. Another important aspect of a ggplot is (4) aesthetic mappings. This isn’t necessary to create a ggplot, per se, but its flexibility is one of the main reasons to use ggplot as opposed to base

How to

We’ll start by installing and loading ggplot2, and then look at the basic structure behind a ggplot.

Install ggplot2 and load the library

# install.packages("ggplot2")
library(ggplot2)

Load some data

data(iris)
class(iris)

## [1] "data.frame"

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Side note: Keep track of which variables are numeric/factors. Often if you think your code is correct but the plot doesn’t look right it’s because ggplot2 thinks your numeric is a factor, or vice versa.

Now, we’ll create a basic scatter plot.

We need to provide ggplot with the data frame, and the names of our x and y-variables. geom_point() tells it we want a scatter plot. This is the basic ggplot2 structure.

g1 <- ggplot(iris, aes(x=Sepal.Width, y=Petal.Length)) + 
      geom_point()
print(g1) # even though typing 'g1' into the console would produce the plot in an interactive R session, any time you are creating a document (pdf, jpeg) you need to wrap your ggplot object in print()

It might be more interesting to look at this by species. We can color by another variable in our data frame, in this case, Species (a factor).

g1 <- ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
      geom_point()
print(g1)

The ordering of species is determined by the factor levels. If you want to change this ordering you can relevel the factor, or if you don’t want to change your data, you can set the order of the breaks and labels manually.

levels(iris$Species)

## [1] "setosa"     "versicolor" "virginica"

Side note: this is what happens if Species isn’t a factor.

ggplot(iris, aes(Sepal.Width, Petal.Length, colour = as.numeric(Species))) + 
      geom_point()

We created our ggplot layer and called it g1. We can treat this like a regular R object; save it and print it later, or look at its attributes. This will give us an idea of some of the specific parts of the plot we can manipulate.

attributes(g1)

## $names
## [1] "data"        "layers"      "scales"      "mapping"     "theme"      
## [6] "coordinates" "facet"       "plot_env"    "labels"     
## 
## $class
## [1] "gg"     "ggplot"

We can also add things to the original ggplot layer to create a new ggplot object. For example, let’s connect the points on our scatter plot.

g2 <- g1 + geom_line()
print(g2)

Now that we have a basic plot, we will go through and see how various pieces can be tweaked. Let’s walk through how to creat and modify a title, axis labels, legend, and colors.

Titles

Adding a title is fairly intuitive, just use ggtitle().

g3 <- g1 + ggtitle("This is my plot.")
print(g3)

By default the title is left-justified. If we want to change the horizontal alignment, we can do by adding a theme layer and setting hjust. Notice our g3 object with the left-justified title doesn’t have this theme layer.

g3$theme # this is an empty list

## list()

I like to have all parts of the code in the same place, so I’m going to create some new ggplot objects.

ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
          geom_point() +
          ggtitle("This is my plot, hjust = 0.") +
          theme(plot.title = element_text(hjust = 0))

ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
          geom_point() +
          ggtitle("This is my plot, hjust = 1") +
          theme(plot.title = element_text(hjust = 1))

g4 <- ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
          geom_point() +
          ggtitle("This is my plot centered, hjust = 0.5.") +
          theme(plot.title = element_text(hjust = 0.5)) 

print(g4)

Notice our theme attribute is no longer an empty list.

g4$theme # not empty anymore!

## List of 1
##  $ plot.title:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0pt 0pt 5.5pt 0pt
##   .. ..- attr(*, "valid.unit")= int 8
##   .. ..- attr(*, "unit")= chr "pt"
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi FALSE

We also could have created this plot by adding the theme layer directly to the object with the title.

print(g3 + theme(plot.title = element_text(hjust = 0.5)))

Axes

Use scale_x_continuous and scale_y_continuous to modify the axis titles. Notice you need to specify whether each variable is continuous, or discrete.

g5 <- ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
          geom_point() +
          ggtitle("This is my plot centered.") +
          scale_x_continuous(name = "Sepal width (units)") +
          scale_y_continuous(name = "Petal length (units)") +
          theme(plot.title = element_text(hjust = 0.5)) 
print(g5)

Different types of plots get different types of aesthetics, which are labeled differently. Generally, points and lines have colour aesthetics, and polygons have fill aesthetics. For example, if we create a bar chart, we need to (1) specify a fill instead of a colour for our aesthetic mapping, and (2) use scale_x_discrete() to modify the x-axis title.

g_bar1 <- ggplot(iris, aes(Species, Sepal.Width, fill = Species)) +
              geom_bar(stat = "identity") +
              ggtitle("This is my plot centered.") +
              scale_x_discrete("Species blah") +
              scale_y_continuous("Sepal width (units)") +
              theme(plot.title = element_text(hjust = 0.5))
print(g_bar1)

Instead of the Species label for the x-axis, we might want to remove the title and make the species names more visible. Angling the x-asis text and modifying its size is especially useful if you have many factor levels and your labels are crowding each other.

g_bar2 <- ggplot(iris, aes(Species, Sepal.Width, fill = Species)) +
              geom_bar(stat = "identity") +
              ggtitle("This is my plot centered.") +
              scale_x_discrete(name = "", labels = c("Setosa", "Versicolor", "Virginica")) +
              scale_y_continuous("Sepal width (units)") +
              theme(plot.title = element_text(hjust = 0.5),
                    axis.text.x = element_text(angle=45, vjust=0.5, size=12, color = "black"))
print(g_bar2)

Legends

Since the bars are already labeled, we could remove the legend of our bar chart. Again, we need to specify the type of aesthetic legend that we want to remove. In this case it’s a fill legend. If this were a scatter or line plot, it would be a colour legend.

g_bar3 <- ggplot(iris, aes(Species, Sepal.Width, fill = Species)) +
              geom_bar(stat = "identity") +
              ggtitle("This is my plot centered.") +
              scale_x_discrete(name = "", labels = c("Setosa", "Versicolor", "Virginica")) +
              scale_y_continuous("Sepal width (units)") +
              theme(plot.title = element_text(hjust = 0.5),
                    axis.text.x = element_text(angle=45, vjust=0.5, size=12, color = "black")) +
              scale_fill_discrete(guide = FALSE)
print(g_bar3)

Or, we could use specific colors and modify the legend in other ways. Here, I am suppressing the x-axis tick marks and title, and setting colors manually, and modifying the legend labels and title. I’m also adding a manual line-break to the legend title. Scale_fill_manual is only needed for the manual coloring - you could modify the legend labels and title using scale_fill_discrete() if you were ok with default colors.

g_bar4 <- ggplot(iris, aes(Species, Sepal.Width, fill = Species)) +
              geom_bar(stat = "identity") +
              ggtitle("This is my plot centered with ugly colors.") +
              scale_x_discrete(name = "", breaks = NULL) +
              scale_y_continuous("Sepal width (units)") +
              theme(plot.title = element_text(hjust = 0.5)) +
              scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9"), 
                                name="Flower\nSpecies",
                                labels=c("Setosa", "Versicolor", "Virginica"))
print(g_bar4)

Long/wide data: for ggplot all of your data must be in a single data frame, and grouping variables have to be stacked. Yes, this is annoying.

Lets say we want to create a plot that lets us compare the bias and variance components of mse. First, we’ll simulate some data that we might have.

set.seed(123)
bias <- rnorm(7, 0, 2) 
bias2 <- bias^2
var <- rgamma(7, 1, 1)
mse <- bias2 + var
dat <- as.data.frame(rbind(mse, bias2, var))
names(dat) <- paste0("model_", 1:ncol(dat))
print(dat)

##        model_1   model_2    model_3    model_4    model_5    model_6
## mse   3.923850 4.0978521 10.6819333 0.70251405 0.10022603 12.6797889
## bias2 1.256532 0.2119267  9.7182864 0.01988573 0.06686127 11.7657916
## var   2.667319 3.8859253  0.9636468 0.68262832 0.03336475  0.9139972
##         model_7
## mse   1.1019922
## bias2 0.8497750
## var   0.2522172

We might want a plot that looks like this.

Or one that looks like this.

But, for either plot we need to get our data from “wide” to “long.”

print(dat) # wide

##        model_1   model_2    model_3    model_4    model_5    model_6
## mse   3.923850 4.0978521 10.6819333 0.70251405 0.10022603 12.6797889
## bias2 1.256532 0.2119267  9.7182864 0.01988573 0.06686127 11.7657916
## var   2.667319 3.8859253  0.9636468 0.68262832 0.03336475  0.9139972
##         model_7
## mse   1.1019922
## bias2 0.8497750
## var   0.2522172

print(gdat) # long

##      Model variable       value
## 1  model_1    bias2  1.25653180
## 2  model_2    bias2  0.21192671
## 3  model_3    bias2  9.71828643
## 4  model_4    bias2  0.01988573
## 5  model_5    bias2  0.06686127
## 6  model_6    bias2 11.76579164
## 7  model_7    bias2  0.84977500
## 8  model_1      var  2.66731861
## 9  model_2      var  3.88592535
## 10 model_3      var  0.96364683
## 11 model_4      var  0.68262832
## 12 model_5      var  0.03336475
## 13 model_6      var  0.91399722
## 14 model_7      var  0.25221723

The package reshape2 will automate a lot of this for you. It has two main functions. melt() takes wide data and makes it long, and cast will take long data and make it wide. There is another R package called tidyr that will also do this sort of thing for you, but I have not personally used it. This is how I converted my data from wide to long.

dat <- as.data.frame(t(dat))
dat$Model <- rownames(dat)
suppressPackageStartupMessages(require(reshape2))
gdat <- melt(dat[, 2:4], id.vars = "Model")
print(gdat)

##      Model variable       value
## 1  model_1    bias2  1.25653180
## 2  model_2    bias2  0.21192671
## 3  model_3    bias2  9.71828643
## 4  model_4    bias2  0.01988573
## 5  model_5    bias2  0.06686127
## 6  model_6    bias2 11.76579164
## 7  model_7    bias2  0.84977500
## 8  model_1      var  2.66731861
## 9  model_2      var  3.88592535
## 10 model_3      var  0.96364683
## 11 model_4      var  0.68262832
## 12 model_5      var  0.03336475
## 13 model_6      var  0.91399722
## 14 model_7      var  0.25221723

Another reason to be aware of this stacking format is if you want to map additional aesthetic elements (size, shape, line thickness, etc.) to a graph, you’ll need a vector for each aesthetic element that runs the entire length of your data. This is often important if you are combining multiple types of graphs or aesthetic elements. Another way to map multiple aesthetic elements by some factor variable is using the group aesthetic.

g <- ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
      geom_point() + # as written, this inherits 'color' from the aes above
      geom_smooth(method = lm, se = FALSE) # as written, this inherits 'color' from the aes above 
      
print(g)

g <- ggplot(iris, aes(Sepal.Width, Petal.Length, group = Species)) + # specify a grouping var
      geom_point(aes(colour = Species)) + # specify color specifically by the group var
      geom_smooth(method = lm, se = FALSE, aes(colour = Species, linetype = Species)) # specify color and line type by the group var
      
print(g)

Combining multiple ggplot objects

In base R, “par(mfrow=c( , ))” is a very nice, quick way to display multiple plots on a single page. This does NOT work for ggplot plots. Happily, there are several ways around this:

Facets

Facet example

Packages
- gridExtra
- cowplot
- egg

The basic idea for gridExtra/cowplot/egg is to divide up the plotting space using a grid and to plot the ggplot objects accordingly. I’ll leave the details to these references.

Special characters in titles and labels

We can include greek letters and sub/superscripts by putting our title in an expression.

g4 <- ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
          geom_point() +
          ggtitle(expression("This is my plot centered,"~mu^2~", "~hat(beta))) +
          theme(plot.title = element_text(hjust = 0.5)) 

print(g4)

Or, if we want Greek letters, sub/super scripts along with variables, we need to use bquote.

# what's the correlation between Petal.Length and Sepal.Width?
myCor <- cor(iris$Petal.Length, iris$Sepal.Width)
myCor # let's round this or it'll all print...

## [1] -0.4284401

myCor <- round(myCor, digits = 2)
myCor

## [1] -0.43

g4 <- ggplot(iris, aes(Sepal.Width, Petal.Length, colour = Species)) + 
          geom_point() +
          ggtitle(bquote("r" == .(myCor)~"and"~mu^2~", "~hat(beta))) +
          theme(plot.title = element_text(hjust = 0.5)) 

print(g4)

R-bloggers bquote reference

Creating ggplots within functions

bquote can be especially helpful when creating ggplots as function outputs. Instead of plotting all three species together, we could write a function to create a scatter plot for each one, with an appropriate title.

myScatter <- function(df, subvar, sublevel, xvar, yvar){
  g_df <- df[which(df[, subvar] == sublevel),]
  g <- ggplot(g_df, aes(x=xvar, y=yvar)) +
       geom_point() +
       ggtitle(bquote("Species: "~.(sublevel)))
  return(g)
}

g <- myScatter(iris, "Species", "setosa", Sepal.Width, Petal.Length) 
print(g)
# Error in FUN(X[[i]], ...) : object 'Sepal.Width' not found

What’s the problem with this?

Within the function, xvar and yvar resolve to Sepal.Width and Petal.Length respectively, which are not defined objects.

Similarly, the code below won’t give us what we want.

g <- myScatter(iris, "Species", "setosa", "Sepal.Width", "Petal.Length") 
print(g)

Instead, we need to include our x and y vars as strings and have ggplot map those strings to the variables we want in one of several ways.

### option 1
myScatter <- function(df, subvar, sublevel, xvar, yvar){
  g_df <- df[which(df[, subvar] == sublevel),]
  
  g_df$xvar <- g_df[, xvar] # just create vars named "xvar" 
  g_df$yvar <- g_df[, yvar] # and "yvar"
  
  g <- ggplot(g_df, aes(x=xvar, y=yvar)) +
       geom_point() +
       ggtitle(bquote("Species: "~.(sublevel)))
  return(g)
}

g <- myScatter(iris, "Species", "setosa", "Sepal.Width", "Petal.Length")
print(g)

### option 2
myScatter <- function(df, subvar, sublevel, xvar, yvar){
  g_df <- df[which(df[, subvar] == sublevel),]
  g <- ggplot(g_df, aes_string(x=xvar, y=yvar)) + # use aes_string, not aes
       geom_point() +
       ggtitle(bquote("Species: "~.(sublevel)))
  return(g)
}

g <- myScatter(iris, "Species", "setosa", "Sepal.Width", "Petal.Length")
print(g)

### option 3
myScatter <- function(df, subvar, sublevel, xvar, yvar){
  xvar <- sym(xvar) # sym converts the symbol to the string that the variable contains
  print(xvar)
  yvar <- sym(yvar)
  g_df <- df[which(df[, subvar] == sublevel),]
  g <- ggplot(g_df, aes(x=!!xvar, y=!!yvar)) +  # !! unquotes the string
       geom_point() +
       ggtitle(bquote("Species: "~.(sublevel)))
  return(g)
}

g <- myScatter(iris, "Species", "setosa", "Sepal.Width", "Petal.Length")

## Sepal.Width

print(g)

### option 4
myScatter <- function(df, subvar, sublevel, xvar, yvar){
  xvar <- ensym(xvar) # sym converts the symbol to the string that the variable contains
  print(xvar)
  yvar <- ensym(yvar)
  g_df <- df[which(df[, subvar] == sublevel),]
  g <- ggplot(g_df, aes(x=!!xvar, y=!!yvar)) +  # !! unquotes the string
       geom_point() +
       ggtitle(bquote("Species: "~.(sublevel)))
  return(g)
}

g <- myScatter(iris, "Species", "setosa", "Sepal.Width", "Petal.Length")

## Sepal.Width

print(g)

Option 1 is admittedly hacky, and you’ll end up with the not-very-informative axes labels “xvar” and “yvar.” However it works consistently, and you can always pass informative labels in as characters.

Option 2 works well here, but in practice I have had this break a lot.

Options 3 and 4 are new as of V3.0.0 and so I admit I haven’t used them.

Colors

R color brewer includes nice color palettes, as well as some specific to color-blind individuals.
For more information on the ggplot2 default color scales, and how to customize color scales, see the scales package
A handy function for getting a vector of the default ggplot2 hex colors (which is just equally spaced hues around the color wheel, starting from 15) is the following:

gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}

(Thanks, stackoverflow.)

This is useful because it allows you to create non-ggplots that use the same color scheme as ggplots. It’s also handy if you have one plot with say, 5 colors, and then you want to only plot 3 of those colors, but you want them to be consistent across plots.

Interactive ggplots: ggplotly

Other Resources

http://www.cookbook-r.com/Graphs/
https://ggplot2.tidyverse.org/
Google / StackOverflow
Old but helpful and well-illustrated reshape2 examples
Difference between reshape2 and tidyr

Data visualization with ggplot2

Anna Reisetter

10/17/2019