here - GitHub Pages.

here - GitHub Pages.
An Introduction to R
Andy Teucher
Contents
1 Introduction
1
2 Introduction to R and RStudio
2
3 Data frames
11
4 Exploring Data Frames
22
5 Data visualization with ggplot2
27
6 Subsetting data
50
7 Dataframe manipulation with dplyr
61
8 Dataframe manipulation with tidyr
67
9 Writing data
77
10 Basic statistics
78
11 Writing functions
95
12 Flow control
102
13 Best Practices
106
14 Getting help
108
1
1.0.1
Introduction
Course website
These notes are a pdf version of the website for the course, which can be viewed at: https://ateucher.github.
io/rcourse_site
1
1.0.2
Credits
Most of the material here was borrowed and adapted from Software Carpentry’s novice R Bootcamp material,
which they make available for reuse under the Creative Commons Attribution (CC_BY) license. These are
amazing people, doing amazing things to help the scientific world be more productive with their code and
data. Check them out, and if you get the chance to attend a bootcamp, do it.
The course notes from Poisson Consulting’s 2012 R Course were also very helpful in putting this material
together.
1.0.3
Source
Source material for the course notes can be found here: https://github.com/ateucher/rcourse_site
1.0.4
License
Notes: CC-BY. Code: MIT
2
Introduction to R and RStudio
2.1
•
•
•
•
•
•
•
2.2
Learning Objectives
To
To
To
To
To
To
To
gain familiarity with the various panes in the RStudio IDE
gain familiarity with the buttons, short cuts and options in the RStudio IDE
understand variables and how to assign to them
be able to manage your workspace in an interactive R session
be able to use mathematical and comparison operations
be able to call functions
be able to create self-contained projects in RStudio
Introduction to RStudio
Throughout this lesson, we’re going to teach you some of the fundamentals of the R language as well as some
best practices for organising code for scientific projects that will make your life easier.
We’ll be using RStudio: a free, open source R integrated development environment. It provides a built in
editor, works on all platforms (including on servers) and provides many advantages such as integration with
version control and project management.
Basic layout
When you first open RStudio, you will be greeted by three panels:
• The interactive R console (entire left)
• Environment/History (tabbed in upper right)
• Files/Plots/Packages/Help/Viewer (tabbed in lower right)
Once you open files, such as R scripts, an editor panel will also open in the top left.
2
2.3
Work flow within RStudio
There are two main ways one can work within RStudio.
1. Test and play within the interactive R console
• This works well when doing small tests and initially starting off.
• It quickly becomes laborious
2. Start writing in an .R file and use RStudio’s command / short cut to push current line, selected lines or
modified lines to the interactive R console.
• This is a great way to start; all your code is saved for later
• You will be able to run the file you create from within RStudio or using R’s source() function.
2.4
Tip: Running segments of your code
RStudio offers you great flexibility in running code from within the editor window. There are
buttons, menu choices, and keyboard shortcuts. To run the current line, you can 1. click on the
Run button just above the editor panel, or 2. select “Run Lines” from the “Code” menu, or 3. hit
Ctrl-Enter in Windows or Linux or Command-Enter on OS X. (This shortcut can also be seen by
hovering the mouse over the button). To run a block of code, select it and then Run. If you have
modified a line of code within a block of code you have just run, there is no need to reselct the
section and Run, you can use the next button along, Re-run the previous region. This will
run the previous code block inculding the modifications you have made.
2.5
Introduction to R
Much of your time in R will be spent in the R interactive console. This is where you will run all of your code,
and can be a useful environment to try out ideas before adding them to an R script file. This console in
RStudio is the same as the one when you open up the basic R GUI.
The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a
blinking cursor. It operates on the idea of a “Read, Evaluate, Print loop” (REPL): you type in commands, R
tries to execute them, and then returns a result.
2.6
Using R as a calculator
The simplest thing you could do with R is do arithmetic:
1 + 100
## [1] 101
And R will print out the answer, with a preceding “[1]”. Don’t worry about this for now, we’ll explain that
later. For now think of it as indicating ouput.
If you type in an incomplete command, R will wait for you to complete it:
3
> 1 +
+
Any time you hit return and the R session shows a “+” instead of a “>”, it means it’s waiting for you to
complete the command. If you want to cancel a command you can simply hit “Esc” and RStudio will give
you back the “>” prompt.
2.7
Tip: Cancelling commands
Cancelling a command isn’t just useful for killing incomplete commands: you can also use it to
tell R to stop running code (for example if its taking much longer than you expect), or to get rid
of the code you’re currently writing.
When using R as a calculator, the order of operations is the same as you would have learnt back in school.
From highest to lowest precedence:
•
•
•
•
•
•
Parentheses: (, )
Exponents: ˆ
Divide: /
Multiply: *
Add: +
Subtract: -
3 + 5 * 2
## [1] 13
Use parentheses to group operations in order to force the order of evaluation if it differs from the default, or
to make clear what you intend.
(3 + 5) * 2
## [1] 16
This can get unwieldy when not needed, but clarifies your intentions. Remember that others may later read
your code.
(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2
# clear, if you remember the rules
3 + 5 * (2 ^ 2)
# if you forget some rules, this might help
The text after each line of code is called a “comment”. Anything that follows after the hash (or octothorpe)
symbol # is ignored by R when it executes code.
Really small or large numbers get a scientific notation:
2 / 10000
## [1] 2e-04
4
Which is shorthand for “multiplied by 10ˆXX”. So 2e-4 is shorthand for 2 * 10ˆ(-4).
You can write numbers in scientific notation too:
5e3
# Note the lack of minus here
## [1] 5000
2.8
Functions
Most of R’s functionality comes from its functions. A function takes zero, one or multiple arguments,
depending on the function, and returns a value. To call a function enter it’s name followed by a pair of
brackets - include any arguments in the brackets.
log(10)
## [1] 2.302585
To find out more about a function called function_name type ?function_name. To search for the functions
associated with a topic type ??topic or ??"multiple topics". As well as providing a detailed description
of the command and how it works, scrolling ot the bottom of the help page will usually show a collection of
code examples which illustrate command usage.
Exercise 1 Which function calculates sums? And what arguments does it take?
2.8.1
Arguments
The documentation for log indicates that the function requires an argument x that is a vector of numeric
(real) or complex numbers and an argument base which is the base of the logarithm.
Exercise 2 What kind of logarithm does the log function take by default?
When calling a function its arguments can be specified using positional and/or named matching.
log(x = 10, base = 2)
## [1] 3.321928
log(10, 2)
## [1] 3.321928
log(2, 10)
## [1] 0.30103
2.9
Mathematical functions
R has many built in mathematical functions.
5
# trigonometry functions
sin(1)
## [1] 0.841471
# natural logarithm
log(1)
## [1] 0
log10(10) # base-10 logarithm
## [1] 1
exp(0.5) # e^(1/2)
## [1] 1.648721
Don’t worry about trying to remember every function in R. You can simply look them up on google, or if you
can remember the start of the function’s name, use the tab completion in RStudio.
This is one advantage that RStudio has over R on its own, it has autocompletion abilities that allow you to
more easily look up functions, their arguments, and the values that they take.
2.10
Comparing things
We can also do comparison in R:
1 == 1
# equality (note two equals signs, read as "is equal to")
## [1] TRUE
1 != 2
# inequality (read as "is not equal to")
## [1] TRUE
1 <
2
# less than
## [1] TRUE
1 <= 1
# less than or equal to
## [1] TRUE
1 > 0
# greater than
## [1] TRUE
6
1 >= -9 # greater than or equal to
## [1] TRUE
2.11
Tip: Comparing Numbers
A word of warning about comparing numbers: you should never use == to compare two numbers
unless they are integers (a data type which can specifically represent only whole numbers).
Computers may only represent decimal numbers with a certain degree of precision, so two numbers
which look the same when printed out by R, may actually have different underlying representations
and therefore be different by a small margin of error (called Machine numeric tolerance).
Instead you should use the all.equal function.
Further reading: http://floating-point-gui.de/
2.12
Variables and assignment
We can store values in variables by giving them a name, and using the assignment operator <- (To save finger
strokes, type Alt-):
x <- 1 / 40
Notice that assignment does not print a value. Instead, we stored it for later in something called a variable.
x now contains the value 0.025:
x
## [1] 0.025
Look for the Environment tab in one of the panes of RStudio, and you will see that x and its value have
appeared. Our variable x can be used in place of a number in any calculation that expects a number:
log(x)
## [1] -3.688879
Notice also that variables can be reassigned:
x <- 100
x used to contain the value 0.025 and and now it has the value 100.
Assignment values can contain the variable being assigned to:
x <- x + 1 #notice how RStudio updates its description of x on the top right tab
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated
before the assignment occurs.
Exercise 3 Create an object called x with the value 7. What is the value of xˆx. Save the value in a object
called i. If you assign the value 20 to the object x does the value of i change? What does this indicate about
how R assigns values to objects?
Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor
contain spaces at all. Different people use different conventions for long variable names, these include
7
• periods.between.words
• underscores_between_words
• camelCaseToSeparateWords
What you use is up to you, but be consistent.
It is also possible to use the = operator for assignment:
x = 1 / 40
But this is much less common among R users. The most important thing is to be consistent with the
operator you use. There are occasionally places where it is less confusing to use <- than =, and it is the most
common symbol used in the community. So the recommendation is to use <-.
2.13
Managing your environment
There are a few useful commands you can use to interact with the R session.
ls will list all of the variables and functions stored in the global environment (your working R session):
ls()
[1] "x"
"y"
Note here that we didn’t given any arguments to ls, but we still needed to give the parentheses to tell R to
call the function.
If we type ls by itself, R will print out the source code for that function!
You can use rm to delete objects you no longer need:
rm(x)
If you have lots of things in your environment and want to delete all of them, you can pass the results of ls
to the rm function:
rm(list = ls())
In this case we’ve combined the two. Just like the order of operations, anything inside the innermost
parentheses is evaluated first, and so on.
In this case we’ve specified that the results of ls should be used for the list argument in rm.
2.14
Tip: Warnings vs. Errors
Pay attention when R does something unexpected! Errors, like above, are thrown when R cannot
proceed with a calculation. Warnings on the other hand usually mean that the function has run,
but it probably hasn’t worked as expected.
In both cases, the message that R prints out usually give you clues how to fix a problem.
2.15
Challenge 1
Which of the following are valid R variable names?
8
min_height
max.height
_age
.mass
MaxLength
min-length
2widths
celsius2kelvin
2.16
Challenge 2
What will be the value of each variable after each statement in the following program?
mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20
2.17
Challenge 3
Run the code from the previous challenge, and write a command to compare mass to age. Is mass
larger than age?
2.18
Challenge 4
Clean up your working environment by deleting the mass and age variables.
2.19
Project management with RStudio
2.19.1
Introduction
The scientific process is naturally incremental, and many projects start life as random notes, some data, some
code, then a report or manuscript, and eventually everything is a bit mixed together.
It’s pretty easy to get data scattered among many different folders, with multiple versions.
There are many reasons why we should avoid this:
1. It is really hard to tell which version of your data is the original and which is the modified;
2. It gets really messy because it mixes files with various extensions together;
3. It probably takes you a lot of time to actually find things, and relate the correct figures to the exact
files/code that has been used to generate it;
A good project layout will ultimately make your life easier:
•
•
•
•
It
It
It
It
will help ensure the integrity of your data;
makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
allows you to easily upload your code with your manuscript submission;
makes it easier to pick the project back up after a break.
9
2.19.2
A possible solution
Fortunately, there are tools and packages which can help you manage your work effectively.
One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be
using this today to create a self-contained, reproducible project.
2.20
Challenge 5: Creating a self-contained project
We’re going to create a new project in RStudio:
1.
2.
3.
4.
5.
Click
Click
Click
Type
Click
the “File” menu button, then “New Project”.
“New Directory”.
“Empty Project”.
in the name of the directory to store your project, e.g. “r_course”.
the “Create Project” button.
Now when we start R in this project directory, or open this project with RStudio, all of our work on this
project will be entirely self-contained in this directory.
2.20.1
Best practices for project organisation
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will
make project management easier:
2.20.2
Treat data as read only
This is probably the most important goal of setting up a project. Data is typically time consuming and/or
expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you
are never sure of where the data came from, or how it has been modified since collection. It is therefore a
good idea to treat your data as “read-only”.
2.20.3
Data Cleaning
In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any
other programming language) will find useful. This task is sometimes called “data munging”. I find it useful
to store these scripts in a separate folder, and create a second “read-only” data folder to hold the “cleaned”
data sets.
2.20.4
Treat generated output as disposable
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated
from your scripts.
There are lots of different ways to manage this output. I find it useful to have an output folder with different
sub-directories for each separate analysis. This makes it easier later, as many of my analyses are exploratory
and don’t end up being used in the final project, and some of the analyses get shared between projects.
10
2.20.5
Separate function definition and application
The most effective way I find to work in R, is to play around in the interactive session, then copy commands
across to a script file when I’m sure they work and do what I want. You can also save all the commands
you’ve entered using the history command, but I don’t find it useful because when I’m typing its 90% trial
and error.
When your project is new and shiny, the script file usually contains many lines of directly executed code.
As it matures, reusable chunks get pulled into their own functions. It’s a good idea to separate these into
separate folders; one to store useful functions that you’ll reuse across analyses and projects, and one to store
the analysis scripts.
2.20.6
Save the data in the data directory
Now we have a good directory structure we will now place/save the data file in the data/ directory.
2.21
Challenge 6
Download the gapminder data from here.
1. Download the file (CTRL + S, right mouse click -> “Save as”, or File -> “Save page as”)
2. Make sure it’s saved under the name gapminder-FiveYearData.csv
3. Save the file in the data/ folder within your project.
We will load and inspect these data later.
3
Data frames
3.1
Learning Objectives
• To be aware of the different types of data
• To begin exploring the data.frame, and understand how it’s related to vectors, factors and
lists
• To be able to ask questions from R about the type, class, and structure of an object.
One of R’s most powerful features is its ability to deal with tabular data - like what you might already have in
a spreadsheet or a CSV. Let’s start by making a toy dataset in your data/ directory, called feline-data.csv:
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
We can load this into R via the following:
cats <- read.csv(file="data/feline-data.csv")
cats
##
coat weight likes_string
## 1 calico
2.1
1
## 2 black
5.0
0
## 3 tabby
3.2
1
11
We can begin exploring our dataset right away, pulling out columns via the following:
cats$weight
## [1] 2.1 5.0 3.2
cats$coat
## [1] calico black tabby
## Levels: black calico tabby
We can do other operations on the columns:
## We discovered that the scale weighs one Kg light:
cats$weight + 2
## [1] 4.1 7.0 5.2
paste("My cat is", cats$coat)
## [1] "My cat is calico" "My cat is black"
"My cat is tabby"
But what about
cats$weight + cats$coat
## Warning in Ops.factor(cats$weight, cats$coat): '+' not meaningful for
## factors
## [1] NA NA NA
Understanding what happened here is key to successfully analyzing data in R.
3.2
Data Types
If you guessed that the last command will return an error because 2.1 plus black is nonsense, you’re right and you already have some intuition for an important concept in programming called data types. We can ask
what type of data something is:
class(cats$weight)
## [1] "numeric"
class(cats$coat)
## [1] "factor"
There are 5 main classes: numeric (double), integers, logical and character. Factor is a special class that we’ll
get into later.
12
class(1.25)
## [1] "numeric"
class(1L)
## [1] "integer"
class(TRUE)
## [1] "logical"
class('banana')
## [1] "character"
Note the L suffix to insist that a number is an integer. Character classes are always enclosed in quotation
marks.
No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.
This strictness has some really important concequences. Try adding another row to your cat data like this:
tabby,2.3 or 2.4,1
Reload your cats data like before, and check what type of data we find in the weight column:
cats <- read.csv(file="data/feline-data.csv")
class(cats$weight)
## [1] "factor"
Oh no, our weights aren’t numeric anymore! If we try to do the same math we did on them before, we run
into trouble:
cats$weight + 1
## Warning in Ops.factor(cats$weight, 1): '+' not meaningful for factors
## [1] NA NA NA NA
What happened? When R reads a csv into one of these tables, it insists that everything in a column be the
same basic type; if it can’t understand everything in the column as a double, then nobody in the column gets
to be a double. The table that R loaded our cats data into is something called a data.frame, and it is our
first example of something called a data structure - things that R knows how to build out of the basic data
types. In order to successfully use our data in R, we need to understand what these basic data structures
are, and how they behave. For now, let’s remove that extra line from our cats data and reload it, while we
investigate this behavior further:
feline-data.csv:
13
coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1
And back in RStudio:
cats <- read.csv(file="data/feline-data.csv")
3.3
Vectors & Type Coercion
To better understand the behavior we just saw, let’s meet another of the data structures: the vector. All
vectors are one of the classes we met above. We can create a vector by calling the function of the same name:
x <- numeric(5)
x
## [1] 0 0 0 0 0
y <- character(3)
y
## [1] "" "" ""
Just like you might be familiar with from vectors elsewhere, a vector in R is essentially an ordered list of
things, with the special condition that everything in the vector must be the same basic data type.
You can check if something is a vector:
str(x)
##
num [1:5] 0 0 0 0 0
The somewhat cryptic output from this command indicates the basic data type found in this vector; the
number of things in the vector; and a few examples of what’s actually in the vector. If we similarly do
str(cats$weight)
##
num [1:3] 2.1 5 3.2
we see that that’s a vector, too - the columns of data we load into R data.frames are all vectors, and that’s
the root of why R forces everything in a column to be the same basic data type.
3.4
Discussion 1
Why is R so opinionated about what we put in our columns of data? How does this help us?
You can also make vectors with explicit contents with the c (combine) function:
14
x <- c(2,6,3)
x
## [1] 2 6 3
y <- c("Hello", "Goodbye", "I love data")
y
## [1] "Hello"
"Goodbye"
"I love data"
Given what we’ve learned so far, what do you think the following will produce?
x <- c(2,6,'3')
This is something called type coercion, and it is the source of many surprises and the reason why we need to
be aware of the basic data types and how R will interpret them. Consider:
x <- c('a', TRUE)
x
## [1] "a"
"TRUE"
x <- c(0, TRUE)
x
## [1] 0 1
The coercion rules go: logical -> integer -> numeric -> complex -> character. You can try to force
coercion against this flow using the as. functions:
x <- c('0','2','4')
x
## [1] "0" "2" "4"
y <- as.numeric(x)
y
## [1] 0 2 4
z <- as.logical(y)
z
## [1] FALSE
TRUE
TRUE
15
As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty
of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like,
type coercion may well be to blame; make sure everything is the same type in your vectors and your columns
of data.frames, or you will get nasty surprises!
But coercion isn’t a bad thing. For example, likes_string is numeric, but we know that the 1s and 0s
actually represent TRUE and FALSE (a common way of representing them). R has special kind of data type
called logical, which has two states: TRUE or FALSE, which is exactly what our data represents. We can
‘coerce’ this column to be logical by using the as.logical function:
cats$likes_string
## [1] 1 0 1
cats$likes_string <- as.logical(cats$likes_string)
cats$likes_string
## [1]
TRUE FALSE
TRUE
You can also append things to an existing vector using the c (combine) function:
x <- c('a', 'b', 'c')
x
## [1] "a" "b" "c"
x <- c(x, 'd')
x
## [1] "a" "b" "c" "d"
You can also make series of numbers:
mySeries <- 1:10
mySeries
##
[1]
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
seq(10)
##
[1]
seq(1,10, by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6
## [18] 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
## [ reached getOption("max.print") -- omitted 61 entries ]
We can ask a few other questions about vectors:
16
x <- seq(10)
head(x, n=2)
## [1] 1 2
tail(x, n=4)
## [1]
7
8
9 10
length(x)
## [1] 10
Finally, you can give names to elements in your vector, and ask for them that way:
x <- 5:8
names(x) <- c("a", "b", "c", "d")
x
## a b c d
## 5 6 7 8
x['b']
## b
## 6
3.4.1
Missing values
Missing values are represented by NA. Functions such as min, max and mean that require knowledge of all the
input values return an NA if one or more values are missing. This behaviour can be altered by setting the
na.rm argument to be TRUE.
x <- c(1, 2, 3, NA)
mean(x)
## [1] NA
mean(x, na.rm = TRUE)
## [1] 2
17
3.5
Factors
str(cats$coat)
##
Factor w/ 3 levels "black","calico",..: 2 1 3
Another important data structure is called a factor. Factors usually look like character data, but are typically
used to represent categorical information. For example, let’s make a vector of strings labeling cat colorations
for all the cats in our study:
coats <- c('tabby', 'tortoiseshell', 'tortoiseshell', 'black', 'tabby')
coats
## [1] "tabby"
## [5] "tabby"
"tortoiseshell" "tortoiseshell" "black"
str(coats)
##
chr [1:5] "tabby" "tortoiseshell" "tortoiseshell" "black" ...
We can turn a vector into a factor like so:
CATegories <- as.factor(coats)
str(CATegories)
##
Factor w/ 3 levels "black","tabby",..: 2 3 3 1 2
Now R has noticed that there are three possible categories in our data - but it also did something surprising;
instead of printing out the strings we gave it, we got a bunch of numbers instead. R has replaced our
human-readable categories with numbered indices under the hood:
class(coats)
## [1] "character"
typeof(coats)
## [1] "character"
class(CATegories)
## [1] "factor"
typeof(CATegories)
## [1] "integer"
18
3.6
Challenge 2
When we loaded our cats data, the coats column was interpreted as a factor; try using the help
for read.csv to figure out how to keep text columns as character vectors instead of factors; then
write a command or two to show that the cats$coats column actually is a character vector when
loaded in this way.
In modeling functions, it’s important to know what the baseline levels are. This is assumed to be the first
factor, but by default factors are labeled in alphabetical order. You can change this by specifying the levels:
mydata <- c("case", "control", "control", "case")
x <- factor(mydata, levels = c("control", "case"))
str(x)
##
Factor w/ 2 levels "control","case": 2 1 1 2
In this case, we’ve explicitly told R that “control” should represented by 1, and “case” by 2. This designation
can be very important for interpreting the results of statistical models!
3.7
Lists
Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the
other types, because you can put anything you want in it:
x <- list(1, "a", TRUE, 1+4i)
x
##
##
##
##
##
##
##
##
##
##
##
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
x[2]
## [[1]]
## [1] "a"
x <- list(title = "Research Bazaar", numbers = 1:10, data = TRUE )
x
## $title
## [1] "Research Bazaar"
##
## $numbers
19
## [1] 1 2
##
## $data
## [1] TRUE
3
4
5
6
7
8
9 10
We can now understand something a bit surprising in our data.frame; what happens if we run:
typeof(cats)
## [1] "list"
We see that data.frames look like lists ‘under the hood’ - this is because a data.frame is really a list of vectors
and factors, as they have to be - in order to hold those columns that are a mix of vectors and factors, the
data.frame needs something a bit more flexible than a vector to put all the columns together into a familiar
table.
3.8
Matrices
Last but not least is the matrix. We can declare a matrix full of zeros:
x <- matrix(0, ncol=6, nrow=3)
x
##
[,1] [,2] [,3] [,4] [,5] [,6]
## [1,]
0
0
0
0
0
0
## [2,]
0
0
0
0
0
0
## [3,]
0
0
0
0
0
0
and we can ask for and put values in the elements of our matrix with a couple of different notations:
x[1,1] <- 1
x
##
[,1] [,2] [,3] [,4] [,5] [,6]
## [1,]
1
0
0
0
0
0
## [2,]
0
0
0
0
0
0
## [3,]
0
0
0
0
0
0
x[1][1]
## [1] 1
x[1][1] <- 2
x[1,1]
## [1] 2
20
x
##
[,1] [,2] [,3] [,4] [,5] [,6]
## [1,]
2
0
0
0
0
0
## [2,]
0
0
0
0
0
0
## [3,]
0
0
0
0
0
0
3.9
Challenge 3
What do you think will be the result of length(x)? Try it. Were you right? Why / why not?
3.10
Challenge 4
Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did
the matrix function fill your matrix by column, or by row, as its default behaviour? See if you
can figure out how to change this. (hint: read the documentation for matrix!)
3.11
Challenge 5
Create a list of length two containing a character vector for each of the sections in this part of the
workshop:
• Data types
• Data structures
Populate each character vector with the names of the data types and data structures we’ve seen
so far.
3.12
Challenge solutions
Solutions to challenges
3.13
Discussion 1
By keeping everything in a column the same, we allow ourselves to make simple assumptions
about our data; if you can interpret one entry in the column as a number, then you can interpret
all of them as numbers, so we don’t have to check every time. This consistency, like consistently
using the same separator in our data files, is what people mean when they talk about clean data;
in the long run, strict consistency goes a long way to making our lives easier in R.
3.14
Solution to Challenge 1
x <- 11:20
subset <- x[3:5]
names(subset) <- c('S', 'W', 'C')
3.15
Solution to Challenge 2
21
cats <- read.csv(file="data/feline-data.csv", stringsAsFactors=FALSE)
str(cats$coat)
##
chr [1:3] "calico" "black" "tabby"
Note: new students find the help files difficult to understand; make sure to let them know that
this is typical, and encourage them to take their best guess based on semantic meaning, even if
they aren’t sure.
3.16
Solution to challenge 3
What do you think will be the result of length(x)?
x <- matrix(0, ncol=6, nrow=3)
length(x)
## [1] 18
Because a matrix is really just a vector with added dimension attributes, length gives you the
total number of elements in the matrix.
3.17
Solution to challenge 4
Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did
the matrix function fill your matrix by column, or by row, as its default behaviour? See if you
can figure out how to change this. (hint: read the documentation for matrix!)
x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row
3.18
Solution to Challenge 5
dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
dataStructures <- c('data.frame', 'vector', 'factor', 'list', 'matrix')
answer <- list(dataTypes, dataStructures)
Note: it’s nice to make a list in big writing on the board or taped to the wall listing all of these
types and structures - leave it up for the rest of the workshop to remind people of the importance
of these basics.
4
Exploring Data Frames
4.1
Learning Objectives
• To learn how to manipulate a data.frame in memory
• To tour some best practices of exploring and understanding a data.frame when it is first
loaded.
22
At this point, you’ve see it all - in the last lesson, we toured all the basic data types and data structures in R.
Everything you do will be a manipulation of those tools. But a whole lot of the time, the star of the show is
going to be the data.frame - that table that we started with that information from a CSV gets dumped into
when we load it. In this lesson, we’ll learn a few more things about working with data.frame.
We learned last time that the columns in a data.frame were vectors, so that our data are consistent in type
throughout the column. As such, we can perform operations on them just as we did with vectors
# Calculate weight of cats in g
cats$weight * 1000
## [1] 2100 5000 3200
We can also assign this result to a new column in the data frame:
cats$weight_kg <- cats$weight * 1000
cats
##
coat weight likes_string weight_kg
## 1 calico
2.1
1
2100
## 2 black
5.0
0
5000
## 3 tabby
3.2
1
3200
Our new column has appeared!
4.2
Discussion 1
What do you think
cats$weight[4]
will print at this point?
So far, you’ve seen the basics of manipulating data.frames with our cat data; now, let’s use those skills to
digest a more realistic dataset.
4.3
Reading in data
Remember earlier we obtained the gapminder dataset, which contains GDP ,population, and life expentancy
for many countries around the world. ‘Gapminder’.
If you’re curious about where this data comes from you might like to look at the Gapminder website.
Let’s first open up the data in Excel, an environment we’re familiar with, to have a quick look.
Now we want to load the gapminder data into R.
As its file extension would suggest, the file contains comma-separated values, and seems to contain a header
row.
We can use read.csv to read this into R
23
gapminder <- read.csv(file="data/gapminder-FiveYearData.csv")
head(gapminder)
##
##
##
##
##
##
##
country year
pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333
Asia 28.801 779.4453
2 Afghanistan 1957 9240934
Asia 30.332 820.8530
3 Afghanistan 1962 10267083
Asia 31.997 853.1007
4 Afghanistan 1967 11537966
Asia 34.020 836.1971
5 Afghanistan 1972 13079460
Asia 36.088 739.9811
[ reached getOption("max.print") -- omitted 1 row ]
4.4
Miscellaneous Tips
1. Another type of file you might encounter are tab-separated format. You can use read.delim
to read in tab-separated files.
2. If your file uses a different separater, the more generic read.table will let you specifiy it
with the sep argument.
3. You can also read in files from the Internet by replacing the file paths with a web address.
4. You can read directly from excel spreadsheets without converting them to plain text first by
using the xlsx package.
To make sure our analysis is reproducible, we should put the code into a script file so we can come back to it
later.
4.5
Challenge 3
Go to file -> new file -> R script, and write an R script to load in the gapminder dataset.
Run the script using the source function, using the file path as its argument (or by pressing the
“source” button in RStudio).
4.6
Using data frames: the gapminder dataset
To recap what we’ve just learned, let’s have a look at our example data (life expectancy in various countries
for various years).
Remember, there are a few functions we can use to interrogate data structures in R:
class() # what is the data structure?
length() # how long is it? What about two dimensional objects?
attributes() # does it have any metadata?
str() # A full summary of the entire object
dim() # Dimensions of the object - also try nrow(), ncol()
Let’s use them to explore the gapminder dataset.
class(gapminder)
## [1] "data.frame"
The gapminder data is stored in a “data.frame”. This is the default data structure when you read in data,
and (as we’ve heard) is useful for storing data with mixed types of columns.
Let’s look at some of the columns.
24
4.7
Challenge 4: Data types in a real dataset
Look at the first 6 rows of the gapminder data frame we loaded before:
head(gapminder)
##
##
##
##
##
##
##
country year
pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333
Asia 28.801 779.4453
2 Afghanistan 1957 9240934
Asia 30.332 820.8530
3 Afghanistan 1962 10267083
Asia 31.997 853.1007
4 Afghanistan 1967 11537966
Asia 34.020 836.1971
5 Afghanistan 1972 13079460
Asia 36.088 739.9811
[ reached getOption("max.print") -- omitted 1 row ]
Write down what data type you think is in each column
class(gapminder$year)
## [1] "integer"
class(gapminder$lifeExp)
## [1] "numeric"
Can anyone guess what we should expect the type of the continent column to be?
class(gapminder$continent)
## [1] "factor"
If you were expecting a the answer to be “character”, you would rightly be surprised by the answer.
One of the default behaviours of R is to treat any text columns as “factors” when reading in data. The
reason for this is that text columns often represent categorical data, which need to be factors to be handled
appropriately by the statistical modeling functions in R.
However it’s not obvious behaviour, and something that trips many people up. We can disable this behaviour
when we read in the data.
gapminder <- read.csv(file="data/gapminder-FiveYearData.csv",
stringsAsFactors = FALSE)
4.8
Tip
I highly recommend burning this pattern into your memory, or getting it tattooed onto your arm.
The first thing you should do when reading data in, is check that it matches what you expect, even if the
command ran without warnings or errors. The str function, short for “structure”, is really useful for this:
25
str(gapminder)
## 'data.frame':
1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year
: int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop
: num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
We can see that the object is a data.frame with 1,704 observations (rows), and 6 variables (columns). Below
that, we see the name of each column, followed by a “:”, followed by the type of variable in that column,
along with the first few entries.
As discussed above, we can retrieve or modify the column or row names of the data.frame:
colnames(gapminder)
## [1] "country"
"year"
"pop"
"continent" "lifeExp"
"gdpPercap"
copy <- gapminder
colnames(copy) <- letters[1:6]
head(copy, n=3)
##
a
b
c
d
e
f
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
4.9
Challenge 5
Recall that we also used the names function (above) to modify column names. Does it matter
which you use? You can check help with ?names and ?colnames to see whether it should matter.
rownames(gapminder)[1:20]
## [1] "1" "2" "3" "4" "5" "6" "7"
## [15] "15" "16" "17" "18" "19" "20"
"8"
"9"
"10" "11" "12" "13" "14"
See those numbers in the square brackets on the left? That tells you the number of the first entry in that row
of output. So we see that for the 5th row, the rowname is “5”. In this case, the rownames are simply the row
numbers.
4.10
Challenge Solutions
Solutions to challenges 2 & 3.
26
4.11
Solution to Challenge 2
Create a data frame that holds the following information for yourself:
• First name
• Last name
• Age
Then use rbind to add the same information for the people sitting near you.
Now use cbind to add a column of logicals answering the question, “Is there anything in this
workshop you’re finding confusing?”
my_df
my_df
my_df
my_df
4.12
<<<<-
data.frame(first_name = "Andy", last_name = "Teucher", age = 36)
rbind(my_df, data.frame(first_name = "Jane", last_name = "Smith", age = 29))
rbind(my_df, data.frame(first_name = c("Jo", "John"), last_name = c("White", "Lee"), age =
cbind(my_df, confused = c(FALSE, FALSE, TRUE, FALSE))
Solution to Challenge 5
?colnames tells you that the colnames function is the same as names for a data frame. For other
structures, they may not be the same. In particular, names does not work for matrices, but
colnames does. You can verify this with
m <- matrix(1:9, nrow=3)
colnames(m) <- letters[1:3] # works as you would expect
names(m) <- letters[1:3] # destroys the matrix
5
Data visualization with ggplot2
5.1
•
•
•
•
•
•
•
Learning Objectives
To be able to use ggplot2 to generate publication quality graphics
To understand the basics of the grammar of graphics:
The aesthetics layer
The geometry layer
Adding statistics
Transforming scales
Coloring or paneling by groups.
Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.
There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2
package.
5.2
Base plotting
R’s base (built-in) plotting functions are powerful and very flexible, but not overly user friendly. For simple
exploratory plots that don’t need to look nice, they are useful. They are generally specified as plot(x, y,
...)
27
8e+04
4e+04
0e+00
gapminder$gdpPercap
plot(gapminder$lifeExp, gapminder$gdpPercap)
30
40
50
60
gapminder$lifeExp
You can also specify them in a formula format plot(y ~ x, data='', ...)
plot(gdpPercap ~ lifeExp, data=gapminder)
28
70
80
8e+04
4e+04
0e+00
gdpPercap
30
40
50
lifeExp
hist(gapminder$gdpPercap)
29
60
70
80
600
0
200
Frequency
1000
Histogram of gapminder$gdpPercap
0
20000
40000
60000
80000
gapminder$gdpPercap
boxplot(gdpPercap ~ continent, data=gapminder)
30
100000
120000
8e+04
4e+04
0e+00
Africa
5.3
Americas
Asia
Europe
Oceania
ggplot2
Today we’ll be learning about the ggplot2 package developed by Hadley Wickham, because it is the most
effective for creating publication quality graphics.
5.3.1
ggplot2 and the Grammar of Graphics
The ggplot2 package provides an R implementation of Leland Wilkinson’s Grammar of Graphics (1999). The
Grammar of Graphics challenges data analysts to think beyond the garden variety plot types (e.g. scatter-plot,
barplot) and to consider the components that make up a plot or graphic, such as how data are represented
on the plot (as lines, points, etc.), how variables are mapped to coordinates or plotting shape or colour, what
transformation or statistical summary is required, and so on. Specifically, ggplot2 allows users to build a
plot layer-by-layer by specifying:
• The data,
• some aesthetics, that map variables in the data to a visual representation on the plot. This tells ggplot2
how to show each variable, such as axes on the plot or to size, shape, color, etc.
• a geom, which specifies the geometry of how the data are represented on the plot (points, lines, bars,
etc.),
• a stat, a statistical transformation or summary of the data applied prior to plotting,
• facets, that allow the data to be divided into chunks on the basis of other categorical or continuous
variables and the same plot drawn for each chunk.
31
Because ggplot2 implements a layered grammar of graphics, data points and additional information (scatterplot smoothers, confidence bands, etc.) can be added to the plot via additional layers, each of which utilize
further geoms, aesthetics, and stats.
Let’s start off with an example:
library(ggplot2)
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point()
gdpPercap
90000
60000
30000
0
40
60
80
lifeExp
So the first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot,
and any of the arguments we give the ggplot function are the global options for the plot: they apply to all
layers on the plot.
We’ve passed in two arguments to ggplot. First, we tell ggplot what data we want to show on our figure, in
this example the gapminder data we read in earlier. For the second argument we passed in the aes function,
which tells ggplot how variables in the data map to aesthetic properties of the figure, in this case the x and
y locations. Here we told ggplot we want to plot the “lifeExp” column of the gapminder data frame on the
x-axis, and the “gdpPercap” column on the y-axis. Notice that we didn’t need to explicitly pass aes these
columns (e.g. x = gapminder[, "lifeExp"]), this is because ggplot is smart enough to know to look in
the data for that column!
By itself, the call to ggplot isn’t enough to draw a figure:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap))
32
gdpPercap
90000
60000
30000
0
40
60
80
lifeExp
We need to tell ggplot how we want to visually represent the data, which we do by adding a new geom
layer. In our example, we used geom_point, which tells ggplot we want to visually represent the relationship
between x and y as a scatterplot of points:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point()
33
gdpPercap
90000
60000
30000
0
40
60
80
lifeExp
5.4
Challenge 1
Modify the example so that the figure visualise how life expectancy has changed over time:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) + geom_point()
Hint: the gapminder dataset has a column called “year”, which should appear on the x-axis.
5.5
Challenge 2
In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom
about the x and y locations of each point. Another aesthetic property we can modify is the point
color. Modify the code from the previous challenge to color the points by the “continent” column.
What trends do you see in the data? Are they what you expected?
5.6
Geom Layers
Using a scatterplot probably isn’t the best for visualising change over time. Instead, let’s tell ggplot to
visualise the data as a line plot:
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line()
34
80
continent
60
lifeExp
Africa
Americas
Asia
Europe
Oceania
40
1950
1960
1970
1980
1990
2000
year
Instead of adding a geom_point layer, we’ve added a geom_line layer. We’ve added the by aesthetic, which
tells ggplot to draw a line for each country.
But what if we want to visualise both lines and points on the plot? We can simply add another layer to the
plot:
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line() + geom_point()
35
80
continent
60
lifeExp
Africa
Americas
Asia
Europe
Oceania
40
1950
1960
1970
1980
1990
2000
year
It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have
been drawn on top of the lines. Here’s a demonstration:
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
geom_line(aes(color=continent)) + geom_point()
36
80
continent
60
lifeExp
Africa
Americas
Asia
Europe
Oceania
40
1950
1960
1970
1980
1990
2000
year
In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to
the geom_line layer so it no longer applies to the points. Now we can clearly see that the points are drawn
on top of the lines.
5.7
Challenge 3
Switch the order of the point and line layers from the previous example. What happened?
There are many other geoms we can use to explore the data. One common one is a histogram, so we can see
the distribution of a single variable:
ggplot(gapminder, aes(x = gdpPercap)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
37
count
750
500
250
0
0
40000
80000
gdpPercap
Or a boxplot to compare distribution of life expectancy across continents:
ggplot(gapminder, aes(x = continent, y = lifeExp)) + geom_boxplot()
38
120000
80
lifeExp
60
40
Africa
Americas
Asia
Europe
Oceania
continent
5.8
Transformations and statistics
ggplot also makes it easy to overlay statistical models over the data. To demonstrate we’ll go back to our
first example:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap, color=continent)) +
geom_point()
39
90000
gdpPercap
continent
Africa
Americas
60000
Asia
Europe
Oceania
30000
0
40
60
80
lifeExp
Currently it’s hard to see the relationship between the points due to some strong outliers in GDP per capita.
We can change the scale of units on the y axis using the scale functions. These control the mapping between
the data values and visual values of an aesthetic.
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10()
40
gdpPercap
1e+05
1e+04
1e+03
40
60
80
lifeExp
The log10 function applied a transformation to the values of the gdpPercap column before rendering them
on the plot, so that each multiple of 10 now only corresponds to an increase in 1 on the transformed scale,
e.g. a GDP per capita of 1,000 is now 3 on the y axis, a value of 10,000 corresponds to 4 on the y axis and so
on. This makes it easier to visualise the spread of data on the y-axis.
We can fit a simple linear relationship to the data by adding another layer, geom_smooth:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10() + geom_smooth(method="lm")
41
gdpPercap
1e+05
1e+04
1e+03
40
60
lifeExp
We can make the line thicker by setting the size aesthetic in the geom_smooth layer:
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point() + scale_y_log10() + geom_smooth(method="lm", size=1.5)
42
80
gdpPercap
1e+05
1e+04
1e+03
40
60
80
lifeExp
There are two ways an aesthetic can be specified. Here we set the size aesthetic by passing it as an argument
to geom_smooth. Previously in the lesson we’ve used the aes function to define a mapping between data
variables and their visual representation.
5.9
Challenge 4
Modify the color and size of the points on the point layer in the previous example.
Hint: do not use the aes function.
5.10
Multi-panel figures
Earlier we visualised the change in life expectancy over time across all countries in one plot. Alternatively,
we can split this out over multiple panels by adding a layer of facet panels:
ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country)
43
Afghanistan
AlbaniaAlgeria AngolaArgentina
AustraliaAustriaBahrain
Bangladesh
Belgium Benin Bolivia
80
60
40
Bosnia and Herzegovina
BotswanaBrazil Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Central
CanadaAfrican Republic
Chad Chile
80
60
40
ChinaColombia
Comoros
Congo Dem.
Congo
Rep.Rep.
Costa Rica
Cote d'Ivoire
Croatia Cuba
Czech Republic
DenmarkDjibouti
80
60
40
Dominican Republic
Ecuador Egypt
El Salvador
Equatorial Guinea
EritreaEthiopiaFinlandFrance GabonGambia
Germany
80
60
40
Ghana Greece
Guatemala
Guinea
Guinea−Bissau
HaitiHonduras
Hong Kong China
HungaryIceland IndiaIndonesia
80
60
40
lifeExp
Iran
Iraq Ireland Israel
Italy JamaicaJapan Jordan Kenya
Korea Dem.
Korea
Rep.Rep.
Kuwait
continent
Africa
80
60
40
Americas
LebanonLesothoLiberia Libya
Madagascar
MalawiMalaysia MaliMauritania
MauritiusMexicoMongolia
80
60
40
Asia
Montenegro
Morocco
Mozambique
Myanmar
Namibia Nepal
Netherlands
New Zealand
NicaraguaNiger NigeriaNorway
80
60
40
Europe
Oceania
OmanPakistanPanama
Paraguay PeruPhilippines
PolandPortugal
Puerto Rico
Reunion
Romania
Rwanda
80
60
40
Sao Tome and
Saudi
Principe
Arabia
SenegalSerbia
Sierra Leone
Singapore
Slovak Republic
SloveniaSomalia
South Africa
SpainSri Lanka
80
60
40
SudanSwaziland
Sweden
SwitzerlandSyria TaiwanTanzania
Thailand Trinidad
Togo and Tobago
Tunisia Turkey
80
60
40
Uganda
United Kingdom
United States
Uruguay
Venezuela
Vietnam
West Bank and
Yemen
Gaza
Rep.
Zambia
Zimbabwe
80
60
40
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
1950
1960
1970
1980
1990
2000
year
The facet_wrap layer took a “formula” as its argument, denoted by the tilde (~). This tells R to draw a
panel for each unique value in the country column of the gapminder dataset.
5.11
Modifying text
To clean this figure up for a publication we need to change some of the text elements. The x-axis is way too
cluttered, and the y axis should read “Life expectancy”, rather than the column name in the data frame.
We can do this by adding a couple of different layers. The theme layer controls the axis text, and overall
text size, and there are special layers for changing the axis labels. To change the legend title, we need to use
the scales layer.
ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country) +
xlab("Year") + ylab("Life expectancy") + ggtitle("Figure 1") +
scale_fill_discrete(name="Continent") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
44
Figure 1
Afghanistan
AlbaniaAlgeria AngolaArgentina
AustraliaAustriaBahrain
Bangladesh
Belgium Benin Bolivia
80
60
40
Bosnia and Herzegovina
BotswanaBrazil Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Central
CanadaAfrican Republic
Chad Chile
80
60
40
ChinaColombia
Comoros
Congo Dem.
Congo
Rep.Rep.
Costa Rica
Cote d'Ivoire
Croatia Cuba
Czech Republic
DenmarkDjibouti
80
60
40
Dominican Republic
Ecuador Egypt
El Salvador
Equatorial Guinea
EritreaEthiopiaFinlandFrance GabonGambia
Germany
80
60
40
Life expectancy
Ghana Greece
Guatemala
Guinea
Guinea−Bissau
HaitiHonduras
Hong Kong China
HungaryIceland IndiaIndonesia
80
60
40
Iran
Iraq Ireland Israel
Italy JamaicaJapan Jordan Kenya
Korea Dem.
Korea
Rep.Rep.
Kuwait
continent
Africa
80
60
40
Americas
LebanonLesothoLiberia Libya
Madagascar
MalawiMalaysia MaliMauritania
MauritiusMexicoMongolia
80
60
40
Asia
Montenegro
Morocco
Mozambique
Myanmar
Namibia Nepal
Netherlands
New Zealand
NicaraguaNiger NigeriaNorway
80
60
40
Europe
Oceania
OmanPakistanPanama
Paraguay PeruPhilippines
PolandPortugal
Puerto Rico
Reunion
Romania
Rwanda
80
60
40
Sao Tome and
Saudi
Principe
Arabia
SenegalSerbia
Sierra Leone
Singapore
Slovak Republic
SloveniaSomalia
South Africa
SpainSri Lanka
80
60
40
SudanSwaziland
Sweden
SwitzerlandSyria TaiwanTanzania
Thailand Trinidad
Togo and Tobago
Tunisia Turkey
80
60
40
Uganda
United Kingdom
United States
Uruguay
Venezuela
Vietnam
West Bank and
Yemen
Gaza
Rep.
Zambia
Zimbabwe
80
60
40
Year
This is just a taste of what you can do with ggplot2. RStudio provides a really useful cheat sheet of the
different layers available, and more extensive documentation is available on the ggplot2 website. Finally, if
you have no idea how to change something, a quick google search will usually send you to a relevant question
and answer on Stack Overflow with reusable code to modify!
5.12
Challenge 5
Create a density plot of GDP per capita, filled by continent.
Advanced: - Transform the x axis to better visualise the data spread. - Add a facet layer to panel
the density plots by year.
5.12.1
•
•
•
•
•
•
5.13
Further ggplot2 resources
The official ggplot2 documentation
The ggplot2 book, by the developer, Hadley Wickham
The ggplot2 Google Group (mailing list, discussion forum).
Intermediate Software Carpentry lesson on data visualization with ggplot2.
A blog with a good number of posts describing how to reproduce various kind of plots using ggplot2.
Thousands of questions and answers tagged with “ggplot2” on Stack Overflow, a programming Q&A
site.
Challenge solutions
Solutions to challenges
45
5.14
Solution to challenge 1
Modify the example so that the figure visualise how life expectancy has changed over time:
ggplot(data = gapminder, aes(x = year, y = lifeExp)) + geom_point()
80
lifeExp
60
40
1950
1960
1970
1980
1990
2000
year
5.15
Solution to challenge 2
In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom
about the x and y locations of each point. Another aesthetic property we can modify is the point
color. Modify the code from the previous challenge to color the points by the “continent” column.
What trends do you see in the data? Are they what you expected?
ggplot(data = gapminder, aes(x = year, y = lifeExp, color=continent)) +
geom_point()
46
80
continent
60
lifeExp
Africa
Americas
Asia
Europe
Oceania
40
1950
1960
1970
1980
1990
2000
year
5.16
Solution to challenge 3
Switch the order of the point and line layers from the previous example. What happened?
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
geom_point() + geom_line(aes(color=continent))
47
80
continent
60
lifeExp
Africa
Americas
Asia
Europe
Oceania
40
1950
1960
1970
1980
1990
2000
year
The lines now get drawn over the points!
5.17
Solution to challenge 4
Modify the color and size of the points on the point layer in the previous example.
Hint: do not use the aes function.
ggplot(data = gapminder, aes(x = lifeExp, y = gdpPercap)) +
geom_point(size=3, color="orange") + scale_y_log10() +
geom_smooth(method="lm", size=1.5)
48
gdpPercap
1e+05
1e+04
1e+03
40
60
80
lifeExp
5.18
Solution to challenge 5
Create a density plot of GDP per capita, filled by continent.
Advanced: - Transform the x axis to better visualise the data spread. - Add a facet layer to panel
the density plots by year.
ggplot(data = gapminder, aes(x = gdpPercap, fill=continent)) +
geom_density(alpha=0.6) + facet_wrap( ~ year) + scale_x_log10()
49
1952
1957
1962
1967
1972
1977
1982
1987
2.0
1.5
1.0
0.5
density
0.0
continent
2.0
Africa
1.5
Americas
1.0
Asia
0.5
Europe
0.0
Oceania
1992
1997
2002
2007
2.0
1.5
1.0
0.5
0.0
1e+03 1e+04 1e+05 1e+03 1e+04 1e+05 1e+03 1e+04 1e+05 1e+03 1e+04 1e+05
gdpPercap
6
Subsetting data
6.1
Learning Objectives
• To be able to subset vectors and data frames
• To be able to extract individual and multiple elements:
– by index,
– by name,
– using comparison operations
• To be able to skip and remove elements from various data structures.
R has many powerful subset operators and mastering them will allow you to easily perform complex operations
on any kind of dataset.
There are six different ways we can subset any kind of object, and three different subsetting operators for the
different data structures.
Let’s start with the workhorse of R: atomic vectors.
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
##
a
b
c
d
e
## 5.4 6.2 7.1 4.8 7.5
So now that we’ve created a dummy vector to play with, how do we get at its contents?
50
6.2
Accessing elements using their indices
To extract elements of a vector we can give their corresponding index, starting from one:
x[1]
##
a
## 5.4
x[4]
##
d
## 4.8
The square brackets operator is just like any other function. For atomic vectors (and matrices), it means “get
me the nth element”.
We can ask for multiple elements at once:
x[c(1, 3)]
##
a
c
## 5.4 7.1
Or slices of the vector:
x[1:4]
##
a
b
c
d
## 5.4 6.2 7.1 4.8
the : operator just creates a sequence of numbers from the left element to the right. I.e. x[1:4] is equivalent
to x[c(1,2,3,4)].
We can ask for the same element multiple times:
x[c(1,1,3)]
##
a
a
c
## 5.4 5.4 7.1
If we ask for a number outside of the vector, R will return missing values:
x[6]
## <NA>
##
NA
This is a vector of length one containing an NA, whose name is also NA.
If we ask for the 0th element, we get an empty vector:
51
x[0]
## named numeric(0)
6.3
Vector numbering in R starts at 1
In many programming languages (C and python, for example), the first element of a vector has
an index of 0. In R, the first element is 1.
6.4
Skipping and removing elements
If we use a negative number as the index of a vector, R will return every element except for the one specified:
x[-2]
##
a
c
d
e
## 5.4 7.1 4.8 7.5
We can skip multiple elements:
x[c(-1, -5)]
# or x[-c(1,5)]
##
b
c
d
## 6.2 7.1 4.8
6.5
Tip: Order of operations
A common trip up for novices occurs when trying to skip slices of a vector. Most people first try
to negate a sequence like so:
x[-1:3]
## Error in x[-1:3]: only 0's may be mixed with negative subscripts
This gives a somewhat cryptic error:
But remember the order of operations. : is really a function, so what happens is it takes its first
argument as -1, and second as 3, so generates the sequence of numbers: c(-1, 0, 1, 2, 3).
The correct solution is to wrap that function call in brackets, so that the - operator applies to
the results:
x[-(1:3)]
##
d
e
## 4.8 7.5
To remove elements from a vector, we need to assign the results back into the variable:
52
x <- x[-4]
x
##
a
b
c
e
## 5.4 6.2 7.1 7.5
6.6
Challenge 1
Given the following code:
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
##
a
b
c
d
e
## 5.4 6.2 7.1 4.8 7.5
1. Come up with at least 3 different commands that will produce the following output:
##
b
c
d
## 6.2 7.1 4.8
2. Compare notes with your neighbour. Did you have different strategies?
6.7
Subsetting by name
We can extract elements by using their name, instead of index:
x[c("a", "c")]
##
a
c
## 5.4 7.1
This is usually a much more reliable way to subset objects: the position of various elements can often change
when chaining together subsetting operations, but the names will always remain the same!
Unfortunately we can’t skip or remove elements so easily.
To skip (or remove) a single named element:
x[-which(names(x) == "a")]
##
b
c
d
e
## 6.2 7.1 4.8 7.5
The which function returns the indices of all TRUE elements of its argument. Remember that expressions
evaluate before being passed to functions. Let’s break this down so that its clearer what’s happening.
First this happens:
53
names(x) == "a"
## [1]
TRUE FALSE FALSE FALSE FALSE
The condition operator is applied to every name of the vector x. Only the first name is “a” so that element is
TRUE.
which then converts this to an index:
which(names(x) == "a")
## [1] 1
Only the first element is TRUE, so which returns 1. Now that we have indices the skipping works because we
have a negative index!
Skipping multiple named indices is similar, but uses a different comparison operator:
x[-which(names(x) %in% c("a", "c"))]
##
b
d
e
## 6.2 4.8 7.5
The %in% goes through each element of its left argument, in this case the names of x, and asks, “Does this
element occur in the second argument?”.
6.8
Challenge 2
Run the following code to define vector x as above:
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
##
a
b
c
d
e
## 5.4 6.2 7.1 4.8 7.5
Given this vector x, what would you expect the following to do?
x[-which(names(x) == "g")]
Try out this command and see what you get. Did this match your expectation? Why did we get
this result? (Tip: test out each part of the command on it’s own like we just did above - this is a
useful debugging strategy)
Which of the following are true:
• A) if there are no TRUE values passed to which, an empty vector is returned
• B) if there are no TRUE values passed to which, an error message is shown
• C) integer() is an empty vector
• D) making an empty vector negative produces an “everything” vector
• E) x[] gives the same result as x[integer()]
54
6.9
Tip: Non-unique names
You should be aware that it is possible for multiple elements in a vector to have the same name.
(For a data frame, columns can have the same name — although R tries to avoid this — but row
names must be unique.) Consider these examples:
x <- 1:3
x
## [1] 1 2 3
names(x) <- c('a', 'a', 'a')
x
## a a a
## 1 2 3
x['a']
# only returns first value
## a
## 1
x[which(names(x) == 'a')]
# returns all three values
## a a a
## 1 2 3
6.10
Tip: Getting help for operators
Remember you can search for help on operators by wrapping them in quotes: help("%in%") or
?"%in%".
So why can’t we use == like before? That’s an excellent question.
Let’s take a look at just the comparison component:
names(x) == c('a', 'c')
## Warning in names(x) == c("a", "c"): longer object length is not a multiple
## of shorter object length
## [1]
TRUE FALSE
TRUE
Obviously “c” is in the names of x, so why didn’t this work? == works slightly differently than %in%. It will
compare each element of its left argument to the corresponding element of its right argument.
Here’s a mock illustration:
c("a", "b", "c", "e")
|
|
|
|
c("a", "c")
# names of x
# The elements == is comparing
When one vector is shorter than the other, it gets recycled:
55
c("a", "b", "c", "e")
|
|
|
|
c("a", "c", "a", "c")
# names of x
# The elements == is comparing
In this case R simply repeats c("a", "c") twice. If the longer vector length isn’t a multiple of the shorter
vector length, then R will also print out a warning message:
names(x) == c('a', 'c', 'e')
## [1]
TRUE FALSE FALSE
This difference between == and %in% is important to remember, because it can introduce hard to find and
subtle bugs!
6.11
Subsetting through other logical operations
We can also more simply subset through logical operations:
x[c(TRUE, TRUE, FALSE, FALSE)]
## a a
## 1 2
Note that in this case, the logical vector is also recycled to the length of the vector we’re subsetting!
x[c(TRUE, FALSE)]
## a a
## 1 3
Since comparison operators evaluate to logical vectors, we can also use them to succinctly subset vectors:
x[x > 7]
## named integer(0)
6.12
Tip: Combining logical operations
There are many situations in which you will wish to combine multiple conditions. To do so several
logical operations exist in R:
•
•
•
•
| logical OR: returns TRUE, if either the left or right are TRUE.
& logical AND: returns TRUE if both the left and right are TRUE
! logical NOT: converts TRUE to FALSE and FALSE to TRUE
&& and || compare the individual elements of two vectors. Recycling rules also apply here.
6.13
Challenge 3
Given the following code:
56
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
##
a
b
c
d
e
## 5.4 6.2 7.1 4.8 7.5
1. Write a subsetting command to return the values in x that are greater than 4 and less than
7.
6.14
Handling special values
At some point you will encounter functions in R which cannot handle missing, infinite, or undefined data.
There are a number of special functions you can use to filter out this data:
• is.na will return all positions in a vector, matrix, or data.frame containing NA.
• likewise, is.nan, and is.infinite will do the same for NaN and Inf.
• is.finite will return all positions in a vector, matrix, or data.frame that do not contain NA, NaN or
Inf.
• na.omit will filter out all missing values from a vector
6.15
Data frames
Remember the data frames are lists underneath the hood, so similar rules apply. However they are also two
dimensional objects:
[ with one argument will act the same was as for lists, where each list element corresponds to a column. The
resulting object will be a data frame:
head(gapminder[3])
##
##
##
##
##
##
##
1
2
3
4
5
6
pop
8425333
9240934
10267083
11537966
13079460
14880372
Similarly, [[ will act to extract a single column:
head(gapminder[["lifeExp"]])
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
And $ provides a convenient shorthand to extract columns by name:
head(gapminder$year)
57
## [1] 1952 1957 1962 1967 1972 1977
With two arguments, [ behaves the same way as for matrices:
gapminder[1:3,]
##
country year
pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333
Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934
Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083
Asia 31.997 853.1007
If we subset a single row, the result will be a data frame (because the elements are mixed types):
gapminder[3,]
##
country year
pop continent lifeExp gdpPercap
## 3 Afghanistan 1962 10267083
Asia 31.997 853.1007
But for a single column the result will be a vector (this can be changed with the third argument, drop =
FALSE).
6.16
Challenge 4
Fix each of the following common data frame subsetting errors:
1. Extract observations collected for the year 1957
gapminder[gapminder$year = 1957,]
2. Extract all columns except 1 through to 4
gapminder[,-1:4]
3. Extract the rows where the life expectancy is longer the 80 years
gapminder[gapminder$lifeExp > 80]
4. Extract the first row, and the fourth and fifth columns (lifeExp and gdpPercap).
gapminder[1, 4, 5]
5. Advanced: extract rows that contain information for the years 2002 and 2007
gapminder[gapminder$year == 2002 | 2007,]
6.17
Challenge 5
1. Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?
2. Create a new data.frame called gapminder_small that only contains rows 1 through 9 and
19 through 23. You can do this in one or two steps.
58
6.18
Challenge solutions
Solutions to challenges.
6.19
Solution to challenge 1
Given the following code:
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
##
a
b
c
d
e
## 5.4 6.2 7.1 4.8 7.5
1. Come up with at least 3 different commands that will produce the following output:
##
b
c
d
## 6.2 7.1 4.8
x[2:4]
x[-c(1,5)]
x[c("b", "c", "d")]
x[c(2,3,4)]
6.20
Solution to challenge 2
Run the following code to define vector x as above:
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
##
a
b
c
d
e
## 5.4 6.2 7.1 4.8 7.5
Given this vector x, what would you expect the following to do?
x[-which(names(x) == "g")]
Try out this command and see what you get. Did this match your expectation?
Why did we get this result? (Tip: test out each part of the command on it’s own like we just did
above - this is a useful debugging strategy)
Which of the following are true:
• A) if there are no TRUE values passed to “which”, an empty vector is returned
• B) if there are no TRUE values passed to “which”, an error message is shown
• C) integer() is an empty vector
• D) making an empty vector negative produces an “everything” vector
• E) x[] gives the same result as x[integer()]
59
Answer: A and C are correct.
The which command returns the index of every TRUE value in its input. The names(x) == "g"
command didn’t return any TRUE values. Because there were no TRUE values passed to the which
command, it returned an empty vector. Negating this vector with the minus sign didn’t change
its meaning. Because we used this empty vector to retrieve values from x, it produced an empty
numeric vector. It was a named numeric empty vector because the vector type of x is “named
numeric” since we assigned names to the values (try str(x) ).
6.21
Solution to challenge 4
Fix each of the following common data frame subsetting errors:
1. Extract observations collected for the year 1957
# gapminder[gapminder$year = 1957,]
gapminder[gapminder$year == 1957,]
2. Extract all columns except 1 through to 4
# gapminder[,-1:4]
gapminder[,-c(1:4)]
3. Extract the rows where the life expectancy is longer the 80 years
# gapminder[gapminder$lifeExp > 80]
gapminder[gapminder$lifeExp > 80,]
4. Extract the first row, and the fourth and fifth columns (lifeExp and gdpPercap).
# gapminder[1, 4, 5]
gapminder[1, c(4, 5)]
5. Advanced: extract rows that contain information for the years 2002 and 2007
# gapminder[gapminder$year == 2002 | 2007,]
gapminder[gapminder$year == 2002 | gapminder$year == 2007,]
gapminder[gapminder$year %in% c(2002, 2007),]
6.22
Solution to challenge 5
1. Why does gapminder[1:20] return an error? How does it differ from gapminder[1:20, ]?
Answer: gapminder is a data.frame so needs to be subsetted on two dimensions. gapminder[1:20,
] subsets the data to give the first 20 rows and all columns.
2. Create a new data.frame called gapminder_small that only contains rows 1 through 9 and
19 through 23. You can do this in one or two steps.
gapminder_small <- gapminder[c(1:9, 19:23),]
60
7
Dataframe manipulation with dplyr
7.1
Learning Objectives
• To be able to use the 6 main dataframe manipulation ‘verbs’ with pipes in dplyr
Manipulation of dataframes means many things to many researchers, we often select certain observations
(rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary
statistics. We can do these operations using the normal base R operations:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
## [1] 7902.15
But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both
now and later, and potentially introduce some nasty bugs.
7.2
The dplyr package
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way
that will reduce the above repetition, reduce the probability of making errors, and probably even save you
some typing. As an added bonus, you might even find the dplyr grammar easier to read.
Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine
them.
1.
2.
3.
4.
5.
select()
filter()
group_by()
summarize()
mutate()
If you have have not installed this package earlier, please do so:
install.packages('dplyr')
Now let’s load the package:
library(dplyr)
61
7.3
Using select()
If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use
the select() function. This will keep only the variables you select.
year_country_gdp <- select(gapminder,year,country,gdpPercap)
If we open up year_country_gdp we’ll see that it only contains the year, country and gdpPercap. Above we
used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the
pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon
the gapminder dataframe and pass it on, using the pipe symbol %>%, to the next step, which is the select()
function. In this case we don’t specify which data object we use in the select() function since in gets that
from the previous pipe. Fun Fact: There is a good chance you have encountered pipes before in the shell.
In R, a pipe symbol is %>% while in the shell it is | but the concept is the same!
62
7.4
Using filter()
If we now wanted to move forward with the above, but only with European countries, we can combine select
and filter
year_country_gdp_euro <- gapminder %>%
filter(continent=="Europe") %>%
select(year,country,gdpPercap)
7.5
Challenge 1
Write a single command (which can span multiple lines and includes pipes) that will produce
a dataframe that has the African values for lifeExp, country and year, but not for other
Continents. How many rows does your dataframe have and why?
As with last time, first we pass the gapminder dataframe to the filter() function, then we pass the filtered
version of the gapminder dataframe to the select() function. Note: The order of operations is very
important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we
would have removed it in the previous step.
7.6
Using group_by() and summarize()
Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but
up to now we haven’t done that since we would have to repeat the above for each continent. Instead of
filter(), which will only pass observations that meet your criteria (in the above: continent=="Europe"),
we can use group_by(), which will essentially use every unique criteria that you could have used in filter.
str(gapminder)
## 'data.frame':
1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year
: int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop
: num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
str(gapminder %>% group_by(continent))
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year
: int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop
: num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## - attr(*, "vars")=List of 1
##
..$ : symbol continent
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 5
##
..$ : int 24 25 26 27 28 29 30 31 32 33 ...
63
##
##
##
##
##
##
##
##
##
##
##
..$ : int 48 49 50 51 52 53 54 55 56 57 ...
..$ : int 0 1 2 3 4 5 6 7 8 9 ...
..$ : int 12 13 14 15 16 17 18 19 20 21 ...
..$ : int 60 61 62 63 64 65 66 67 68 69 ...
- attr(*, "group_sizes")= int 624 300 396 360 24
- attr(*, "biggest_group_size")= int 624
- attr(*, "labels")='data.frame':
5 obs. of 1 variable:
..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
..- attr(*, "vars")=List of 1
.. ..$ : symbol continent
..- attr(*, "drop")= logi TRUE
You will notice that the structure of the dataframe where we used group_by() (grouped_df) is not the same
as the original gapminder (data.frame). A grouped_df can be thought of as a list where each item in the
listis a data.frame which contains only the rows that correspond to the a particular value continent (at
least in the example above).
7.7
Using summarize()
The above was a bit on the uneventful side because group_by() much more exciting in conjunction with
summarize(). This will allow use to create new variable(s) by using functions that repeat for each of the
continent-specific data frames. That is to say, using the group_by() function, we split our original dataframe
into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
64
That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.
7.8
Challenge 2
Calculate the average life expectancy per country. Which had the longest life expectancy and
which had the shortest life expectancy?
The function group_by() allows us to group by multiple variables. Let’s group by year and continent.
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in
summarize().
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop))
65
7.9
Using mutate()
We can also create new variables prior to (or even after) summarizing information using mutate().
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_pop=mean(pop),
sd_pop=sd(pop))
7.10
Advanced Challenge
Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent.
Then arrange the continent names in reverse order. Hint: Use the dplyr functions arrange()
and sample_n(), they have similar syntax to other dplyr functions.
7.11
Solution to Challenge 1
year_country_lifeExp_Africa <- gapminder %>%
filter(continent=="Africa") %>%
select(year,country,lifeExp)
7.12
Solution to Challenge 2
lifeExp_bycountry <- gapminder %>%
group_by(country) %>%
summarize(mean_lifeExp=mean(lifeExp))
7.13
Solution to Advanced Challenge
lifeExp_2countries_bycontinents <- gapminder %>%
filter(year==2002) %>%
group_by(continent) %>%
sample_n(2) %>%
summarize(mean_lifeExp=mean(lifeExp)) %>%
arrange(desc(mean_lifeExp))
7.14
Other great resources
Data Wrangling Cheat sheet
Introduction to dplyr
66
8
Dataframe manipulation with tidyr
8.1
Learning Objectives
• To be understand the concepts of ‘long’ and ‘wide’ data formats and be able to convert
between them with tidyr
Researchers often want to manipulate their data from the ‘wide’ to the ‘long’ format, or vice-versa. The ‘long’
format is where:
• each column is a variable
• each row is an observation
In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID
variables.
For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables
containing the same type of data. These can be either repeated observations over time, or observation of
multiple variables (or a mix of both). You may find data input may be simpler or some other applications
may prefer the ‘wide’ format. However, many of R‘s functions have been designed assuming you have ’long’
format data. This tutorial will help you efficiently transform your data regardless of original format.
67
These data formats mainly affect readability. For humans, the wide format is often more intuitive since we
can often see more of the data on the screen due to it’s shape. However, the long format is more machine
readable and is closer to the formating of databases. The ID variables in our dataframes are similar to the
fields in a database and observed variables are like the database values.
8.2
Getting started
First install the packages if you haven’t already done so (you probably installed dplyr in the previous lesson):
#install.packages("tidyr")
#install.packages("dplyr")
Load the packages
68
library("tidyr")
library("dplyr")
First, lets look at the structure of our original gapminder dataframe:
str(gapminder)
## 'data.frame':
1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year
: int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop
: num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
8.3
Challenge 1
Is gapminder a purely long, purely wide, or some intermediate format?
Sometimes, as with the gapminder dataset, we have multiple types of observed data. It is somewhere in
between the purely ‘long’ and ‘wide’ data formats. We have 3 “ID variables” (continent, country, year)
and 3 “Observation variables” (pop,lifeExp,gdpPercap). I usually prefer my data in this intermediate format
in most cases despite not having ALL observations in 1 column given that all 3 observation variables have
different units. There are few operations that would need us to stretch out this dataframe any longer (i.e. 4
ID variables and 1 Observation variable).
While using many of the functions in R, which are often vector based, you usually do not want to do
mathematical operations on values with different units. For example, using the purely long format, a single
mean for all of the values of population, life expectancy, and GDP would not be meaningful since it would
return the mean of values with 3 incompatible units. The solution is that we first manipulate the data either
by grouping (see the lesson on dplyr), or we change the structure of the dataframe. Note: Some plotting
functions in R actually work better in the wide format data.
8.4
From wide to long format with gather()
Until now, we’ve been using the nicely formatted original gapminder dataset, but ‘real’ data (i.e. our own
research data) will never be so well organized. Here let’s start with the wide format version of the gapminder
dataset.
str(gap_wide)
## 'data.frame':
142 obs. of 38 variables:
## $ continent
: chr "Africa" "Africa" "Africa" "Africa" ...
## $ country
: chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
69
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
gdpPercap_1987:
gdpPercap_1992:
gdpPercap_1997:
gdpPercap_2002:
gdpPercap_2007:
lifeExp_1952 :
lifeExp_1957 :
lifeExp_1962 :
lifeExp_1967 :
lifeExp_1972 :
lifeExp_1977 :
lifeExp_1982 :
lifeExp_1987 :
lifeExp_1992 :
lifeExp_1997 :
lifeExp_2002 :
lifeExp_2007 :
pop_1952
:
pop_1957
:
pop_1962
:
pop_1967
:
pop_1972
:
pop_1977
:
pop_1982
:
pop_1987
:
pop_1992
:
pop_1997
:
pop_2002
:
pop_2007
:
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
num
int
int
5681 2430 1226 6206 912 ...
5023 2628 1191 7954 932 ...
4797 2277 1233 8647 946 ...
5288 2773 1373 11004 1038 ...
6223 4797 1441 12570 1217 ...
43.1 30 38.2 47.6 32 ...
45.7 32 40.4 49.6 34.9 ...
48.3 34 42.6 51.5 37.8 ...
51.4 36 44.9 53.3 40.7 ...
54.5 37.9 47 56 43.6 ...
58 39.5 49.2 59.3 46.1 ...
61.4 39.9 50.9 61.5 48.1 ...
65.8 39.9 52.3 63.6 49.6 ...
67.7 40.6 53.9 62.7 50.3 ...
69.2 41 54.8 52.6 50.3 ...
71 41 54.4 46.6 50.6 ...
72.3 42.7 56.7 50.7 52.3 ...
9279525 4232095 1738315 442308 4469979 ...
10270856 4561361 1925173 474639 4713416 ...
11000948 4826015 2151895 512764 4919632 ...
12760499 5247469 2427334 553541 5127935 ...
14760787 5894858 2761407 619351 5433886 ...
17152804 6162675 3168267 781472 5889574 ...
20033753 7016384 3641603 970347 6634596 ...
23254956 7874230 4243788 1151184 7586551 ...
26298373 8735988 4981671 1342614 8878303 ...
29072015 9875024 6066080 1536536 10352843 ...
31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 6
33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807
The first step towards getting our nice intermediate data format is to first convert from the wide to the long
format. The tidyr function gather() will ‘gather’ your observation variables into a single variable.
gap_long <- gap_wide %>% gather(obstype_year,obs_values,starts_with('pop'),starts_with('lifeExp'),starts
str(gap_long)
## 'data.frame':
## $ continent
:
## $ country
:
## $ obstype_year:
## $ obs_values :
5112
chr
chr
chr
num
obs. of 4 variables:
"Africa" "Africa" "Africa" "Africa" ...
"Algeria" "Angola" "Benin" "Botswana" ...
"pop_1952" "pop_1952" "pop_1952" "pop_1952" ...
9279525 4232095 1738315 442308 4469979 ...
Here we have used piping syntax which is similar to what we were doing in the previous lesson with dplyr. In
fact, these are compatible and you can use a mix of tidyr and dplyr functions by piping them together
Inside gather() we first name the new column for the new ID variable (obstype_year), the name for the
new amalgamated observation variable (obs_value), then the names of the old observation variable. We
70
could have typed out all the observation variables, but as in the select() function (see dplyr lesson), we can
use the starts_with() argument to select all variables that starts with the desired character sring. Gather
also allows the alternative syntax of using the - symbol to identify which variables are not to be gathered
(i.e. ID variables)
71
gap_long <- gap_wide %>% gather(obstype_year,obs_values,-continent,-country)
str(gap_long)
## 'data.frame':
## $ continent
:
## $ country
:
## $ obstype_year:
## $ obs_values :
5112
chr
chr
chr
num
obs. of 4 variables:
"Africa" "Africa" "Africa" "Africa" ...
"Algeria" "Angola" "Benin" "Botswana" ...
"gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
2449 3521 1063 851 543 ...
That may seem trivial with this particular dataframe, but sometimes you have 1 ID variable and 40 Observation
variables with irregular variables names. The flexibility is a huge time saver!
Now obstype_year actually contains 2 pieces of information, the observation type (pop,lifeExp, or
gdpPercap) and the year. We can use the separate() function to split the character strings into multiple
variables
gap_long <- gap_long %>% separate(obstype_year,into=c('obs_type','year'),sep="_")
gap_long$year <- as.integer(gap_long$year)
8.5
Challenge 2
Using gap_long, calculate the mean life expectancy, population, and gdpPercap for each continent.
Hint: use the group_by() and summarize() functions we learned in the dplyr lesson
8.6
From long to intermediate format with spread()
Now just to double-check our work, let’s use the opposite of gather() to spread our observation variables
back out with the aptly named spread(). We can then spread our gap_long() to the original intermediate
format or the widest format. Let’s start with the intermediate format.
gap_normal <- gap_long %>% spread(obs_type,obs_values)
dim(gap_normal)
## [1] 1704
6
dim(gapminder)
## [1] 1704
6
names(gap_normal)
## [1] "continent" "country"
"year"
"gdpPercap" "lifeExp"
"pop"
"pop"
"continent" "lifeExp"
"gdpPercap"
names(gapminder)
## [1] "country"
"year"
Now we’ve got an intermediate dataframe gap_normal with the same dimensions as the original gapminder,
but the order of the variables is different. Let’s fix that before checking if they are all.equal().
72
gap_normal <- gap_normal[,names(gapminder)]
all.equal(gap_normal,gapminder)
##
##
##
##
##
[1]
[2]
[3]
[4]
[5]
"Component
"Component
"Component
"Component
"Component
\"country\": 1704 string mismatches"
\"pop\": Mean relative difference: 1.634504"
\"continent\": 1212 string mismatches"
\"lifeExp\": Mean relative difference: 0.203822"
\"gdpPercap\": Mean relative difference: 1.162302"
head(gap_normal)
##
##
##
##
##
##
##
country year
pop continent lifeExp gdpPercap
1 Algeria 1952 9279525
Africa 43.077 2449.008
2 Algeria 1957 10270856
Africa 45.685 3013.976
3 Algeria 1962 11000948
Africa 48.303 2550.817
4 Algeria 1967 12760499
Africa 51.407 3246.992
5 Algeria 1972 14760787
Africa 54.518 4182.664
[ reached getOption("max.print") -- omitted 1 row ]
head(gapminder)
##
##
##
##
##
##
##
country year
pop continent lifeExp gdpPercap
1 Afghanistan 1952 8425333
Asia 28.801 779.4453
2 Afghanistan 1957 9240934
Asia 30.332 820.8530
3 Afghanistan 1962 10267083
Asia 31.997 853.1007
4 Afghanistan 1967 11537966
Asia 34.020 836.1971
5 Afghanistan 1972 13079460
Asia 36.088 739.9811
[ reached getOption("max.print") -- omitted 1 row ]
We’re almost there, the original was sorted by country, continent, then year.
gap_normal <- gap_normal %>% arrange(country,continent,year)
all.equal(gap_normal,gapminder)
## [1] TRUE
That’s great! We’ve gone from the longest format back to the intermediate and we didn’t introduce any
errors in our code.
Now lets convert the long all the way back to the wide. In the wide format, we will keep country and continent
as ID variables and spread the observations across the 3 metrics (pop,lifeExp,gdpPercap) and time (year).
First we need to create appropriate labels for all our new variables (time*metric combinations) and we also
need to unify our ID variables to simplify the process of defining gap_wide
gap_temp <- gap_long %>% unite(var_ID,continent,country,sep="_")
str(gap_temp)
## 'data.frame':
## $ var_ID
:
## $ obs_type :
## $ year
:
## $ obs_values:
5112 obs. of 4 variables:
chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
num 2449 3521 1063 851 543 ...
73
gap_temp <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_names,obs_type,year,sep="_")
str(gap_temp)
## 'data.frame':
5112 obs. of 3 variables:
## $ ID_var
: chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ var_names : chr "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" "gdpPercap_1952" ...
## $ obs_values: num 2449 3521 1063 851 543 ...
Using unite() we now have a single ID variable which is a combination of continent,country,and we have
defined variable names. We’re now ready to pipe in spread()
gap_wide_new <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_names,obs_type,year,sep="_") %>%
spread(var_names,obs_values)
str(gap_wide_new)
## 'data.frame':
142 obs. of 37 variables:
## $ ID_var
: chr "Africa_Algeria" "Africa_Angola" "Africa_Benin" "Africa_Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952
: num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957
: num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962
: num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967
: num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972
: num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977
: num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982
: num 20033753 7016384 3641603 970347 6634596 ...
74
##
##
##
##
##
$
$
$
$
$
pop_1987
pop_1992
pop_1997
pop_2002
pop_2007
8.7
:
:
:
:
:
num
num
num
num
num
23254956
26298373
29072015
31287142
33333216
7874230 4243788 1151184 7586551 ...
8735988 4981671 1342614 8878303 ...
9875024 6066080 1536536 10352843 ...
10866106 7026113 1630347 12251209 ...
12420476 8078314 1639131 14326203 ...
Challenge 3
Take this 1 step further and create a gap_ludicrously_wide format data by spreading over
countries, year and the 3 metrics? Hint this new dataframe should only have 5 rows.
Now we have a great ‘wide’ format dataframe, but the ID_var could be more usable, let’s separate it into 2
variables with separate()
gap_wide_betterID <- separate(gap_wide_new,ID_var,c("continent","country"),sep="_")
gap_wide_betterID <- gap_long %>%
unite(ID_var,continent,country,sep="_") %>%
unite(var_names,obs_type,year,sep="_") %>%
spread(var_names,obs_values) %>%
separate(ID_var,c("continent","country"),sep="_")
str(gap_wide_betterID)
## 'data.frame':
142 obs. of 38 variables:
## $ continent
: chr "Africa" "Africa" "Africa" "Africa" ...
## $ country
: chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952
: num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957
: num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962
: num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967
: num 12760499 5247469 2427334 553541 5127935 ...
75
##
##
##
##
##
##
##
##
$
$
$
$
$
$
$
$
pop_1972
pop_1977
pop_1982
pop_1987
pop_1992
pop_1997
pop_2002
pop_2007
:
:
:
:
:
:
:
:
num
num
num
num
num
num
num
num
14760787
17152804
20033753
23254956
26298373
29072015
31287142
33333216
5894858 2761407 619351 5433886 ...
6162675 3168267 781472 5889574 ...
7016384 3641603 970347 6634596 ...
7874230 4243788 1151184 7586551 ...
8735988 4981671 1342614 8878303 ...
9875024 6066080 1536536 10352843 ...
10866106 7026113 1630347 12251209 ...
12420476 8078314 1639131 14326203 ...
all.equal(gap_wide,gap_wide_betterID)
## [1] TRUE
There and back again!
8.8
Solution to Challenge 1
The original gapminder data.frame is in an intermediate format. It is not purely long since it had
multiple observation variables (pop,lifeExp,gdpPercap).
8.9
Solution to Challenge 2
gap_long %>% group_by(continent,obs_type) %>%
summarize(means=mean(obs_values))
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Source: local data frame [15 x 3]
Groups: continent [?]
continent obs_type
means
(chr)
(chr)
(dbl)
1
Africa gdpPercap 2.193755e+03
2
Africa
lifeExp 4.886533e+01
3
Africa
pop 9.916003e+06
4
Americas gdpPercap 7.136110e+03
5
Americas
lifeExp 6.465874e+01
6
Americas
pop 2.450479e+07
7
Asia gdpPercap 7.902150e+03
8
Asia
lifeExp 6.006490e+01
9
Asia
pop 7.703872e+07
[ reached getOption("max.print") -- omitted 6 rows ]
8.10
Solution to Challenge 3
gap_ludicrously_wide <- gap_long %>%
unite(var_names,obs_type,year,country,sep="_") %>%
spread(var_names,obs_values)
76
8.11
Other great resources
Data Wrangling Cheat sheet
Introduction to tidyr
9
Writing data
9.1
Learning Objectives
• To be able to write out plots and data from R
9.2
Saving plots
You have already seen how to save the most recent plot you create in ggplot2, using the command ggsave.
As a refresher:
ggsave("My_most_recent_plot.pdf")
You can save a plot from within RStudio using the ‘Export’ button in the ‘Plot’ window. This will give you
the option of saving as a .pdf or as .png, .jpg or other image formats.
Sometimes you will want to save plots without creating them in the ‘Plot’ window first. Perhaps you want
to make a pdf document with multiple pages: each one a different plot, for example. Or perhaps you’re
looping through multiple subsets of a file, plotting data from each subset, and you want to save each plot,
but obviously can’t stop the loop to click ‘Export’ for each one.
In this case you can use a more flexible approach. The function pdf creates a new pdf device. You can control
the size and resolution using the arguments to this function.
pdf("Life_Exp_vs_time.pdf", width=12, height=4)
ggplot(data=gapminder, aes(x=year, y=lifeExp, colour=country)) +
geom_line()
# You then have to make sure to turn off the pdf device!
dev.off()
Open up this document and have a look.
9.3
Challenge 1
Rewrite your ‘pdf’ command to print a second page in the pdf, showing a facet plot (hint: use
facet_grid) of the same data with one panel per continent.
The commands jpeg, png etc. are used similarly to produce documents in different formats.
77
9.4
Writing data
At some point, you’ll also want to write out data from R.
We can use the write.csv function for this, which is very similar to read.csv from before.
Let’s create a data-cleaning script, for this analysis, we only want to focus on the gapminder data for Australia:
aust_subset <- gapminder[gapminder$country == "Australia",]
write.csv(aust_subset, file="cleaned-data/gapminder-aus.csv")
Open up the file from the file browser and have a look.
Hmm, that’s not quite what we wanted. Where did all these quotation marks come from? Also the row
numbers are meaningless.
Let’s look at the help file to work out how to change this behaviour.
?write.csv
By default R will wrap character vectors with quotation marks when writing out to file. It will also write out
the row and column names.
Let’s fix this:
write.csv(aust_subset, file="cleaned-data/gapminder-aus.csv",
quote=FALSE, row.names=FALSE)
Now lets look at the data again. That looks better!
9.5
Challenge 2
Write a data-cleaning script file that subsets the gapminder data to include only data points
collected since 1990.
Use this script to write out the new subset to a file in the cleaned-data/ directory.
10
Basic statistics
Of course, R was written by statisticians, for statisticians. We’re not going to go deep into stats - partly
because I’m not really that qualified to teach it, and because we don’t have time to cover all of the potential
needs that people in the course will have. But we can cover a few of the basics, and introduce the common R
way of fitting statistical models.
10.0.1
t-test
We’ll keep going with our gapminder data; we want to test if GDP is significantly different between the
Americas and Europe in 2007; so we can use a basic two-sample t-test.
First, let’s search the help to find out what functions are avaible: ??"t-test" . Student’s t-test is the one
we want. There are a few variations of the t-test available. If we are testing a single sample against a known
value (for example, find out if something is different from 0), we would use the single-sample t-test like so:
78
# Simulate some data with a normal distribution, a mean of 0, and sd of 1.
data <- rnorm(100)
mean(data)
## [1] -0.1294407
t.test(data, mu=0)
##
##
##
##
##
##
##
##
##
##
##
One Sample t-test
data: data
t = -1.3695, df = 99, p-value = 0.174
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.3169871 0.0581058
sample estimates:
mean of x
-0.1294407
## Unsurprisingly, not significant.
For our GDP question data, we want to use a two-sample t-test. I like using the formula specification because
it’s similar to how many other statistical tests are specified: t.test(Value ~ factor, data=)
Since we’re only interested in Europe and the Americas in 2007, we need to do a bit of filtering of the data.
library(dplyr)
gdp_07_EuAm <- filter(gapminder,
continent %in% c("Americas", "Europe"),
year == 2007)
summary(gdp_07_EuAm)
##
##
##
##
##
##
##
##
##
##
##
##
##
country
year
pop
Albania
: 1
Min.
:2007
Min.
:
301931
Argentina
: 1
1st Qu.:2007
1st Qu.: 4933193
Austria
: 1
Median :2007
Median : 9319622
Belgium
: 1
Mean
:2007
Mean
: 26999449
Bolivia
: 1
3rd Qu.:2007
3rd Qu.: 27379710
continent
lifeExp
gdpPercap
Africa : 0
Min.
:60.92
Min.
: 1202
Americas:25
1st Qu.:72.89
1st Qu.: 7952
Asia
: 0
Median :76.19
Median :13172
Europe :30
Mean
:75.81
Mean
:18667
Oceania : 0
3rd Qu.:79.36
3rd Qu.:31320
[ reached getOption("max.print") -- omitted 2 rows ]
gdp_07_EuAm <- droplevels(gdp_07_EuAm)
79
t.test(gdpPercap ~ continent, data = gdp_07_EuAm)
##
##
##
##
##
##
##
##
##
##
##
Welch Two Sample t-test
data: gdpPercap by continent
t = -4.8438, df = 52.996, p-value = 1.148e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-19870.011 -8232.889
sample estimates:
mean in group Americas
mean in group Europe
11003.03
25054.48
10.0.2
Simple linear regression
Let’s explore the relationship between life expectancy and year
reg <- lm(lifeExp ~ year, data=gapminder)
We won’t go into too much detail, but briefly:
• lm estimates linear statistical models
• The first argument is a formula, with a ~ b meaning that a, the dependent (or response) variable, is a
function of b, the independent variable.
• We tell lm to use the gapminder data frame, so it knows where to find the variables lifeExp and year.
Let’s look at the output, which is an object of class lm:
reg
##
##
##
##
##
##
##
Call:
lm(formula = lifeExp ~ year, data = gapminder)
Coefficients:
(Intercept)
-585.6522
year
0.3259
class(reg)
## [1] "lm"
There’s a great deal stored in this object!
For now, we can look at the summary:
summary(reg)
80
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = lifeExp ~ year, data = gapminder)
Residuals:
Min
1Q
-39.949 -9.651
Median
1.697
3Q
10.335
Max
22.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -585.65219
32.31396 -18.12
<2e-16 ***
year
0.32590
0.01632
19.96
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.63 on 1702 degrees of freedom
Multiple R-squared: 0.1898, Adjusted R-squared: 0.1893
F-statistic: 398.6 on 1 and 1702 DF, p-value: < 2.2e-16
As you might expect, life expectancy has slowly been increasing over time, so we see a significant positive
association!
10.0.2.1
Plot the data with the regression line, along with confidence limits
p <- ggplot(gapminder, aes(x = year, y = lifeExp)) + geom_point()
dummy <- data.frame(year = seq(from = min(gapminder$year),
to = max(gapminder$year),
length.out = 100))
pred <- predict(reg, newdata=dummy, interval = "conf")
dummy <- cbind(dummy, pred)
p + geom_line(data = dummy, aes(y = fit)) +
geom_line(data = dummy, aes(y = lwr), linetype = 'dashed') +
geom_line(data = dummy, aes(y = upr), linetype = 'dashed')
81
80
lifeExp
60
40
1950
1960
1970
1980
1990
2000
year
ggplot2 will also generate a fitted line and confidence intervals for you - which is useful, but only works for a
univariate relationship . . . it’s also nice to do it yourself as above so you know that the fit is coming directly
from regression model you ran.
p + geom_smooth(method="lm")
82
80
lifeExp
60
40
1950
1960
1970
1980
1990
2000
year
10.0.2.2
Checking Assumptions
We can check these assumptions of the model by plotting the residuals vs the fitted values.
fitted <- fitted(reg)
residuals <- resid(reg)
ggplot(data=NULL, aes(x = fitted, y = residuals)) + geom_point() +
geom_hline(yintercept = 0)
83
20
residuals
0
−20
−40
50
55
60
65
fitted
We can check also assumptions using plot(). There are actually a bunch of different plot methods in R,
which are dispatched depending on the type of object you call them on. When you call plot on an lm object,
a series of diagnostic plots is created to help us check the assumptions of the lm object.
plot(reg)
84
0
−20
1294
−40
Residuals
20
Residuals vs Fitted
1293
50
55
60
Fitted values
lm(lifeExp ~ year)
85
65
1464
1
0
−1
−2
1464
1294
−3
Standardized residuals
2
Normal Q−Q
1293
−3
−2
−1
0
1
Theoretical Quantiles
lm(lifeExp ~ year)
86
2
3
Scale−Location
0.5
1.0
1.5
1294
0.0
Standardized residuals
1293
50
55
60
Fitted values
lm(lifeExp ~ year)
87
65
1464
−2
−1
0
1
Standardized residuals
2
Residuals vs Leverage
−3
1044
1464
−4
Cook's distance
0.0000
0.0005
1293
0.0010
0.0015
0.0020
Leverage
lm(lifeExp ~ year)
Get more information on these plots by checking ?plot.lm.
10.0.3
Analysis of Variance (ANOVA)
Now say we want to extend our GDP analysis above to all continents, then we can’t use a t-test; we have
to use an ANOVA. Since an ANOVA is simply a linear regression model with a categorical rather than
continuous predictor variable, we still use the lm() function. Let’s test for differences in petal length among
all three species.
gap_07 <- filter(gapminder, year == 2007)
gdp_aov <- lm(gdpPercap ~ continent, data=gapminder)
summary(gdp_aov)
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = gdpPercap ~ continent, data = gapminder)
Residuals:
Min
1Q Median
-13496 -4376 -1332
Coefficients:
(Intercept)
continentAmericas
3Q
Max
997 105621
Estimate Std. Error t value Pr(>|t|)
2193.8
346.8
6.326 3.21e-10 ***
4942.4
608.6
8.121 8.79e-16 ***
88
##
##
##
##
##
##
##
##
##
continentAsia
5708.4
556.5 10.257 <
continentEurope
12275.7
573.3 21.412 <
continentOceania
16427.9
1801.9
9.117 <
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
2e-16 ***
2e-16 ***
2e-16 ***
'.' 0.1 ' ' 1
Residual standard error: 8662 on 1699 degrees of freedom
Multiple R-squared: 0.2296, Adjusted R-squared: 0.2278
F-statistic: 126.6 on 4 and 1699 DF, p-value: < 2.2e-16
anova(gdp_aov)
##
##
##
##
##
##
##
##
Analysis of Variance Table
Response: gdpPercap
Df
Sum Sq
Mean Sq F value
Pr(>F)
continent
4 3.7990e+10 9497557167 126.57 < 2.2e-16 ***
Residuals 1699 1.2749e+11
75037832
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
10.0.3.1
Plot
ggplot(data = gap_07, aes(x = continent, y = gdpPercap)) + geom_boxplot()
50000
gdpPercap
40000
30000
20000
10000
0
Africa
Americas
Asia
continent
89
Europe
Oceania
ggplot(data = gap_07, aes(x = continent, y = gdpPercap)) + geom_point()
50000
gdpPercap
40000
30000
20000
10000
0
Africa
Americas
Asia
Europe
Oceania
continent
ggplot(data = gap_07, aes(x = continent, y = gdpPercap, colour = continent)) +
geom_jitter()
90
50000
40000
gdpPercap
continent
Africa
30000
Americas
Asia
Europe
20000
Oceania
10000
0
Africa
Americas
Asia
Europe
Oceania
continent
10.0.3.2
Check assumptions
fitted <- fitted(gdp_aov)
residuals <- resid(gdp_aov)
ggplot(data=NULL, aes(x = fitted, y = residuals)) + geom_point() +
geom_hline(yintercept = 0)
91
100000
residuals
75000
50000
25000
0
5000
10000
15000
fitted
10.0.4
More advanced linear models and model selection using AIC
Here we’re going to divert to a different dataset: Measurements of Sepals and Petals (widths and lengths)
in three species of Iris. We are going to explore the relationship between sepal length, sepal width among
species.
mod1 <- lm(Sepal.Length ~ Sepal.Width * Species, data=iris) # includes interaction term
mod1a <- lm(Sepal.Length ~ Sepal.Width + Species + Sepal.Width:Species, data=iris) #Equivalent to above
mod2 <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris) # ANCOVA
mod3 <- lm(Sepal.Length ~ Sepal.Width, data=iris)
mod4 <- lm(Sepal.Length ~ Species, data=iris)
AIC(mod1, mod2, mod3, mod4)
##
##
##
##
##
mod1
mod2
mod3
mod4
df
7
5
3
4
AIC
187.0922
183.9366
371.9917
231.4520
Let’s plot the data:
ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, colour=Species, group=Species)) +
geom_point() +
geom_smooth(method="lm", formula = y ~ x)
92
8
Sepal.Length
7
Species
setosa
6
versicolor
virginica
5
4
2.0
2.5
3.0
3.5
4.0
4.5
Sepal.Width
10.0.5
Generalized linear models: Logistic regression
Say you want to know whether elevation can predict whether or not a particular species of beetle is present
(all other things being equal of course). You walk up a hillside, starting at 100m elevation and sampling for
the beetle every 10m until you reach 1000m. At each stop you record whether or the beetle is present (1) or
absent (0).
First, let’s simulate some data
## Generate a sequence of elevations
elev <- seq(100, 1000, by=10)
# Generate a vector of probabilities the same length as `elev` with increasing
# probabilities
probs <- 0:length(elev) / length(elev)
## Generate a sequence of 0's and 1's
pres <- rbinom(length(elev), 1, prob=probs)
## combine into a data frame and remove consituent parts
elev_pres.data <- data.frame(elev, pres)
rm(elev, pres)
## Plot the data
ggplot(elev_pres.data, aes(x = elev, y = pres)) + geom_point()
93
1.00
pres
0.75
0.50
0.25
0.00
250
500
750
1000
elev
Presence / absence data is a classic example of where to use logistic regression; the outcome is binary (0
or 1), and the predictor variable is continuous (elevation, in this case). Logisitic regression is a particular
type of model in the family of Generalized Linear Models. Where ordinary least squares regression assumes a
normal disribution of the response variable, Generalized linear models assume a different distribution. Logistic
regression assumes a binomial distribution (outcome will be in one of two states). Another common example
is the poisson distribution, which is often useful for count data.
Implementing GLMs is relatively straightforward using the glm() function. You specify the model formula in
the same way as in lm(), and specify the distribution you want in the family parameter.
lr1 <- glm(pres ~ elev, data=elev_pres.data, family=binomial)
summary(lr1)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
glm(formula = pres ~ elev, family = binomial, data = elev_pres.data)
Deviance Residuals:
Min
1Q
Median
-2.0408 -0.8305
0.4611
3Q
0.8683
Max
1.9630
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.532193
0.629982 -4.019 5.83e-05 ***
elev
0.004768
0.001074
4.441 8.96e-06 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
94
##
##
##
##
##
##
##
##
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 126.054
Residual deviance: 99.601
AIC: 103.6
on 90
on 89
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 3
So let’s add the curve generated by the logistic regression to the plot:
ggplot(elev_pres.data, aes(x = elev, y = pres)) +
geom_point() +
geom_line(aes(y = predict(lr1, type="response")))
1.00
pres
0.75
0.50
0.25
0.00
250
500
750
elev
11
Writing functions
11.1
•
•
•
•
•
Learning Objectives
Define a function that takes arguments.
Return a value from a function.
Test a function.
Set default values for function arguments.
Explain why we should divide programs into small, single-purpose functions.
95
1000
If we only had one data set to analyze, it would probably be faster to load the file into a spreadsheet and use
that to plot simple statistics. However, the gapminder data is updated periodically, and we may want to pull
in that new information later and re-run our analysis again. We may also obtain similar data from a different
source in the future.
In this lesson, we’ll learn how to write a function so that we can repeat several operations with a single
command.
11.2
What is a function?
Functions gather a sequence of operations into a whole, preserving it for ongoing use. Functions
provide:
•
•
•
•
a name we can remember and invoke it by
relief from the need to remember the individual operations
a defined set of inputs and expected outputs
rich connections to the larger programming environment
As the basic building block of most programming languages, user-defined functions constitute
“programming” as much as any single abstraction can. If you have written a function, you are a
computer programmer.
11.3
Defining a function
Let’s open a new R script file in the functions/ directory and call it functions-lesson.R.
my_sum <- function(a, b) {
the_sum <- a + b
return(the_sum)
}
Letâe™s define a function fahr_to_kelvin that converts temperatures from Fahrenheit to Kelvin:
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
We define fahr_to_kelvin by assigning it to the output of function. The list of argument names are
contained within parentheses. Next, the body of the function–the statements that are executed when it
runs–is contained within curly braces ({}). The statements in the body are indented by two spaces. This
makes the code easier to read but does not affect how the code operates.
When we call the function, the values we pass to it are assigned to those variables so that we can use them
inside the function. Inside the function, we use a return statement to send a result back to whoever asked for
it.
11.4
Tip
One feature unique to R is that the return statement is not required. R automatically returns
whichever variable is on the last line of the body of the function. Since we are just learning, we
will explicitly define the return statement.
96
Let’s try running our function. Calling our own function is no different from calling any other function:
# freezing point of water
fahr_to_kelvin(32)
## [1] 273.15
# boiling point of water
fahr_to_kelvin(212)
## [1] 373.15
11.5
Challenge 1
Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that
temperature in Celsius
Hint: To convert from Kelvin to Celsius you minus 273.15
11.6
Combining functions
The real power of functions comes from mixing, matching and combining them into ever large chunks to get
the effect we want.
Let’s define two functions that will convert temperature from Fahrenheit to Kelvin, and Kelvin to Celsius:
fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}
11.7
Challenge 2
Define the function to convert directly from Fahrenheit to Celsius, by reusing the two functions
above (or using your own functions if you prefer).
We’re going to define a function that calculates the Gross Domestic Product of a nation from the data
available in our dataset:
# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat) {
gdp <- dat$pop * dat$gdpPercap
return(gdp)
}
97
We define calcGDP by assigning it to the output of function. The list of argument names are contained
within parentheses. Next, the body of the function – the statements executed when you call the function – is
contained within curly braces ({}).
We’ve indented the statements in the body by two spaces. This makes the code easier to read but does not
affect how it operates.
When we call the function, the values we pass to it are assigned to the arguments, which become variables
inside the body of the function.
Inside the function, we use the return function to send back the result. This return function is optional: R
will automatically return the results of whatever command is executed on the last line of the function.
calcGDP(head(gapminder))
## [1]
6567086330
7585448670
8758855797
9648014150
9678553274 11697659231
That’s not very informative. Let’s add some more arguments so we can extract that per year and country.
# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
if(!is.null(year)) {
dat <- dat[dat$year %in% year, ]
}
if (!is.null(country)) {
dat <- dat[dat$country %in% country,]
}
gdp <- dat$pop * dat$gdpPercap
new <- data.frame(dat, gdp=gdp)
return(new)
}
If you’ve been writing these functions down into a separate R script (a good idea!), you can load in the
functions into our R session by using the source function:
source("functions/functions-lesson.R")
Ok, so there’s a lot going on in this function now. In plain English, the function now subsets the provided
data by year if the year argument isn’t empty, then subsets the result by country if the country argument
isn’t empty. Then it calculates the GDP for whatever subset emerges from the previous two steps. The
function then adds the GDP as a new column to the subsetted data and returns this as the final result. You
can see that the output is much more informative than just getting a vector of numbers.
Let’s take a look at what happens when we specify the year:
head(calcGDP(gapminder, year=2007))
##
##
##
##
##
##
country year
pop continent lifeExp gdpPercap
gdp
12 Afghanistan 2007 31889923
Asia 43.828
974.5803 31079291949
24
Albania 2007 3600523
Europe 76.423 5937.0295 21376411360
36
Algeria 2007 33333216
Africa 72.301 6223.3675 207444851958
48
Angola 2007 12420476
Africa 42.731 4797.2313 59583895818
[ reached getOption("max.print") -- omitted 2 rows ]
98
Or for a specific country:
calcGDP(gapminder, country="Australia")
##
##
##
##
##
##
61
62
63
64
[
country year
pop continent lifeExp gdpPercap
gdp
Australia 1952 8691212
Oceania 69.120 10039.60 87256254102
Australia 1957 9712569
Oceania 70.330 10949.65 106349227169
Australia 1962 10794968
Oceania 70.930 12217.23 131884573002
Australia 1967 11872264
Oceania 71.100 14526.12 172457986742
reached getOption("max.print") -- omitted 8 rows ]
Or both:
calcGDP(gapminder, year=2007, country="Australia")
##
country year
pop continent lifeExp gdpPercap
gdp
## 72 Australia 2007 20434176
Oceania 81.235 34435.37 703658358894
Let’s walk through the body of the function:
calcGDP <- function(dat, year=NULL, country=NULL) {
Here we’ve added two arguments, year, and country. We’ve set default arguments for both as NULL using
the = operator in the function definition. This means that those arguments will take on those values unless
the user specifies otherwise.
if(!is.null(year)) {
dat <- dat[dat$year %in% year, ]
}
if (!is.null(country)) {
dat <- dat[dat$country %in% country,]
}
Here, we check whether each additional argument is set to null, and whenever they’re not null overwrite
the dataset stored in dat with a subset given by the non-null argument.
I did this so that our function is more flexible for later. We can ask it to calculate the GDP for:
•
•
•
•
The whole dataset;
A single year;
A single country;
A single combination of year and country.
By using %in% instead, we can also give multiple years or countries to those arguments.
11.8
Tip: Pass by value
Functions in R almost always make copies of the data to operate on inside of a function body.
When we modify dat inside the function we are modifying the copy of the gapminder dataset
stored in dat, not the original variable we gave as the first argument.
This is called “pass-by-value” and it makes writing code much safer: you can always be sure that
whatever changes you make within the body of the function, stay inside the body of the function.
99
11.9
Tip: Function scope
Another important concept is scoping: any variables (or functions!) you create or modify inside
the body of a function only exist for the lifetime of the function’s execution. When we call
calcGDP, the variables dat, gdp and new only exist inside the body of the function. Even if we
have variables of the same name in our interactive R session, they are not modified in any way
when executing a function.
gdp <- dat$pop * dat$gdpPercap
new <- data.frame(dat, gdp=gdp)
return(new)
}
Finally, we calculated the GDP on our new subset, and created a new data frame with that column added.
This means when we call the function later we can see the context for the returned GDP values, which is
much better than in our first attempt where we just got a vector of numbers.
11.10
Challenge 3
The paste function can be used to combine text together, e.g:
best_practice <- c("Write", "programs", "for", "people", "not", "computers")
paste(best_practice, collapse=" ")
## [1] "Write programs for people not computers"
Write a function called fence that takes two vectors as arguments, called text and wrapper, and
prints out the text wrapped with the wrapper:
fence(text=best_practice, wrapper="***")
Note: the paste function has an argument called sep, which specifies the separator between text.
The default is a space: " “. The default for paste0 is no space”“.
11.11
Tip
R has some unique aspects that can be exploited when performing more complicated operations.
We will not be writing anything that requires knowledge of these more advanced concepts. In
the future when you are comfortable writing functions in R, you can learn more by reading the
R Language Manual or this chapter from Advanced R Programming by Hadley Wickham. For
context, R uses the terminology “environments” instead of frames.
11.12
Tip: Testing and documenting
It’s important to both test functions and document them: Documentation helps you, and others,
understand what the purpose of your function is, and how to use it, and its important to make
sure that your function actually does what you think.
When you first start out, your workflow will probably look a lot like this:
1. Write a function
2. Comment parts of the function to document its behaviour
100
3.
4.
5.
6.
Load in the source file
Experiment with it in the console to make sure it behaves as you expect
Make any necessary bug fixes
Rinse and repeat.
Formal documentation for functions, written in separate .Rd files, gets turned into the documentation you see in help files. The roxygen2 package allows R coders to write documentation alongside
the function code and then process it into the appropriate .Rd files. You will want to switch to
this more formal method of writing documentation when you start writing more complicated R
projects.
Formal automated tests can be written using the testthat package.
11.13
Challenge solutions
Solutions to challenges
11.14
Solution to challenge 1
Write a function called kelvin_to_celsius that takes a temperature in Kelvin and returns that
temperature in Celsius
kelvin_to_celsius <- function(temp) {
celsius <- temp - 273.15
return(celsius)
}
11.15
Solution to challenge 2
Define the function to convert directly from Fahrenheit to Celsius, by reusing these two functions
above
fahr_to_celsius <- function(temp) {
temp_k <- fahr_to_kelvin(temp)
result <- kelvin_to_celsius(temp_k)
return(result)
}
11.16
Solution to challenge 3
Write a function called fence that takes two vectors as arguments, called text and wrapper, and
prints out the text wrapped with the wrapper:
fence <- function(text, wrapper){
text <- c(wrapper, text, wrapper)
result <- paste(text, collapse = " ")
return(result)
}
best_practice <- c("Write", "programs", "for", "people", "not", "computers")
fence(text=best_practice, wrapper="***")
## [1] "*** Write programs for people not computers ***"
101
12
Flow control
12.1
Learning Objectives
• Write conditional statements with if and else.
• Write and understand for loops.
Often when we’re coding we want to control the flow of our actions. This can be done by setting actions to
occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a
particular number of times.
There are several ways you can control flow in R. For conditional statements, the most commonly used
approaches are the constructs:
# if
if (condition is true) {
perform action
}
# if ... else
if (condition is true) {
perform action
} else { # that is, if the condition is false,
perform alternative action
}
Say, for example, that we want R to print a message if a variable x has a particular value:
# sample a random number from a Poisson distribution
# with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
x
## [1] 8
Note you may not get the same output as your neighbour because you may be sampling different random
numbers from the same distribution.
Let’s set a seed so that we all generate the same ‘pseudo-random’ number, and then print more information:
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than 5")
}
102
## [1] "x is greater than 5"
12.2
Tip: pseudo-random numbers
In the above case, the function rpois generates a random number following a Poisson distribution
with a mean (i.e. lambda) of 8. The function set.seed guarantees that all machines will
generate the exact same ‘pseudo-random’ number (more about pseudo-random numbers). So if
we set.seed(10), we see that x takes the value 8. You should get the exact same number.
Important: when R evaluates the condition inside if statements, it is looking for a logical element, i.e.,
TRUE or FALSE. This can cause some headaches for beginners. For example:
x <- 4 == 3
if (x) {
"4 equals 3"
}
As we can see, the message was not printed because the vector x is FALSE
x <- 4 == 3
x
## [1] FALSE
12.3
Challenge 1
Use an if statement to print a suitable message reporting whether there are any records from
2002 in the gapminder dataset. Now do the same for 2012.
Did anyone get a warning message like this?
## Warning in if (gapminder$year == 2012) {: the condition has length > 1 and
## only the first element will be used
If your condition evaluates to a vector with more than one logical element, the function if will still run, but
will only evaluate the condition in the first element. Here you need to make sure your condition is of length 1.
12.4
Tip: any and all
The any function will return TRUE if at least one TRUE value is found within a vector, otherwise
it will return FALSE. This can be used in a similar way to the %in% operator. The function all,
as the name suggests, will only return TRUE if all values in the vector are TRUE.
12.5
Repeating operations
If you want to iterate over a set of values, when the order of iteration is important, and perform the same
operation on each, a for loop will do the job. We saw for loops in the shell lessons earlier. This is the most
flexible of looping operations, but therefore also the hardest to use correctly. Avoid using for loops unless
the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous
iterations.
The basic structure of a for loop is:
103
for(iterator in set of values){
do a thing
}
For example:
for(i in 1:10){
print(i)
}
##
##
##
##
##
##
##
##
##
##
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
1
2
3
4
5
6
7
8
9
10
The 1:10 bit creates a vector on the fly; you can iterate over any other vector as well.
We can use a for loop nested within another for loop to iterate over two things at once.
for (i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
print(paste(i,j))
}
}
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
"1
"1
"1
"1
"1
"2
"2
"2
"2
"2
"3
"3
"3
"3
"3
"4
"4
"4
"4
"4
"5
"5
a"
b"
c"
d"
e"
a"
b"
c"
d"
e"
a"
b"
c"
d"
e"
a"
b"
c"
d"
e"
a"
b"
104
## [1] "5 c"
## [1] "5 d"
## [1] "5 e"
Rather than printing the results, we could write the loop output to a new object.
output_vector <- c()
for (i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
temp_output <- paste(i, j)
output_vector <- c(output_vector, temp_output)
}
}
output_vector
## [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a"
## [12] "3 b" "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b"
## [23] "5 c" "5 d" "5 e"
This approach can be useful, but ‘growing your results’ (building the result object incrementally) is computationally inefficient, so avoid it when you are iterating through a lot of values.
12.6
Tip: don’t grow your results
One of the biggest things that trips up novices and experienced R users alike, is building a results
object (vector, list, matrix, data frame) as your for loop progresses. Computers are very bad at
handling this, so your calculations can very quickly slow to a crawl. It’s much better to define an
empty results object before hand of the appropriate dimensions. So if you know the end result
will be stored in a matrix like above, create an empty matrix with 5 row and 5 columns, then at
each iteration store the results in the appropriate location.
A better way is to define your (empty) output object before filling in the values. For this example, it looks
more involved, but is still more efficient.
output_matrix <- matrix(nrow=5, ncol=5)
j_vector <- c('a', 'b', 'c', 'd', 'e')
for (i in 1:5){
for(j in 1:5){
temp_j_value <- j_vector[j]
temp_output <- paste(i, temp_j_value)
output_matrix[i, j] <- temp_output
}
}
output_vector2 <- as.vector(output_matrix)
output_vector2
## [1] "1 a" "2 a" "3 a" "4 a" "5 a" "1 b" "2 b" "3 b" "4 b" "5 b" "1 c"
## [12] "2 c" "3 c" "4 c" "5 c" "1 d" "2 d" "3 d" "4 d" "5 d" "1 e" "2 e"
## [23] "3 e" "4 e" "5 e"
105
12.7
Tip: While loops
Sometimes you will find yourself needing to repeat an operation until a certain condition is met.
You can do this with a while loop.
while(this condition is true){
do a thing
}
As an example, here’s a while loop that generates random numbers from a uniform distribution
(the runif function) between 0 and 1 until it gets one that’s less than 0.1.
z <- 1
while(z > 0.1){
z <- runif(1)
print(z)
}
while loops will not always be appropriate. You have to be particularly careful that you don’t
end up in an infinite loop because your condition is never met.
12.8
Challenge 2
Compare the objects output_vector and output_vector2. Are they the same? If not, why not?
How would you change the last block of code to make output_vector2 the same as output_vector?
12.9
Challenge 3
Write a script that loops through the gapminder data by continent and prints out whether the
mean life expectancy is smaller or larger than 50 years.
12.10
Challenge 4
Modify the script from Challenge 4 to also loop over each country. This time print out whether
the life expectancy is smaller than 50, between 50 and 70, or greater than 70.
12.11
Challenge 5 - Advanced
Write a script that loops over each country in the gapminder dataset, tests whether the country
starts with a ‘B’, and graphs life expectancy against time as a line graph if the mean life expectancy
is under 50 years.
13
13.0.1
Best Practices
Some best practices for using R and designing programs
1. Start your code with a description of what it is:
106
# This is code to replicate the analyses and figures from my 2014 Science paper.
# Code developed by Andy Teucher and friends
2. Run all of your import statments (library or require):
library(ggplot2)
library(reshape)
library(vegan)
3. Set your working directory. Avoid changing the working directory once a script is underway. Use
setwd() first . Do it at the beginning of a R session. Better yet, start R inside a project folder.
4. Use # or #- to set off sections of your code so you can easily scroll through it and find things.
5. If you have only one or a few functions, put them at the top of your code, so they are among the first
things run. If you written many functions, put them all in their own .R file, and source them. Source
will run all of these functions so that you can use them as you need them.
source("my_genius_fxns.R")
6. Use consistent style within your code.
7. Keep your code modular. If a single function or loop gets too long, consider breaking it into smaller
pieces.
8. Don’t repeat yourself. Automate! If you are repeating the same piece of code on multiple objects or
files, use a loop or a function to do the same thing. The more you repeat yourself, the more likely you
are to make a mistake.
9. Manage all of your source files for a project in the same directory. Then use relative paths as necessary.
For example, use
dat <- read.csv(file = "/files/dataset-2013-01.csv", header = TRUE)
rather than:
dat <- read.csv(file = "/Users/ateucher/Documents/sannic-project/files/dataset-2013-01.csv", header = TR
10. Don’t save a session history (the default option in R, when it asks if you want an RData file). Instead,
start in a clean environment so that older objects don’t contaminate your current environment. This
can lead to unexpected results, especially if the code were to be run on someone else’s machine.
11. Where possible keep track of sessionInfo() somewhere in your project folder. Session information
is invaluable since it captures all of the packages used in the current project. If a newer version of
a project changes the way a function behaves, you can always go back and reinstall the version that
worked (Note: At least on CRAN all older versions of packages are permanently archived).
12. Collaborate. Grab a buddy and practice “code review”. We do it for methods and papers, why not
code? Our code is a major scientific product and the result of a lot of hard work!
13. Develop your code using version control and frequent updates!
107
14
Getting help
To get help for a particular function in R, type ? and then the function name:
?mean
To get help for a topic in R, do a “fuzzy”" search with ?? (wrap the phrase in quotes if more than one word):
??"t-test"
14.0.2
•
•
•
•
•
•
General help
Cookbook for R - lots of plotting help here, including ggplot2
Google (It actually knows what “R” is now)
Stackoverflow (use the [r] tag - also [ggplot2], [dplyr], etc.)
http://www.rdocumentation.org/
Each other! (Talk it out)
Tell it to the duck
14.0.3
Cheat Sheets
• General R cheatsheet
• RStudio Cheetsheets on:
–
–
–
–
14.0.4
Data Wrangling with dplyr and tidyr,
Data Visualization with ggplot2,
Using RStudio
Other more advanced topics such as package development, Shiny, and R Markdown
ggplot2
• ggplot2 official documentation
• Color Brewer: A great general resource for choosing colour palettes
14.0.5
Books
• The Art of R Programming
• Hadley Wickham’s online book: Advanced R Programming
• The R Graphics Cookbook by Winston Change - The paper version of the Cookbook for R website
mentioned above
14.0.6
•
•
•
•
Other learning resources - online courses, etc
R for cats
Try R - codeschool
swirl - Learn statistics and R simultaneously - within R itself!
R Programming (Coursera) - starts June 2!
– There are a number of R and/or statistics courses on Coursera. It’s a free online learning platform,
and the courses are taught by high calibre professors from good universities.
108
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising