This is a document designed to give a basic introduction to R for those who have no coding experience. This should help you get prepared for applied projects in CPLN and MUSA 500-600 level coursework within a couple of weeks. It coalesces introductory materials I’ve developed over the past six years for MUSA 508, MUSA 795, CPLN 501, and CPLN 675 and should give you what you need to get started in any of those courses.
This isn’t a “coding” tutorial, it’s a starter kit that demonstrates how R can do some familiar Excel or ArcGIS-like tasks - loading spreadsheets and spatial data, creating new columns of data, summarizing and querying data, making charts and maps - and do them faster, better, and more reproducibly.
Use the table of contents at left to navigate through the document - the simplest topics are first (e.g. “What is R?”, “How do I download R?”), the latter sections are more advanced (Geoprocessing, Census Data, Markdown documents). There are references to books and other materials at the end of the doc.
This document will be updated on an ongoing basis to include new packages and methods - you may see a few “Coming soon” previews towards the end of the document.
Michael Fichman
Associate Professor of Practice
University of Pennsylvania Weitzman School of Design
Department of City and Regional Planning
R is a programming language you can use to do stuff that you can do in Excel, or ArcGIS, or stats software like SAS… but (with some practice) faster, better, and more efficiently.
You can load data in spreadsheet form, make new columns, create data summaries, make charts, make maps. You can load spatial data and use R as a GIS. You can make statistical models. You can also do way more complicated stuff too. Most importantly, you can set up routines or workflows to let you repeat processes that in other software would take lots of clicks and steps.
It’s “open source,” which means it is written and maintained by a community of people like you and me. That also means that sometimes, things will go wrong and there isn’t always a manual to consult.
If you’ve never programmed before, that’s OK, you’re in the right place.
I run R code and do projects in R Studio, which is like a viewer for R. You can write code to load and manipulate data. You can see plots and data sets that you’ve loaded using R Studio.
Now open up R Studio and let’s take a look at some of the components.
Your R
Studio GUI (graphical user interface) should look something like this
(Image by Kieran Healy)
You can type and execute code in the console.
I could type something like myName <- "Michael"
into
the console, and it would create a variable called myName
that is equal to the character string “Michael” in my “environment.”
Go ahead, try it.
myName <- "Michael"
This is where you can see the data sets and other things you’ve
loaded into R to work on. If I load an excel spreadsheet into R using
read_excel
, I could say something like
myData <- read_excel("filepath_on_my_computer.xlsx")
it
will show up in my environment as a thing called
myData
.
You can clear the environment by going to the Session
dropdown and saying Clear Workspace
.
You can also save your environment or load up an old one from a project you were working on.
If you load up a .R or .rmd file - two types of R files, they will show up as tabs in your code window.
If I give you some code, you can execute it a piece at a time by
copying-and-pasting into the console, or highlighting code and hitting
ctrl+Enter
to run it.
If you are writing code, you can save your code as either a .R file (just plain code), or a .rmd file (a Markdown file, like what you are reading now, which is something you can write to present code or analysis.)
-If you make charts or plots - they show up in this window.
-The “Files” tab is a place to graphically look through file trees on your computer.
-You can consult the “Packages” or “Help” tabs to find documentation about different functions and packages in the R world. (More about packages later)
How do you “run code” in R?
Usually, I will give you a bunch of code in class, and you will have to adapt that code for a project you are doing. Eventually, you will write some of your own code - but most people cut and paste stuff and switch variable names and input data sets around. That’s how I do much of my work - I grab things from my code base I’ve generated over the years. So don’t worry about writing things from scratch.
Code with a #
in front of it is called “commented
code.” This stuff doesn’t execute it’s where you make notes.
Code without the #
will run - either as written in
the console or by hitting ctrl+enter on code the code window
# This code doesn't do anything, but the next line does
print('hello world')
## [1] "hello world"
R is case sensitive - many of your issues can be resolved by checking
your syntax. You might get some kind of error because you created a data
set called derp
and you try to call it in the code with
DERP
. Happens all the time.
TUTORIAL:
If this is your first time using R, you might want to try out
a .R script called running_code_in_R.R
which you
can find here. It has some simple code and simple operations you can
try - download the .R file, load it into your R Studio and follow the
instructions to create some simple objects.
Most pieces of data you load into R or create with R will have a name. If you load in a spreadsheet, you will want to give it a name so it shows up in my R environment. You can assign information to a variable using this sign “<-”
You can have all kinds of data objects:
A data frame: a matrix of data with column names, basically like an excel spreadsheet (much more on this later)
A vector: a list of items, denoted with the letter c, like this:
myVector <- c("A", "B", "C")
myVariable <- "derp"
myNumber <- 10
There are numbers, characters, and other data types. There are also
NA
data - missing data - these are tricky. You can’t do
arithmetic on these.
Look at the example below. We create a vector of numeric data called
vector1
which is just numbers, and then we ask its median.
That works.
Then we create one called vector2
with some NA in there,
it doesn’t work to take the median there.
vector1 <- c(1, 4, 6, 7, 10)
median(vector1)
## [1] 6
vector2 <- c(1, 4, 6, NA, 10)
median(vector2)
## [1] NA
A lot of the things you will want to do with R will require loading packages. “Base R” is basic R - there are a bunch of functions it does. Since R is open source, lots of people have developed their own cool packages that do additional stuff.
A package is a set of functions - some of them make graphics, some handle spatial data. You will almost always start your session by loading some packages.
The first time you use a package on your computer, you can load it
with the command install.packages
, like this:
install.packages('tidyverse')
If you have already installed a package on your computer, you can
load it into your environment using the library
command.
library(tidyverse)
Packages all come with documentation - lists of functions in these packages have little vignettes that tell you how to use them. Check that out in the “Packages” tab in your R Studio environment to see these, or just google package names or function names to find the documentation.
You can also type things like ??select
in your console
and documentation about the function select
will pop up in
your “Help” window in R Studio.
The tidyverse
, which we loaded in the previous code
chunk, is a group of important packages which are used to wrangle data,
make graphics, load data, handle dates, handle text. Better yet, these
projects are coordinated by their developers, so they all work nicely
together.
For work in our class, you should load up the tidyverse on every project. Most of my code doesn’t work without it.
Most projects will start with you loading some spreadsheet-like data - from a survey you conducted, from an excel sheet, from the Census’ website, from a shapefile that I gave you for class. For non-spatial data, these spreadsheet-like data will be in a format called a “data frame” (or as a variant called a “tibble).
There are a lot of file types you can load, and some of them have
their own functions, like you can read in Excel data with the
read_excel
function in the readxl
package.
Later on I will show you how to read in a .shp. (You can read more about
these data import functionalities here.)
There are several ways to load data into your R environment.
You can load data from your computer in one of two ways, either by
going to File / Import Data Set
in a dropdown menu R
Studio, and clicking on the relevant options (note that “From txt” will
let you load a csv, txt, tsv or other similar file types), or by doing
it with code, like this:
myData <- read.csv("filepath/onMyComputer/fileName.csv")
You can load data from the web, like from Google Drive or Github, using the same kind of syntax. Here we load a csv from the web:
myData <- read.csv("https://raw.githubusercontent.com/mafichman/R_FAQ_For_Planners/main/data/someCensusData.csv")
Now that you’ve loaded in some data, you can examine them in several
ways. You can View(myData)
and see a visual representation
of the data in an excel-like way, or you can glimpse
your
data frame.
glimpse
will tell you what data types the columns are
(character, numeric, etc), let you see some samples of the data, and let
you see how many rows and columns the data contain.
glimpse(myData)
## Rows: 384
## Columns: 23
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ GEOID <dbl> 4.2101e+10, 4.2101e+10, 4.2101e+10, 4.2101e+10,…
## $ NAME.x <chr> "Census Tract 1, Philadelphia County, Pennsylva…
## $ total_pop.2010 <int> 2883, 1886, 2915, 1525, 3156, 948, 938, 2538, 1…
## $ total_HU.2010 <int> 2455, 1128, 2038, 1531, 2534, 750, 713, 2284, 1…
## $ total_vacant.2010 <int> 422, 123, 206, 362, 239, 114, 60, 341, 67, 433,…
## $ med_HH_Income.2010 <int> 73041, 43218, 65577, 21832, 50020, 45357, 53938…
## $ total_White.2010 <int> 2660, 358, 2968, 876, 2762, 849, 863, 1981, 118…
## $ total_GradDeg.2010 <int> 918, 254, 1107, 323, 1048, 321, 426, 699, 496, …
## $ vacancyPct.2010 <dbl> 0.17189409, 0.10904255, 0.10107949, 0.23644677,…
## $ pctWhite.2010 <dbl> 0.9226500, 0.1898197, 1.0181818, 0.5744262, 0.8…
## $ NAME.y <chr> "Census Tract 1, Philadelphia County, Pennsylva…
## $ total_pop.2016 <int> 3983, 2357, 3315, 2633, 3202, 1305, 1021, 3238,…
## $ total_HU.2016 <int> 2624, 1079, 1925, 1969, 2880, 942, 757, 2729, 1…
## $ total_vacant.2016 <int> 319, 129, 160, 299, 500, 68, 81, 452, 109, 363,…
## $ med_HH_Income.2016 <int> 103772, 50455, 93036, 57604, 70038, 40568, 7125…
## $ total_White.2016 <int> 3020, 825, 2740, 1605, 2721, 1126, 991, 2138, 1…
## $ total_GradDeg.2016 <int> 1094, 455, 1075, 847, 1147, 409, 548, 1333, 743…
## $ vacancyPct.2016 <dbl> 0.12157012, 0.11955514, 0.08311688, 0.15185373,…
## $ pctWhite.2016 <dbl> 0.7582224, 0.3500212, 0.8265460, 0.6095708, 0.8…
## $ change_med_HH_Income <dbl> 23375.771, 2884.947, 20855.396, 33573.518, 1498…
## $ change_Grad_Degree_Pct <dbl> -0.043750978, 0.058365438, -0.055476303, 0.1098…
## $ pointBreeze <chr> "REST OF PHILADELPHIA", "REST OF PHILADELPHIA",…
After you manipulate data, you might want to write it out to file. Everything we are doing in the R-Studio environment does not affect the underlying data - so any project you do might need a step where you out put data.
You can do this with commands like write.csv
- so you
could data your data frame myData
and write it out as a
file with a name and location of your choosing like so:
write.csv(myData, "your_file_path/your_file_name.csv")
We talk a bit later about reading/writing spatial data as well.
dplyr
is the core data wrangling package in “tidy” R.
You can use functions from dplyr
to manipulate data
frames.
You can make new columns, and remove or rename columns from your data frame. You can summarize data (e.g. what is the sum of all the rows in a given column). You can filter data to keep only rows that meet some criteria.
It does lots of other stuff too.
There is a handy guide for all the dplyr data tools at R Studio’s dplyr cheat sheet.
Very important, to what I’m about to show you from here on
out is a thingy called a “pipe” - %>%
- which is an
object you put in your code to chain functions together and make kind of
data-wrangling recipes.
To understand what the pipe does, think about the following “recipe”
like this: I make an omlette
by taking a data frame called
eggs
and applying the functions scramble
and
fry
omlette <- eggs %>% scramble() %>% fry()
OK, keep that in mind while we look at dplyr
commands.
Let’s use a few basic dplyr
commands to wrangle our
dataframe myData
.
mutate
makes a new column. I am going to
overwrite myData
- making a new version that has a
new column in it. Each row of the data frame represents a census tract,
and I’d like to see what the difference is in tract population between
2010 and 2016. I’ll call this new column pop_change
and set
its value equal to total_pop.2016
minus
total_pop.2010
. Note the pipe used to “chain” the mutate
statement in there.
myData <- myData %>%
mutate(pop_change = total_pop.2016 - total_pop.2010)
Did it work? Try typing View(myData)
or
glimpse(myData)
into your console.
I’m now going to create a second variable called
pop_change_positive
using mutate
. If the
population change was positive, I’m going to call set this variable
equal to TRUE
- I’ll do an ifelse
statement to
make it’s value contingent on the value of pop_change
.
myData <- myData %>%
mutate(pop_change_positive = ifelse(pop_change > 0, TRUE, FALSE))
filter
lets you reduce your data frame based on some
criteria. I’d like to create a new data frame called
population_loss
that consists only of census tracts in
myData
that lost population (e.g. pop_change_positive ==
FALSE).
population_loss <- myData %>%
filter(pop_change_positive == FALSE)
I can do this kind of filtering based on lots of criteria.
select
keeps only certain columns that you want to
keep.
I’m going to make a new data frame, called twoVariables
that consists of only the columns NAME.y
and
GEOID
from the data frame myData
twoVariables <- myData %>%
select(NAME.y, GEOID)
rename
changes the names of your columns. I’d like to
rename the column from the data frame twoVariables
called
NAME.y
to be called just NAME
twoVariables <- twoVariables %>%
rename(NAME = NAME.y)
You’ll often want to know about the characteristics or central
tendencies of your data. Using the group_by
and
summarize
commands you can do lots of this.
Here are some examples - notice I’m not creating new data frames here.
First, let’s just find out what the median pop_change
was between 2010 and 2016 for Philadelphia census tracts.
myData %>%
summarize(median_pop_change = median(pop_change))
## median_pop_change
## 1 133
What if we want to know how one type of tract versus another varied
in terms of pop_change
? We can group our data into
categories using group_by
and then summarize.
For tracts for which pop_change_positive
(remember we
created this column as a TRUE/FALSE - was there population change
between 2010 and 2016?) what is the median pop_change
?
myData %>%
group_by(pop_change_positive) %>%
summarize(median_pop_change = median(pop_change))
## # A tibble: 2 × 2
## pop_change_positive median_pop_change
## <lgl> <dbl>
## 1 FALSE -332
## 2 TRUE 352
You can do “tabular joins” between data sets when they have a “unique ID” - a number or code that identifies each observation or subject in your data. For example, the “GEOID” is the unique code associated with each census tract.
There are several types of joins you can use in R. Hadley Wickham’s “R For Data Science” has a great chapter and series of code examples on the mechanics of left, right, inner, outer, full, and other joins.
The basic mechanics of a join go like this:
new_dataframe <- left_join(dataframe1, dataframe2, by = c("unique_ID"))
This will get you a data set where you preserve all of
dataframe1
(the “left” side of the left join), and tack on
anything from dataframe2
that matches
unique_ID
found in dataframe1
Notice the by
call where you specify the unique ID,
that’s a weird bit of syntax. You can join on multiple Unique IDs if you
want.
One concept that is important but a bit hard to engage is data
format. The dataframe myData
that we have been working with
so far is what’s called a “wide” data set. Each row is a census tract,
each column is a variable. This is great for doing column math with
mutate
and it’s good for making certain kinds of plots -
for instance a scatter plot with two continuous variables (more on this
in the next section).
Take a look at myData
using the command
View(myData)
and you’ll see that it is indeed “wide”
format.
There is another form of data called “long” data that are useful for other circumstances, like making comparisons between continuous variables across categories. (Note: Often you find time-series data in long format).
Let’s examine indicators from 2010 of tracts that appreciated in population between 2010 and 2016:
We gather
those variables, and make a data frame where
each row is a tract, it’s pop_change_positve
value
(TRUE/FALSE), a variable name, and a value for that variable.
Then we can group_by
the population change direction
(TRUE - positive) and the variable
, and summarize the
median_value
for each variable.
myData %>%
select(pop_change_positive, med_HH_Income.2010, total_White.2010, total_HU.2010) %>%
gather(-pop_change_positive, key = "variable", value = "value") %>%
group_by(pop_change_positive, variable) %>%
summarize(median_value = median(value, na.rm = TRUE))
## # A tibble: 6 × 3
## # Groups: pop_change_positive [2]
## pop_change_positive variable median_value
## <lgl> <chr> <dbl>
## 1 FALSE med_HH_Income.2010 31264
## 2 FALSE total_HU.2010 1859
## 3 FALSE total_White.2010 932.
## 4 TRUE med_HH_Income.2010 37174
## 5 TRUE total_HU.2010 1673
## 6 TRUE total_White.2010 1326.
Note there is a more modern version of gather
known as pivot_longer
, where the syntax would look like
this:
myData %>%
select(pop_change_positive, med_HH_Income.2010, total_White.2010, total_HU.2010) %>%
pivot_longer(cols = -pop_change_positive) %>%
group_by(pop_change_positive, name) %>%
summarize(median_value = median(value, na.rm = TRUE))
wide and long data are a weird concept - this graphic is pretty helpful understanding how data can transform from one form to another.
Can you think of a few situations in which one form or another is appropriate?
Image: https://www.garrickadenbuie.com/
The wide/long data concept is a bit nebulous, but it really comes in handy when you are making facetted graphics - a subject covered in the next section.
ggplot2
is the main graphics library in the
tidyverse
- it’s fantastic. You can make bar plots, line
plots, and much more with it.
You make ggplots by writing “recipes” for charts. The +
operator is a bit like what the %>%
is in
dplyr/tidyverse - you use it to add elements together.
The core elements of ggplot recipes are called geometries (a
geom
), and the way you style the data is with aesthetics
(aes
). You can add some other styles to your ggplots, but
if it involves changing something about the appearance of the plot based
on data, you use an aes
call.
There are lots of resources for ggplot - the simplest and easiest one to check out is R Studio’s ggplot cheat sheet. There is also a whole (free) book on data viz in ggplot - Data Visualization by Kieran Healy. It’s fantastic.
Let’s make a simple scatterplot.
Think about the “recipe” for the following chart where we see plot Philadelphia census tract median household income as a function of the tract’s pencentage of vacant housing units using this narrative (remember each row in myData is a Philadelphia census tract):
myData
where the
x aesthetic is vacancyPct.2016
and the y aesthetic is
med_HH_Income.2016
ggplot()+
geom_point(data = myData,
aes(x = vacancyPct.2016, y = med_HH_Income.2016))
We can add more aesthetics to this plot. Let’s do this first by just tinkering with the colors to make it look cool - we will do this with some commands outside the aesthetics to make the points blue.
ggplot()+
geom_point(data = myData,
aes(x = vacancyPct.2016, y = med_HH_Income.2016),
color = "blue")
OK, now let’s make the point coloration change according to a
data-related element, something we have to specify in the aesthetics. We
specify the color as being equal to
pop_change_positive
:
ggplot()+
geom_point(data = myData,
aes(x = vacancyPct.2016, y = med_HH_Income.2016, color = pop_change_positive))
You can very simply make “small multiple” plots where you subdivide
your data into categories for graphic comparison. This is done with the
facet_wrap
command.
Here’s an alternate take on our previous plot, cut into two plots.
Notice in the code I pulled pop_change_positive
out of the
aesthetics from the previous plot and put it in the
facet_wrap
command.
ggplot()+
geom_point(data = myData,
aes(x = vacancyPct.2016, y = med_HH_Income.2016))+
facet_wrap(~pop_change_positive)
Remember how we used gather
and
pivot_longer
to make our wide data into long data in the
last section? That’s an incredibly useful technique for making quick
data-mining visualizations across a number of variables.
Notice how below, I do the same data manipulation I did in the
previous section, and then I just pipe in the ggplot call, and I use an
argument in the facet_wrap
called scales
so
that each chart has a y-axis with an appropriate scale to that
variable.
myData %>%
select(pop_change_positive, med_HH_Income.2010, total_White.2010, total_HU.2010) %>%
pivot_longer(cols = -pop_change_positive) %>%
group_by(pop_change_positive, name) %>%
summarize(median_value = median(value, na.rm = TRUE)) %>%
ggplot()+
geom_bar(aes(x = pop_change_positive, y = median_value), stat = "identity")+
facet_wrap(~name, scales = "free")
You can alter most elements of the ggplot. We could take our initial scatterplot and do some things to spruce it up, like add a title, change the axis labels etc.,
ggplot()+
geom_point(data = myData,
aes(x = vacancyPct.2016, y = med_HH_Income.2016))+
labs(title="Philadelphia Tract Med. HH Income as a Function of Pct. Vacancy",
subtitle = "Data: US Census Bureau, 2016 ACS 5-Year Estimates",
x ="Pct Vacant Housing Units (0-1)", y = "Median Household Income (2016 $)")
I like to make some pre-made recipes where I style the lines, fonts
and other elements, and then I can just add it to my plots. Here’s a
recipe I call plotTheme
. Load it up into your environment
and then you can add it to your plots to make them nicer. The sky is the
limit.
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.75),
axis.ticks=element_blank())
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Now I can add plotTheme
to my recipe from the last plot
for some extra styling. Notice I just tack it onto the recipe below with
a +
:
ggplot()+
geom_point(data = myData,
aes(x = vacancyPct.2016, y = med_HH_Income.2016))+
labs(title="Philadelphia Tract Med. HH Income as a Function of Pct. Vacancy",
subtitle = "Data: US Census Bureau, 2016 ACS 5-Year Estimates",
x ="Pct Vacant Housing Units (0-1)", y = "Median Household Income (2016 $)")+
plotTheme
You can access the US Census using the tidycensus
package - this is much faster than grabbing Census data on the web. You
can grab the variables you want with or without spatial data
attached.
Even if you only use R for this - you’re making your life a
bit easier. You can grab data with tidycensus
, write it to
file (more on this in the next section) and then use it in Excel or
ArcGIS.
Here’s how it works - you “tell” the census, through some R Code, that you want certain variables, for certain years, and certain geographies, and it returns it to you in your R Environment.
There is a whole book on tidycensus from it’s creator/maintainer (cough cough MUSA Masterclass guest speaker 2021) Kyle Walker.
To use tidycensus
you should install and load the
tidycensus
package. We are going to grab some spatial data
as well, so load the sf
package.
If you don’t have these packages installed, you should install them as follows (if you do, skip this step):
install.packages('tidycensus')
install.packages('sf')
OK, if you have the packages installed on your computer, now load
them into your R environment with a library
call:
library(tidycensus)
library(sf)
You will need a “key” to access the Census API. You can find one at their website.
Paste it into the code block below and run it to load your key in to your R environment:
census_api_key("YOUR KEY GOES HERE", overwrite = TRUE)
Now that we have our census credentials loaded, we can start
downloading information from the API using some functions from
tidycensus. We are going to grab some 2016 ACS estimates for
Philadelphia census tracts. In order to choose variables of interest, we
are going to load the 2016 ACS data dictionary using the tidycensus
function load_variables
. We turn it into a dataframe called
acs_variable_list.2016
.
acs_variable_list.2016 <- load_variables(2016, #year
"acs5", #five year ACS estimates
cache = TRUE)
Once we have loaded these data, we can observe and search through the
data frames of variable information which should appear in our global
environment either by clicking on them or using the
View(acs_variable_list.2016)
command.
Look around in th data frame for a few minutes and see what’s in there.
We can populate a vector of variable names we will send to the Census
API. We call this list acs_vars
. This is the beauty of a
code-based workflow - you can take this vector and put anything you want
in it when you have a new analysis to do and re-run it for different
variables. These need to be character strings, and hence, in quotes as
you see below.
Keep in mind the categories and code numbers change a bit over time - you may need separate vectors for different census years.
acs_vars <- c("B01001_001E", # ACS total Pop estimate
"B25002_001E", # Estimate of total housing units
"B25002_003E", # Number of vacant housing units
"B19013_001E", # Median HH Income ($)
"B02001_002E", # People describing themselves as "white alone"
"B06009_006E") # Total graduate or professional degree
We use the get_acs
function in tidycensus
to query the API and get tract level-data for all of Philadelphia from
the 2016 American Community Survey’s (ACS) 5-year estimates. Notice the
different arguments for the function, and that they require certain
types of info. For example, geography
requires one of a
finite list of answers, and they have to be formatted as character
string.
Remember the ??
function - you can learn about the
parameters for get_acs
this way. There is also a function
called get_decennial
which you can use for decennial census
counts.
We ask for data on our acs_vars
for all tracts in
Philadelphia County, PA in 2016. We ask for “wide” data (e.g. one
variable per column, one row per tract) and we set geometry
to FALSE
. (Later, we will set geometry
to
true
and get some spatial data.)
acsTractsPHL.2016 <- get_acs(geography = "tract",
year = 2016,
variables = acs_vars,
geometry = FALSE,
state = "PA",
county = "Philadelphia",
output = "wide")
Voila - now you have census data. Take a look and see what you have there - since it’s the ACS, they are estimates - so you’ll have data marked “E” for estimate, or “M” for margin of error.
Let’s use a glimpse
command to check it out
glimpse(acsTractsPHL.2016)
## Rows: 384
## Columns: 14
## $ GEOID <chr> "42101008301", "42101008302", "42101006300", "42101007300"…
## $ NAME <chr> "Census Tract 83.01, Philadelphia County, Pennsylvania", "…
## $ B01001_001E <dbl> 4809, 4334, 4264, 2529, 4443, 7415, 4311, 5856, 4738, 2960…
## $ B01001_001M <dbl> 606, 516, 637, 382, 570, 794, 591, 716, 536, 424, 449, 324…
## $ B25002_001E <dbl> 1988, 2221, 1697, 1361, 2262, 3177, 2069, 2395, 2509, 1342…
## $ B25002_001M <dbl> 56, 52, 60, 80, 72, 122, 73, 58, 55, 106, 81, 47, 63, 80, …
## $ B25002_003E <dbl> 371, 527, 374, 203, 262, 590, 459, 109, 407, 218, 369, 105…
## $ B25002_003M <dbl> 119, 118, 100, 121, 134, 175, 108, 86, 125, 99, 114, 45, 9…
## $ B19013_001E <dbl> 29770, 31202, 25871, 25369, 56944, 28201, 24402, 37390, 30…
## $ B19013_001M <dbl> 6061, 4228, 7342, 8100, 12681, 8161, 2987, 4350, 6472, 267…
## $ B02001_002E <dbl> 113, 89, 397, 124, 2649, 463, 83, 70, 47, 466, 1087, 551, …
## $ B02001_002M <dbl> 80, 119, 243, 71, 387, 224, 81, 50, 38, 170, 347, 174, 465…
## $ B06009_006E <dbl> 80, 92, 29, 55, 1033, 166, 50, 265, 175, 74, 101, 65, 10, …
## $ B06009_006M <dbl> 69, 53, 43, 48, 199, 74, 65, 130, 99, 49, 93, 38, 14, 14, …
The variable names that come with the stock Census data aren’t that helpful.
You can use the rename
function from dplyr
to make the columns in your data intelligible.
acsTractsPHL.2016 <- acsTractsPHL.2016 %>%
rename (total_pop.2016 = B01001_001E,
total_HU.2016 = B25002_001E,
total_vacant.2016 = B25002_003E,
med_HH_Income.2016 = B19013_001E,
total_White.2016 = B02001_002E,
total_GradDeg.2016 = B06009_006E)
We can use R as a GIS and easily manipulate vector data (points,
lines and polygons). The main package we use for this is called
sf
. Here are some things that it does:
Reads .shps, geojson and other spatial data files.
Allows you to manipulate spatial data using the same tools we use
to manipulate data frames - dplyr
tools like
mutate
, rename
etc.,
Lets you do geoprocessing like you do in ArcGIS - spatial joins, area and distance calculations, union and dissolve functions, and more.
Makes maps by interfacing with ggplot
Let’s load in some spatial data in two ways.
First, let’s just load in some simple data using
st_read
, that’s the sf
package’s basic read
function. You can read stuff from the web or from a filepath on your
computer this way.
These are some data I grabbed from Open Street Map - Philadelphia area restaurants. You might recognize some of these. If you want to know more about grabbing OSM data, check my repo with sample code.
These data are points (vector data can be points, lines or polygons).
restaurants <- st_read("https://raw.githubusercontent.com/mafichman/R_FAQ_For_Planners/main/data/restaurants.geojson")
We can also get census data from tidycensus
in spatial
form.
You might remember the code block below from Section 8 of this demo -
where we grabbed a table of census data. We can grab the same data but
in spatial form by changing the geometry
argument to
TRUE
.
These are polygon data - each row is a census tract, each column is a variable associated with each tract.
acsTractsPHL.2016.sf <- get_acs(geography = "tract",
year = 2016,
variables = acs_vars,
geometry = TRUE,
state = "PA",
county = "Philadelphia",
output = "wide") %>%
rename (total_pop.2016 = B01001_001E,
total_HU.2016 = B25002_001E,
total_vacant.2016 = B25002_003E,
med_HH_Income.2016 = B19013_001E,
total_White.2016 = B02001_002E,
total_GradDeg.2016 = B06009_006E)
Let’s examine an sf object, restaurants
. Call the object
in your console and look at some of the characteristics.
It is worth noting here that the 10th row in the data set, “Mi Puebla”, is a fantastic Mexican spot in Mt Airy. It makes me hungry just seeing that data point.
restaurants
## Simple feature collection with 827 features and 7 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.29785 ymin: 39.87397 xmax: -74.92357 ymax: 40.1477
## Geodetic CRS: WGS 84
## First 10 features:
## name osm_id addr.street addr.postcode
## 1 Friendly's 332503202 <NA> <NA>
## 2 Sam's Morning Glory 333786044 <NA> <NA>
## 3 Rexy's Bar and Restaurant 363599966 <NA> <NA>
## 4 Applebee's 363600675 <NA> <NA>
## 5 Groove Ground 431179432 <NA> <NA>
## 6 Spring Garden Pizza & Restaurant 566683522 <NA> <NA>
## 7 Carollo's Pizza 583287916 <NA> <NA>
## 8 Earth Bread & Brewery 596245658 Germantown Ave <NA>
## 9 Cresheim Valley Grain Exchange 596245661 Germantown Ave <NA>
## 10 Mi Puebla RESTAURANT & Bakery 596261287 Germantown Ave <NA>
## addr.city addr.state cuisine geometry
## 1 <NA> <NA> american POINT (-74.95853 39.94496)
## 2 <NA> <NA> <NA> POINT (-75.15893 39.94086)
## 3 <NA> <NA> <NA> POINT (-75.09105 39.89246)
## 4 <NA> NJ american POINT (-75.08793 39.89296)
## 5 <NA> <NA> <NA> POINT (-75.07044 39.91681)
## 6 <NA> <NA> pizza POINT (-75.15755 39.96205)
## 7 <NA> <NA> <NA> POINT (-75.0522 39.96179)
## 8 <NA> <NA> <NA> POINT (-75.19035 40.0592)
## 9 <NA> <NA> american POINT (-75.19071 40.05961)
## 10 <NA> <NA> <NA> POINT (-75.19061 40.05987)
This looks a lot like a data frame, and it is! You can easily
manipulate sf objects just like data frames with dplyr
like
this:
sf_object <- sf_object %>% mutate(column_a = column_b + column_c)
But there are a couple major differences between sf objects and data frames.
First, there is the geometry
column - this is a set of
drawing and geoprocessing instructions. You can’t do a
mutate
or other kind of data manipulation on this column.
restaurants
is a point file, so the geometry is an ordered
series of points.
Second, there is a crs
or “coordinate reference system”
associated with the file. If you’ve used GIS you’ve heard of
“projection” - that’s where it’s stored in the sf object.
The crs here is 4326, which is also known as “Web Mercator” or”WGS 84” - this is the common latitude/longitude coordinate system you might be familiar with. You can change the crs when you are doing geoprocessing (more on that in a moment).
The st_transform
command is your friend, that’s how you
do reprojections - look it up - ??st_transform
The universe of coordinate systems can be found on a great website called spatialreference.org.
ggplot
supports a geometry called geom_sf
that allows you to plot sf objects. As with making graphs and plots, you
have lots of styling options with ggplot.
Here is a simple one, using the census data we grabbed:
ggplot()+
geom_sf(data = acsTractsPHL.2016.sf)
You can can make very complex maps in ggplot. You can add many
geom_sf drawing layers, and you can use ggplot’s aes
features to make cool aesthetics for your maps. Here’s something a bit
more complicated:
I use a fill
parameter in my aes
call
to symbologize total_GradDeg.2016
in the census
data.
I set the linework to be transparent (this is outside the
aes
call because it doesn’t pertain to any data, just
looks.
I add my Philly area restaurants on top, and give them a color and a size parameter.
ggplot()+
geom_sf(data = acsTractsPHL.2016.sf,
aes(fill = total_GradDeg.2016),
color = "transparent")+
geom_sf(data = restaurants, color = "yellow", size = 0.5)
The sf
package has lots of commands that do basic
geoprocessing. I can do spatial joins, I can calculate distances and
areas, I can do spatial processes like buffer, union, and dissolve.
Mind your projections when you do these things. If you try to calculate distance if your crs is in decimal degrees, you’ll get outputs in decimal degrees (probably not helpful).
You can join spatial objects to one another in sf
just
like in any other GIS. Let’s join restaurants
to
acsTracts.PHL.2016.sf
. After we do that join, we can
summarize how many restaurants are in each tract.
Before we get started, we have to make sure our data are projected in
the same coordinate system. Let’s use the st_crs
command to
check on that. Turns out restaurants
is in WGS 84 (crs =
4326), and acsTractsPHL.2016.sf
is in NAD 83 (crs =
4249)
st_crs(restaurants) == st_crs(acsTractsPHL.2016.sf)
## [1] FALSE
The function st_join
does the work for us here to join
points to the polygons that contain them, and it has lots of options.
Here, we say the type of join is intersects
as in “does
this point from the left hand data set intersect the polygon from the
right hand data set?” We also say left=TRUE
- indicating we
want a “left join” - preserving all items from the left hand data set,
whether or not they join to the items on the right. The things that
don’t join (eg. outside of Philly), they don’t show up.
Notice, that I pipe the st_transform
command to
restaurants
right inside the join statement and re-project
it so this operation works!
restaurants.and.tracts <- st_join(restaurants %>%
st_transform(crs = st_crs(acsTractsPHL.2016.sf)), acsTractsPHL.2016.sf,
join = st_intersects,
left = TRUE)
This operation produces a data set where the rows are restaurants, and there is a column denoting the tract it falls in.
Now we can convert to a data frame (as.data.frame
)
summarize the number of observations by census tract
(group_by
, summarize
), join these to our tract
sf
file by GEOID
(left_join
).
restaurant_summary <-
restaurants.and.tracts %>%
as.data.frame() %>%
group_by(GEOID) %>%
summarize(restaurant_sum = n())
tracts_and_restaurants <-
left_join(acsTractsPHL.2016.sf, restaurant_summary,
by = c("GEOID"))
ggplot()+
geom_sf(data = tracts_and_restaurants,
aes(fill = restaurant_sum),
color = "transparent")
The sf
package can do most of what you do with vectors
in ArcGIS, including…
Geometric functions like st_union
,
st_buffer
, st_dissolve
and so on.
You can find a polygon centroid with
st_centroid
.
Measurement functions: Calculate a polygon’s area or perimeter
(in the native units of its projection) with st_area
. Find
the distance between two sf
objects with
st_distance
.
There are some spatial functions that you need to find other packages for, like K-nearest-neighbors functions.
Check out the sf
“cheat sheet” to see what it can
do:
A “fishnet” is a raster-like grid where you impose a uniform surface of aerial units - squares or hexagons - across a landscape. This can be really useful for visualizing or processing information.
I can make a “fishnet” grid across Philadelphia County, and join the restaurant point data in order to visualize the density of in a different way.
Let’s start by loading a geojson representing Philadelphia County,
and coercing it to the sf format using st_as_sf
.
Not sure what the projection is? It’s important to know - the
cellsize you set for your fishnet will be in the native units of the
projection. You can find that out using st_crs
to “ask”
your data about the projection. We see that this is in something called
["EPSG",4269]
- there is projection in “degree” and
“latitude” - those aren’t linear units like feet or meters!
philaCounty <- read_sf("https://raw.githubusercontent.com/mafichman/R_FAQ_For_Planners/main/data/phila_county.geojson") %>%
st_as_sf()
st_crs(philaCounty)
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
Before we make our fishnet, let’s project it to a coordinate
reference system 2272, which is in feet. How did I choose that Need to
find the right coordinate system for your study area? Visit spatialreference.org or use the
crs_suggest
package (more on this in the end references to
this document if you’re interested.)
philaCounty <- philaCounty %>%
st_transform(2272)
Finally, let’s make our fishnet. We use the st_make_grid
function to create the fishnet - and we specify that we want the
cellsize to equal 2500 - that’s in the native units of the projection,
feet. We ask for square cells. Check out this data pipeline - we use an
operation to keep only the fishnet cells that intersect the original
philaCounty
shape, and then we create a
uniqueID
using a mutate
command - giving each
shape an identifier that is useful for doing analysis later.
fishnet <-
st_make_grid(philaCounty,
cellsize = 2500,
square = TRUE) %>%
.[philaCounty] %>% # clips the grid to the philaCounty extent
st_sf() %>%
mutate(uniqueID = rownames(.))
Check it out in ggplot!
ggplot()+
geom_sf(data = fishnet)
Now we can join our restaurants to the fishnet using the same routine as we used for our tract-based summary, but with a different result because of the different aerial units.
restaurants.and.fishnet <- st_join(restaurants %>%
st_transform(crs = st_crs(fishnet)), fishnet,
join = st_intersects,
left = TRUE)
restaurant_fishet_summary <-
restaurants.and.fishnet %>%
as.data.frame() %>%
group_by(uniqueID) %>%
summarize(restaurant_sum = n())
fishnet_and_restaurants <-
left_join(fishnet, restaurant_fishet_summary,
by = c("uniqueID"))
ggplot()+
geom_sf(data = fishnet_and_restaurants,
aes(fill = restaurant_sum),
color = "transparent")
You can write out any sf objects as .shp files which are readable in ArcGIS. This is especially useful if you want to just use R to grab data efficiently and want to map it elsewhere. I prefer to write geojsons - they are more widely usable and the column names are more stable.
Use the function “st_write” like so:
st_write(my_sf_object, "my_filepath/my_filename.shp")
I will probably be asking you to turn in assignments using R Markdown. Markdown is a way to present your code-based projects as a more polished document (you’re reading a markdown right now!). You can use this tool to make repeatable reporting workflows for non-techincal audience using R. I often write data reports like this one for my clients, where I “hide” all the code and just show the outputs.
A markdown document embeds “chunks” of R code in between bits of
text. These chunks can simply be bits of code, or they can be code that
creates some kind of image like a ggplot
chart or a
kable
table. You can manipulate the options of a code chunk
so that the output appears a certain way.
The basic file format of a markdown is .rmd
. You can
“knit” the markdown into a polished document in HTML, pdf or doc
format.
If you want to experiment with some markdown language - you can find a download button at the top right of this document and download the code that was used to make this.
For more details on using R Markdown, consult http://rmarkdown.rstudio.com or check out this great book from Xie et al.
As you work on increasingly sophisticated or collaborative projects using R, you will want to figure out a way to manage your files and code, and to figure out ways to keep your code working as you improve and develop it.
Best practices in coding is a subject for a whole entire course, but here are a few simple tips to get you started.
R has a “projects” file-type which can help you keep an organized
file structure and working environment in R Studio. There is a file type
called .rProj
that coordinates this. You can open a new
project by going to File -> New Project in R Studio. Check out the
“Files” part of your R Studio window and see how you can now graphically
view your files. Best of all, if you keep things contained in a project,
it’s more portable from computer to computer because you’re just calling
files inside one Project.
Read Hadley Wickham’s tutorial on how working with R Projects can help you with organization.
Github is a code-collaboration environment that allows you to use “source control” and work on things with others without accidentally over-writing or breaking other peoples’ code. Github works well with R Projects. You can “push” and “pull” code and data straight from R Studio or use a program called Github Desktop (which I prefer). You will need an account on Github to make this all work.
Here is a quick step-by-step guide to setting up a collaborative repo, an R project, and sending some code back and forth, using Github Desktop and R Studio:
Create a new repo on your github page
Add your github collaborator by going to settings/collaborators
Go to your Github Desktop app, and “clone” the repo - this will populate a version on your computer in it’s own folder.
Go into R Studio, and start a new “project” - put it in the repo folder. Do some work - create a new R script, add some data.
When you are done - go to Github Desktop, write some annotations about the changes you made to the code, and “push” the changes to the web repo.
Your collaborator can clone the project, and then push and pull things themselves. This will help you manage changes and versions. If you keep all your code and data inside the Github folder, the file paths will all be the same and it will be easy to collaborate. You can work separately without breaking each other’s code - use “branches” to make changes on your own version of the code, and then push it to the main code base later.
The finer mechanics of collaborating with Github are too complex for this document, but check out Matt Harris’ tutorial from MUSA 508 on using Github for detailed instructions on how to get set up and start collaborating.
It’s a bit weird to send HTML markdowns to clients and n00bs and say
“you need to DL this and open it in your browser to see it”, but you can
easily publish a knit html markdown to the web using github. If you name
the markdown (in HTML form) index
, you can go to your repo
settings, and activate the “pages” functionality. This will publish the
html file as a github.io webpage.
This page itself is an example - (check out the repo behind it)[https://github.com/mafichman/R_FAQ_For_Planners] to see how this all works!
You will probably find yourself writing code that other people have to use - or that you will save and use at some time in the future. It’s important to make sure you format and annotate your code so that your colleagues or your future self can use it, understand it, and troubleshoot it.
Here are some simple tips to keep your code nice and neat:
Name your files and variables things that are intelligible. Name
a dataframe census_2011
instead of nonsense like
new_dataframe
. Don’t call your functions or your data
objects names that might be used elsewhere in your R environment - for
example, don’t try to name a dataframe false
or
median
.
Use spaces in your code before and after operators like
parenthesis or equals signs, like this:
theMedian <- median(c(1, 4, 6))
Use a return at the end of each pipe when you are using tidyverse
and dplyr, and use a return after each +
when you are using
ggplot2.
Use indentations and multiple lines to split apart long functions - try to use one idea per line. R Studio does a good job of managing indentations for you automatically.
Here’s a quick example of a few of these things in action by piping
together a few operations. We mutate
a new variable called
high_income
in acsTractsPHL.2016.sf
and send
that categorical to the fill
command in a
geom_sf
in ggplot2
using the sf
package. In the “bad” example we just write a big giant sentence. In the
“better” example we use returns after our pipes and plus signs, and we
break up our ifelse
statement using returns to make it
easier to read.
# Bad
acsTractsPHL.2016.sf %>% mutate(high_income = ifelse(med_HH_Income.2016 > 100000, "Over 100k", "Under 100k")) %>% ggplot()+ geom_sf(aes(fill = high_income))
# Better
acsTractsPHL.2016.sf %>%
mutate(high_income = ifelse(med_HH_Income.2016 > 100000,
"Over 100k",
"Under 100k")) %>%
ggplot()+
geom_sf(aes(fill = high_income))
Check out https://style.tidyverse.org/index.html for more info on how to write legible code.
New additions for 2024 will include tips on how to organize your R Projects, R scripts, data, markdowns and more.
Upcoming additions include vignettes on Socrata and REST API query functionalities for R, and using two-factor authentification with boxr and PennBox.
Presently, vignettes for boxr can be found https://github.com/mafichman/boxr.
Upcoming additions will include vignettes for some of the following packages - Leaflet, osmdata, raster/terra, and crs_suggest.
Currently, a vignette on the use of osmdata for Open Street Map data can be found https://github.com/mafichman/osm_data.
The mapview
package is a quick and easy way to look at
spatial data interactively with a basemap. mapview
can
handle many kinds of spatial data, including sf
vectors. I
like to use mapview
to make quality checks on my data, and
quickly scan for patterns at multiple scales. With static maps in
ggplot
with sf
, sometimes this can be
cumbersome. mapview
has some drawbacks - it’s hard to make
sophisticated styling, and it can choke on large data sets (more than a
few thousand objects).
Here is a really quick web map of our restaurants
data
set from Section 9 in mapview
To use mapview
for the first time you should install and
load the package. (If you already have it installed, skip this
step):
install.packages('mapview')
Now load them into your R environment with a library
call:
library(mapview)
Now we can simply feed restaurants
to the
mapView
function. You can click on the points and see their
attributes. Simple!
If you have data to symbologize, you can use the zcol
argument to specify it.
mapView(restaurants)
This section contains online resources for learning R and troubleshooting code-based problems.
It’s super important to know how to ask a good question on the internet or on the class forum about some issue you are having. Googling around is a really good way to find help - so you can google something about the error you are having, or look on https://stackoverflow.com, which is a community of people doing Q&A and troubleshooting.
You can look up details about the functions you are using in R using
the ??
command in your console like this:
??read.csv
You can also check out the documentation about the packages, functions and the arguments they take in the Packages tab in the Viewer window in R Studio.
If you are going to ask a question on the class forum, include some code, and details about your problem - like the data sources that you tried to use and how they look. This allows somebody to reproduce the issue on their own, or at least understand it.
A good question gives your respondent lots of background info:
What is the nature of the data in your question? Can you post a
glimpse
command so we can see what the column names and
data types are?
What is the code snippet that didn’t run correctly?
What kind of error message did you get?
What packages do you have loaded?
Did you google the question before you came and asked for help? (The answer should be yes).
Here are some examples of some bad questions, and some alternatives:
Class forum issue: “I tried to make this code work (pastes code block) and I got an error.”
->
“I ran this code block using data that look like this (show the
results of a head
or glimpse
command) and got
this error (copy and paste error message). I have the following packages
installed….”
Google query:
“How do you get my file to join the other?”
->
“Spatial joins in r with sf”
Generative AI solutions like Chat GPT can be useful aids in debugging code, writing functions, and solving coding problems. It’s helpful to think of GPT as a highly structured, highly customized search engine. To get the best results, use the same kind of steps described in the previous section about asking a question on a class message board.
Keep in mind that if you have not educated yourself about how a coding language works, you not only can’t ask good questions of a GPT, you cannot audit the results of queries effectively. You might not be able to interpret the varying degrees of wrongness you get from GPT outputs. You might also generate outputs that are either a) inoperable or worse b) functional but fundamentally wrong for your task.
Consider the following situation where I want to mutate
a new column and I consult ChatGPT about how to format the data. I will
provide an example of a good query and a bad query as I try to solve my
problem.
I am going to load the data set mtcars
(a data set about
cars and their attributes that comes with R and is often used to create
code examples just like this!).
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
I’d like to mutate
a column that is a ratio of
horsepower (hp
) to miles per gallon (mpg
).
Simply, I would say something like this:
mtcars %>% mutate(hp_mpg_ratio = hp/mpg)
However, I get tons of decimal places when I do this, and I don’t know how to fix that! So I ask ChatGPT:
How do I round to two decimal places in R?
ChatGPT will give me a generic code example that I have to interpret and apply to my code, but if I ask it
Can you adjust this tidy R code to round the output to two decimal places?
mtcars %>% mutate(hp_mpg_ratio = hp/mpg)
I will get a precise re-working of my code.
The more sophisticated your problem, the more important it is that you ask a very specific question, sometimes even providing background information about packages, data types, and so on.
Let’s say I wanted to make a plot of mpg
, which is a
continuous variable, but I am interested in the number of cars that are
over or under 20 mpg. I can ask a very specific question like the
following:
How do I adjust this tidy R code so that the continuous variable mpg is engineered as a categorical variable that is either over or under 20. The result should be a bar plot and the feature transformation should take place within the ggplot call.
ggplot(mtcars)+ geom_histogram(aes(mpg))
The result is as follows:
ggplot(mtcars, aes(x = ifelse(mpg > 20, "Over 20", "Under 20"))) +
geom_bar(color = "black", fill = "steelblue") +
labs(x = "MPG Category", y = "Frequency", title = "Bar Plot of MPG Over/Under 20")
R Studio’s website has some really good cheat sheets with simple walkthroughs of basic functions in some core packages
Handy tips for making your code readable
Color Palette ideas
If you’re in a class with me - these books are likely to be among the required or supplemental readings. They are all fantastic, open source, and extremely, extremely useful. Here they are in “bookdown” form for free.
R for Data Science - Hadley Wickham - Supplementary text for CPLN 675
Public Policy Analytics - Ken Steif - Text for MUSA 508
R Markdown Cookbook - Yihui Xie, Christophe Dervieux, Emily Riederer
Have a question or something you’d like to see included here? Add it to [https://github.com/mafichman/R_FAQ_For_Planners/issues](the “issues” section of the repo for this text!)