ggplot(data = enterobase_assemblystats,aes(x=`Contig Number(>=200 bp)`))+
geom_histogram(binwidth = 5)
Module 5
Basic data visualization in RStudio
R is very commonly used in scientific research for statistical analysis and data visualization for many different types of data modalities.
R is a programming language, whereas Rstudio is an Integrated development environment or IDE.
The goal of an IDE is to provide a single environment for writing and running code, troubleshooting/debugging, package management as well as rendering visualizations.
In this workshop, we will primarily be using R/Rstudio for data visualization. I will quickly get to handling real tabular data and creating plots while explaining some basic concepts along the way. However, if you would like a comprehensive tutorial on the basics of R from scratch, please check out the Software Carpentry R tutorial and/or Swirl
Your RStudio will look something like this:
You can write your code in the Source pane and press ctrl + enter
to run the selected line(s). This is your scripting environment where you can save everything you are typing.
Alternatively, you can also write code in the Console pane and press enter
to run as you were doing for the bash command line. The code in the console is not saved.
Any datasets you load or variables/functions you set will be visible in the Environments pane.
Any plots/documents that you render will be visible in the Output pane.
Setting up your R environment
New packages can be installed using install.packages("package_name")
Not all installed R packages are available by default. You can load packages using the library
function.
The R working directory can be seen using the getwd()
command
You can change the working directory using setwd("/path/to/dir")
install.packages(c("data.table","dplyr","tidyverse","ggplot2"))
library(dplyr)
library(tidyverse)
library(ggplot2)
getwd()
setwd("NWU_2025_workshop_data/test_datasets/results")
Download an example table
We learned about Enterobase in Module 3. On your browser, navigate to the Enterobase Salmonella page, and search for the serovar ‘Minnesota’. Download the assemblystats table and the Achtman 7 Gene MLST table.
Reading your data
We are using the function fread
which is part of the data.table
package to read a table.
To use functions in R, you call the name of the function, and enclose the different values the function will accept within parenthesis.
To access the help page for any function, type ?
followed by the name of the function
?fread
fread
is a function to quickly read .tsv
or .csv
files
Lets read in our Enterobase assemblystats table
<-fread(file="/path/to/20251020_enterobase_minnesota_assemblystats.tsv",header=T,sep="\t") enterobase_assemblystats
fread(...)
: Calling thefread
functionfile="..."
: Providing the path for the file to be readheader=T
: Stating that the file has a header (which is the first line in thecsv
ortsv
that has all the column names).T
is short forTRUE
.sep="\t"
: Stating the delimiter in the file, since our file is tab separated, we set the delimiter as\t
.enterobase_assemblystats<-
: We are saving the read table to an object calledenterobase_assemblystats
. Once it is saved, it should appear in your Environment pane. You can click on it here and it will open a new tab in your Source pane where you can browse through the table.
Making simple plots using ggplot2
Ggplot is the most common R tool used for creating publication quality plots.
ggplot follows a consistent grammar (The gg in ggplot means grammar of graphics) making it very easy to build upon.
Apart from the default ggplot options, there are several additional packages using the ggplot grammar such as ggtree, ggside, ggsignif and many others that are all very helpful in scientific data visualization. ggplot can make (almost) any kind of plot you can think of!
Making a histogram using ggplot2
Lets make a simple histogram showing the distribution of the number of contigs in our dataset
ggplot()
: Calling the ggplot functiondata = enterobase_assemblystats
: The data that we want to use, here it isenterobase_assemblystats
, which is the enterobase table that we readaes()
: These are the mapping aesthetics (i.e which column of the input dataset we want our x and y axes to be)x=`Contig Number(>=200 bp)`
: The name of the column we want on thex
axis.- Unfortunately, the table has column names with spaces and special characters like
(>=)
which can have different special meaning. So we have to take some extra steps to make sure the column names are read properly. - Here we enclose the column name in `` - these are called “backticks”.
- Now, the column name will be read literally and the special characters will not have special meanings.
- Normally you could simply say
x=column_name
if your column names don’t have any special characters
+ geom_histogram(binwidth = 5)
- Adding a new layer- In ggplot, the
+
symbol is used to signify the addition of a new layer. We usedggplot()
to tell it what data to use, and now we usegeom_
to tell it how to visually represent the data. geom_histogram
is the histogram function in ggplot.binwidth = 5
is a value for the width of the histogram “bars” thatgeom_histogram
accepts. This is the range of values that are represented by one bar in the histogram. Increase or decrease this number to see how it changes the plot!
- In ggplot, the
Making a scatterplot using ggplot2
Make a scatter plot (geom_point
) with the number of contigs on the x axis and the N50 value in the y axis
Click to reveal answer!
ggplot(data = enterobase_assemblystats,aes(x=`Contig Number(>=200 bp)`,y=N50))+
geom_point()
ggplot()
: Calling the ggplot functiondata = enterobase_assemblystats
: The data that we want to use, here it isenterobase_assemblystats
, which is the QUAST output table that we readaes()
: These are the mapping aesthetics (i.e which column of the input dataset we want our x and y axes to be)x=`Contig Number(>=200 bp), y=N50`
: The name of the columns we want on thex
andy
axes.
+ geom_point()
- Adding the scatterplot function in ggplot
Looks like there are two complete genomes in the table with very high N50 values that’s throwing off the scale!
We can make the plot with only samples having N50 values less than 4 million (<4e+06).
This means we have to subset the
enterobase_assemblystats
table, only keeping rows where theN50
column has a value less than 4 millionWe can do this like so:
$N50 < 4e+06,] enterobase_assemblystats[enterobase_assemblystats
enterobase_assemblystats$N50
: The$
sign is used to call a specific column within the dataframe. Here we are calling the “N50” column.< 4e+06
means ‘less than 4 million’. This is our conditionenterobase_assemblystats$N50 < 4e+06
: Type this in the console and look at the output! You will see a list (in R, this is called a vector) of ‘TRUE’ and/or ‘FALSE’ values. It is telling us for which rows the above condition (i.e, N50 < 4 million) is TRUE (or FALSE)- But we don’t just want to know for which rows the condition is TRUE for, we want to filter the dataframe such that we get all the rows where the condition is true.
enterobase_assemblystats[enterobase_assemblystats$N50 < 4e+06,]
: This is how you filter a dataframe in R.- The format is
dataframe[rows,columns]
- The rows we want and the columns we want are split by a comma
- Instead of
rows
- we have our condition which is all the rows where N50 < 4e+06. - We are not filtering by columns, we want all columns so we leave it blank. That’s why there is nothing after the comma
- The format is
`enterobase_assemblystats[enterobase_assemblystats$N50 < 4e+06,c(1,4,7)]` will print only the first, fourth and seventh column of the dataframe. After the **comma** we specify the columns we want, enclosed in `c()`.
You can run `colnames(dataframe)` to see the names of all the columns (and the column number). Run `colnames(enterobase_assemblystats)` on the console and find the number for the N50 column!
We can thereby alter the ggplot
command this: Note that it is identical to the previous ggplot
command, we have only changed what we give as input to data
.
ggplot(data = enterobase_assemblystats[enterobase_assemblystats$N50<4000000,],aes(x=`Contig Number(>=200 bp)`,y=N50))+
geom_point()
Try playing around with different x and y axes, and also different filters. You can also add a color
/ fill
to aes
to colour your datapoints/bars based on a specific factor.
EXERCISE: Remake the N50 vs Number of contigs scatter plot colouring the points by ‘Country’ !
Click here to see the answer!
ggplot(data = enterobase_assemblystats[enterobase_assemblystats$N50<4000000,],aes(x=`Contig Number(>=200 bp)`,y=N50,color=Country))+
geom_point()
EXERCISE: Make a boxplot showing the distribution of N50 values per country using geom_boxplot
Click here to see the answer!
ggplot(data = enterobase_assemblystats[enterobase_assemblystats$N50<4000000,],aes(x=N50,y=Country))+
geom_boxplot()
Data cleaning/wrangling
We saw how to subset/filter a dataframe. Lets look at some more examples
Get rid of NAs
You will see that your enterobase table has many missing or “NA” values. We can select only specific rows that do not have missing values. For eg: Lets select only the “Uberstrain”,“Source Niche”, “Collection Year”, “Country”, and “Sample ID” columns and remove all rows with ‘NA’ such that only rows where all four columns have values will be present
First select only the columns we want. Use
colnames(enterobase_assemblystats)
to get the column numbers for our desired columns (alternatively you can also specify the names in quotes separated by commas withinc()
).Write the subsetted dataframe to a new dataframe using
<-
<-enterobase_assemblystats[,c(1,5,8,13,29)] enterobase_assemblystats_filtered
Now remove all “NA” values from the new dataframe.
<-enterobase_assemblystats_filtered[complete.cases(enterobase_assemblystats_filtered)] enterobase_assemblystats_filtered
Cleaning non-standard column values (“unknown” “Unknown” “ND”)
You can see in the ‘Source Niche’ column in the filtered
dataframe that there are still some blank values that did not get marked appropriately as ‘NA’. There are also some columns marked ‘ND’ which is effectively the same as a missing value. These types of errors are common with public data which is why you must always double check after filtering.
How do you check if all the values in a column are OK?
An easy trick is to use the table
command, which produces frequency tables. That is, it prints out every value in a column along with the number of times that value has occurred.
table(enterobase_assemblystats_filtered$`Source Niche`)
Animal Feed Companion Animal Environment
62 12 1 60
Food Human Livestock ND
261 154 29 20
Poultry Wild Animal
82 2
Here we can see a blank value has occurred 62 times and “ND” has occurred 20 times in the filtered set. Lets remove all rows where the “Source Niche” column has value = “” (blank) or = “ND”
<-enterobase_assemblystats_filtered[!enterobase_assemblystats_filtered$`Source Niche` %in% c("","ND"),] enterobase_assemblystats_filtered
- Here, we followed the same format for filtering the dataframe as we did with the N50 values:
dataframe[rows,columns]
- For the
rows
, we entered!enterobase_assemblystats_filtered$`Source Niche` %in% c("","ND")
!
means NOT , so by starting the condition with!
, we are saying “For rows where the following condition is NOT TRUEenterobase_assemblystats_filtered$`Source Niche`
: Refers to the “Source Niche” column, enclosed in backticks because of the space character%in% c("","ND")
: means if the value is in the listc("","ND")
- which are the values we are looking for (“” is blank, and “ND” is ND)- So the full command means, Get all rows from the dataframe “enterobase_assemblystats_filtered” IF the value in the “Source Niche” column is NOT “” or “ND”
Alternatively, if you know what you are looking for, you can appropriately specify what values should be counted as NA
right when you read your dataframe using the na.strings=
option.
<-fread("/home/vishnu/UmeaU_backup/NWU_2025_workshop/NWU_2025_workshop_data/test_datasets/20251020_enterobase_minnesota_assemblystats.tsv",na.strings = c("ND","","Unkown","unknown")) enterobase_assemblystats
Merging data tables
You can merge two different data frames by a common column using merge
Read in the enterobase 7 Gene MLST table you downloaded
The advantage with fread
is that you can select exactly which columns to read in right at the start so that the whole table isn’t read. In our case, we only want column 1 and 34, so we specify that as select = c(1,34)
.
<-fread(file="/path/to/20251020_enterobase_minnesota_mlst.tsv",header=T,sep="\t",select = c(1,34)) enterobase_mlst
Lets merge the MLST table with our filtered assemblystats table using the merge
command
<-merge(enterobase_assemblystats_filtered,enterobase_mlst,by="Uberstrain") enterobase_assemblystats_mlst_merged
- The
merge
function is used like so:merge(dataframe_1,dataframe_2,by="column_name")
- As long as the column name specified is the same for both our dataframes (in our case both columns are named “Uberstrain”), it will merge, keeping only values common in both dataframes.
Adding a column by condition
You can add a new column based on a specific condition using the mutate
function. For example lets say we want a column called ‘is_human’ which will have the values “yes” if “Source Niche” equals “Human” and “no” if not.
<-mutate(enterobase_assemblystats_mlst_merged, is_human = if_else(`Source Niche` == "Human", "yes", "no")) enterobase_assemblystats_mlst_merged
- We used the
mutate
function like so:mutate(dataframe, <name for new column> = <value for new column>)
- We named the new column
is_human
, and used anif_else
condition for the value if_else(`Source Niche` == "Human", "yes", "no")
:if_else(<CONDITION>,<What to do if TRUE>,<What to do if FALSE>)
- Here the CONDITION is IF “Source Niche” equals (
==
) “Human”. If true, then the value for the new columnis_human
is “yes”, else, the value is “no”
- Here the CONDITION is IF “Source Niche” equals (
Group by and summarize
If you want the mean of a column, say N50, you can simply do:
mean(enterobase_assemblystats$N50, na.rm = TRUE)
[1] 238868.8
The na.rm
option is to remove the NA
values.
But what if you want the mean N50 value for each country? That is where the function group_by
is used
%>% group_by(Country) %>% summarise(mean_N50_per_country = mean(N50, na.rm = TRUE)) enterobase_assemblystats
# A tibble: 26 × 2
Country mean_N50_per_country
<chr> <dbl>
1 "" 455326.
2 "Austria" 230209
3 "Benin" 268875
4 "Brazil" 252957.
5 "Canada" 225783.
6 "Chile" 189261
7 "Colombia" 270202.
8 "England" 259390
9 "France" 153003.
10 "Germany" 175992.
# ℹ 16 more rows
%>%
is similar to the|
(pipe) character in bash.enterobase_assemblystats %>% group_by()
pipes the enterobase_assemblystats table to the group_by functiongroup_by(Country) %>%
: we want to group the different rows from the input by the “Country” column. This creates as many “groups” as there are countries. When we pipe this out, each “group” will be treated separately.summarise(mean_N50_per_country = mean(N50, na.rm = TRUE))
:- We piped the “groups” to the
summarise
function, which will now provide a “summary” for each group. - We are asking it to create a new column
mean_N50_per_country
mean(N50, na.rm = TRUE)
is the samemean
function we used above. We are using themean
function to get the mean of N50 values while removing any missing values usingna.rm=TRUE
- We piped the “groups” to the
- This provides an output of “Grouped” means, each group being a country.
Sources and further reading
- Enterobase: https://enterobase.warwick.ac.uk/
- GGplot2 Essentials from STHDA: https://www.sthda.com/english/wiki/ggplot2-essentials
- R Graph Gallery: https://r-graph-gallery.com/ggplot2-package.html
- Software Carpentry R course: https://swcarpentry.github.io/r-novice-gapminder/index.html