Module 1
Bash terminal basics
This module will cover the basics of navigating the bash terminal, go through some key bash commands, and common file formats.
What is the command line? Why is it essential?
- The command line is a way to interact with your computer simply by typing on your keyboard, without clicking on icons
- Instead of manually clicking through multiple windows and files, you type out “commands” and your computer does them for you
- This allows you to easily reproduce what you did before, automate several tasks, and handle large datasets too cumbersome for a human to manually go through.
- This also provides a great foundation for learning other programming languages
- Several bioinformatic tools do not have a graphical user interface (GUI) and need to be used from the command-line
Exploring the file system
To ‘move’ around your computer, typically you would use the built-in file explorer, where you would point and click to navigate through different files and folders.
We can also perform the same actions from the terminal by simply typing out the PATH to each file or folder
The PATH to a file is the exact location of that file on the computer.
It is very important to know at all times exactly where you are in the file system as well as where the files you are working with are.
The working directory
Open the terminal and create an empty file
touch example.txt
We created a file called example.txt
using the touch
command. Here, we simply specified the name of the file to be created, but did NOT specify where the file should be created. This means, the file got created in the WORKING DIRECTORY.
If you are creating or referring to a file without specifying the PATH , your computer will assume you are pointing to the current WORKING DIRECTORY.
The WORKING DIRECTORY refers to the directory you are currently in.
The WORKING DIRECTORY can also be signified as ./
.
While ./
will always point to the current WORKING DIRECTORYy, it is merely a shorthand that is referring to the actual PATH to your WORKING DIRECTORY directory.
The pwd
command or ‘Print Working Directory’ prints the path to the current working directory. It might look something like this
pwd
/mnt/c/Users/username/project
- The path to any location always starts with
/
which signifies the root directory. The root is the starting location and it is as far back in the filesystem as you can go. - If you imagine the filesystem as a tree with each branch being a different folder, the root directory is the base of the tree trunk
- From the root directory
/
, we are inside the directorymnt
, followed byc
, followed byUsers
and so on. - When we created
example.txt
, we created it from theproject
directory. - This means, the PATH to
example.txt
is
/mnt/c/Users/username/project/example.txt
Both are generally used interchangably though directory is the technically correct term.
The home directory
When you open your terminal, your default location will be your HOME directory
This can be signified as ~/
.
While ~/
will always point to your default HOME directory, it is merely a shorthand that is referring to the actual PATH to your HOME directory.
This path can be seen using:
echo $HOME
/mnt/c/Users/username
- The
echo
command simply prints out the following text $HOME
refers to the built-in variable calledHOME
$HOME
contains a value, which is the path to your HOME directory- In this case, my home directory is
/mnt/c/Users/username
- So if I want to point to
example.txt
, another way for me to do it is:
~/project/example.txt
Changing directories
The cd
command (or “change directory”) allows you to move between different directories
For example, if we would like to move to our home directory, we can type
cd ~/
In the above example, you would directly move to your home directory no matter where in the filesystem you were, as ~/
always refers to your home directory.
Now our current working directory is our home directory.
If we would like to switch to the project
directory, we can type
cd project
We can move out one level from the current working directory using ../
. Similar to how ./
always refers to the current directory, ../
always refers to one directory above the current directory.
For example, if our current directory is /mnt/c/Users/username/project/
cd ../
pwd
/mnt/c/Users/username/
cd ../
takes us one directory above project
, which in this case is username
. If we were to repeat this, we would go up one more directory, taking us to Users
:
cd ../
pwd
/mnt/c/Users/
Therefore ./
and ../
are shorthands that refer to paths that change depending on the current working directory. In other words, they are RELATIVE .
Absolute vs. Relative Paths
There are two main types of paths when referring to files and directories: ABSOLUTE and RELATIVE paths.
Absolute Paths always start from the root directory /
and specify the exact location of a file or folder in the filesystem. For example, the absolute path to example.txt
is:
/mnt/c/Users/username/project/example.txt
Relative Paths are paths that are relative to the current working directory. For instance, if you are currently in the username directory, you can refer to example.txt
by:
./project/example.txt
If you are in project
, and you want to refer to another file another_example.txt
that is in Users
, you can refer to it relative to your current location such as
../../another_example.txt
../../
goes two levels above project
to reach Users
and then refers to another_example.txt
.
It’s generally a good idea to use absolute paths when you’re working with files or directories, especially when writing scripts or commands that could be run from different locations. This ensures that you are always referring to the exact file, no matter where you are in the filesystem.
The autocomplete feature
When you are typing out filenames or directory names in the shell, it will attempt to automatically complete what you are typing with the file/directory names WITHIN THE DIRECTORY YOU ARE CURRENTLY IN. For example, say you want to change to the directory /mnt/c/Users/username/
.
When you begin typing /mnt/c/Users/us
and then press the tab
(⭾) button on your keyboard, it will automatically complete to /mnt/c/Users/username/
.
If you have multiple folders that begin with user
, it will complete to the furthest possible character. For example if you have the following files in your Users
directory
$: ls Users
user1 user2 username
The autocomplete feature will complete /mnt/c/Users/us
to /mnt/c/Users/user
, after which you can type n
to make it /mnt/c/Users/usern
and then press tab
again, which will then complete it to /mnt/c/Users/username
The autocomplete feature works the same way for files, for example /mnt/c/Users/username/project/exa
will complete to /mnt/c/Users/username/project/example.txt
.
example.txt
and Example.txt
will be treated as different files. If you type exa
and press tab
, it will never complete to Example.txt
.
Other useful commands
mkdir
: Makes a new directory called dir1
in the current directory (project
)
mkdir dir1
I recommend making a separate project
directory (or any name of your choosing) for this workshop to perform all upcoming exercises.
In general it is good practice to not use spaces or special characters ( such as ?!<>,[]{}()
) in your file or folder names as usually, spaces and special characters have a SPECIAL MEANING in the command line and stand for specific operations. This can lead to the system getting confused when you refer to a folder with a special character, as it will assume you are performing an operation that the character stands for. For example, the space character is usually the separator for different options/modifiers for command line tools. Use _
instead of spaces.
ls
: Lists all contents in the specified directory.
In the below example we list all the contents in the directory project
ls /mnt/c/Users/username/project/
dir1
example.txt
cp
: Creates a copy of a file.
The below example copies example.txt
to dir1
.
cp example.txt ./dir1/
mv
: Moves a file from one location to another.
The below example moves example.txt
from dir1
to dir2
.
mkdir dir2
mv ./dir1/example.txt ./dir2/
cat
: Display all the contents within a file.
To display all the contents that is within example.txt
:
ls cat example.txt
In our case since we did not write anything into example.txt
it will be empty. Feel free to open example.txt
in your text editor, write some text into it and then try out the cat
command.
head
and tail
: Display the first n or last n lines in a file
By default head
or tail
will display the first 10 or last 10 lines of a file.
head example.txt
rm
: Removes (delete) a specified file. In the below example we delete example.txt
in dir2
rm /mnt/c/Users/username/project/dir2/example.txt
Be very careful when using the rm
command as there will be no confirmation dialogue box asking you if you are sure you want to delete the file as you might be used to. The file will simply PERMANENTLY be deleted!
man
: Shows the manual (help) page for any command.
You can also follow any command with either -h
or --help
to display the help page. Look up the help page of the above commands to see all the different ways they can be used!
man ls
ls --help
Feel free to experiment with the above commands, navigating between different folders, creating and moving files, using absolute and relative paths and so on.
You can also add modifiers to each command to slightly change its behaviour. For example ls
simply lists the names of all the files in the specified directory but ls -l
lists them with a lot more additional information! You can also use multiple modifiers at a time like ls -l -a
(What does -a
do?) . Make sure to look into the --help
pages for the commands to know more.
EXERCISE: Try to find the modifiers for each of the commands above to do the following :
- List files along with the file sizes
Click to reveal answer
ls -sh
- List files sorted by the time they were created
Click to reveal answer
ls -lt
- Print the first 30 lines from a file
Click to reveal answer
head -n 30 filename.txt
- Print all lines starting from the 3rd line
Click to reveal answer
tail -n +3 filename.txt
- Copy or move a directory along with all the contents
Click to reveal answer
cp -r /path/to/dir /path/to/new_dir
- Delete a directory
Click to reveal answer
rm -r /path/to/dir
Files and File Extensions
Every file on your computer has a file extension, which is part of its name and comes after the last period (dot) in the filename. For example, in example.txt
, .txt
is the file extension.
File extensions are used by the operating system to determine what kind of file it is, what program should open it, and how it should be processed. Some common file extensions include:
.txt
- Text file.csv
- A table/dataframe with comma separated values.docx
- Microsoft Word document.pdf
- PDF document.jpg
- JPEG image file.sh
- Shell script
Why Are File Extensions Important?
File extensions are important because they help the system identify how to handle a file. For example: - A .txt
file is usually opened with a simple text editor (e.g., Notepad, VSCode, etc.). A .docx
file is typically opened in Microsoft Word or similar word processing software.
The extension does not determine the content of a file, just how it is generally interpreted by the system. You can change a file extension, but it doesn’t change the underlying content, though it might prevent the file from being opened correctly. For example you can change the extension of a .jpg
image to .txt
, but that doesn’t mean you can open it in Notepad as it will still be an image.
Plain Text Files
What is Plain Text?
Plain text files contain ONLY text. There is no additional formatting (such as bold/italics/colours), images, or embedded metadata. These files can be created and edited with simple text editors such as NotePad (Windows), TextEdit (MacOS), gedit (Ubuntu)
A word/google docs document with only text still is not a plain text file. They still have a lot of embedded formatting and metadata.
Plain text is UNIVERSAL. It is not a proprietary format that can only be read by specific programmes (for example, .docx
). Any text editor across any system can read and manipulate plain text files.
In bioinformatics, plain text files are preferred for storing data because they can be easily read, processed, and manipulated by a variety of different software. Scripts for some programming languages such as Python, Bash, and R are also in plain text format.
However, not all bioinformatics data comes in plain text. Many data formats are designed for efficient storage, performance, or specific analysis tasks, and they may be binary formats (e.g., BAM). In bioinformatics workflows, plain text formats are still commonly used for various data exchange and reporting.
For some programming languages, the scripts need to be compiled before they can be executed (for eg: C, Rust). While the script itself will be in plain-text, it cannot be executed unless it is compiled into a binary. In the case of Python, R and Bash, compilation is not needed.
Here are some of the most common plain text files used in bioinformatics:
1. .csv or .tsv
This will probably be the most common file format you will be directly interacting with on a regular basis. csv
(Comma Separated Values) or tsv
(Tab separated values) are used to store tabular data with commas (in csv) or tabs (in tsv) as the separator in-between values in the table. In other words, it is a way to represent a table in plain-text, with rows separated by new lines and columns separated by commas or tabs. This separator is referred to as the delimiter, with tsv
files also being referred to as tab delimited files and csv
as comma delimited. Many of the common bioinformatic file formats we will be discussing below are also tab delimited files.
Example .tsv file:
sample_name Gene_symbol Sequence_name
SAMN09634554 sseK2 type III secretion system effector arginine glycosyltransferase SseK2
SAMN42760649 mdsB multidrug efflux RND transporter permease subunit MdsB SAMN07714001 golT gold/copper-translocating P-type ATPase GolT
Same example as a .csv
sample_name,Gene_symbol,Sequence_name
SAMN09634554,sseK2,type III secretion system effector arginine glycosyltransferase SseK2
SAMN42760649,mdsB,multidrug efflux RND transporter permease subunit MdsB SAMN07714001,golT,gold/copper-translocating P-type ATPase GolT
A .csv
or .tsv
may not be the most convenient way for us to visualize a table compared to software like excel that neatly separate each value into individual cells, however when you are working with large datasets having thousands to millions of rows, these format enable extremely fast and efficient processing of the data. Moreover, several bioinformatics tools will use .csv
and/or .tsv
files as their inputs or outputs. So it is very important you are familiar with handling these files.
2. FASTA
- FASTA is one of the most widely used formats for representing biological sequences, such as DNA, RNA, or protein sequences. A FASTA file can consist of multiple sequences. Each entry in a FASTA file has two components :
- Description line: Starts with
>
, followed by an identifier and optional description. - Sequence: The actual sequence of nucleotides (DNA) or amino acids (protein). The sequence can be in a single line or broken up into multiple lines.
- Description line: Starts with
- In a genome FASTA file, each sequence is called a contig . In most cases a genome will not be assembled “perfectly” into a single long stretch of DNA, and will be broken up into contigs. This is covered in detail in Module 2.
- Extensions:
.fasta
,.fna
,.fa
,.faa
Example FASTA format:
>HNLNHC_00005 Oxaloacetate decarboxylase
GTGCGCGAGGACCTTGGCTTTATCCCGCTGGTGACCCCCACTTCACAGATT
GTCGGCACCCAGGCGGTGCTCAACGTCCTGACCGGCGAACGCTACAAAACC
>HNLNHC_00010 Oxaloacetate decarboxylase beta chain 2
ATGGAAAGTCTGAACGCCCTGCTTCAGGGCATGGGGCTGATGCACCTTGGC GCAGGCCAGGCCATCATGCTGCTGGTGAGCCTGCTGCTGCTGTGGCTGGCG
3. FASTQ
- The FASTQ format is an extension of the FASTA format, used for storing raw sequence data with quality scores. Each entry contains four lines:
- Header: Starts with
@
followed by the sequence identifier. For paired-end sequencing, this identifier is what keeps track of the “mates”. - Sequence: The actual sequence of nucleotides.
- Plus line: Starts with a
+
sign (may also followed by the same sequence identifier) - Quality scores: Per-base quality score corresponding to the sequence in line 2. Each symbol here corresponds to a numeric score for the base call in the corresponding position. Learn more about quality scores here.
- Header: Starts with
- Extensions:
.fastq
,.fq
. Typically paired-end fastq reads for a single sample will have the suffix_R1.fastq
and_R2.fastq
.
Example FASTQ format:
@SRR16006951.1 1 length=149
CTGTTCGATATTGCCGCCTTGCGCCCCGCGCCGCTCACCCCGCTGGTGGCATTAATTACCGGCCACTGCGTCAGATCCAAAAGACCGCCGTCAATCAGCGGTTTTAGCGACAACTGCGCTGCGGTTGGATAGCAACCAGGAACCGCAAT
+SRR16006951.1 1 length=149
AAAAAEEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEAEEEEEEEEEEAAEAEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEA</EEEEE/<EAAAE/<
@SRR16006951.2 2 length=148
GTCGCCGTGTTTCTCTCCTTTGATGGCGAACTCGACACCCAGCCGCTGATAGAACAGCTATGGCAAGCGGGCAAGCGCGTCTACCTTCCGGTTCTTCATCCCTTCAGCCCTGGCAACCTCCTGTTCCTGCACTATCATACGCAGAGTG
+SRR16006951.2 2 length=148 AAAAAEEEEAEE/EEE/EAEEAEEAEEEEEEA<AEEEEEE//EEAE<EEEA/EE<EAEEEA<AEEE/E<EEEEEAEAE<EA/E<A6E/E/E/E/<E/E</<A/E<</EEE/<EE/EAE//A/<<E/E/6<EA/A</A//</AE/E/E/
5. GFF
- GFF or General Feature Format is a tab delimited file with information about genes and other features from a FASTA file.
- Once a FASTA file is annotated i.e by running some kind of gene finding/functional annotation software, specific regions in the fasta file will be identified as genes (or other genetic features such as non-coding RNA, tRNA and so on)
- A GFF file may vary depending on exactly how it was generated but in general it has the following columns:
- Metadata lines starting with
##
- Sequence ID/contig ID that was annotated
- The specific feature (CDS/ncRNA etc)
- The start and end positions of the feature on the contig
- More details regarding the annotation
- Metadata lines starting with
- Extensions:
.gff
,.gff3
Example GFF format :
##gff-version 3
##feature-ontology https://github.com/The-Sequence-Ontology/SO-Ontologies/blob/v3.1/so.obo
contig_1 CDS 90 1232 ID=HCEKFJ_00005;Name=Biotin/lipoyl-binding protein;product=Biotin/lipoyl-binding protein; contig_46 ncRNA 1728 1828 ID=HCEKFJ_23455;Name=RNAI;gene=RNAI;product=RNAI;
4. GenBank
- GenBank is also a gene feature/annotation format containing information about genes and other features, but is larger and contains more information than the
gff
file. For eg:- Information regarding the person/institution that generated the file with contact information
- If the record is connected to a specific publication
- Amino acid sequences for each annotated gene.
- Extensions:
.gbk
,.gbff
Example GenBank format : NCBI Sample Genbank Record
6. SAM/BAM
- SAM (Sequence Alignment/Map) is a tab delimited with information about the alignment of sequences to a reference genome.
- Depending on how the SAM file was generated, the formatting can vary slightly but the first few columns are mandatory. In general all SAM files have
- The identifier of read that was mapped
- The SAM flag describing certain properties of the mapped read
- The reference contig ID, mapping quality
- The mapping position
- A compressed representation of the alignment called CIGAR (for eg: if any insertions/deletions are present and at what position)
- See here for a more detailed explanation of SAM files.
- BAM is the binary equivalent of SAM. BAM files are not in plain-text, they are compressed (and usually also indexed) versions of SAM files. This would make BAM files more efficient in terms of storage and allows fast querying for alignments across specific positions without having to search through the whole file.
- Extensions:
.sam
,.bam
. A BAM index will have the extention.bai
Example SAM format:
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
7. VCF
- VCF or Variant Call Format is a tab delimited file derived from the SAM file and comprises specifically genetic variation data after alignment.
- A VCF file contains file metadata (lines starting with
##
) followed by genetic variants (SNPs/indels) at each position when compared to the reference. - In the example below, you can see that the lines starting with
##
have information about the file and theINFO
andFORMAT
columns - Similar to the SAM file, the VCF file can have slight differences depending on how it was generated but the first few columns are mandatory.
- For eg: A vcf file generated from an alignment against an annotated reference will also provide the functional nature of the variant, for eg: if a SNP was found in a coding region, the VCF file will have additional information (eg: Synonymous/Non-synonymous SNP,if indel causes frameshift)
- The binary equivalent of a VCF file is BCF
- Extensions:
.vcf
,.bcf
Example VCF Format:
##fileformat=VCFv4.2
##reference=GRCh37
##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
NZ_CP019184.1 39681 . T A 682.658 . AB=0;AO=20;DP=20;QA=780;QR=0;RO=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL
NZ_CP019184.1 40836 . A G 682.658 . AB=0;AO=20;DP=20;QA=780;QR=0;RO=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL NZ_CP019184.1 48112 . A G 682.658 . AB=0;AO=20;DP=20;QA=780;QR=0;RO=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL
Working with files in the terminal
Now that we went through some common file formats we will be working with, let’s now learn how to manipulate and extract these files from the terminal.
Standard output redirection
When you enter any command in the terminal and see the output on your screen - this output is called the standard output or stdout
. For eg: the ls
command lists the files and folders in the current directory. Everything that is listed is the stdout
ls
dir1
dir2
example.txt
The stdout
of any command can be redirected into a file using >
.
ls > list_of_files.txt
Note that there is no stdout
printed this time. This is because we redirected the stdout
into list_of_files.txt
. Now you can see the contents of list_of_files.txt
using the cat
command that we learned before.
cat list_of_files.txt
dir1
dir2
example.txt
>
REDIRECTION IS DANGEROUS
Be very careful when writing a to file with >
as if you happen to specify an existing file, it will OVERWRITE all the contents of the existing file with whatever you are redirecting into it. This process is IRREVERSIBLE and you will lose all the original contents of the file. If you want to append (or add to) an existing file, you can use >>
and it will add new lines instead of overwriting the existing lines.
It is also possible to redirect the stdout
of one command into the input of another command using the “pipe” characer - |
. This process is called piping. For example, lets redirect the output of ls
to a new command wc
. wc
prints the number of lines, words, and characters present in the given input. In this case, the input would be whatever the ls
command outputs. Here, wc
is telling us that there are 3 lines, 3 words and 22 characters.
ls | wc
3 3 22
Performing the above command is basically the same as performing:
ls > list_of_files.txt
wc list_of_files.txt
But by using piping, we were able to do it much faster and without writing a new file.
Extracting specific information
The grep
command will search for a string (a specific piece of alphanumeric text) within a plain-text file. It is similar to the ctrl + F
or find feature that you many have commonly used to search for text within documents. For example, if we want to search for the string dir
in list_of_files.txt
, we can do
grep "dir" list_of_files.txt
dir1
dir2
grep
by default will output ALL lines containing the matching string. The string matching is CASE-SENSITIVE, therefore grepping for the string Dir
in this case will give an empty result. However if you want grep to perform a CASE-INSENSITIVE search, you can do:
grep -i "Dir" list_of_files.txt
dir1
dir2
In the above example, grep
ignores the case of the string to be searched. There are many such useful modifiers, check man grep
or grep --help
to see the different behaviours. Another useful modifier is -v
for INVERTED matching. This will output only lines that DO NOT HAVE the search string.
grep -v "dir" list_of_files.txt
example.txt
If grep
is similar to the find operation, then sed
is the equivalent of find + replace. sed
can substitute a specific character, string or pattern with another. By default, sed
will replace the first occurrence of the search string in each line. The syntax for sed
substitutions is as follows:
sed 's/search_string/replace_string/' filename
Lets say we want to replace dir
in list_of_files.txt
with Directory
sed 's/dir/Directory/` list_of_files.txt
Directory1
Directory2
example.txt
Note that sed
is simply performing the specified action in the stdout
. It is not actually changing the text within list_of_files.txt
. If you would like to REWRITE list_of_files.txt
with the specified action, you can perform an in-place substitution with the -i
modifier in sed
. THIS IS IRREVERSIBLE.
sed -i 's/dir/Directory/` list_of_files.txt
cat list_of_files.txt
Directory1
Directory2
example.txt
Both grep
and sed
have the -i
modifier, however in grep
it means to perform case-insensitive matches, while in sed
it means to perform the substitution in-place. It is very important to note that the same modifier can mean different things for different tools. It is always a good idea to check the help page for each tool. How will you perform a case-insensitive substitution in sed
?
Another useful command is tr
or translate . While it is similar to sed
in that it performs find+replace tasks, tr
is faster and easier to use for character-by-character operations, while sed
is better for longer strings. For example, if you want to replace a single character with another, tr
can do it very easily. For example, lets pipe the sed
output to tr
to convert Directory
into directory
sed 's/dir/Directory/` list_of_files.txt | tr 'D' 'd'
directory1
directory2
example.txt
In the above command, tr
looks for all instances of D
and converts it into d
. While this can be done with sed
, sed
by default only performs operations for the FIRST INSTANCE PER LINE. In other words, sed
splits your input line by line, and performs the operation independently for each line.
tr
performs the operation on a character by character basis, and therefore can work across different lines, as the newline (or \n
- which is the character that actually represents a new line) itself is treated as just another character. This allows you to do operations such as the one below very easily.
sed 's/dir/Directory/` list_of_files.txt | tr '\n' '\s'
Directory1 Directory2 example.txt
Here, we replaced all newline characters (\n
) with spaces (\s
)
Working with bioinformatics data
Lets download some real bioinformatics data and see how we can use what we have learned so far.
If you have performed the setup.sh
step in Module0 , you can skip to the exercise. The file should be present in NWU_2025_workshop_data/test_datasets/GCA_049744075.1
Activate your conda environment so that the necessary bioinformatics packages are loaded. See Module0: Setup for more information.
conda activate NWU_2025_workshop
Download a genome fasta file using NCBI’s datasets
tool. The datasets
tool should already be installed in the conda environment. You can see how to use the datasets
tool by typing datasets --help
. To download a genome for a specific accession number, the command is :
datasets download genome accession GCA_049744075.1
This will download a zipped archive from NCBI. Lets extract the archive using unzip
, take the file we need and then remove the rest.
unzip ncbi_dataset.zip
mkdir test_datasets
mv ncbi_dataset/data/GCA_049744075.1/GCA_049744075.1_ASM4974407v1_genomic.fna test_datasets/
rm -r ncbi_dataset.zip ncbi_dataset # you can keep specifying files/folders after the rm command to remove them all
EXERCISE: Perform the following actions on the above fasta file
- Count the number of contigs in the fasta file
Click to reveal answer
The grep
command can be used to count the number of contigs (or sequences) in a FASTA file. Each sequence in a FASTA file starts with a header line (>
), so we can grep the >
character to get only the header lines, and add the -c
flag to grep so that it reports the number of matches.
grep -c ">" GCA_049744075.1_ASM4974407v1_genomic.fna
Alternatively, we can also use wc -l
to report the number of lines matched, as grep
by default outputs all matching lines
grep ">" GCA_049744075.1_ASM4974407v1_genomic.fna | wc -l
- Extract only sequence names from the FASTA File and write it to a new file.
click to reveal answer
We can extract only sequence names by grepping the “>” character as done previously, but now we pipe it to sed
to remove only the >
, keeping only the actual names themselves, and then redirect that output to a new text file.
grep ">" GCA_049744075.1_ASM4974407v1_genomic.fna | sed 's/>//' > GCA_049744075.1_header_names.txt
- Count the total number of bases in the FASTA file.
click to reveal answer
We use inverted matching with grep -v
to get all lines EXCEPT lines starting with “>”, then we remove all newlines using tr -d
, and then use wc
to count the number of characters. We have to remove all newlines (\n
) as each \n
is considered a separate character, but here we only want to count the number of nucleotides.
grep -v ">" GCA_049744075.1_ASM4974407v1_genomic.fna | tr -d '\n' | wc -c
- In silico restriction digestion - how many fragments will you get if you digest your genome with EcoRI?
click to reveal answer
The recognition site for EcoRI is GAATTC
. We will only get the DNA sequence from our FASTA file using grep
and tr
as above, but now pipe it to sed
and replace each occurrence of GAATTC
with a newline (in effect, “cutting” our DNA), then we count the total number of lines
grep -v ">" GCA_049744075.1_ASM4974407v1_genomic.fna | tr -d '\n' | sed 's/GAATTC/\n/g' | wc -l