Module 1

Bash terminal basics

This module will cover the basics of navigating the bash terminal, go through some key bash commands, and common file formats.

What is the command line? Why is it essential?

  • The command line is a way to interact with your computer simply by typing on your keyboard, without clicking on icons
  • Instead of manually clicking through multiple windows and files, you type out “commands” and your computer does them for you
  • This allows you to easily reproduce what you did before, automate several tasks, and handle large datasets too cumbersome for a human to manually go through.
  • This also provides a great foundation for learning other programming languages
  • Several bioinformatic tools do not have a graphical user interface (GUI) and need to be used from the command-line

Exploring the file system

To ‘move’ around your computer, typically you would use the built-in file explorer, where you would point and click to navigate through different files and folders.

We can also perform the same actions from the terminal by simply typing out the PATH to each file or folder

The PATH to a file is the exact location of that file on the computer.

It is very important to know at all times exactly where you are in the file system as well as where the files you are working with are.

The working directory

Open the terminal and create an empty file

touch example.txt

We created a file called example.txt using the touch command. Here, we simply specified the name of the file to be created, but did NOT specify where the file should be created. This means, the file got created in the WORKING DIRECTORY.

If you are creating or referring to a file without specifying the PATH , your computer will assume you are pointing to the current WORKING DIRECTORY.

The WORKING DIRECTORY refers to the directory you are currently in.

The WORKING DIRECTORY can also be signified as ./ .

While ./ will always point to the current WORKING DIRECTORYy, it is merely a shorthand that is referring to the actual PATH to your WORKING DIRECTORY directory.

The pwd command or ‘Print Working Directory’ prints the path to the current working directory. It might look something like this

pwd
/mnt/c/Users/username/project
  • The path to any location always starts with / which signifies the root directory. The root is the starting location and it is as far back in the filesystem as you can go.
  • If you imagine the filesystem as a tree with each branch being a different folder, the root directory is the base of the tree trunk
  • From the root directory /, we are inside the directory mnt, followed by c, followed by Users and so on.
  • When we created example.txt, we created it from the project directory.
  • This means, the PATH to example.txt is
/mnt/c/Users/username/project/example.txt

Both are generally used interchangably though directory is the technically correct term.

The home directory

When you open your terminal, your default location will be your HOME directory

This can be signified as ~/ .

While ~/ will always point to your default HOME directory, it is merely a shorthand that is referring to the actual PATH to your HOME directory.

This path can be seen using:

echo $HOME
/mnt/c/Users/username
  • The echo command simply prints out the following text
  • $HOME refers to the built-in variable called HOME
  • $HOME contains a value, which is the path to your HOME directory
  • In this case, my home directory is /mnt/c/Users/username
  • So if I want to point to example.txt, another way for me to do it is:
~/project/example.txt

Changing directories

The cd command (or “change directory”) allows you to move between different directories

For example, if we would like to move to our home directory, we can type

cd ~/
Note

In the above example, you would directly move to your home directory no matter where in the filesystem you were, as ~/ always refers to your home directory.

Now our current working directory is our home directory.

If we would like to switch to the project directory, we can type

cd project

We can move out one level from the current working directory using ../ . Similar to how ./ always refers to the current directory, ../ always refers to one directory above the current directory.

For example, if our current directory is /mnt/c/Users/username/project/

cd ../
pwd
/mnt/c/Users/username/

cd ../ takes us one directory above project, which in this case is username. If we were to repeat this, we would go up one more directory, taking us to Users:

cd ../
pwd
/mnt/c/Users/

Therefore ./ and ../ are shorthands that refer to paths that change depending on the current working directory. In other words, they are RELATIVE .

Absolute vs. Relative Paths

There are two main types of paths when referring to files and directories: ABSOLUTE and RELATIVE paths.

Absolute Paths always start from the root directory / and specify the exact location of a file or folder in the filesystem. For example, the absolute path to example.txt is:

/mnt/c/Users/username/project/example.txt

Relative Paths are paths that are relative to the current working directory. For instance, if you are currently in the username directory, you can refer to example.txt by:

./project/example.txt

If you are in project , and you want to refer to another file another_example.txt that is in Users, you can refer to it relative to your current location such as

../../another_example.txt

../../ goes two levels above project to reach Users and then refers to another_example.txt .

It’s generally a good idea to use absolute paths when you’re working with files or directories, especially when writing scripts or commands that could be run from different locations. This ensures that you are always referring to the exact file, no matter where you are in the filesystem.

The autocomplete feature

When you are typing out filenames or directory names in the shell, it will attempt to automatically complete what you are typing with the file/directory names WITHIN THE DIRECTORY YOU ARE CURRENTLY IN. For example, say you want to change to the directory /mnt/c/Users/username/ .

When you begin typing /mnt/c/Users/us and then press the tab (⭾) button on your keyboard, it will automatically complete to /mnt/c/Users/username/ .

If you have multiple folders that begin with user, it will complete to the furthest possible character. For example if you have the following files in your Users directory

$: ls Users

user1 user2 username

The autocomplete feature will complete /mnt/c/Users/us to /mnt/c/Users/user, after which you can type n to make it /mnt/c/Users/usern and then press tab again, which will then complete it to /mnt/c/Users/username

The autocomplete feature works the same way for files, for example /mnt/c/Users/username/project/exa will complete to /mnt/c/Users/username/project/example.txt .

The shell is CASE-SENSITIVE

example.txt and Example.txt will be treated as different files. If you type exa and press tab, it will never complete to Example.txt.

Other useful commands

mkdir : Makes a new directory called dir1 in the current directory (project)

mkdir dir1

I recommend making a separate project directory (or any name of your choosing) for this workshop to perform all upcoming exercises.

DO NOT USE SPACES IN FILE/FOLDER NAMES

In general it is good practice to not use spaces or special characters ( such as ?!<>,[]{}() ) in your file or folder names as usually, spaces and special characters have a SPECIAL MEANING in the command line and stand for specific operations. This can lead to the system getting confused when you refer to a folder with a special character, as it will assume you are performing an operation that the character stands for. For example, the space character is usually the separator for different options/modifiers for command line tools. Use _ instead of spaces.

ls : Lists all contents in the specified directory.

In the below example we list all the contents in the directory project

ls /mnt/c/Users/username/project/
dir1
example.txt

cp : Creates a copy of a file.

The below example copies example.txt to dir1 .

cp example.txt ./dir1/  

mv : Moves a file from one location to another.

The below example moves example.txt from dir1 to dir2.

mkdir dir2
mv ./dir1/example.txt ./dir2/  

cat : Display all the contents within a file.

To display all the contents that is within example.txt:

ls cat example.txt

In our case since we did not write anything into example.txt it will be empty. Feel free to open example.txt in your text editor, write some text into it and then try out the cat command.

head and tail : Display the first n or last n lines in a file

By default head or tail will display the first 10 or last 10 lines of a file.

head example.txt 

rm : Removes (delete) a specified file. In the below example we delete example.txt in dir2

rm /mnt/c/Users/username/project/dir2/example.txt
REMOVING IS PERMANENT

Be very careful when using the rm command as there will be no confirmation dialogue box asking you if you are sure you want to delete the file as you might be used to. The file will simply PERMANENTLY be deleted!

man: Shows the manual (help) page for any command.

You can also follow any command with either -h or --help to display the help page. Look up the help page of the above commands to see all the different ways they can be used!

man ls
ls --help

Feel free to experiment with the above commands, navigating between different folders, creating and moving files, using absolute and relative paths and so on.

You can also add modifiers to each command to slightly change its behaviour. For example ls simply lists the names of all the files in the specified directory but ls -l lists them with a lot more additional information! You can also use multiple modifiers at a time like ls -l -a (What does -a do?) . Make sure to look into the --help pages for the commands to know more.

EXERCISE: Try to find the modifiers for each of the commands above to do the following :

  1. List files along with the file sizes
Click to reveal answer
ls -sh
  1. List files sorted by the time they were created
Click to reveal answer
ls -lt
  1. Print the first 30 lines from a file
Click to reveal answer
head -n 30 filename.txt
  1. Print all lines starting from the 3rd line
Click to reveal answer
tail -n +3 filename.txt
  1. Copy or move a directory along with all the contents
Click to reveal answer
cp -r /path/to/dir /path/to/new_dir
  1. Delete a directory
Click to reveal answer
rm -r /path/to/dir

Files and File Extensions

Every file on your computer has a file extension, which is part of its name and comes after the last period (dot) in the filename. For example, in example.txt, .txt is the file extension.

File extensions are used by the operating system to determine what kind of file it is, what program should open it, and how it should be processed. Some common file extensions include:

  • .txt - Text file
  • .csv - A table/dataframe with comma separated values
  • .docx - Microsoft Word document
  • .pdf - PDF document
  • .jpg - JPEG image file
  • .sh - Shell script

Why Are File Extensions Important?

File extensions are important because they help the system identify how to handle a file. For example: - A .txt file is usually opened with a simple text editor (e.g., Notepad, VSCode, etc.). A .docx file is typically opened in Microsoft Word or similar word processing software.

Note

The extension does not determine the content of a file, just how it is generally interpreted by the system. You can change a file extension, but it doesn’t change the underlying content, though it might prevent the file from being opened correctly. For example you can change the extension of a .jpg image to .txt , but that doesn’t mean you can open it in Notepad as it will still be an image.

Plain Text Files

What is Plain Text?

Plain text files contain ONLY text. There is no additional formatting (such as bold/italics/colours), images, or embedded metadata. These files can be created and edited with simple text editors such as NotePad (Windows), TextEdit (MacOS), gedit (Ubuntu)

Note

A word/google docs document with only text still is not a plain text file. They still have a lot of embedded formatting and metadata.

Plain text is UNIVERSAL. It is not a proprietary format that can only be read by specific programmes (for example, .docx). Any text editor across any system can read and manipulate plain text files.

In bioinformatics, plain text files are preferred for storing data because they can be easily read, processed, and manipulated by a variety of different software. Scripts for some programming languages such as Python, Bash, and R are also in plain text format.

However, not all bioinformatics data comes in plain text. Many data formats are designed for efficient storage, performance, or specific analysis tasks, and they may be binary formats (e.g., BAM). In bioinformatics workflows, plain text formats are still commonly used for various data exchange and reporting.

Note

For some programming languages, the scripts need to be compiled before they can be executed (for eg: C, Rust). While the script itself will be in plain-text, it cannot be executed unless it is compiled into a binary. In the case of Python, R and Bash, compilation is not needed.

Here are some of the most common plain text files used in bioinformatics:

1. .csv or .tsv

This will probably be the most common file format you will be directly interacting with on a regular basis. csv (Comma Separated Values) or tsv (Tab separated values) are used to store tabular data with commas (in csv) or tabs (in tsv) as the separator in-between values in the table. In other words, it is a way to represent a table in plain-text, with rows separated by new lines and columns separated by commas or tabs. This separator is referred to as the delimiter, with tsv files also being referred to as tab delimited files and csv as comma delimited. Many of the common bioinformatic file formats we will be discussing below are also tab delimited files.

Example .tsv file:

sample_name   Gene_symbol   Sequence_name
SAMN09634554    sseK2   type III secretion system effector arginine glycosyltransferase SseK2
SAMN42760649    mdsB    multidrug efflux RND transporter permease subunit MdsB
SAMN07714001    golT    gold/copper-translocating P-type ATPase GolT

Same example as a .csv

sample_name,Gene_symbol,Sequence_name
SAMN09634554,sseK2,type III secretion system effector arginine glycosyltransferase SseK2
SAMN42760649,mdsB,multidrug efflux RND transporter permease subunit MdsB
SAMN07714001,golT,gold/copper-translocating P-type ATPase GolT
Note

A .csv or .tsv may not be the most convenient way for us to visualize a table compared to software like excel that neatly separate each value into individual cells, however when you are working with large datasets having thousands to millions of rows, these format enable extremely fast and efficient processing of the data. Moreover, several bioinformatics tools will use .csv and/or .tsv files as their inputs or outputs. So it is very important you are familiar with handling these files.

2. FASTA

  • FASTA is one of the most widely used formats for representing biological sequences, such as DNA, RNA, or protein sequences. A FASTA file can consist of multiple sequences. Each entry in a FASTA file has two components :
    • Description line: Starts with >, followed by an identifier and optional description.
    • Sequence: The actual sequence of nucleotides (DNA) or amino acids (protein). The sequence can be in a single line or broken up into multiple lines.
  • In a genome FASTA file, each sequence is called a contig . In most cases a genome will not be assembled “perfectly” into a single long stretch of DNA, and will be broken up into contigs. This is covered in detail in Module 2.
  • Extensions: .fasta,.fna,.fa,.faa

Example FASTA format:

>HNLNHC_00005 Oxaloacetate decarboxylase
GTGCGCGAGGACCTTGGCTTTATCCCGCTGGTGACCCCCACTTCACAGATT
GTCGGCACCCAGGCGGTGCTCAACGTCCTGACCGGCGAACGCTACAAAACC

>HNLNHC_00010 Oxaloacetate decarboxylase beta chain 2
ATGGAAAGTCTGAACGCCCTGCTTCAGGGCATGGGGCTGATGCACCTTGGC
GCAGGCCAGGCCATCATGCTGCTGGTGAGCCTGCTGCTGCTGTGGCTGGCG

3. FASTQ

  • The FASTQ format is an extension of the FASTA format, used for storing raw sequence data with quality scores. Each entry contains four lines:
    • Header: Starts with @ followed by the sequence identifier. For paired-end sequencing, this identifier is what keeps track of the “mates”.
    • Sequence: The actual sequence of nucleotides.
    • Plus line: Starts with a + sign (may also followed by the same sequence identifier)
    • Quality scores: Per-base quality score corresponding to the sequence in line 2. Each symbol here corresponds to a numeric score for the base call in the corresponding position. Learn more about quality scores here.
  • Extensions: .fastq,.fq . Typically paired-end fastq reads for a single sample will have the suffix _R1.fastq and _R2.fastq.

Example FASTQ format:

@SRR16006951.1 1 length=149
CTGTTCGATATTGCCGCCTTGCGCCCCGCGCCGCTCACCCCGCTGGTGGCATTAATTACCGGCCACTGCGTCAGATCCAAAAGACCGCCGTCAATCAGCGGTTTTAGCGACAACTGCGCTGCGGTTGGATAGCAACCAGGAACCGCAAT
+SRR16006951.1 1 length=149
AAAAAEEEEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEAEEEEEEEEEEAAEAEEEEEEEEEEEEEEEEEEAAEEEEEEEEEEEEEEEEEEEEEEEA</EEEEE/<EAAAE/<
@SRR16006951.2 2 length=148
GTCGCCGTGTTTCTCTCCTTTGATGGCGAACTCGACACCCAGCCGCTGATAGAACAGCTATGGCAAGCGGGCAAGCGCGTCTACCTTCCGGTTCTTCATCCCTTCAGCCCTGGCAACCTCCTGTTCCTGCACTATCATACGCAGAGTG
+SRR16006951.2 2 length=148
AAAAAEEEEAEE/EEE/EAEEAEEAEEEEEEA<AEEEEEE//EEAE<EEEA/EE<EAEEEA<AEEE/E<EEEEEAEAE<EA/E<A6E/E/E/E/<E/E</<A/E<</EEE/<EE/EAE//A/<<E/E/6<EA/A</A//</AE/E/E/

5. GFF

  • GFF or General Feature Format is a tab delimited file with information about genes and other features from a FASTA file.
  • Once a FASTA file is annotated i.e by running some kind of gene finding/functional annotation software, specific regions in the fasta file will be identified as genes (or other genetic features such as non-coding RNA, tRNA and so on)
  • A GFF file may vary depending on exactly how it was generated but in general it has the following columns:
    • Metadata lines starting with ##
    • Sequence ID/contig ID that was annotated
    • The specific feature (CDS/ncRNA etc)
    • The start and end positions of the feature on the contig
    • More details regarding the annotation
  • Extensions: .gff, .gff3

Example GFF format :

##gff-version 3
##feature-ontology https://github.com/The-Sequence-Ontology/SO-Ontologies/blob/v3.1/so.obo
contig_1    CDS     90      1232    ID=HCEKFJ_00005;Name=Biotin/lipoyl-binding protein;product=Biotin/lipoyl-binding protein;
contig_46   ncRNA   1728    1828    ID=HCEKFJ_23455;Name=RNAI;gene=RNAI;product=RNAI;

4. GenBank

  • GenBank is also a gene feature/annotation format containing information about genes and other features, but is larger and contains more information than the gff file. For eg:
    • Information regarding the person/institution that generated the file with contact information
    • If the record is connected to a specific publication
    • Amino acid sequences for each annotated gene.
  • Extensions: .gbk, .gbff

Example GenBank format : NCBI Sample Genbank Record

6. SAM/BAM

  • SAM (Sequence Alignment/Map) is a tab delimited with information about the alignment of sequences to a reference genome.
  • Depending on how the SAM file was generated, the formatting can vary slightly but the first few columns are mandatory. In general all SAM files have
    • The identifier of read that was mapped
    • The SAM flag describing certain properties of the mapped read
    • The reference contig ID, mapping quality
    • The mapping position
    • A compressed representation of the alignment called CIGAR (for eg: if any insertions/deletions are present and at what position)
  • See here for a more detailed explanation of SAM files.
  • BAM is the binary equivalent of SAM. BAM files are not in plain-text, they are compressed (and usually also indexed) versions of SAM files. This would make BAM files more efficient in terms of storage and allows fast querying for alignments across specific positions without having to search through the whole file.
  • Extensions: .sam, .bam . A BAM index will have the extention .bai

Example SAM format:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;

7. VCF

  • VCF or Variant Call Format is a tab delimited file derived from the SAM file and comprises specifically genetic variation data after alignment.
  • A VCF file contains file metadata (lines starting with ##) followed by genetic variants (SNPs/indels) at each position when compared to the reference.
  • In the example below, you can see that the lines starting with ## have information about the file and the INFO and FORMAT columns
  • Similar to the SAM file, the VCF file can have slight differences depending on how it was generated but the first few columns are mandatory.
  • For eg: A vcf file generated from an alignment against an annotated reference will also provide the functional nature of the variant, for eg: if a SNP was found in a coding region, the VCF file will have additional information (eg: Synonymous/Non-synonymous SNP,if indel causes frameshift)
  • The binary equivalent of a VCF file is BCF
  • Extensions: .vcf , .bcf

Example VCF Format:

##fileformat=VCFv4.2
##reference=GRCh37
##INFO=<ID=AB,Number=A,Type=Float,Description="Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=RO,Number=1,Type=Integer,Description="Reference allele observation count">
##FORMAT=<ID=QR,Number=1,Type=Integer,Description="Sum of quality of the reference observations">
##FORMAT=<ID=AO,Number=A,Type=Integer,Description="Alternate allele observation count">
##FORMAT=<ID=QA,Number=A,Type=Integer,Description="Sum of quality of the alternate observations">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible genotype generated from the reference and alternate alleles given the sample ploidy">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
NZ_CP019184.1   39681   .   T   A   682.658 .   AB=0;AO=20;DP=20;QA=780;QR=0;RO=0;TYPE=snp  GT:DP:RO:QR:AO:QA:GL    
NZ_CP019184.1   40836   .   A   G   682.658 .   AB=0;AO=20;DP=20;QA=780;QR=0;RO=0;TYPE=snp  GT:DP:RO:QR:AO:QA:GL    
NZ_CP019184.1   48112   .   A   G   682.658 .   AB=0;AO=20;DP=20;QA=780;QR=0;RO=0;TYPE=snp  GT:DP:RO:QR:AO:QA:GL    

Working with files in the terminal

Now that we went through some common file formats we will be working with, let’s now learn how to manipulate and extract these files from the terminal.

Standard output redirection

When you enter any command in the terminal and see the output on your screen - this output is called the standard output or stdout . For eg: the ls command lists the files and folders in the current directory. Everything that is listed is the stdout

ls 

dir1
dir2
example.txt

The stdout of any command can be redirected into a file using >.

ls > list_of_files.txt

Note that there is no stdout printed this time. This is because we redirected the stdout into list_of_files.txt. Now you can see the contents of list_of_files.txt using the cat command that we learned before.

cat list_of_files.txt

dir1
dir2
example.txt
THE > REDIRECTION IS DANGEROUS

Be very careful when writing a to file with > as if you happen to specify an existing file, it will OVERWRITE all the contents of the existing file with whatever you are redirecting into it. This process is IRREVERSIBLE and you will lose all the original contents of the file. If you want to append (or add to) an existing file, you can use >> and it will add new lines instead of overwriting the existing lines.

It is also possible to redirect the stdout of one command into the input of another command using the “pipe” characer - |. This process is called piping. For example, lets redirect the output of ls to a new command wc . wc prints the number of lines, words, and characters present in the given input. In this case, the input would be whatever the ls command outputs. Here, wc is telling us that there are 3 lines, 3 words and 22 characters.

ls | wc

3  3  22

Performing the above command is basically the same as performing:

ls > list_of_files.txt
wc list_of_files.txt

But by using piping, we were able to do it much faster and without writing a new file.

Extracting specific information

The grep command will search for a string (a specific piece of alphanumeric text) within a plain-text file. It is similar to the ctrl + F or find feature that you many have commonly used to search for text within documents. For example, if we want to search for the string dir in list_of_files.txt, we can do

grep "dir" list_of_files.txt

dir1
dir2

grep by default will output ALL lines containing the matching string. The string matching is CASE-SENSITIVE, therefore grepping for the string Dir in this case will give an empty result. However if you want grep to perform a CASE-INSENSITIVE search, you can do:

grep -i "Dir" list_of_files.txt

dir1
dir2

In the above example, grep ignores the case of the string to be searched. There are many such useful modifiers, check man grep or grep --help to see the different behaviours. Another useful modifier is -v for INVERTED matching. This will output only lines that DO NOT HAVE the search string.

grep -v "dir" list_of_files.txt

example.txt

If grep is similar to the find operation, then sed is the equivalent of find + replace. sed can substitute a specific character, string or pattern with another. By default, sed will replace the first occurrence of the search string in each line. The syntax for sed substitutions is as follows:

sed 's/search_string/replace_string/' filename

Lets say we want to replace dir in list_of_files.txt with Directory

sed 's/dir/Directory/` list_of_files.txt

Directory1
Directory2
example.txt

Note that sed is simply performing the specified action in the stdout. It is not actually changing the text within list_of_files.txt. If you would like to REWRITE list_of_files.txt with the specified action, you can perform an in-place substitution with the -i modifier in sed. THIS IS IRREVERSIBLE.

sed -i 's/dir/Directory/` list_of_files.txt

cat list_of_files.txt

Directory1
Directory2
example.txt

Both grep and sed have the -i modifier, however in grep it means to perform case-insensitive matches, while in sed it means to perform the substitution in-place. It is very important to note that the same modifier can mean different things for different tools. It is always a good idea to check the help page for each tool. How will you perform a case-insensitive substitution in sed ?

Another useful command is tr or translate . While it is similar to sed in that it performs find+replace tasks, tr is faster and easier to use for character-by-character operations, while sed is better for longer strings. For example, if you want to replace a single character with another, tr can do it very easily. For example, lets pipe the sed output to tr to convert Directory into directory

sed 's/dir/Directory/` list_of_files.txt |  tr 'D' 'd'

directory1
directory2
example.txt

In the above command, tr looks for all instances of D and converts it into d. While this can be done with sed, sed by default only performs operations for the FIRST INSTANCE PER LINE. In other words, sed splits your input line by line, and performs the operation independently for each line.

tr performs the operation on a character by character basis, and therefore can work across different lines, as the newline (or \n - which is the character that actually represents a new line) itself is treated as just another character. This allows you to do operations such as the one below very easily.

sed 's/dir/Directory/` list_of_files.txt |  tr '\n' '\s'

Directory1 Directory2 example.txt

Here, we replaced all newline characters (\n) with spaces (\s)

Working with bioinformatics data

Lets download some real bioinformatics data and see how we can use what we have learned so far.

Note

If you have performed the setup.sh step in Module0 , you can skip to the exercise. The file should be present in NWU_2025_workshop_data/test_datasets/GCA_049744075.1

Activate your conda environment so that the necessary bioinformatics packages are loaded. See Module0: Setup for more information.

conda activate NWU_2025_workshop

Download a genome fasta file using NCBI’s datasets tool. The datasets tool should already be installed in the conda environment. You can see how to use the datasets tool by typing datasets --help . To download a genome for a specific accession number, the command is :

datasets download genome accession GCA_049744075.1

This will download a zipped archive from NCBI. Lets extract the archive using unzip, take the file we need and then remove the rest.

unzip ncbi_dataset.zip
mkdir test_datasets
mv ncbi_dataset/data/GCA_049744075.1/GCA_049744075.1_ASM4974407v1_genomic.fna test_datasets/
rm -r ncbi_dataset.zip ncbi_dataset # you can keep specifying files/folders after the rm command to remove them all

EXERCISE: Perform the following actions on the above fasta file

  1. Count the number of contigs in the fasta file
Click to reveal answer

The grep command can be used to count the number of contigs (or sequences) in a FASTA file. Each sequence in a FASTA file starts with a header line (>), so we can grep the > character to get only the header lines, and add the -c flag to grep so that it reports the number of matches.

grep -c ">" GCA_049744075.1_ASM4974407v1_genomic.fna

Alternatively, we can also use wc -l to report the number of lines matched, as grep by default outputs all matching lines

grep ">" GCA_049744075.1_ASM4974407v1_genomic.fna | wc -l
  1. Extract only sequence names from the FASTA File and write it to a new file.
click to reveal answer

We can extract only sequence names by grepping the “>” character as done previously, but now we pipe it to sed to remove only the > , keeping only the actual names themselves, and then redirect that output to a new text file.

grep ">" GCA_049744075.1_ASM4974407v1_genomic.fna | sed 's/>//' > GCA_049744075.1_header_names.txt
  1. Count the total number of bases in the FASTA file.
click to reveal answer

We use inverted matching with grep -v to get all lines EXCEPT lines starting with “>”, then we remove all newlines using tr -d, and then use wc to count the number of characters. We have to remove all newlines (\n) as each \n is considered a separate character, but here we only want to count the number of nucleotides.

grep -v ">" GCA_049744075.1_ASM4974407v1_genomic.fna | tr -d '\n' | wc -c
  1. In silico restriction digestion - how many fragments will you get if you digest your genome with EcoRI?
click to reveal answer

The recognition site for EcoRI is GAATTC . We will only get the DNA sequence from our FASTA file using grep and tr as above, but now pipe it to sed and replace each occurrence of GAATTC with a newline (in effect, “cutting” our DNA), then we count the total number of lines

grep -v ">" GCA_049744075.1_ASM4974407v1_genomic.fna | tr -d '\n' | sed 's/GAATTC/\n/g' | wc -l