So… now we have a folder full of fasta files. Wouldn’t it be nice to know what is in them? There are a couple of ways of exploring these files. We will look at three – more, tail and grep.
The more command is just a way to read through a readable file. Go to your terminal (you should still be pointing into the Unix folder – if you aren’t, use what you know about cd to get back there.) One of the fasta files we extracted from Archive.zip is KP760420.fasta. In the terminal, type
more KP760420.fasta
This shows you the first few lines of text from any text file. If the file is longer than a few lines, you can scroll through the entire file by hitting the space bar. In the case of these fasta files, they are short enough that ‘more’ shows the entire contents of the file.
The tail command is similar, but shows only the last few lines of the file. Try:
tail KP760420.fasta
In this case, you don’t see the first line of the file because the fasta file is longer than the number of lines tail returns. You would normally use tail to look at the end of really long file – if you used more you would need to scroll through all the lines and if your file is thousands of lines long, that could take a while.
If you want to search for a specific word or set of characters in a file, you can use grep. Is the string “TGCCATATAAGGGACTGAA” in the fasta file KP760420.fasta? In the terminal type:
grep TGCCATATAAGGGACTGAA KP760420.fasta
If the search string is in the file, grep will return the line it is found in. grep has an option
-c
which will return the number of times a string is found in a file. See how many times the sequence ‘TATA’ is found in the sequence in KP760420.fasta.
grep -c TATA KP760420.fasta
How many times is ‘ATGC’ in the sequence in KP760420.fasta?
grep can also look in multiple files with one command. Say we want to know which of the many fasta files we downloaded are sequences from Brugia. Unix uses * as a wildcard, and can be used as an argument. This code:
grep Brugia *
asks the computer to look in every file in the folder it is pointing to, and say whether the word Brugia is found in it. Give it a try now.
How many of the fasta files are from Brugia? Which are they?
How would you use the -c option to find out how many times the string “ATGTG” is in each file?