When you log in you go to your home directory. Each user has
his/her own home directory. This directory can be anywhere in the
system, but in most Linux systems is under the /home directory.
In our case our home directory is ``/home/
your account id
''. If
you use konqueror to inspect your home directory (K menu, Internet
submenu, Konqueror entry; type the ``home'' icon.) you will see the
directory initially contained sub-directories ``www'' and ``Desktop''.
We could create a directory structure directly from Konqueror, but we
will use the shell command line instead. For our tutorial we want the
following structure
cd tutorials
Now we should copy the pages. The main internet page for this
tutorial is located at
http://mairinque.ime.usp.br/~gubi/book/tutorial/index.html
To mirror the site completely we will call wget with the mirror option (-m) and with the -nH option1.3. Therefore to copy the site completely you need to write
wget -nH -m --cut-dirs=2 http://mairinque.ime.usp.br/~gubi/book/tutorial/index.html
Not only the file index.html will be copied in your directory,
but all other files related to the links. This means all the pages
relative to the links will be present and all the pages these pages
point to. If you use your local browser to open the file index.html directory, you will see your local copy of the
tutorial page. Try the links and you will see that all the
respective pages are local to your computer now.
Sites with passwords: Many sites in the internet require passwords to show some part of their material. Wget can retrieve pages protected by passwords. If you know a valid user and its password you can use the options -http-user= and -http-passwd=. The book site has a part that is password protected. Try the command
wget -nH -m http://mairinque.ime.usp.br/~gubi/book/protected/protected.htmlYou will get back an error message stating that you need a password to access the page. To retrieve the information you can use the user bookreader with the password bookworm. Let's try:
wget -nH -m --http-user=bookreader --http-passwd=bookworm http://mairinque.ime.usp.br/~gubi/book/protected.htmlNow you have downloaded the protected pages.
/home/bioinfo/tdr2006/sequences
and its subdirectories, looking for all files with the extension
``.fasta''
find /home/bioinfo/tdr2006/sequences -name "*.fasta" -print
~/recentFiles
ls -l /home/bioinfo/tdr2006/sequences/fastaFiles
The files that were created after December 2005 are
~/recentFiles. Since we need to copy
three files, it will be tiresome to type the complete path for all
the files that we want to copy. The easiest thing to do is to ``go''
to the /home/bioinfo/tdr2006/sequences/fastaFiles directory,
so we can only type the names of the files (that is, their relative path), avoiding to type long addresses many times.
However, we need to use the complete path for the destination
directory.
cd /home/bioinfo/tdr2006/sequences/fastaFiles
cp newFile1.fasta newFile2.fasta even_newer_file.fasta ~/recentFiles
Finally, to make sure the files were copied correctly, check the
``recentFiles'' directory.
ls -l ~/recentFiles
Please note that in the listing, all files are listed as being created today.
Now, remember to go to your new directory.
cd ~/recentFiles
more newFile1.fasta
Try typing the ``enter'' key and the space bar to inspect the file and see what happens. Another option is to ask at once for all files to be inspected:
more newFile1.fasta newFile2.fasta newFile3.fasta even_newer_file.fasta
Initially, you have almost the same output as before, but when the
first page of the file is displayed, before the actual contents,
there are five lines indicating the name of the file being
displayed.
Also, as you reach the end of the file, the contents of the next one will appear. Please note that, at the last page of a file, there is a text in the bottom of the page indicating the next file to be displayed.
After you finish inspecting you will see that the file
``even_newer_file.fasta'' is not actually a fasta file, it
contains normal text.
Try doing the same thing using the less command. Please note that you can move backwards using the ``page up'' key. Also, the uparrow and downarrow keys will work accordingly.
even_newer_file.fasta'' is misnamed and
does not contain genetic sequences in FASTA format. Now we will
change the file's name to ``even_newer_file.txt''.
mv even_newer_file.fasta even_newer_file.txt
Now if you list your files (using the ``ls -l'' command) you
will see your file changed names. Please note that the date of the
file remains unchanged.
grep "Gubialan" newFile1.fasta
This file contains two entries where the name appears in the
header.
We could proceed and reissue the command for the other two files, substituting ``newFile1.fasta'' by ``newFile2.fasta'' and by ``newFile3.fasta''. However we can do this all in one command line:
grep ``Gubialan'' newFile1.fasta newFile2.fasta newFile3.fasta
Notice that we now have many lines, all preceded by the file names.
We can see that file newFile3.fasta has no entries displayed,
therefore it is the file that does not contain sequences generated
by Gubialan. A third way of issuing this command is to use
``wildcards''. Wildcards are characters with special meaning in the
shell command line. One that is particularly useful is ``*''. When
``*'' appears in the middle of the command lines, it stands for
``any characters, any length''. When this character appears, the
Linux shell verify which are the possible completions for the
expression and expands them inside the command line. In other words,
if we type
grep ``Gubialan'' newFile*.fasta
We have that ``newFile*.fasta'' means
``any existing file with name starting with 'newFile' and ending with '.fasta'''.
Therefore we will have the same command line as before, since ``*'' can in this case be successfully substituted by ``1'', ``2'' and ``3''. In the samy way can type even shorter:
grep ``Gubialan'' newFile*
In this case Linux will see every possible completion and issue,
again, the command. This time ``*'' is substituted by ``1.fasta'',
``2.fasta'', ``3.fasta''. One important note: if there other files
with name starting with ``newFile'', they would also have been
included in the command line.
wc newFile1.fasta
As a result three numbers will appear in the screen, respectively
the number of lines in the file, the number of words, and the
number of characters. We can see from the output that newFile1.fasta has
XXXX lins, YYYY words, and ZZZ characters.
We also give the name of more than one file, by typing all the file names or by using wildcards:
wc newFile*
As you can see in the output, in this case the program not only shows
the counts for each file, but also the total count
grep ``>'' newFile1.fasta
We can also look at the fasta headers of all fasta files of our
example:
grep ``>'' newFile*.fasta
By visual inspection we can see that there are a total of 17 fasta headers, and therefore, 17 sequences in our files.
|'' character. If we want the output of program p1 to be
used directly as the input of program p2, we need just to
``connect'' them using a pipe: p1 | p2.
grep ">" newFile1.fasta
We only need to ``pipe it in'' the command wc:
grep ">" newFile1.fasta | wc
The output will show that newFile1.fasta contains 3 sequences.
We can do this also for all fasta files:
grep ">" newFile*.fasta | wc
The ouput will show how many fasta headers all three files have.
grep ">" newFile*.fasta
Next, we want to select only those lines that contain the word
``Gubialan''
grep ">" newFile*.fasta | grep ``Gubialan''
Finally we want to count these lines
grep ">" newFile*.fasta | grep ``Gubialan'' | wc
The result will show the number of fasta sequences that were
sequenced by Gubialan
cat newFile*
However, if you try the command above you will notice that the
contents of all the files are really put together, but they are all
displayed in the screen. To create the file ``all.fasta'' we need
now to redirect this output to the file:
cat newFile* > all.fasta
You can now inspect your file and verify that it contains indeed the
sequence of all the previous files. For this you can either use
more:
more all.fasta
Or, for example, check only the fasta headers to see if all the ones
from the three files are in all.fasta:
grep ">" all.fasta
latestPaper_v'' in the directory
/home/bioinfo/tdr2006/researchGroup and its subdirectories.
_v'', so will accept anything after this initial
part of the name. To specify that we just put the wildcard at the
end. We also want to start the search at the directory
``/home/bioinfo/tdr2006/researchGroup'', therefore you should issue the command:
find /home/bioinfo/tdr2006/researchGroup -name ``latestPaper_v*''
As a result you will see the file names with their complete path.
/home/bioinfo/tdr2006/researchGroup (find, cp, cd)
cd
mkdir papers
Now, here were two files found in the previous item:
``/home/bioinfo/tdr2006/researchGroup/dir1/latestPaper_v1.txt'' and
``/home/bioinfo/tdr2006researchGroup/dir2/latestPaper_v3.txt''. There are many
ways in which we can perform the copy, depending if we use relative paths or absolute paths and on our use of
wilcards. We will show three.
In the first one we use only relative paths and no wildcard
cp ../bioinfo/tdr2006/researchGroup/dir1/latestPaper_v1.txt ../researchGroup/dir2/latestPaper_v3.txt papers
Now we use absolute paths for the source files:
cp /home/bioinfo/tdr2006/researchGroup/dir1/latestPaper_v1.txt /home/bioinfo/researchGroup/dir2/latestPaper_v3.txt papers
Finally we try a shorter version, with wildcards:
cp ../bioinfo/tdr2006/researchGroup/dir*/latestPaper_v* papers
Please note that wilcards were used twice, once to describe the
directory names and the next time to specify the files.
cd papers
Now, to find out the differences between files
``latestPaper_v1.txt'' and ``latestPaper_v2.txt'' we
use the diff command:
diff latestPaper_v1.txt latestPaper_v2.txt
The output of the program will only show the difference between the files. First, the line numbers involved are described. In our output the first difference starts with ``1c1,2'' which means, first line of the first file, lines 1 and 2 of the second file. Next, the contents of the lines of the first file preceded by ``<'' and the contents of the lines of the second file preceded by ``>''. The next differences are similar.
cd
grep ">" newFile1.fasta > newFile1.names
grep ">" newFile2.fasta > newFile2.names
diff newFile1.names newFile2.names
grep ">" newFile1.fasta | sort > newFile1.names.sorted
grep ">" newFile2.fasta | sort > newFile2.names.sorted
diff newFile1.names.sorted newFile2.names.sorted
Check the results and you will see that now there are no common
sequences displayed in the diff output.
2>, >
Obs: The software will generate the file all.fasta.slowprocess.log
~/programs/quickprocess all.fasta
As you notices the output will quickly run out of your shell window. Now we will try to run quickprogram
again and save the standard output in quickprocess.out and the error output in quickprocess.error:
~/programs/quickprocess all.fasta \verb:>: quickprocess.out \verb:2>: quickprocess.error
You can now use more or less to inspect each output1.4.
^z,bg,top)
~/programs/slowprocess all.fasta
You will notices that the shell will ``freeze''. That means the program is running and that the shell
is waiting for output or for the program to finish. To finish it just type ctrl-c.
~/programs/slowprocess
Now, to stop the current program you shold type ctrl-z (that is, press, at the same
time, the ``ctrl'' and the ``z'' keys. This will stop a program, or put it to ``sleep''. You will notices
that, after typing ctrl-z you can again type shell commands. To check if the program is sleeping
try typing
jobs
It will show that the program ``slowprocess'' is currently stoped To resume the program you should type:
fg
Which will put the program in ``foreground'', stopping the shell again. There is an alternative,
you can type ctrl-z and then the command
bg
Wich will send the program to run in the ``background''. If you use the jobs command again, you will
see that the program ``slowprocess'' is now running.
2:''). We also want to start our process directly in the
background, so we need to type ``SPMamp;'' at the end of the command line:
slowprocess 2> slowprocess.error &
You can check now if the process is actually running with the ``jobs'' command.
top
YOu will see 12 collumns. The first one will show the process id number. This number is used by the system. the second collumn is ``USER'', this collumn identifies which is the user that started the program. The 9th collumn, ``%CPU'', shows how much of the computer's processor is being used by each program. The 10th collumn, ``%MEM'' shows how much of teh computer's memory is being used by each program. The 11th collumn, ``TIME+'', shows for how much time has elapsed since the program started. Finally, the lasst collumn shows which is the shell command associated with each program.
tail slowAndBadProcess.error
Keep doing once every minute until you see that ``something wrong'' is printed in the last line.
jobs
Your output can be sligthly different depending on what you have done in your shell so far. The important thing is to register wich is the number of ``slowAndBadProcess'' Now, you can kill the program by using the kill command (supposing it is process [1]):
kill %1
Please note that we need to preceed the number of the process by the ``%'' symbol.
diff all.fasta all.slowproces.fasta
Remember, the program output will show only the differences between the two files.
Lines that occur only in file ``all.fasta'' Will be preceded by a ``<'' and lines occuring only
in ``all.slowprocess.fasta'' will be preceded by a ``>''.
scp visitor@mairinque.ime.usp.br:lotsaseqs.fasta .
The format of the command will always be the same, first scp
the command name, then, when specifying the source file you type the
account, ``@'' the computer address, ``:'', and the name of
the file, folowed by the destination, in our case the current
directory (remember, if we put a directory as the destination of a
copy command, files are copied with the same name). If is very
important that you do not put any spaces when describing the source
file (you can see that in the command above).
When you issue the command, the computer will ask for a password. This is only natural, since the computer mairinque.ime.usp.br is checking if you are really authorized to use the account. Type now the password ``20visit06'' (do not type the quote symbols!). The system will show the progress of the copying and will return to the shell. Now yo can check if the file was really copied:
ls -l lotsaseqs.fasta
You will see that the file has just been created.
We still need to copy the programs, they are located in the subdirectory programs, in the same machine and account. Again, we use scp, but this time we will add the ``-r'' option to copy the directory recursively and put the copy inside ``tutorial''
scp -r visitor@mairinque.ime.usp.br:programs ~/tutorial
grep ">" lotsaseqs.fasta | wc
Remember, the first number is the line count.
grep ">" all.fasta | sort > all.fasta.headers
grep ">" lotsaseqs.fasta | sort > lotsaseqs.fasta.headers
These commands first separate all fasta headers from a file, then sort them, and finally store the
results in a file. Be careful the first time we use the character ``>'' we used it whithin quotes for
the grep command, meaning we want to separate all lines that have ``>'' in them. The second time
we use ``>'', it does not have quotes around it, this means we want to save the output of the previous
command in a file. Therfore, the ouput of sort is stored in files ``all.fasta.headers'' and
``lotsaseqs.fasta.headers'', respectively. Finally, the pipeline operator (``|'') makes the output
of grep be the input of sort.