Learning by doing

Next: FAQ Up: The Shell and Command Previous: Some initial concepts

Learning by doing

In all this section we will explain the main Linux commands and actions through a tutorial. At each step a task will be proposed and you will be taught how to perform that task. We strongly recommend that you read this chapter while performing the actions in the companion Knoppix system. Therefore, before starting on this section, boot your computer with the Unix cd-rom (see section

for details on how to create the system cd and how to boot it).

Creating a work environment (a simple directory structure)

When you log in you go to your home directory. Each user has his/her own home directory. This directory can be anywhere in the system, but in most Linux systems is under the /home directory. In our case our home directory is ``/home/your account id''. If you use konqueror to inspect your home directory (K menu, Internet submenu, Konqueror entry; type the ``home'' icon.) you will see the directory initially contained sub-directories ``www'' and ``Desktop''. We could create a directory structure directly from Konqueror, but we will use the shell command line instead. For our tutorial we want the following structure

recentFiles
results

Creating a local copy of an internet site

Task:

Copy the hypertext (html) version of the tutorial and FAQ of this chapter into your local machine's tutorial directory.

Comments:

One very useful Linux program is wget. With this program you can mirror a whole internet site in your computer with just one command. Wget works also if the site has a password. To use wget you need to know the address of the main page you want to copy. If the page is password-protected you also need a valid user address and password. Using wget you can either copy only the page, all the links, all the links in the links, and so forth. You can also only copy the pages that refer to the links in the same interenet server (so you can, for example, avoid copying advertising links from a commercial page).

Issuing the command:

First we need to go to the Tutorials directory.

               cd tutorials

Now we should copy the pages. The main internet page for this tutorial is located at

        http://mairinque.ime.usp.br/~gubi/book/tutorial/index.html

To mirror the site completely we will call wget with the mirror option (-m) and with the -nH option^1.3. Therefore to copy the site completely you need to write

        wget -nH -m --cut-dirs=2 http://mairinque.ime.usp.br/~gubi/book/tutorial/index.html

Not only the file index.html will be copied in your directory, but all other files related to the links. This means all the pages relative to the links will be present and all the pages these pages point to. If you use your local browser to open the file index.html directory, you will see your local copy of the tutorial page. Try the links and you will see that all the respective pages are local to your computer now.

Sites with passwords: Many sites in the internet require passwords to show some part of their material. Wget can retrieve pages protected by passwords. If you know a valid user and its password you can use the options -http-user= and -http-passwd=. The book site has a part that is password protected. Try the command

 wget -nH -m http://mairinque.ime.usp.br/~gubi/book/protected/protected.html

You will get back an error message stating that you need a password to access the page. To retrieve the information you can use the user bookreader with the password bookworm. Let's try:

 wget -nH -m --http-user=bookreader --http-passwd=bookworm http://mairinque.ime.usp.br/~gubi/book/protected.html

Now you have downloaded the protected pages.

Finding files in a local directory

Task:

Check the directory /home/bioinfo/tdr2006/sequences and its subdirectories, looking for all files with the extension ``.fasta''

Comments:

Command line unix has a file search program similiar to the ``search'' facility of Windows Explorer. Konqueror also have a similiar facility. However, using the command line can be much faster that opening a browser and navigating through a set of menus.

Issuing the command:

To find a file we use the find program. This program needs two parameters, the directory on which to start the search, and the name of the file to be searched. You can use wildcards to describe filenames, wildcards include the following caracters:

* -- This wildcard designates any character or series of caracters, or even no character at all. In our case writing *.fasta indicates names that start with any characters whatsoever, but necessarily end with ``.fasta''.
? -- The question mark will designate any single character. Writing ``?asta'' will match ``fasta'', ``pasta'', ``.asta'', and so on.
[list of characters] -- This expression is just like the `?' wildcard, but instead of indicating any character, it will match one of the characters in the list. A range maybe indicated if you use a `-'. For instance, ``[fp]asta[0-9]'' will designate ``fasta0, fasta1, fasta2, ..., fasta9, pasta0, ..., pasta9''.

Wildcards can be used anywhere in the specification of a name to be searched. To perform the task described in this item you type the command

 find /home/bioinfo/tdr2006/sequences -name "*.fasta" -print

Checking file information, moving files

Task:

Check the date and size of the fasta files you found in the previous items, and copy the files that were created after December 2005 to the directory ~/recentFiles

Comments:

There are many options the user can specify when issuing the list command (ls). In particular we can ask for a complete listing, where not only the names of the files are shown, but also their permissions, size, and date of last modification. Also, the copy command (cp ) can be used to copy one or more files. When we specify more than one file to be copied, the destination of the copy command needs to be a directory. In this case, all files will be copied under the same name.

Issuing the commands:

You need first to issue the ls command with the ``-l'' option (for long):

         ls -l /home/bioinfo/tdr2006/sequences/fastaFiles

The files that were created after December 2005 are

newFile1.fasta
newFile2.fasta
newFile3.fasta
even_newer_file.fasta

Next you can issue a single copy command to copy all three files into the directory ~/recentFiles. Since we need to copy three files, it will be tiresome to type the complete path for all the files that we want to copy. The easiest thing to do is to ``go'' to the /home/bioinfo/tdr2006/sequences/fastaFiles directory, so we can only type the names of the files (that is, their relative path), avoiding to type long addresses many times. However, we need to use the complete path for the destination directory.

         cd /home/bioinfo/tdr2006/sequences/fastaFiles
         cp newFile1.fasta newFile2.fasta even_newer_file.fasta  ~/recentFiles

Finally, to make sure the files were copied correctly, check the ``recentFiles'' directory.

       ls -l ~/recentFiles

Please note that in the listing, all files are listed as being created today.

Now, remember to go to your new directory.

         cd ~/recentFiles

Inspecting file contents (more less)

Task:

Inspect the files you just copied, checking if they are really all fasta files.

Comments:

Sometimes we need to give a ``quick look'' at a file, just to check what is in it. One option is starting a text editor program (such as, for example, Notepad). However in Linux there is a easier way of looking inside a file, the more command. This command will show the contents of a file on your terminal screen, one page at a time (where ``page'' is exactly the current size of the terminal window). To go to the next page you just press the space bar, to move down only line only, type the enter key, to end the program, type the ``q'' key. There is a more sofisticated version of more called less (don't mind the name, it is ``computer scientist humor''). Less has more options, like searching the text for specific strings and moving back in the text. More is always available in Linux systems, less no. For more details on less, check the Linux on line manual.

Issuing the commands:

To inspect a file, you need to type the more command, followed by the name of the file. You can issue the more command to many files at a time, in this case, more will browse each file at a time. To see one file at a time (for example, newFile1.fasta), you should type:

         more newFile1.fasta

Try typing the ``enter'' key and the space bar to inspect the file and see what happens. Another option is to ask at once for all files to be inspected:

        more newFile1.fasta newFile2.fasta newFile3.fasta even_newer_file.fasta

Initially, you have almost the same output as before, but when the first page of the file is displayed, before the actual contents, there are five lines indicating the name of the file being displayed.

Also, as you reach the end of the file, the contents of the next one will appear. Please note that, at the last page of a file, there is a text in the bottom of the page indicating the next file to be displayed.

After you finish inspecting you will see that the file ``even_newer_file.fasta'' is not actually a fasta file, it contains normal text.

Try doing the same thing using the less command. Please note that you can move backwards using the ``page up'' key. Also, the uparrow and downarrow keys will work accordingly.

Changing file names

Task:

In the previous task, we inspected three files and found out that the file ``even_newer_file.fasta'' is misnamed and does not contain genetic sequences in FASTA format. Now we will change the file's name to ``even_newer_file.txt''.

Comments:

In unix there is no ``rename'' command. Instead we have the ``move'' command, named mv. This command can either be used to change a file's location or be used to simply change a files name. In this command the user specifies the old name and the new name of a file. Either the old name or the new name can be relative paths or complete paths. The effect of the command is both renaming and moving the file contents.

Issuing the commands:

Simply type:

     mv even_newer_file.fasta even_newer_file.txt

Now if you list your files (using the ``ls -l'' command) you will see your file changed names. Please note that the date of the file remains unchanged.

Checking files for specific content

Task:

Find the fasta files that contain files generated by Gubialan. For this exercise we assume that the name of the person that generated the sequence is written in the header of each fasta sequence ( that is, in the first line, the one initiated by ``

''.

Comments:

There is a very powerful search program in unix called grep. This program is used to inspect the contents of files looking for specific patterns of characters. The program examines the files and prints all the lines of the file that contain the searched patter (in our case, the word ``Gubialan''). If more than one file is specified in the search, the name of the file is printed before each line, so the user can identify which lines came from which files. We will use both forms in this exercise

Issuing the commands:

If we want to search for the word ``Gubialan'' in the file newFile1.fasta we should type

       grep "Gubialan" newFile1.fasta

This file contains two entries where the name appears in the header.

We could proceed and reissue the command for the other two files, substituting ``newFile1.fasta'' by ``newFile2.fasta'' and by ``newFile3.fasta''. However we can do this all in one command line:

        grep ``Gubialan'' newFile1.fasta newFile2.fasta newFile3.fasta

Notice that we now have many lines, all preceded by the file names. We can see that file newFile3.fasta has no entries displayed, therefore it is the file that does not contain sequences generated by Gubialan. A third way of issuing this command is to use ``wildcards''. Wildcards are characters with special meaning in the shell command line. One that is particularly useful is ``*''. When ``*'' appears in the middle of the command lines, it stands for ``any characters, any length''. When this character appears, the Linux shell verify which are the possible completions for the expression and expands them inside the command line. In other words, if we type

        grep ``Gubialan'' newFile*.fasta

We have that ``newFile*.fasta'' means

``any existing file with name starting with 'newFile' and ending with '.fasta'''.

Therefore we will have the same command line as before, since ``*'' can in this case be successfully substituted by ``1'', ``2'' and ``3''. In the samy way can type even shorter:

        grep ``Gubialan'' newFile*

In this case Linux will see every possible completion and issue, again, the command. This time ``*'' is substituted by ``1.fasta'', ``2.fasta'', ``3.fasta''. One important note: if there other files with name starting with ``newFile'', they would also have been included in the command line.

Checking file sizes

Task:

Check the number of lines of each of the three fasta files.

Comments:

The Linux shell offers a command. wc (from Word Count) to count the size of the files. It gives not only the number of bytes (characters) that the file contains, but also the number of words and the number of lines of the file.

Issuing the commands:

To count the number of characters, words and lines of file ``newFile1.fasta'' we just need to type:

          wc newFile1.fasta

As a result three numbers will appear in the screen, respectively the number of lines in the file, the number of words, and the number of characters. We can see from the output that newFile1.fasta has XXXX lins, YYYY words, and ZZZ characters.

We also give the name of more than one file, by typing all the file names or by using wildcards:

           wc newFile*

As you can see in the output, in this case the program not only shows the counts for each file, but also the total count

Using grep to select lines in a file

Task:

Look at all the fasta headers of the fasta files.

Comments:

It is very common to have a file or a set of files where you want only to look at specific lines. In the Windows world to do this, users generally open the file in a text editor and then use the ``find'' facility to look at each line. This can be tiresome and will not work for very large files, since most text editors in Windows cannot handle really big files. The grep command, on the other hand, can handle files of any size. Not only this, but the user only sees the lines he is interested in. The challenge is to find out which is the pattern we are searching for. In the case of the fasta head lines it is easy, we want the lines of the file that contain ``>''.

Issuing the commands:

To see all fasta headers of the file newFile1.fasta, we need just to type:

          grep ``>'' newFile1.fasta

We can also look at the fasta headers of all fasta files of our example:

          grep ``>'' newFile*.fasta

By visual inspection we can see that there are a total of 17 fasta headers, and therefore, 17 sequences in our files.

Using grep and wc together

Task:

Check how many fastas there are in each file.

Comments:

This task was actually performed manually in the previous item. However, in this case it was easy to manually count the number of sequences because it was very small. However, if files are large and have a big number of sequences makes manual counting tiresome and error prone. To automatically count the lines ouput by the grep command, we can ``attach'' to it the wc command. In the Linux shell we can directly connect the output of a program into the input of another program. We do this by using pipes. Pipes are like intermediate files, but much faster and simpler to specify. A pipe in the shell is specified using the ``|'' character. If we want the output of program p1 to be used directly as the input of program p2, we need just to ``connect'' them using a pipe: p1 | p2.

Issuing the commands:

If we want to count how many lines are output by the command:

        grep ">" newFile1.fasta

We only need to ``pipe it in'' the command wc:

        grep ">" newFile1.fasta | wc

The output will show that newFile1.fasta contains 3 sequences.

We can do this also for all fasta files:

        grep ">" newFile*.fasta | wc

The ouput will show how many fasta headers all three files have.

Using grep and wc together

Task:

Check how many fastas were generated by Gubialan.

Comments:

We can have more than one pipe in a single command line. This means we can connect two, three, four, any number of programs in a single processing sequence. This is particularly useful when we have filter programs like grep, so the user can perform many successive filtering steps, all in a single command line.

Issuing the commands:

In this case we want to perform two filterings and one counting: first we want to select only the fasta headers from the files:

        grep ">" newFile*.fasta

Next, we want to select only those lines that contain the word ``Gubialan''

        grep ">" newFile*.fasta | grep ``Gubialan''

Finally we want to count these lines

        grep ">" newFile*.fasta | grep ``Gubialan'' | wc

The result will show the number of fasta sequences that were sequenced by Gubialan

Concatenating files, saving a program's output in a file

Task:

Create a new file named ``all.fasta'', with the contents of all three fasta files.

Comments:

One common way to join the contents of different files is to use a text editor and ``cut an paste'' tools, in a process that is tiresome, error prone, and cannot be applied to a large number of files or to files with large sizes. In Linux joining files can be done extremely fast using the cat command. This command displays on the screen the contents of a file. It is similar to more seen above, but the contents are displayed at once, with no pause after each page. The command cat can also be used do display the contents of many files, by the use of wildcards. However, to fulfill the task of creating a new file we still need anoter element ``output redirection''. In the shell, we can save the screen output of any command in a file by redirecting this output. To redirect the output of a command to a file we need to add, at the end of the command the ``>'' character, followed by the name of the file.

Issuing the commands:

We will use cat and wildcards to make the system output the contents of all the files at once:

        cat newFile*

However, if you try the command above you will notice that the contents of all the files are really put together, but they are all displayed in the screen. To create the file ``all.fasta'' we need now to redirect this output to the file:

        cat newFile* > all.fasta

You can now inspect your file and verify that it contains indeed the sequence of all the previous files. For this you can either use more:

        more all.fasta

Or, for example, check only the fasta headers to see if all the ones from the three files are in all.fasta:

        grep ">" all.fasta

Searching files by name

Task:

Find all the files with names starting with ``latestPaper_v'' in the directory /home/bioinfo/tdr2006/researchGroup and its subdirectories.

Comments:

Searching for files with specific names or parts of the name is a very common task. Very frequently users download some file from the internet and realize later they are not where they were supposed to be. To perform this task the unix shell has the program find. In this program the user specifies the name of the file to be searched and the starting point of the search (a directory). As a result find will show the path to all the files matching the request. You can either specify the complete file name or use wildcards to specify only parts of the name.

Issuing the commands:

We want to find all files starting with ``lastesPaper_v'', so will accept anything after this initial part of the name. To specify that we just put the wildcard at the end. We also want to start the search at the directory ``/home/bioinfo/tdr2006/researchGroup'', therefore you should issue the command:

      find /home/bioinfo/tdr2006/researchGroup -name ``latestPaper_v*''

As a result you will see the file names with their complete path.

Copying files

Task:

Create the directory ``papers'' in your home directory. Copy the files you found in the previous item the new directory you just created. o a directory in your account. account (they should be 2, one in each directory in /home/bioinfo/tdr2006/researchGroup (find, cp, cd)

Comments:

When copying files, we can also use wildcards. It is important to remember that, if we use wildcards, are probably specifying more than one file to be copied, so the destinatino have to be a directory.

Issuing the commands:

First we need to create the new directory. We will first go to the home directory by using the command cd with no argument, and then by using mkdir to create the new directory:

       cd
       mkdir papers

Now, here were two files found in the previous item: ``/home/bioinfo/tdr2006/researchGroup/dir1/latestPaper_v1.txt'' and ``/home/bioinfo/tdr2006researchGroup/dir2/latestPaper_v3.txt''. There are many ways in which we can perform the copy, depending if we use relative paths or absolute paths and on our use of wilcards. We will show three.

In the first one we use only relative paths and no wildcard

      cp ../bioinfo/tdr2006/researchGroup/dir1/latestPaper_v1.txt ../researchGroup/dir2/latestPaper_v3.txt papers

Now we use absolute paths for the source files:

      cp /home/bioinfo/tdr2006/researchGroup/dir1/latestPaper_v1.txt /home/bioinfo/researchGroup/dir2/latestPaper_v3.txt papers

Finally we try a shorter version, with wildcards:

      cp ../bioinfo/tdr2006/researchGroup/dir*/latestPaper_v* papers

Please note that wilcards were used twice, once to describe the directory names and the next time to specify the files.

Comparing the contents of files

Task:

Check the differences between the two files, verify how many sequences the last version specify (diff, more)

Comments:

Files are constantly changing. Sometimes, to avoid loosing important information due to bad editing, users maintain many versions of a file instead of always modifying the same file.This way we avoid the damage of editing mistakes. Another common possibility (as it is the case of this chapter) two people are working in different parts of the same file. Finally, we have the case where we just need to now the differences between to arbitrary files. We can do this in Linux using the diff command. We type the command name, the names of the two files, and a description of the differences is shown on the terminal window.

Issuing the commands:

To find out the differences between the files let's move first into directory ``papers'':

      cd papers

Now, to find out the differences between files ``latestPaper_v1.txt'' and ``latestPaper_v2.txt'' we use the diff command:

       diff latestPaper_v1.txt latestPaper_v2.txt

The output of the program will only show the difference between the files. First, the line numbers involved are described. In our output the first difference starts with ``1c1,2'' which means, first line of the first file, lines 1 and 2 of the second file. Next, the contents of the lines of the first file preceded by ``<'' and the contents of the lines of the second file preceded by ``>''. The next differences are similar.

Returning to the home directory

Task:

Go back to the sequence directory.

Comments:

Issuing the commands:

cd

Checking for new sequences in fasta files

Task:

Find out the names of the sequences of newFile2.fasta that are not present in newFile1.fasta. All the sequences will have a similar name, with a common prefix and a number. Verify the names of the sequences and check if there is a sequence missing (grep, sort, more)

Comments:

We need to compare the names of sequences. So far we know how to select the sequence names using grep, how to store the results of a program in a file using > and how to compare the contents of two files using diff. If we want to know the sequences that are in newFile1.fasta and not in newFile2.fasta, we will first use grep to isolate the sequence names in two files ``newFile1.names'' and ``newFile2.names'' (these are example names, any name can be used). Then, we can use the command diff in the new files:

       grep ">" newFile1.fasta > newFile1.names
       grep ">" newFile2.fasta > newFile2.names
       diff newFile1.names newFile2.names

Checking for new sequences in fasta files with unordered sequences

Task:

If you examine closely the output of the diff command of the last item you will notice a problem: some sequences that are in both files are reported as differences (SEE SEQUENCE XXXX ANDD YYY). Try to correct the problem;

Comments:

The problem is that diff reads both files in order and check the differences line by line. If the sequences are not in the same order in both files the results will be misleading. Therefore, before comparing the sequence names, we need to sort both files to be sure the names appear in the same order. The unix shell provides a sorting command sort that will read a file and print the lines of this file in alphabetical order. If, after selecting the sequence names, we sort them before saving in the names files, then the procedure should work appropriately.

Issuing the commands:

We will perform almost the same task as before, but we can now ``pipe'' the result of the grep program into sort before storing them. We will store the results under new file names for clarity.

       grep ">" newFile1.fasta | sort > newFile1.names.sorted
       grep ">" newFile2.fasta | sort > newFile2.names.sorted
       diff newFile1.names.sorted newFile2.names.sorted

Check the results and you will see that now there are no common sequences displayed in the diff output.

Running programs, output redirection

Task:

Run the program ``quickprocess'' for the file all.fasta. Now run it again and save the ouput and the error message in different files. redirect the error to slow.error and output to all.slowprocess.fasta (2>, > Obs: The software will generate the file all.fasta.slowprocess.log

Comments:

Running a specific program in the command line is like running a shell command (actually all the shell commands we have seen are actually Linux programs). However, when programs generate a lot of screen output in Linux, we can save this output in files. To understand this we need to first explain the concept of standard output and error output. In Linux, programs have three types of output: file output, standard output and error output. Normally, whatever is sent to standard output and to error output is shown on the shell window. However, the user can redirect either or both output to files. To redirect a program's standard output to a file we use the ``>'' character, and to redirect the error output to a file, we use the ``2>'' characters (no space between them).

Issuing the commands:

We want to run program ``quickprocess'' on file all.fasta. This program was downloaded when you perfomred the wget command and is in directory ``programs''. Initially we will run the program ``quickprocess'' normally:

      ~/programs/quickprocess all.fasta

As you notices the output will quickly run out of your shell window. Now we will try to run quickprogram again and save the standard output in quickprocess.out and the error output in quickprocess.error:

      ~/programs/quickprocess all.fasta \verb:>: quickprocess.out \verb:2>: quickprocess.error

You can now use more or less to inspect each output^1.4.

Running a slow program, stopping it, running it in the background

Task:

You should run program ``slowprocess'' for the file all.fasta. This is a slow program and it will not end soon. End the program. , run it in background and check CPU usage (^z,bg,top)

Comments:

In Linux you can end any program that is being run from the shell. This is an important feature when some programs just start taking much more time than previously anticipated. To stop a program you only have to type contol-c(that is, press, at the same time, the ``ctrl'' and the ``z'' keys)).

Issuing the commands:

To start the slowprocess program for the file all.fasta just type:

       ~/programs/slowprocess all.fasta

You will notices that the shell will ``freeze''. That means the program is running and that the shell is waiting for output or for the program to finish. To finish it just type ctrl-c.

Running processes in the background

Task:

Run program slowprocess, stop it, make it continue in the background.

Comments:

Linux actually runs many programs at the same time. You can see that when you have a clock program in your taskbar at the same time that your are using the shell. In the previous exercise you have terminated the execution of a program in the shell by typing ctrl-c. Alternatively you can stop the program temporarily, and later have it resume its execution exactly at the poing where it was stopped. Once a program is stopped in the shell you can restart it normally, or make it run in the background. Programs running in the background behave normally, but you can keep using the shell and typing new commands as the original program runs. This means you can have many programs running in the shell's background. The shell also offers a command to check which programs are crurrently running of slepping in the shell, ``jobs''.

Issuing the commands:

Start the program ``slowprocess'':

          ~/programs/slowprocess

Now, to stop the current program you shold type ctrl-z (that is, press, at the same time, the ``ctrl'' and the ``z'' keys. This will stop a program, or put it to ``sleep''. You will notices that, after typing ctrl-z you can again type shell commands. To check if the program is sleeping try typing

      jobs

It will show that the program ``slowprocess'' is currently stoped To resume the program you should type:

fg

Which will put the program in ``foreground'', stopping the shell again. There is an alternative, you can type ctrl-z and then the command

bg

Wich will send the program to run in the ``background''. If you use the jobs command again, you will see that the program ``slowprocess'' is now running.

Kill process, start process slowprocess directly in the backgrond, error output in file

Task:

Kill the running process ``slowprocess'', restart the process directly in the background, storing the output in file ``slowprocess.error''.

Comments:

We have seen how to put a running process in the background. However, we can start a process directly in the background, saving some typing. If we want a process to run directly in the background we should type ``SPMamp;'' at the end of the command line (just before typing ``enter''). The shell will resume immediately and the process will run at the same time, in the background.

Issuing the commands:

We want the error output of our program to go into file ``slowprocess.error'', to do this we will use error output redirection (``2:''). We also want to start our process directly in the background, so we need to type ``SPMamp;'' at the end of the command line:

       slowprocess 2> slowprocess.error &

You can check now if the process is actually running with the ``jobs'' command.

Check CPU usage

Task:

Check how much of the computer's capacity is being used by the program slowprocess.

Comments:

We have seen the command ``jobs'' to list all the processes being run from the shell. However this program only shows what is running, not how much of the computer's capacity is being allocated to it. Linux offers a program that will show how much of the computer's memory and cpu time is being used by the programs at a given moment. It is important to note, however, that this program will check ALL programs being run in the computer, by you and potentially by other users too.

Issuing the commands:

The program ``slowprocess'' should still be running, to check how it is using the system you shouls type

top

YOu will see 12 collumns. The first one will show the process id number. This number is used by the system. the second collumn is ``USER'', this collumn identifies which is the user that started the program. The 9th collumn, ``%CPU'', shows how much of the computer's processor is being used by each program. The 10th collumn, ``%MEM'' shows how much of teh computer's memory is being used by each program. The 11th collumn, ``TIME+'', shows for how much time has elapsed since the program started. Finally, the lasst collumn shows which is the shell command associated with each program.

Checking the final lines of a file

Task:

Check interactively final lines of slowAndBad.error (tail), when you see ``scomething wrong'' printed, kill the program.

Comments:

In the field of Bioinformatics it is not uncommon to have programs that run for a long time. This can be due to at least three different reasons: there is some other program using all the computer's resources, your program may be slow and run for a long time or something may be wrong with your program. When this happens, generally some message will appear at the end of the program's output. An easy way to perform a check on the end of an output is to send it to a file and to keep checking the last lines of that file. This way you can, at the same time, keep all outpout saved in a file for later detailed inspection and avoid browsing the whole file to check for the last lines. To check the last lines of a file we use the command tail.

Issuing the commands:

The program ``slowAndBadProcesss'' was designed to misbehave. However it will take some time for this to happen. When it happens, the program will print ``something wrong'' in the error output. In the previous exercises you already started the ``slowAndBadProcess'' program, sending its error output to ``slowAndBadProcess.error''. To check the last lines of that file, you should issue the command:

       tail slowAndBadProcess.error

Keep doing once every minute until you see that ``something wrong'' is printed in the last line.

Killing a program running in the background

Task:

Kill the program slowAndBadProcess

Comments:

When a program starts misbehaving it is good practice to kill it, specially if it consumes a lot of system's resources, which is the case of slowAndProcess. Once you know you need to kill a program, you should first check its number in the shell process roster. You can do this by using the command jobs. Once you know the number you can use the kill command to terminate that program.

Issuing the commands:

First check wich is the process number in the shell roster by using the jobs command:

       jobs

Your output can be sligthly different depending on what you have done in your shell so far. The important thing is to register wich is the number of ``slowAndBadProcess'' Now, you can kill the program by using the kill command (supposing it is process [1]):

        kill %1

Please note that we need to preceed the number of the process by the ``%'' symbol.

Wait for slowprocess to finish

Task:
Comments:
Issuing the commands:

Comparing two similar files

Task:

Check the differences between all.fasta and all.slowprocess.fasta

Comments:

We can now use the diff command to compare the outpouts of our files all.fasta and all.slowprocess.fasta.

Issuing the commands:

        diff all.fasta all.slowproces.fasta

Remember, the program output will show only the differences between the two files. Lines that occur only in file ``all.fasta'' Will be preceded by a ``<'' and lines occuring only in ``all.slowprocess.fasta'' will be preceded by a ``>''.

Copying a file from an user account in a remote computer

Task:

Copy the file lotsaseqs.fasta and some programs from the home directory of user visitor in computer mairinque.ime.usp.br.

Comments:

File transfer is probably one of the most common computer tasks today. Millions of people surf the internet and download all kinds of files. Files are also exchanged using email. However, it is not trivial to make a file available in the internet. Also, there is always a limit on the size of files that can be transmitted by email. Linux permits transfering ANY file of ANY size between two Linux/Unix computers that are connected in the internet, provided the user has the appropriate permissions. You just need an account name and passwork in both machines. This is particularly useful when you are away from your computer and finds out that you need one file that you did not put available in the internet. Copying files between different machines is not much different from copying files in the same machine. However, you need to specify some extra information, that is, the internet address of the remote computer and the user account you are going to use in that computer (remember that Linux organizes all system security around user accounts). The command to perform remote copy is scp (from Secure CoPy).

Issuing the commands:

In our exercise we are bringing some files from the remote computer mairinque.ime.usp.br. Since you need an account we provided a visitor account named ``visitor''. To copy the file you should issue the command

        scp visitor@mairinque.ime.usp.br:lotsaseqs.fasta .

The format of the command will always be the same, first scp the command name, then, when specifying the source file you type the account, ``@'' the computer address, ``:'', and the name of the file, folowed by the destination, in our case the current directory (remember, if we put a directory as the destination of a copy command, files are copied with the same name). If is very important that you do not put any spaces when describing the source file (you can see that in the command above).

When you issue the command, the computer will ask for a password. This is only natural, since the computer mairinque.ime.usp.br is checking if you are really authorized to use the account. Type now the password ``20visit06'' (do not type the quote symbols!). The system will show the progress of the copying and will return to the shell. Now yo can check if the file was really copied:

      ls -l lotsaseqs.fasta

You will see that the file has just been created.

We still need to copy the programs, they are located in the subdirectory programs, in the same machine and account. Again, we use scp, but this time we will add the ``-r'' option to copy the directory recursively and put the copy inside ``tutorial''

            scp -r visitor@mairinque.ime.usp.br:programs ~/tutorial

Counting fasta sequences in a file

Task:

Check how many sequences were downloaded with the new file.

Comments:

We have performed this task before, we need to use grep to select the fasta header lines, and then use wc to count the number of selected lines.

Issuing the commands:

We can run the two commands, conecting them with a Linux ``pipe''.

       grep ">" lotsaseqs.fasta | wc

Remember, the first number is the line count.

Investigating fasta files

Task:

Compare the files all.fasta and lotsaseqs.fasta, checking which are the sequences in all.fasta that are not in lotsaseqs.fasta, and vice versa (sort, diff)

Comments:

We are trying to check sequences present in one file an not in another and vice-versa. The ``hard'' way to do it is to check the sequences by sequence content. However, this will be to hard to do without creating a computer program. Instead, we will check sequences by their names. We want then to compare sequence names from both files and check the different ones. We know that we can check diferences between two files using diff. However, for diff to work properly, we need the sequences to be ordered in aphabetical ordering. We cannot do this for the whole sequences, but we can work instead with only the fasta headers. If we separte only the fasta headers from both files and order them, we can then use diff to peform the comparison. We know from previous exercises that we can separate only the fasta headers using grep, and that we can sort stuff using sort.

Issuing the commands:

We first need to separate the fasta headers and sort them. This result needs to be stored in a temporary file:

        grep ">" all.fasta | sort > all.fasta.headers
        grep ">" lotsaseqs.fasta | sort > lotsaseqs.fasta.headers

These commands first separate all fasta headers from a file, then sort them, and finally store the results in a file. Be careful the first time we use the character ``>'' we used it whithin quotes for the grep command, meaning we want to separate all lines that have ``>'' in them. The second time we use ``>'', it does not have quotes around it, this means we want to save the output of the previous command in a file. Therfore, the ouput of sort is stored in files ``all.fasta.headers'' and ``lotsaseqs.fasta.headers'', respectively. Finally, the pipeline operator (``|'') makes the output of grep be the input of sort.

Next: FAQ Up: The Shell and Command Previous: Some initial concepts

gubi
2006-01-18