Data Management Tutorial

Grid Induction - 4th Hands on Workshop ,
NA3 Training Team, Israel

 


So far we've used files located on the UI to compute our task.
Note that these files were transferred to the WN via the Input / Output Sandbox jdl attribute.
As mentioned previously, this may not always be desirable, especially when the computational task requires lots of data.

The purpose of this tutorial is to get you acquainted with Grid Data Management tools that will enable you to use files stored on the Grid for your computational task and store files created by your job on Grid SE.


Our first step will be to find which Storage Elements are accessible for us.


ssh grid-tutor.ct.infn.it using the username and password you've used in the previous sections.
Run the following command, it will list the accessible SE:

% lcg-infosites --vo gilda se


The output of the command is:

Avail Space(Kb) Used Space(Kb) SEs
----------------------------------------------------------
50350000          2790000             trigrid-ce01.unime.it
767750000          41110000            gildase01.roma3.infn.it
54240000          3760000            iceage-se-01.ct.infn.it
68320000          4970000            gildase.oact.inaf.it
51890000          3920000            grid038.ct.infn.it
2590000000          900000000            aliserv6.ct.infn.it
51890000          3920000            grid038.ct.infn.it
27944372          3215408            testbed005.cnaf.infn.it
63810000          10260000            egee016.cnaf.infn.it
898705096          13702320            grid005.iucc.ac.il

* for more info use lcg-infosites -help

We will use for our job files that are stored on the Grid.
Choose one of the SE (from the list we obtained by the lcg-infosites command) to which you will upload your files.



Before we upload the files, each of you will create your own "private" directory on the file catalog.
Name the directory according to the user name you've been given. (i.e. mine is kunikver).
The follwoing command is used to create a directory on the file catalog:

% lfc-mkdir -p /grid/<vo>/<user_name>


(i.e. lfc-mkdir -p /grid/gilda/kunikver)


The following command uploads a file (i.e., transfers it from your UI to a Storage Element) and registers it on the file catalog.

% lcg-cr -d <SE_name> -l lfn:/grid/<vo>/<user_name>/<any_string_you_want> --vo gilda <src_file>


<SE_name> = the name of the SE.
lfn = the logical fine name.
Note: the lfn file name must be of the form lfn:/grid/gilda/<any_string_you_want> otherwise you will receive a Invalid lfn error.
--vo = Virtual Organization.
<src_file> = the phsical file you want to upload from the UI.
It must have the following format: file:/home/<user_name>/<dir>/file_name (including file extention).

Now we are going to upload a file from the UI to the Grid.
Create a file containing your name and save it as "name.txt" (e.g., vered.txt).
To upload the file type the following command:

% lcg-cr -d <SE_name> -l lfn:/grid/<vo>/<user_name>/<file_name> --vo <vo> file:/home/<username>/<file_name>


(i.e., lcg-cr -d egee016.cnaf.infn.it -l lfn:/grid/gilda/kunikver/vered.txt --vo gilda file:/home/kunikver/vered.txt).

The output of the command is:

guid:cf281639-4ad6-4a09-8f98-cc4efdb91857

This is the GUID of the file you have just uploaded to the Grid.



Congratulations !!! You have just uploaded your own file to a Grid SE :-)



The lcg-lg (list GUID) command returns the guid associated with a specified lfn or surl.
Use the lfn you've given to the file you've uploaded to the SE to retrieve its guid.

% lcg-lg -vo <vo> lfn:/grid/<vo>/<user_name>/<your_file_name>


(i.e., lcg-lg --vo gilda lfn:/grid/gilda/kunikver/proteins1.fasta).

The output of the command is:

guid:ef8a45b3-8af9-46d6-a438-bd4b38c94354



Now lets prepare the jdl for the job we are going to run.
Create the following jdl (alignment.jdl) and place it under your home directory.

[
 Type="Job";
 JobType="Normal";
 VirtualOrganisation="gilda";
 Executable="/bin/sh";
 Arguments="alignment.sh alignment.exe proteins1.fasta proteins2.fasta blosum62.txt my_alignment";
 StdOutput="alignment.out";
 StdError="alignment.err";
 InputSandbox={"alignment.sh"};
 OutputSandbox={"alignment.err", "alignment.out"};
 RetryCount=0;
 Rank=other.GlueCEStateFreeCPUs;
]

As you can see, the jdl runs the alignment.exe using the protein sequences in the files proteins1.fasta and proteins2.fasta
Hence, we are using files stored on the Grid for our job.



Now, lets prepare the shell file (alignment.sh) that the jdl is running. Place it under your root directory.
The script uploads the required files from the Grid (by using the lfn) to the WN (where the computation itself takes place).
The output of the program is initially saved into a file located on the local machine (WN).
When the program terminates, the output file is saved on the Grid and registered in the file catalog (using the lcg-cr command).

#!/bin/sh

##@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@##
## Author : Vered Kunik - kunikver@post.tau.ac.il
## Description:
## this program executes a local alignment algorithm on the allocated WN.
##
## input parameters:
## $1 = name of exe file
## $2 = name of first alignment file (in fasta format)
## $3 = name of second alignment file (in fasta format)
## $4 = name of substitution matrix (blosum62)
## $5 = name of output file to which the alignment will be flushed
##@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@##

# read the input parameter
localAlignExe=$1;
proteins1=$2;
proteins2=$3;
blosum62=$4;
output_file=$5;

# ensuring that the correct catalog service is used.
export LCG_GFAL_INFOSYS="grid004.ct.infn.it:2170";
export LFC_HOST="lfc-gilda.ct.infn.it";
export LCG_CATALOG_TYPE="lfc";

# creating an empty file on the WN.
touch $output_file;

# downloading the files from SE to the WN.
lcg-cp --vo gilda lfn:/grid/gilda/kunikver/$proteins1 file:`pwd`/$proteins1;
lcg-cp --vo gilda lfn:/grid/gilda/kunikver/$proteins2 file:`pwd`/$proteins2;
lcg-cp --vo gilda lfn:/grid/gilda/kunikver/$blosum62 file:`pwd`/$blosum62;
lcg-cp --vo gilda lfn:/grid/gilda/kunikver/$localAlignExe file:`pwd`/$localAlignExe;

chmod 755 $localAlignExe;

# running the executable.
./$localAlignExe -mode ss2ss -seq $proteins1 -seq2 $proteins2 -matrix $blosum62 -A 11 -B 1 -ethresh 1.0e3 -dbsize 1 -align >> $output_file;

# uploading the output to the SE.
lcg-cr --vo gilda -d gilda-se-01.pd.infn.it -l lfn:/grid/gilda/kunikver/$output_file file:`pwd`/$output_file;


if [ $? -eq 0 ]
then
echo "the output file was succesfully copied to the SE"
else
echo "unable to copy output file to the SE"
fi





Now (finally) we are prepared to run the job by using the glite-job-submit command (If you wish, you can submit the job via GENIUS portal).

% glite-job-submit -o <username.id> alignment.jdl


As we've done previously, use the glite-job-status command to check the progress of the job:

% glite-job-status -i <username.id>



After the job terminates (i.e., Current Status: Done (Success)), check the contents of your directory to make sure that the output file is there by using:

% lfc-ls /grid/gilda/<username>



Lets retrieve the output of our job (the one that was sent via the OutputSandbox) by using the following command:

% glite-job-output --dir directory_name <URI>


Now lets retrieve the output file which was stored on the Grid:

% lcg-cp --vo gilda lfn://grid/gilda/<username>/<file_name> file:/home/<username>/<file_name>


The output file is: my_alignment


Grid Data Management toold have many more utilities but due to our schedule limitations we will not be able to go through all of them.
Please refer to the Data Management lecture for further information.

Now let us move to our next topic: Web Job Submission Tutorial