|
Introduction
This tutorial gives instructions on how to use
some of the services on NPACI Grid. To run this tutorial,
the prerequisites in Getting
Started must be met.
The tutorial consists of the following sections
and has been formatted to fit your printer:
- Signing Onto the Grid
- Submitting Jobs
- Transferring Files
- Resource and Monitoring Services
- APST
Signing
Onto the Grid: Creating a Proxy Certificate
Proxies are certificates signed by the user,
or by another proxy, that do not require a password to submit
a job. They are intended for short-term use, when the
user is submitting many jobs and cannot be troubled to repeat
a password for every job.
Proxies provide a convenient alternative to
constantly entering passwords, but are also less secure than
the user's normal security credential. Therefore, they
should be deleted after they are no longer needed (or after
they expire).
To create a proxy with the default expiration
(12 hours), run the grid-proxy-init program:
% grid-proxy-init
Enter the passphrase you used when you first
created your certificate. Your output should look like
this:
Your identity: /C=US/O=NPACI/USERID=ux00000
Enter GRID pass phrase for this identity:
Creating proxy ..................... Done
Your proxy is valid until: Thu Jul 17 01:29:53 2003
Obtain information about the current proxy by
running grid-proxy-info:
% grid-proxy-info
Output:
subject : /C=US/O=NPACI/USERID=ux00000/CN=proxy
issuer : /C=US/O=NPACI/USERID=ux00000
type : full
strength : 512 bits
path : /tmp/x509up_u000
timeleft : 11:59:58
To delete a proxy previously created with grid-proxy-init,
run:
% grid-proxy-destroy
Further background information about certificates
Submitting jobs
There are two different job services available
on the NPACI Grid: Globus and Condor-G. A primary advantage
of Globus job services is that a single script may be used
to submit jobs to heterogeneous resources. Condor-G
builds on the Globus job submission process providing a single
access point for running and monitoring jobs. Each of
these are described below.
Globus Jobs
To run a globus job remotely, the job must be
submitted to a machine where a globus gatekeeper is running.
A gatekeeper is a root-level process which handles
all globus job requests at a remote site. For this tutorial,
use the following gatekeeper sites:
- longhorn.tacc.utexas.edu
- morpheus.engin.umich.edu
- hypnos.engin.umich.edu
- tf004i.sdsc.edu
When a gatekeeper receives a job request, it
authenticates the user of the request before starting a jobmanager
on the local host. Each NPACI Grid gatekeeper has two
jobmanagers available: one for submitting interactive jobs
(called jobmanager-fork) and one for submitting jobs to a
batch queue. The batch jobmanager provides an interface
to the local job scheduler, such as PBS or loadleveler.
The interactive jobmanager should be used for
small jobs and for testing and debugging. Batch jobmanagers
are generally reserved for high performance computing applications
requiring supercomputing resources. In this tutorial,
you will submit some small jobs to the batch queues to become
familiar with these services.
When running globus job submission commands,
you must specify the gatekeeper host and optionally, the jobmanager.
By default, your job will run interactively with jobmanager-fork. To
use the batch jobmanager, it must be specified explicitly.
The following is the syntax for accessing the batch jobmanagers
at each of the gatekeeper sites:
- longhorn.tacc.utexas.edu/jobmanager-loadleveler
- morpheus.engin.umich.edu/jobmanager-pbs
- hypnos.engin.umich.edu/jobmanager-pbs
- tf004i.sdsc.edu.edu/jobmanager-loadleveler
Refer to the Grid
Resource Matrix for a complete list of gatekeepers and
jobmanagers available on the NPACI Grid.
Another prerequisite for running a globus job
is that the binary must have been compiled on the remote machine.
In this tutorial, we run commands that reside at the same
location at each site, such as /bin/date and /bin/hostname.
There are several commands for submitting jobs
to remote resources:
A tutorial for each of these commands is provided
below. At any time, you may obtain detailed information about
any globus command with the -help option:
% globus-job-submit -help

globus-job-submit
The globus-job-submit command runs in the background
-- once the job is submitted, the connection to the remote
host goes away and control is returned to the user.
This command returns a contact string which allows your to
monitor your job. Using globus-job-submit, you could
submit a large batch job, exit the system, and return later
to query for the job status. The basic syntax is:
% globus-job-submit <gatekeeper-host>[/<job-manager>]
<command>
The job-manager argument is optional; if not
specified, the default job manager (jobmanager-fork) on the
given gatekeeper-host will be used. As an example, the
following submits a simple command to the fork jobmanager
at TACC:
% globus-job-submit longhorn.tacc.utexas.edu
/bin/date
The globus-job-submit command returns a contact
string that looks something like this:
https://longhorn.tacc.utexas.edu:43406/1949838/1059696984/
This contact string uniquely identifies your
job. You may use this string to get information about
your job from any NPACI Grid host. For example, use
the following command to get the status of your job (i.e.
- "PENDING, ACTIVE, DONE, FAILED, etc):
% globus-job-status <contract-string>
If the previous command returned 'DONE', the
output may be retrieved with:
% globus-job-get-output <contract-string>
As stated earlier, each of the NPACI Grid resources
have batch queues available for running large jobs.
To access these queues, the batch jobmanager must be specified.
For example, the following submits a job to the batch queue
on blue horizon at SDSC:
% globus-job-submit tf004i.sdsc.edu/jobmanager-loadleveler
\ /bin/date
Because you are submitting a job to a batch
queue, your job will have to wait for resources to become
available before it will be executed. Generally speaking,
small commands like this would be submitted to the interactive
queue.
Most of the batch job queues require other parameters
for job submission. For example, the PBS queue on hypnos
requires the following:
- queue = route
- max_wall_time=<seconds>
- email_address=<your@email.address>
For this tutorial, the recommended max_wall_time
is 45 seconds. An example showing how to submit a job
to the batch queue on hypnos is:
% globus-job-submit hypnos.engin.umich.edu/jobmanager-pbs
\
-x "(queue=route)(max_wall_time=45)(email_address=your@email)"\
/bin/date
Refer to the Grid
Services Matrix for all parameters required for each batch
jobmanager.
To stop any job that is still running, use:
% globus-job-clean <contract-string>
globus-job-run
The globus-job-run command is intended for shorter
term interactive jobs. It is similar to launching a
job with ssh/rsh but also includes functionality for staging
executables and input/output files. The basic syntax
is the same as globus-job-submit:
% globus-job-run <gatekeeper-host>[/<job-manager>]
<command>
For example:
% globus-job-run hypnos.engin.umich.edu /bin/date
The output from this command looks like this:
Fri Aug 1 01:55:31 EDT 2003
Unlike the globus-job-submit command, the connection
to the remote host is maintained until the job completes.
Therefore, there is no 'contact string' provided to query
the status of the job or fetch job output.
The following example stages the local executable
to the remote host and executes it. To run this example,
you will need to supply a executable that can run on the remote
host.
% globus-job-run tf004i.sdsc.edu -stage <local-executable>
.

globusrun
The Resource Specification Language (RSL) provides
a common language for specifying resource and execution parameters
necessary for running remote jobs. For example, an RSL
file for submitting a simple job to each GRAM server on the
NPACI Grid would look like this:
+
( &(resourceManagerContact="longhorn.tacc.utexas.edu")
(executable=/bin/hostname)
)
( &(resourceManagerContact="tf004i.sdsc.edu")
(executable=/bin/hostname)
)
( &(resourceManagerContact="hypnos.engin.umich.edu")
(executable=/bin/hostname)
)
( &(resourceManagerContact="morpheus.engin.umich.edu")
(executable=/bin/hostname)
)
The resourceManagerContact parameter specifies
the gatekeeper host and (optionally) jobmanager where the
command is to be submitted. In this example, the default
jobmanager (fork) is executed on each host. The command
to be executed is placed in the executable parameter.
It is important to specify the full pathname to each executable
for the job to run successfully.
The globusrun command gives you access to the
full power of RSL.
If the above RSL segment was saved to a file
named run.rsl, it could be executed with the following globusrun
command from any NPACI Grid host:
% globusrun -o -f run.rsl
Output:
longhorn.tacc.utexas.edu
tf004i
hypnos.engin.umich.edu
morpheus.engin.umich.edu
Interestingly, the hostname command executed
on blue horizon does not return domain information.
This is a peculiarity of the /bin/hostname binary that exists
on horizon.

For More Information
More information on the globus commands may
be found at:
http://www.globus.org/v1.1/programs/index.html
Another tutorial with more detailed information
on RSL is at:
http://www.ipg.nasa.gov/ipgusers/globus/4-globus.html
A complete list of RSL parameters is available
here:
http://www.globus.org/gram/gram_rsl_parameters.html
Finally, the Globus RSL specification is here,
providing a complete description of RSL and some simple examples:
http://www.globus.org/gram/rsl_spec1.html

Condor-G
Jobs
Condor-G provides advanced job submission and
monitoring capabilities on the NPACI Grid. The Condor-G
job manager automatically handles file transfers and job I/O
while using the Globus Toolkit for job launching. The
Condor-G distribution also provides a useful tool called DAGMan
to define job dependencies.
All Condor-G services for the NPACI Grid run
on griddle.sdsc.edu. From this host, jobs may be launched
on any NPACI Grid resource.
Prerequisites
In order to submit jobs, you must have your
environment setup correctly. To run this tutorial, you
need to:
- meet all of the prerequisites in the Getting
Started Guide, which includes
adding setup.csh or setup.sh to your shell initialization
file on griddle (refer to Configuring
Your Environment for details)
copying your certificate (.globus directory) to griddle
- make sure your proxy has not expired - run grid-proxy-init
if you are not sure
Simple Example
The following exercise submits a very simple
job to a remote site. Login to longhorn.tacc.utexas.edu
and create a file named 'hello.sh' with the following contents:
#!/bin/sh
echo "hello!!!"
Make sure the script is executable:
[longhorn]% chmod +x hello.sh
Now login to griddle.sdsc.edu and create a Condor
submit description file with the contents shown below (or
download from here).
Edit the executable parameter value in this
file and save it as a file named test.script.
# path to executable on
remote host
executable = /paci/sdsc/ux<YOUR_ID>/hello.sh
# do not stage executable from local to remote host
transfer_executable = false
# host and jobmanager where job is to be submitted
globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork
# condor-g always uses the globus universe
universe=globus
# local files where standard output and error will be placed
output = hello.out
error = hello.error
# local file where output from condor_submit will be placed
log = hello.log
# submit the job
queue
Now submit the job:
[griddle]% condor_submit
test.script
Your output from the condor_submit
command should look something like this:
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 8.
You may query for the status of your job as
follows:
[griddle]% condor_q
Output:
-- Submitter: griddle.sdsc.edu : <132.249.20.36:60976>
: griddle.sdsc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 ux555555 8/2 18:42 0+00:00:00 I 0 0.0 hello.sh
1 jobs; 1 idle, 0 running, 0 held
As the job is running, output from the condor_submit
command will be placed in hello.log. Once the job has
completed, the remote job output will be placed in hello.out
and any error messages will be placed in hello.err.
The following are additional Condor submit description
files available for download. The first shows how to
submit a simple MPI job to the loadleveler queue on longhorn,
while the second is a general template for description file
creation.

Using DAGMan
A directed acyclic graph (DAG) is used to represent
a set of jobs where the input, output, or execution of one
or more jobs is dependent on other jobs. The jobs are
nodes (vertices) in the graph while edges (arcs) identify
the dependencies. DAGMan submits jobs to Condor-G in
an order represented by a DAG and processes the results.
An input file defined prior to submission describes the DAG;
multiple Condor submit description files must be created to
define each job.
The following example shows 4 jobs. Job
A must run before B and C, that is B and C both depend on
A; similarly, B and C must run before the last job, D.
In this tutorial, A will run on Longhorn, B on Hypnos, C on
Morpheus, and D on Horizon.

Save the following (or download
from here) in a file named
example.dag:
Job A longhorn.condor
Job B morpheus.condor
Job C hypnos.condor
Job D horizon.condor
PARENT A CHILD B C
PARENT B C CHILD D
Retry C 3
Create a longhorn.condor
description file that looks like this:
#
# Simple job on longhorn
#
universe = globus
executable = /bin/hostname
transfer_executable = false
globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork
output = job.$(cluster).out
error = job.$(cluster).err
log = job.log
queue
The hypnos.condor,
morpheus.condor, and
horizon.condor description
files are the same as above with the only change being the
globusscheduler parameter::
For hypnos: hypnos.engin.umich.edu/jobmanager-fork
For morpheus: morpheus.engin.umich.edu/jobmanager-fork
For horizon: tf004i.sdsc.edu/jobmanager-fork
Download or create your own *.condor
description files for each site, then submit the DAG using
the condor_submit_dag command:
[griddle]% condor_submit_dag
example.dag
Output:
Checking your DAG input file
and all submit files it references.
This might take a while...
Done.
---------------------------------------------------------
File for submitting this DAG to Condor : example.dag.condor.sub
Log of DAGMan debugging messages : example.dag.dagman.out
Log of Condor library debug messages : example.dag.lib.out
Log of the life of condor_dagman itself : example.dag.dagman.log
Condor Log file for all jobs of this DAG : job.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 39.
---------------------------------------------------------
Once the job is running, check on the job status
as follows:
[griddle]% condor_q -dag
Output:
-- Submitter: griddle.sdsc.edu :
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
39.0 ux555555 8/1 18:46 0+00:00:56 R 0 2.0 condor_dagman
-f -
43.0 |-D 8/1 18:47 0+00:00:00 I 0 0.0 date
2 jobs; 1 idle, 1 running, 0 held
Look in the job.log file for output from the
condor_submit_dag command. Output from each job is placed
into job.<cluster ID>.log as each job completes.
To remove any condor jobs before they complete,
use condor_rm:
[griddle]% condor_rm -all
More information about the Condor Dagman and
other Condor commands may be found in the Condor
Manual. In it you will find information on how to
process data between the jobs and other more advanced features.

Transferring files
This tutorial shows two methods for transferring
files on the NPACI Grid: via the GridFTP protocol and via
gsiscp.
GridFTP
GridFTP is a high performance, secure, reliable
data transfer protocol optimized for high bandwidth, wide
area networks. GridFTP is based on FTP, the highly popular
Internet file transfer protocol.
GridFTP provides the following protocol features:
- GSI security on control and data channels
- Multiple data channels for parallel transfers
- Partial file transfers
- Third-party (direct server-to-server) transfers
- Authenticated data channels
- Reusable data channels
- Command pipelining
The NPACI Grid uses the GridFTP-based gsiftp
server, which has been configured to run on all NPACI
Grid login hosts. The globus-url-copy client program
is a GridFTP client which may be used to transfer files from
the command line. The usage for this command is as follows:
globus-url-copy <source> <destination>
The source and destination arguments may be
URLs, local files, or standard input/output. The following
table shows some acceptable values for these arguments:
| source/destination |
Description |
| file:<fullpath>
file://<host><fullpath> |
For local files -- relative pathnames
not allowed |
| gsiftp://<host><path> |
For remote files -- relative paths allowed |
| http://<path to file>
https://<path to file> |
For accessing web files -- relative paths allowed |
- (dash) |
Standard input and output |
For example, to transfer a local file to longhorn
you would do something like this:
% globus-url-copy file:/home/ux444444/tmpfile
\
gsiftp://longhorn.tacc.utexas.edu/~/xfer-file
An example of a third-party transfer is as follows:
% globus-url-copy gsiftp://tf004i.sdsc.edu/~/datafile.bin
\
gsiftp://longhorn.tacc.utexas.edu/~/datafile.bin

gsiscp
The GSI-enabled scp command uses similar syntax
as the regular scp command. This command is not based
on the GridFTP protocol. To use this command on the
NPACI Grid, you must specify the port number of the GSI-SSH
server running at the remote sites.
For example:
gsiscp -P 1022 datafile.bin \
longhorn.tacc.utexas.edu:datafile.bin
All GSI server ports are listed in the Grid
Services Matrix.

Monitoring and Discovery
Services
The monitoring and discovery services (MDS)
are available to publish and access NPACI Grid system and
application data. The MDS components of the NPACI Grid
software stack are Globus MDS, Ganglia, and NWS.
Globus MDS
Here are a few example queries against the GIIS:
To get information about file systems with more
than 100 GBs available, run the following command:
grid-info-search -x -h giis.npaci.edu \
-b 'Mds-Vo-name=npaci,o=grid' \
'(Mds-Fs-freeMB>=100000)' \
Mds-Host-hn Mds-Fs-sizeMB Mds-Fs-freeMB
You have a couple of options for graphically
viewing LDAP data published via the Globus MDS:
- Point your browser at http://npackage.cs.ucsb.edu/ldapbrowser/login.php.
Be patient after clicking on the 'Explore' button.
The page may remain blank while the page is loading.
- Use an LDAP browser such as the LDAP Editor/Viewer:
- Freely available at: http://www.iit.edu/~gawojar/ldap/
- Note: the installation instructions
state that the browser is launched by double-clicking
browser.jar. If this is not available, double
click on lbe.jar
To view the NPACI Grid data, configure your
LDAP browser with the following values:
- Host: giis.npaci.edu
- port: 2135
- Base DN: Mds-Vo-name=npaci,o=Grid
The image below shows the LDAP browser with
the Operating System version displayed for one of the blue
horizon machines at SDSC.

For more information about the Globus MDS,
see http://www.globus.org/mds/.

Network Weather Service
The Network Weather Service (NWS) is a distributed
system that periodically monitors and dynamically forecasts
the performance various network and computational resources
can deliver over a given time interval. The service
operates a distributed set of performance sensors (network
monitors, CPU monitors, etc.) from which it gathers readings
of the instantaneous conditions. It then uses numerical
models to generate forecasts of what the conditions will be
for a given time frame.
NWS sensors are running on each NPACI grid resource.
In order to query information about NWS, you need to know
the host where NWS name and memory servers are running: nws.npaci.edu.
The name server implements a directory capability used to
bind process and data names with low-level contact information
(such as TCP/IP port numbers or address pairs).
The following contacts the name server and reports
on all hosts that have registered to it:
% nws_search -N nws.npaci.edu
hosts
The '-N' argument specifies the name server
host. To see the activities associated with each sensor,
do the following:
% nws_search -N nws.npaci.edu
activities
To view the most recent measurements of the
available CPU sensor on b80n03, run the following command:
% nws_extract -M nws.sdsc.edu -N nws.sdsc.edu
\
-f time,measurement availableCpu b80n03.sdsc.edu
The NWS
user's guide explains a great deal about the use of the
NWS, and provides man pages and more usage examples.
NWS statistics may also be viewed on the web at http://nws.cs.ucsb.edu/CGI/graphIt.cgi
and via the GridView
portion of the NPACI hotpage. For more information
about the Network Weather Service, refer to http://nws.cs.ucsb.edu/.
Ganglia
Ganglia provides a scalable real-time monitoring
system designed specifically for clusters. Ganglia monitoring
daemons (gmond) run on all nodes of the cluster, while the
Ganglia Meta Daemons provide a single point of collection
for multiple gmonds.
Ganglia information may be viewed within the
Globus MDS. Connect to the NPACI GIIS (as documented
above) and drill down into the information for either
morpheus or hypnos.
More information about Ganglia is available
at http://ganglia.sourceforge.net/

APST
The AppLeS Parameter Sweep Template (APST) automates
the execution of parameter sweep applications. These
applications typically involve running a single program (or
a small set of programs) many times, varying the runs by changing
the command line arguments, the input files, or both.
The individual runs are usually independent, and little or
no communication takes place other than by using output files
from one run as input files to another.
A very good tutorial for APST is located at
http://grail.sdsc.edu/projects/apst/tutorial.html.

|