National Partnership for Advanced Computational Infrastructure: Archives

These pages are a copy of the original www.npaci.edu website, and should be used for historical reference only.
Please select an item from the toolbar below to be taken to the latest information on that subject.
[ SDSC | User Services | Applications | Allocations | Consulting | SAC | Datastar | Training ]


NPACI-Grid: Tutorial


ABOUT NPACI Grid
What Is It?
Case Studies
Grid Monitor
Testbed Info
Terminology
FAQ

USER REFERENCE
Getting Started
Tutorial
Certificates
Resources
NPACKage
HotPage

LEARN MORE
Events
Web Links
Contacts

 

User Guide - Tutorial

Introduction

This tutorial gives instructions on how to use some of the services on NPACI Grid.  To run this tutorial, the prerequisites in Getting Started must be met.

The tutorial consists of the following sections and has been formatted to fit your printer:

  1. Signing Onto the Grid
  2. Submitting Jobs
  3. Transferring Files
  4. Resource and Monitoring Services
  5. APST

Signing Onto the Grid: Creating a Proxy Certificate

Proxies are certificates signed by the user, or by another proxy, that do not require a password to submit a job.  They are intended for short-term use, when the user is submitting many jobs and cannot be troubled to repeat a password for every job.

Proxies provide a convenient alternative to constantly entering passwords, but are also less secure than the user's normal security credential.  Therefore, they should be deleted after they are no longer needed (or after they expire).

To create a proxy with the default expiration (12 hours), run the grid-proxy-init program:

% grid-proxy-init

Enter the passphrase you used when you first created your certificate.  Your output should look like this:

Your identity: /C=US/O=NPACI/USERID=ux00000
Enter GRID pass phrase for this identity:
Creating proxy ..................... Done
Your proxy is valid until: Thu Jul 17 01:29:53 2003

Obtain information about the current proxy by running grid-proxy-info:

% grid-proxy-info

Output:

subject : /C=US/O=NPACI/USERID=ux00000/CN=proxy
issuer : /C=US/O=NPACI/USERID=ux00000
type : full
strength : 512 bits
path : /tmp/x509up_u000
timeleft : 11:59:58

To delete a proxy previously created with grid-proxy-init, run:

% grid-proxy-destroy

Further background information about certificates

Submitting jobs

There are two different job services available on the NPACI Grid: Globus and Condor-G.  A primary advantage of Globus job services is that a single script may be used to submit jobs to heterogeneous resources.  Condor-G builds on the Globus job submission process providing a single access point for running and monitoring jobs.  Each of these are described below.

Globus Jobs

To run a globus job remotely, the job must be submitted to a machine where a globus gatekeeper is running.   A gatekeeper is a root-level process which handles all globus job requests at a remote site.  For this tutorial, use the following gatekeeper sites:

  • longhorn.tacc.utexas.edu
  • morpheus.engin.umich.edu
  • hypnos.engin.umich.edu
  • tf004i.sdsc.edu

When a gatekeeper receives a job request, it authenticates the user of the request before starting a jobmanager on the local host.  Each NPACI Grid gatekeeper has two jobmanagers available: one for submitting interactive jobs (called jobmanager-fork) and one for submitting jobs to a batch queue.  The batch jobmanager provides an interface to the local job scheduler, such as PBS or loadleveler. 

The interactive jobmanager should be used for small jobs and for testing and debugging.  Batch jobmanagers are generally reserved for high performance computing applications requiring supercomputing resources.  In this tutorial, you will submit some small jobs to the batch queues to become familiar with these services. 

When running globus job submission commands, you must specify the gatekeeper host and optionally, the jobmanager.  By default, your job will run interactively with jobmanager-fork. To use the batch jobmanager, it must be specified explicitly.  The following is the syntax for accessing the batch jobmanagers at each of the gatekeeper sites:

  • longhorn.tacc.utexas.edu/jobmanager-loadleveler
  • morpheus.engin.umich.edu/jobmanager-pbs
  • hypnos.engin.umich.edu/jobmanager-pbs
  • tf004i.sdsc.edu.edu/jobmanager-loadleveler

Refer to the Grid Resource Matrix for a complete list of gatekeepers and jobmanagers available on the NPACI Grid.

Another prerequisite for running a globus job is that the binary must have been compiled on the remote machine.  In this tutorial, we run commands that reside at the same location at each site, such as /bin/date and /bin/hostname. 

There are several commands for submitting jobs to remote resources:

A tutorial for each of these commands is provided below. At any time, you may obtain detailed information about any globus command with the -help option:

% globus-job-submit -help

globus-job-submit

The globus-job-submit command runs in the background -- once the job is submitted, the connection to the remote host goes away and control is returned to the user.  This command returns a contact string which allows your to monitor your job.  Using globus-job-submit, you could submit a large batch job, exit the system, and return later to query for the job status.  The basic syntax is:

% globus-job-submit <gatekeeper-host>[/<job-manager>] <command>

The job-manager argument is optional; if not specified, the default job manager (jobmanager-fork) on the given gatekeeper-host will be used.  As an example, the following submits a simple command to the fork jobmanager at TACC:

% globus-job-submit longhorn.tacc.utexas.edu /bin/date

The globus-job-submit command returns a contact string that looks something like this:

https://longhorn.tacc.utexas.edu:43406/1949838/1059696984/

This contact string uniquely identifies your job.  You may use this string to get information about your job from any NPACI Grid host.  For example, use the following command to get the status of your job (i.e. - "PENDING, ACTIVE, DONE, FAILED, etc):

% globus-job-status <contract-string>

If the previous command returned 'DONE', the output may be retrieved with:

% globus-job-get-output <contract-string>

As stated earlier, each of the NPACI Grid resources have batch queues available for running large jobs.  To access these queues, the batch jobmanager must be specified.  For example, the following submits a job to the batch queue on blue horizon at SDSC:

% globus-job-submit tf004i.sdsc.edu/jobmanager-loadleveler \ /bin/date

Because you are submitting a job to a batch queue, your job will have to wait for resources to become available before it will be executed.  Generally speaking, small commands like this would be submitted to the interactive queue.

Most of the batch job queues require other parameters for job submission.  For example, the PBS queue on hypnos requires the following:

  • queue = route
  • max_wall_time=<seconds>
  • email_address=<your@email.address>

For this tutorial, the recommended max_wall_time is 45 seconds.  An example showing how to submit a job to the batch queue on hypnos is:

% globus-job-submit hypnos.engin.umich.edu/jobmanager-pbs \
-x "(queue=route)(max_wall_time=45)(email_address=your@email)"\
/bin/date

Refer to the Grid Services Matrix for all parameters required for each batch jobmanager.

To stop any job that is still running, use:

% globus-job-clean <contract-string>

 

globus-job-run

The globus-job-run command is intended for shorter term interactive jobs.  It is similar to launching a job with ssh/rsh but also includes functionality for staging executables and input/output files.  The basic syntax is the same as globus-job-submit:

% globus-job-run <gatekeeper-host>[/<job-manager>] <command>

For example:

% globus-job-run hypnos.engin.umich.edu /bin/date

The output from this command looks like this:

Fri Aug 1 01:55:31 EDT 2003

Unlike the globus-job-submit command, the connection to the remote host is maintained until the job completes.  Therefore, there is no 'contact string' provided to query the status of the job or fetch job output.

The following example stages the local executable to the remote host and executes it.  To run this example, you will need to supply a executable that can run on the remote host.

% globus-job-run tf004i.sdsc.edu -stage <local-executable>

.

globusrun

The Resource Specification Language (RSL) provides a common language for specifying resource and execution parameters necessary for running remote jobs.  For example, an RSL file for submitting a simple job to each GRAM server on the NPACI Grid would look like this:

+
( &(resourceManagerContact="longhorn.tacc.utexas.edu")
(executable=/bin/hostname)
)
( &(resourceManagerContact="tf004i.sdsc.edu")
(executable=/bin/hostname)
)
( &(resourceManagerContact="hypnos.engin.umich.edu")
(executable=/bin/hostname)
)
( &(resourceManagerContact="morpheus.engin.umich.edu")
(executable=/bin/hostname)
)

The resourceManagerContact parameter specifies the gatekeeper host and (optionally) jobmanager where the command is to be submitted.  In this example, the default jobmanager (fork) is executed on each host.  The command to be executed is placed in the executable parameter.  It is important to specify the full pathname to each executable for the job to run successfully.

The globusrun command gives you access to the full power of RSL.

If the above RSL segment was saved to a file named run.rsl, it could be executed with the following globusrun command from any NPACI Grid host:

% globusrun -o -f run.rsl

Output:

longhorn.tacc.utexas.edu
tf004i
hypnos.engin.umich.edu
morpheus.engin.umich.edu

Interestingly, the hostname command executed on blue horizon does not return domain information.  This is a peculiarity of the /bin/hostname binary that exists on horizon.

For More Information

More information on the globus commands may be found at:

http://www.globus.org/v1.1/programs/index.html

Another tutorial with more detailed information on RSL is at:

http://www.ipg.nasa.gov/ipgusers/globus/4-globus.html

A complete list of RSL parameters is available here:

http://www.globus.org/gram/gram_rsl_parameters.html

Finally, the Globus RSL specification is here, providing a complete description of RSL and some simple examples:

http://www.globus.org/gram/rsl_spec1.html

Condor-G Jobs

Condor-G provides advanced job submission and monitoring capabilities on the NPACI Grid.  The Condor-G job manager automatically handles file transfers and job I/O while using the Globus Toolkit for job launching.  The Condor-G distribution also provides a useful tool called DAGMan to define job dependencies. 

All Condor-G services for the NPACI Grid run on griddle.sdsc.edu.  From this host, jobs may be launched on any NPACI Grid resource.

Prerequisites

In order to submit jobs, you must have your environment setup correctly.  To run this tutorial, you need to:

  • meet all of the prerequisites in the Getting Started Guide, which includes

    adding setup.csh or setup.sh to your shell initialization file on griddle (refer to Configuring Your Environment for details)

    copying your certificate (.globus directory) to griddle

  • make sure your proxy has not expired - run grid-proxy-init if you are not sure

Simple Example

The following exercise submits a very simple job to a remote site.  Login to longhorn.tacc.utexas.edu and create a file named 'hello.sh' with the following contents:

#!/bin/sh

echo "hello!!!"

Make sure the script is executable:

[longhorn]% chmod +x hello.sh

Now login to griddle.sdsc.edu and create a Condor submit description file with the contents shown below (or download from here).  Edit the executable parameter value in this file and save it as a file named test.script.

# path to executable on remote host
executable = /paci/sdsc/ux<YOUR_ID>/hello.sh

# do not stage executable from local to remote host
transfer_executable = false

# host and jobmanager where job is to be submitted
globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork

# condor-g always uses the globus universe
universe=globus

# local files where standard output and error will be placed
output = hello.out
error = hello.error

# local file where output from condor_submit will be placed
log = hello.log

# submit the job
queue

Now submit the job:

[griddle]% condor_submit test.script

Your output from the condor_submit command should look something like this:

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 8.

You may query for the status of your job as follows:

[griddle]% condor_q

Output:

-- Submitter: griddle.sdsc.edu : <132.249.20.36:60976> : griddle.sdsc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
9.0 ux555555 8/2 18:42 0+00:00:00 I 0 0.0 hello.sh

1 jobs; 1 idle, 0 running, 0 held

As the job is running, output from the condor_submit command will be placed in hello.log.  Once the job has completed, the remote job output will be placed in hello.out and any error messages will be placed in hello.err.

The following are additional Condor submit description files available for download.  The first shows how to submit a simple MPI job to the loadleveler queue on longhorn, while the second is a general template for description file creation.

Using DAGMan

A directed acyclic graph (DAG) is used to represent a set of jobs where the input, output, or execution of one or more jobs is dependent on other jobs.  The jobs are nodes (vertices) in the graph while edges (arcs) identify the dependencies.  DAGMan submits jobs to Condor-G in an order represented by a DAG and processes the results.  An input file defined prior to submission describes the DAG; multiple Condor submit description files must be created to define each job.

The following example shows 4 jobs.  Job A must run before B and C, that is B and C both depend on A; similarly, B and C must run before the last job, D.  In this tutorial, A will run on Longhorn, B on Hypnos, C on Morpheus, and D on Horizon.

Save the following (or download from here) in a file named example.dag:

Job A longhorn.condor
Job B morpheus.condor
Job C hypnos.condor
Job D horizon.condor
PARENT A CHILD B C
PARENT B C CHILD D
Retry C 3

Create a longhorn.condor description file that looks like this:

#
# Simple job on longhorn
#
universe = globus
executable = /bin/hostname
transfer_executable = false
globusscheduler = longhorn.tacc.utexas.edu/jobmanager-fork
output = job.$(cluster).out
error = job.$(cluster).err
log = job.log
queue

The hypnos.condor, morpheus.condor, and horizon.condor description files are the same as above with the only change being the globusscheduler parameter::

For hypnos:  hypnos.engin.umich.edu/jobmanager-fork
For morpheus:  morpheus.engin.umich.edu/jobmanager-fork
For horizon:  tf004i.sdsc.edu/jobmanager-fork

Download or create your own *.condor description files for each site, then submit the DAG using the condor_submit_dag command:

[griddle]% condor_submit_dag example.dag

Output:

Checking your DAG input file and all submit files it references.
This might take a while...
Done.
---------------------------------------------------------
File for submitting this DAG to Condor : example.dag.condor.sub
Log of DAGMan debugging messages : example.dag.dagman.out
Log of Condor library debug messages : example.dag.lib.out
Log of the life of condor_dagman itself : example.dag.dagman.log
 
Condor Log file for all jobs of this DAG : job.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 39.
---------------------------------------------------------

Once the job is running, check on the job status as follows:

[griddle]% condor_q -dag

Output:

-- Submitter: griddle.sdsc.edu :
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
39.0 ux555555 8/1 18:46 0+00:00:56 R 0 2.0 condor_dagman -f -
43.0 |-D   8/1 18:47 0+00:00:00 I 0 0.0 date
 
2 jobs; 1 idle, 1 running, 0 held

Look in the job.log file for output from the condor_submit_dag command.  Output from each job is placed into job.<cluster ID>.log as each job completes.

To remove any condor jobs before they complete, use condor_rm:

[griddle]% condor_rm -all

More information about the Condor Dagman and other Condor commands may be found in the Condor Manual.  In it you will find information on how to process data between the jobs and other more advanced features.

Transferring files

This tutorial shows two methods for transferring files on the NPACI Grid: via the GridFTP protocol and via gsiscp.

GridFTP

GridFTP is a high performance, secure, reliable data transfer protocol optimized for high bandwidth, wide area networks.  GridFTP is based on FTP, the highly popular Internet file transfer protocol.  GridFTP provides the following protocol features:

  • GSI security on control and data channels
  • Multiple data channels for parallel transfers
  • Partial file transfers
  • Third-party (direct server-to-server) transfers
  • Authenticated data channels
  • Reusable data channels
  • Command pipelining

The NPACI Grid uses the GridFTP-based gsiftp server, which has been configured to run on all NPACI Grid login hosts.  The globus-url-copy client program is a GridFTP client which may be used to transfer files from the command line.  The usage for this command is as follows:

globus-url-copy <source> <destination>

The source and destination arguments may be URLs, local files, or standard input/output.  The following table shows some acceptable values for these arguments:

source/destination
Description

file:<fullpath>

file://<host><fullpath>

For local files -- relative pathnames not allowed
gsiftp://<host><path> For remote files -- relative paths allowed

http://<path to file>

https://<path to file>

For accessing web files -- relative paths allowed

-  (dash)

Standard input and output

 

For example, to transfer a local file to longhorn you would do something like this:

% globus-url-copy file:/home/ux444444/tmpfile \
   gsiftp://longhorn.tacc.utexas.edu/~/xfer-file

An example of a third-party transfer is as follows:

% globus-url-copy gsiftp://tf004i.sdsc.edu/~/datafile.bin \
   gsiftp://longhorn.tacc.utexas.edu/~/datafile.bin

gsiscp

The GSI-enabled scp command uses similar syntax as the regular scp command.  This command is not based on the GridFTP protocol.  To use this command on the NPACI Grid, you must specify the port number of the GSI-SSH server running at the remote sites.

For example:

gsiscp -P 1022 datafile.bin \
    longhorn.tacc.utexas.edu:datafile.bin

All GSI server ports are listed in the Grid Services Matrix.

Monitoring and Discovery Services

The monitoring and discovery services (MDS) are available to publish and access NPACI Grid system and application data.  The MDS components of the NPACI Grid software stack are Globus MDS, Ganglia, and NWS.

Globus MDS

Here are a few example queries against the GIIS:

To get information about file systems with more than 100 GBs available, run the following command:

grid-info-search -x -h giis.npaci.edu \
    -b 'Mds-Vo-name=npaci,o=grid' \
    '(Mds-Fs-freeMB>=100000)' \
    Mds-Host-hn Mds-Fs-sizeMB Mds-Fs-freeMB

You have a couple of options for graphically viewing LDAP data published via the Globus MDS:

  • Point your browser at http://npackage.cs.ucsb.edu/ldapbrowser/login.php.  Be patient after clicking on the 'Explore' button.  The page may remain blank while the page is loading.
  • Use an LDAP browser such as the LDAP Editor/Viewer:
    • Freely available at: http://www.iit.edu/~gawojar/ldap/
    • Note: the installation instructions state that the browser is launched by double-clicking browser.jar.  If this is not available, double click on lbe.jar

To view the NPACI Grid data, configure your LDAP browser with the following values:

    • Host: giis.npaci.edu
    • port: 2135
    • Base DN: Mds-Vo-name=npaci,o=Grid

The image below shows the LDAP browser with the Operating System version displayed for one of the blue horizon machines at SDSC.

 

For more information about the Globus MDS, see http://www.globus.org/mds/.


Network Weather Service

The Network Weather Service (NWS) is a distributed system that periodically monitors and dynamically forecasts the performance various network and computational resources can deliver over a given time interval.  The service operates a distributed set of performance sensors (network monitors, CPU monitors, etc.) from which it gathers readings of the instantaneous conditions.  It then uses numerical models to generate forecasts of what the conditions will be for a given time frame.

NWS sensors are running on each NPACI grid resource.  In order to query information about NWS, you need to know the host where NWS name and memory servers are running: nws.npaci.edu.  The name server implements a directory capability used to bind process and data names with low-level contact information (such as TCP/IP port numbers or address pairs).

The following contacts the name server and reports on all hosts that have registered to it:

% nws_search -N nws.npaci.edu hosts

The '-N' argument specifies the name server host.  To see the activities associated with each sensor, do the following:

% nws_search -N nws.npaci.edu activities

To view the most recent measurements of the available CPU sensor on b80n03, run the following command:

% nws_extract -M nws.sdsc.edu -N nws.sdsc.edu \
-f time,measurement availableCpu b80n03.sdsc.edu

The NWS user's guide explains a great deal about the use of the NWS, and provides man pages and more usage examples.  NWS statistics may also be viewed on the web at http://nws.cs.ucsb.edu/CGI/graphIt.cgi and via the GridView portion of the NPACI hotpage.  For more information about the Network Weather Service, refer to http://nws.cs.ucsb.edu/.

Ganglia

Ganglia provides a scalable real-time monitoring system designed specifically for clusters.  Ganglia monitoring daemons (gmond) run on all nodes of the cluster, while the Ganglia Meta Daemons provide a single point of collection for multiple gmonds.

Ganglia information may be viewed within the Globus MDS.  Connect to the NPACI GIIS (as documented above) and drill down into the information for either morpheus or hypnos.

More information about Ganglia is available at http://ganglia.sourceforge.net/


APST

The AppLeS Parameter Sweep Template (APST) automates the execution of parameter sweep applications.  These applications typically involve running a single program (or a small set of programs) many times, varying the runs by changing the command line arguments, the input files, or both.  The individual runs are usually independent, and little or no communication takes place other than by using output files from one run as input files to another.

A very good tutorial for APST is located at http://grail.sdsc.edu/projects/apst/tutorial.html.