next up previous contents index
Next: 2.12 Stork Applications Up: 2. Users' Manual Previous: 2.10 Parallel Applications (Including   Contents   Index

Subsections


2.11 DAGMan Applications

A directed acyclic graph (DAG) can be used to represent a set of computations where the input, output, or execution of one or more computations is dependent on one or more other computations. The computations are nodes (vertices) in the graph, and the edges (arcs) identify the dependencies. Condor finds machines for the execution of programs, but it does not schedule programs based on dependencies. The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for the execution of programs (computations). DAGMan submits the programs to Condor in an order represented by a DAG and processes the results. A DAG input file describes the DAG, and further submit description file(s) are used by DAGMan when submitting programs to run under Condor.

DAGMan is itself executed as a scheduler universe job within Condor. As DAGMan submits programs, it monitors log file(s) to to enforce the ordering required within the DAG. DAGMan is also responsible for scheduling, recovery, and reporting on the set of programs submitted to Condor.


2.11.1 DAGMan Terminology

To DAGMan, a node in a DAG may encompass more than a single program submitted to run under Condor. Figure 2.2 illustrates the elements of a node.

Figure 2.2: One Node within a DAG
\includegraphics{user-man/dagman-node.eps}

Before Condor version 6.7.17, the number of Condor jobs per node was restricted to one. This restriction is now relaxed such that all Condor jobs within a node must share a single cluster number. See the condor_ submit manual page for a further definition of a cluster. A limitation exists such that all jobs within the single cluster must use the same log file.

As DAGMan schedules and submits jobs within nodes to Condor, these jobs are defined to succeed or fail based on their return values. This success or failure is propagated in well-defined ways to the level of a node within a DAG. Further progression of computation (towards completing the DAG) may be defined based upon the success or failure of one or more nodes.

The failure of a single job within a cluster of multiple jobs (within a single node) causes the entire cluster of jobs to fail. Any other jobs within the failed cluster of jobs are immediately removed. Each node within a DAG is further defined to succeed or fail, based upon the return values of a PRE script, the job(s) within the cluster, and/or a POST script.

2.11.2 Input File Describing the DAG

The input file used by DAGMan is called a DAG input file. It may specify seven types of items:

  1. A list of the nodes in the DAG which cause the submission of one or more Condor jobs. Each entry serves to name a node and specify a Condor submit description file.
  2. A list of the nodes in the DAG which cause the submission of a data placement job. Each entry serves to name a node and specify the Stork submit description file.
  3. Any processing required to take place before submission of a node's Condor or Stork job, or after a node's Condor or Stork job has completed execution.
  4. A description of the dependencies within the DAG.
  5. The number of times to retry a node's execution, if a node within the DAG fails.
  6. Any definition of macros associated with a node.
  7. A node's exit value that causes the entire DAG to abort.

Comments may be placed in the DAG input file. The pound character (#) as the first character on a line identifies the line as a comment. Comments do not span lines.

A simple diamond-shaped DAG, as shown in Figure 2.3 is presented as a starting point for examples. This DAG contains 4 nodes.

Figure 2.3: Diamond DAG
\includegraphics{user-man/dagman-diamond.eps}

A very simple DAG input file for this diamond-shaped DAG is

    # Filename: diamond.dag
    #
    JOB  A  A.condor 
    JOB  B  B.condor 
    JOB  C  C.condor	
    JOB  D  D.condor
    PARENT A CHILD B C
    PARENT B C CHILD D

Each DAG input file key word is described below.


2.11.2.1 JOB

The first set of lines in the simple DAG input file list each of the jobs that appear in the DAG. Each job to be managed by Condor is described by a single line that begins with the key word JOB. The syntax used for each JOB entry is

JOB JobName SubmitDescriptionFileName [DIR directory] [DONE]

A JOB entry maps a JobName to a Condor submit description file. The JobName uniquely identifies nodes within the DAGMan input file and in output messages. Note that the name for each node within the DAG must be unique.

The key words JOB and DONE are not case sensitive. Therefore, DONE, Done, and done are all equivalent. The values defined for JobName and SubmitDescriptionFileName are case sensitive, as file names in the Unix file system are case sensitive. The JobName can be any string that contains no white space.

The DIR option specifies a working directory for this node, from which the Condor job will be submitted, and from which a PRE and/or POST script will be run. Note that a DAG containing DIR specifications cannot be run in conjunction with the -usedagdir command-line argument to condor_ submit_dag. A rescue DAG generated by a DAG run with the -usedagdir argument will contain DIR specifications, so the rescue DAG must be run without the -usedagdir argument.

The optional DONE identifies a job as being already completed. This is useful in situations where the user wishes to verify results, but does not need all programs within the dependency graph to be executed. The DONE feature is also utilized when an error occurs causing the DAG to be aborted without completion. DAGMan generates a Rescue DAG, a DAG input file that can be used to restart and complete a DAG without re-executing completed nodes.


2.11.2.2 DATA

The DATA key word specifies a job to be managed by the Stork data placement server. The syntax used for each DATA entry is

DATA JobName SubmitDescriptionFileName [DIR directory] [DONE]

A DATA entry maps a JobName to a Stork submit description file. In all other respects, the DATA key word is identical to the JOB key word.

Here is an example of a simple DAG that stages in data using Stork, processes the data using Condor, and stages the processed data out using Stork. Depending upon the implementation, multiple data jobs to stage in data or to stage out data may be run in parallel.

    DATA    STAGE_IN1  stage_in1.stork
    DATA    STAGE_IN2  stage_in2.stork
    JOB     PROCESS    process.condor 
    DATA    STAGE_OUT1 stage_out1.stork
    DATA    STAGE_OUT2 stage_out2.stork
    PARENT  STAGE_IN1 STAGE_IN2 CHILD PROCESS
    PARENT  PROCESS CHILD STAGE_OUT1 STAGE_OUT2


2.11.2.3 SCRIPT

The third type of item in a DAG input file identifies processing that is done either before a job within the DAG is submitted to Condor or Stork for execution or after a job within the DAG completes its execution. Processing done before a job is submitted to Condor or Stork is called a PRE script. Processing done after a job completes its execution under Condor or Stork is called a POST script. A node in the DAG is comprised of the job together with PRE and/or POST scripts.

PRE and POST script lines within the DAG input file use the syntax:

SCRIPT PRE JobName ExecutableName [arguments]

SCRIPT POST JobName ExecutableName [arguments]

The SCRIPT key word identifies the type of line within the DAG input file. The PRE or POST key word specifies the relative timing of when the script is to be run. The JobName specifies the node to which the script is attached. The ExecutableName specifies the script to be executed, and it may be followed by any command line arguments to that script. The ExecutableName and optional arguments are case sensitive; they have their case preserved.

Scripts are optional for each job, and any scripts are executed on the machine from which the DAG is submitted; this is not necessarily the same machine upon which the node's Condor or Stork job is run. Further, a single cluster of Condor jobs may be spread across several machines.

A PRE script is commonly used to place files in a staging area for the cluster of jobs to use. A POST script is commonly used to clean up or remove files once the cluster of jobs is finished running. An example uses PRE and POST scripts to stage files that are stored on tape. The PRE script reads compressed input files from the tape drive, and it uncompresses them, placing the input files in the current directory. The cluster of Condor jobs reads these input files. and produces output files. The POST script compresses the output files, writes them out to the tape, and then removes both the staged input files and the output files.

DAGMan takes note of the exit value of the scripts as well as the job. A script with an exit value not equal to 0 fails. If the PRE script fails, then neither the job nor the POST script runs, and the node fails.

If the PRE script succeeds, the Condor or Stork job is submitted. If the job fails and there is no POST script, the DAG node is marked as failed. An exit value not equal to 0 indicates program failure. It is therefore important that a successful program return the exit value 0.

If the job fails and there is a POST script, node failure is determined by the exit value of the POST script. A failing value from the POST script marks the node as failed. A succeeding value from the POST script (even with a failed job) marks the node as successful. Therefore, the POST script may need to consider the return value from the job.

By default, the POST script is run regardless of the job's return value.

A node not marked as failed at any point is successful. Table 2.1 summarizes the success or failure of an entire node for all possibilities. An S stands for success, an F stands for failure, and the dash character (-) identifies that there is no script.


Table 2.1: Node success or failure definition
PRE - - F S S - - - - S S S S
JOB S F not run S F S S F F S F F S
POST - - not run - - S F S F S S F F
node S F F S F S F S F S S F F


Two variables may be used within the DAG input file, and may ease script writing. The variables are often utilized in the arguments passed to a PRE or POST script. The variable $JOB evaluates to the (case sensitive) string defined for JobName. For use as an argument to POST scripts, the $RETURN variable evaluates to the return value of the Condor or Stork job. A job that dies due to a signal is reported with a $RETURN value representing the negative signal number. For example, SIGKILL (signal 9) is reported as -9. A job whose batch system submission fails is reported as -1001. A job that is externally removed from the batch system queue (by something other than condor_ dagman) is reported as -1002.

As an example, consider the diamond-shaped DAG example. Suppose the PRE script expands a compressed file needed as input to nodes B and C. The file is named of the form JobName.gz. The DAG input file becomes

    # Filename: diamond.dag
    #
    JOB  A  A.condor 
    JOB  B  B.condor 
    JOB  C  C.condor	
    JOB  D  D.condor
    SCRIPT PRE  B  pre.csh $JOB .gz
    SCRIPT PRE  C  pre.csh $JOB .gz
    PARENT A CHILD B C
    PARENT B C CHILD D

The script pre.csh uses the arguments to form the file name of the compressed file:

    #!/bin/csh
    gunzip $argv[1]$argv[2]


2.11.2.4 PARENT..CHILD

The fourth type of item in the DAG input file describes the dependencies within the DAG. Nodes are parents and/or children within the DAG. A parent node must be completed successfully before any of its children may be started. A child node may only be started once all its parents have successfully completed.

The syntax of a dependency line within the DAG input file:

PARENT ParentJobName... CHILD ChildJobName...

The PARENT key word is followed by one or more ParentJobNames. The CHILD key word is followed by one or more ChildJobNames. Each child job depends on every parent job within the line. A single line in the input file can specify the dependencies from one or more parents to one or more children. As an example, the line

PARENT p1 p2 CHILD c1 c2
produces four dependencies:
  1. p1 to c1
  2. p1 to c2
  3. p2 to c1
  4. p2 to c2

2.11.2.5 RETRY

The fifth type of item in the DAG input file provides a way to retry failed nodes. Use of retry is optional. The syntax for retry is

RETRY JobName NumberOfRetries [UNLESS-EXIT value]

where JobName identifies the node. NumberOfRetries is an integer number of times to retry the node after failure. The implied number of retries for any node is 0, the same as not having a retry line in the file. Retry is implemented on nodes, not parts of a node.

The diamond-shaped DAG example may be modified to retry node C:

    # Filename: diamond.dag
    #
    JOB  A  A.condor 
    JOB  B  B.condor 
    JOB  C  C.condor	
    JOB  D  D.condor
    PARENT A CHILD B C
    PARENT B C CHILD D
    Retry  C 3

If node C is marked as failed (for any reason), then it is started over as a first retry. The node will be tried a second and third time, if it continues to fail. If the node is marked as successful, then further retries do not occur.

Retry of a node may be short circuited using the optional key word UNLESS-EXIT (followed by an integer exit value). If the node exits with the specified integer exit value, then no further processing will be done on the node.

2.11.2.6 VARS

The sixth type of item in the DAG input file provides a method for defining a macro. This macro may then be used in a submit description file. These macros are defined on a per-node basis, using the following format.

VARS JobName macroname= "string" [macroname= "string"... ]

The macro may be used within the submit description file. A macroname consists of alphanumeric characters (a..Z and 0..9), as well as the underscore character. The space character delimits macros, when there is more than one macro defined for a node.

Correct syntax requires that the string must be enclosed in double quotes. To use a double quote inside string, escape it with the backslash character (\). To add the backslash character itself, use two backslashes (\\).

Note that macro names cannot begin with the string "queue" (in any combination of upper and lower case).

If the DAG input file contains

    # Filename: diamond.dag
    #
    JOB  A  A.condor 
    JOB  B  B.condor 
    JOB  C  C.condor	
    JOB  D  D.condor
    VARS A state="Wisconsin"

then file A.condor may use the macro state. This example submit description file for the Condor job in node A passes the value of the macro as a command-line argument to the job.

    # file name: A.condor
    executable = A.exe
    log        = A.log
    error      = A.err
    arguments  = $(state)
    queue

This Condor job's command line will be

A.exe Wisconsin
The use of macros may allow a reduction in the necessary number of unique submit description files.

2.11.2.7 ABORT-DAG-ON

The seventh type of item in the DAG input file provides a way to abort the entire DAG if a given node returns a specific exit code. The syntax for ABORT-DAG-ON:

ABORT-DAG-ON JobName AbortExitValue [RETURN DAGReturnValue]

If the node specified by JobName returns the specified AbortExitValue, the DAG is immediately aborted. A DAG abort differs from a node failure, in that a DAG abort causes all nodes within the DAG to be stopped immediately. This includes removing the jobs in nodes that are currently running. A node failure allows the DAG to continue running, until no more progress can be made due to dependencies.

An abort overrides node retries. If a node returns the abort exit value, the DAG is aborted, even if the node has retry specified.

When a DAG aborts, by default it exits with the node return value that caused the abort. This can be changed by using the optional RETURN key word along with specifying the desired DAGReturnValue. The DAG abort return value can be used for DAGs within DAGs, allowing an inner DAG to cause an abort of an outer DAG.

Adding ABORT-DAG-ON for node C in the diamond-shaped DAG

    # Filename: diamond.dag
    #
    JOB  A  A.condor 
    JOB  B  B.condor 
    JOB  C  C.condor	
    JOB  D  D.condor
    PARENT A CHILD B C
    PARENT B C CHILD D
    Retry  C 3
    ABORT-DAG-ON C 10 RETURN 1

causes the DAG to be aborted, if node C exits with a return value of 10. Any other currently running nodes (only node B is a possibility for this particular example) are stopped and removed. If this abort occurs, the return value for the DAG is 1.

2.11.3 Submit Description File

Each node in a DAG may use a unique submit description file. One key limitation is that each Condor submit description file must submit jobs described by a single cluster number. At the present time DAGMan cannot deal with a submit file producing multiple job clusters.

At one time, DAGMan required that all jobs within all nodes specify the same, single log file. This is no longer the case. However, if the DAG utilizes a large number of separate log files, performance may suffer. Therefore, it is better to have fewer, or even only a single log file. Unfortunately, each Stork job currently requires a separate log file. DAGMan enforces the dependencies within a DAG using the events recorded in the log file(s) produced by job submission to Condor.

Here is a modified version of the DAG input file for the diamond-shaped DAG. The modification has each node use the same submit description file.

    # Filename: diamond.dag
    #
    JOB  A  diamond_job.condor 
    JOB  B  diamond_job.condor 
    JOB  C  diamond_job.condor	
    JOB  D  diamond_job.condor
    PARENT A CHILD B C
    PARENT B C CHILD D

Here is the single Condor submit description file for this DAG:

    # Filename: diamond_job.condor
    #
    executable   = /path/diamond.exe
    output       = diamond.out.$(cluster)
    error        = diamond.err.$(cluster)
    log          = diamond_condor.log
    universe     = vanilla
    notification = NEVER
    queue

This example uses the same Condor submit description file for all the jobs in the DAG. This implies that each node within the DAG runs the same job. The $(cluster) macro produces unique file names for each job's output. As the Condor job within each node causes a separate job submission, each has a unique cluster number.

Notification is set to NEVER in this example. This tells Condor not to send e-mail about the completion of a job submitted to Condor. For DAGs with many nodes, this reduces or eliminates excessive numbers of e-mails.

A separate example shows an intended use of a VARS entry in the DAG input file. This use may dramatically reduce the number of Condor submit description files needed for a DAG. In the case where the submit description file for each node varies only in file naming, the use of a substitution macro within the submit description file reduces the need to a single submit description file. Note that the user log file for a job currently cannot be specified using a macro passed from the DAG.

The example uses a single submit description file in the DAG input file, and uses the Vars entry to name output files.

The relevant portion of the DAG input file appears as

JOB A theonefile.sub
JOB B theonefile.sub
JOB C theonefile.sub

VARS A outfilename="A"
VARS B outfilename="B"
VARS C outfilename="C"

The submit description file appears as

    # submit description file called:  theonefile.sub
    executable   = progX
    universe     = standard
    output       = $(outfilename)
    error        = error.$(outfilename)
    log          = progX.log
    queue

For a DAG like this one with thousands of nodes, being able to write and maintain a single submit description file and a single, yet more complex, DAG input file is preferable.


2.11.4 Job Submission

A DAG is submitted using the program condor_ submit_dag. See the manual page [*] for complete details. A simple submission has the syntax

condor_ submit_dag DAGInputFileName

The diamond-shaped DAG example may be submitted with

condor_submit_dag diamond.dag
In order to guarantee recoverability, the DAGMan program itself is run as a Condor job. As such, it needs a submit description file. condor_ submit_dag produces this needed submit description file, naming it by appending .condor.sub to the DAGInputFileName. This submit description file may be edited if the DAG is submitted with

condor_submit_dag -no_submit diamond.dag
causing condor_ submit_dag to generate the submit description file, but not submit DAGMan to Condor. To submit the DAG, once the submit description file is edited, use

condor_submit diamond.dag.condor.sub

An optional argument to condor_ submit_dag, -maxjobs, is used to specify the maximum number of batch jobs that DAGMan may submit at one time. It is commonly used when there is a limited amount of input file staging capacity. As a specific example, consider a case where each job will require 4 Mbytes of input files, and the jobs will run in a directory with a volume of 100 Mbytes of free space. Using the argument -maxjobs 25 guarantees that a maximum of 25 jobs, using a maximum of 100 Mbytes of space, will be submitted to Condor and/or Stork at one time.

While the -maxjobs argument is used to limit the number of batch system jobs submitted at one time, it may be desirable to limit the number of scripts running at one time. The optional -maxpre argument limits the number of PRE scripts that may be running at one time, while the optional -maxpost argument limits the number of POST scripts that may be running at one time.

An optional argument to condor_ submit_dag, -maxidle, is used to limit the number of idle jobs within a given DAG. When the number of idle node jobs in the DAG reaches the specified value, condor_ dagman will stop submitting jobs, even if there are ready nodes in the DAG. Once some of the idle jobs start to run, condor_ dagman will resume submitting jobs. Note that this parameter only limits the number of idle jobs submitted by a given instance of condor_ dagman. Idle jobs submitted by other sources (including other condor_ dagman runs) are ignored.

DAGs that submit jobs to Stork using the DATA key word must also specify the Stork user log file, using the -storklog argument.

2.11.5 Job Monitoring, Job Failure, and Job Removal

After submission, the progress of the DAG can be monitored by looking at the log file(s), observing the e-mail that job submission to Condor causes, or by using condor_ q -dag. There is a large amount of information in an extra file. The name of this extra file is produced by appending .dagman.out to DAGInputFileName; for example, if the DAG file is diamond.dag, this extra file is diamond.dag.dagman.out. If this extra file grows too large, limit its size with the MAX_DAGMAN_LOG configuration macro (see section 3.3.4).

If you have some kind of problem in your DAGMan run, please save the corresponding dagman.out file; it is the most important debugging tool for DAGMan. As of version 6.8.2, the dagman.out is appended to, rather than overwritten, with each new DAGMan run.

condor_ submit_dag attempts to check the DAG input file. If a problem is detected, condor_ submit_dag prints out an error message and aborts.

To remove an entire DAG, consisting of DAGMan plus any jobs submitted to Condor or Stork, remove the DAGMan job running under Condor. condor_ q will list the job number. Use the job number to remove the job, for example

% condor_q
-- Submitter: turunmaa.cs.wisc.edu : <128.105.175.125:36165> : turunmaa.cs.wisc.edu
 ID      OWNER          SUBMITTED     RUN_TIME ST PRI SIZE CMD
  9.0   smoler         10/12 11:47   0+00:01:32 R  0   8.7  condor_dagman -f -
 11.0   smoler         10/12 11:48   0+00:00:00 I  0   3.6  B.out
 12.0   smoler         10/12 11:48   0+00:00:00 I  0   3.6  C.out

    3 jobs; 2 idle, 1 running, 0 held

% condor_rm 9.0

Before the DAGMan job stops running, it uses condor_ rm and/or stork_ rm to remove any jobs within the DAG that are running.

In the case where a machine is scheduled to go down, DAGMan will clean up memory and exit. However, it will leave any submitted jobs in Condor's queue.

2.11.6 Job Recovery: The Rescue DAG

DAGMan can help with the resubmission of uncompleted portions of a DAG, when one or more nodes results in failure. If any node in the DAG fails, the remainder of the DAG is continued until no more forward progress can be made based on the DAG's dependencies. At this point, DAGMan produces a file called a Rescue DAG.

The Rescue DAG is a DAG input file, functionally the same as the original DAG file. It additionally contains an indication of successfully completed nodes by appending the DONE key word to the node's JOB or DATA lines. If the DAG is resubmitted using this Rescue DAG input file, the nodes marked as completed will not be re-executed.

The Rescue DAG is automatically generated by DAGMan when a node within the DAG fails. The file name assigned is DAGInputFileName, appended with the suffix .rescue. Statistics about the failed DAG execution are presented as comments at the beginning of the Rescue DAG input file.

If the Rescue DAG file is generated before all retries of a node are completed, then the Rescue DAG file will also contain Retry entries. The number of retries will be set to the appropriate remaining number of retries.

The granularity defining success or failure in the Rescue DAG input file is given for nodes. The Condor job within a node may result in the submission of multiple Condor jobs under a single cluster. If one of the multiple jobs fails, the node fails. Therefore, a resubmission of the Rescue DAG will again result in the submission of the entire cluster of jobs.


2.11.7 Visualizing DAGs with dot

It can be helpful to see a picture of a DAG. DAGMan can assist you in visualizing a DAG by creating the input files used by the AT&T Research Labs graphviz package. dot is a program within this package, available from http://www.graphviz.org/, and it is used to draw pictures of DAGs.

DAGMan produces one or more dot files as the result of an extra line in a DAGMan input file. The line appears as

    DOT dag.dot

This creates a file called dag.dot. which contains a specification of the DAG before any jobs within the DAG are submitted to Condor. The dag.dot file is used to create a visualization of the DAG by using this file as input to dot. This example creates a Postscript file, with a visualization of the DAG:

    dot -Tps dag.dot -o dag.ps

Within the DAGMan input file, the DOT command can take several optional parameters:

If conflicting parameters are used in a DOT command, the last one listed is used.


2.11.8 Advanced Usage: A DAG within a DAG

The organization and dependencies of the jobs within a DAG are the keys to its utility. There are cases when a DAG is easier to visualize and construct hierarchically, as when a node within a DAG is also a DAG. Condor DAGMan handles this situation with grace. Since more than one DAG is being discussed, terminology is introduced to clarify which DAG is which. Reuse the example diamond-shaped DAG as given in Figure 2.3. Assume that node B of this diamond-shaped DAG will itself be a DAG. The DAG of node B is called the inner DAG, and the diamond-shaped DAG is called the outer DAG.

To make DAGs within DAGs, the essential element is getting the name of the submit description file for the inner DAG correct within the outer DAG's input file.

Work on the inner DAG first. The goal is to generate a Condor submit description file for this inner DAG. Here is a very simple linear DAG input file used as an example of the inner DAG.

    # Filename: inner.dag
    #
    JOB  X  X.submit
    JOB  Y  Y.submit
    JOB  Z  Z.submit
    PARENT X CHILD Y
    PARENT Y CHILD Z

Use condor_ submit_dag to create a submit description file for this inner dag:

   condor_submit_dag -no_submit inner.dag
The resulting file will be named inner.dag.condor.sub. This file will be needed in the DAG input file of the outer DAG. The naming of the file is the name of the DAG input file (inner.dag) with the suffix .condor.sub.

A simple example of a DAG input file for the outer DAG is

# Filename: diamond.dag
#
    JOB  A  A.submit 
    JOB  B  inner.dag.condor.sub
    JOB  C  C.submit	
    JOB  D  D.submit
    PARENT A CHILD B C
    PARENT B C CHILD D

The outer DAG is then submitted as before, with

   condor_submit_dag diamond.dag

More than one level of nested DAGs is supported.

One item to get right: to locate the log files used in ordering the DAG, DAGMan either needs a completely flat directory structure (all files for outer and inner DAGs within the same directory), or it needs full path names to all log files.


2.11.9 Single Submission of Multiple, Independent DAGs

A single use of condor_ submit_dag may execute multiple, independent DAGs. Each independent DAG has its own DAG input file. These DAG input files are command-line arguments to condor_ submit_dag (see the condor_ submit_dag manual page at 9).

Internally, all of the independent DAGs are combined into a single, larger DAG, with no dependencies between the original independent DAGs. As a result, any generated rescue DAG file represents all of the input DAGs as a single DAG. The file name of this rescue DAG is based on the DAG input file listed first within the command-line arguments to condor_ submit_dag (unlike a single-DAG rescue DAG file, however, the file name will be <whatever>.dag_multi.rescue, as opposed to just <whatever>.dag.rescue). Other files such as dagman.out and the lock file also have names based on this first DAG input file.

The success or failure of the independent DAGs is well defined. When multiple, independent DAGs are submitted with a single command, the success of the composite DAG is defined as the logical AND of the success of each independent DAG. This implies that failure is defined as the logical OR of the failure of any of the independent DAGs.

By default, DAGMan internally renames the nodes to avoid node name collisions. If all node names are unique, the renaming of nodes may be disabled by setting the configuration variable DAGMAN_MUNGE_NODE_NAMES to False (see 3.3.23).


2.11.10 File Paths in DAGs

By default, condor_ dagman assumes that all relative paths in a DAG input file and the associated Condor submit description files are relative to the current working directory when condor_ submit_dag is run. Note that relative paths in submit description files can be modified by the submit command initialdir; see the condor_ submit manual page at  9 for more details. The rest of this discussion ignores initialdir.

In most cases, path names relative to the current working directory is the desired behavior. However, if running multiple DAGs with a single condor_ dagman, and each DAG is in its own directory, this will cause problems. In this case, use the -usedagdir command-line argument to condor_ submit_dag (see the condor_ submit_dag manual page at  9 for more details). This tells condor_ dagman to run each DAG as if condor_ submit_dag had been run in the directory in which the relevant DAG file exists.

For example, assume that a directory called parent contains two subdirectories called dag1 and dag2, and that dag1 contains the DAG input file one.dag and dag2 contains the DAG input file two.dag. Further, assume that each DAG is set up to be run from its own directory with the following command:

cd dag1; condor_submit_dag one.dag
This will correctly run one.dag.

The goal is to run the two, independent DAGs located within dag1 and dag2 while the current working directory is parent. To do so, run the following command:

condor_submit_dag -usedagdir dag1/one.dag dag2/two.dag

Of course, if all paths in the DAG input file(s) and the relevant submit description files are absolute, the -usedagdir argument is not needed; however, using absolute paths is NOT generally a good idea.

If you do not use -usedagdir, relative paths can still work for multiple DAGs, if all file paths are given relative to the current working directory as condor_ submit_dag is executed. However, this means that, if the DAGs are in separate directories, they cannot be submitted from their own directories, only from the parent directory the paths are set up for.

Note that if you use the -usedagdir argument, and your run results in a rescue DAG, the rescue DAG file will be written to the current working directory, and should be run from that directory. The rescue DAG includes all the path information necessary to run each node job in the proper directory.


2.11.11 Configuration

Configuration variables relating to DAGMan may be found in section 3.3.23.


next up previous contents index
Next: 2.12 Stork Applications Up: 2. Users' Manual Previous: 2.10 Parallel Applications (Including   Contents   Index
condor-admin@cs.wisc.edu