next up previous contents index
Next: 2.5 Submitting a Job Up: 2. Users' Manual Previous: 2.3 Matchmaking with ClassAds   Contents   Index


2.4 Road-map for Running Jobs

The road to using Condor effectively is a short one. The basics are quickly and easily learned.

Here are all the steps needed to run a job using Condor.

Code Preparation.
A job run under Condor must be able to run as a background batch job. Condor runs the program unattended and in the background. A program that runs in the background will not be able to do interactive input and output. Condor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Make certain the program will run correctly with the files.

The Condor Universe.
Condor has several runtime environments (called a universe) from which to choose. Of the universes, two are likely choices when learning to submit a job to Condor: the standard universe and the vanilla universe. The standard universe allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted. The standard universe also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. To use the standard universe, it is necessary to relink the program with the Condor library using the condor_ compile command. The manual page for condor_ compile on page [*] has details.

The vanilla universe provides a way to run jobs that cannot be relinked. There is no way to take a checkpoint or migrate a job executed under the vanilla universe. For access to input and output files, jobs must either use a shared file system, or use Condor's File Transfer mechanism.

Choose a universe under which to run the Condor program, and re-link the program if necessary.

Submit description file.
Controlling the details of a job submission is a submit description file. The file contains information about the job such as what executable to run, the files to use for keyboard and screen data, the platform type required to run the program, and where to send e-mail when the job completes. You can also tell Condor how many times to run a program; it is simple to run the same program multiple times with multiple data sets.

Write a submit description file to go with the job, using the examples provided in section 2.5.1 for guidance.

Submit the Job.
Submit the program to Condor with the condor_ submit command.

Once submitted, Condor does the rest toward running the job. Monitor the job's progress with the condor_ q and condor_ status commands. You may modify the order in which Condor will run your jobs with condor_ prio. If desired, Condor can even inform you in a log file every time your job is checkpointed and/or migrated to a different machine.

When your program completes, Condor will tell you (by e-mail, if preferred) the exit status of your program and various statistics about its performances, including time used and I/O performed. If you are using a log file for the job(which is recommended) the exit status will be recorded in the log file. You can remove a job from the queue prematurely with condor_ rm.

2.4.1 Choosing a Condor Universe

A universe in Condor defines an execution environment. Condor Version 6.8.7 supports several different universes for user jobs:

The universe attribute is specified in the submit description file. If a universe is not specified, the default is standard.

The standard universe provides migration and reliability, but has some restrictions on the programs that can be run. The vanilla universe provides fewer services, but has very few restrictions. The PVM universe is for programs written to the Parallel Virtual Machine interface. See section 2.9 for more about PVM and Condor. The MPI universe is for programs written to the MPICH interface. See section 2.10.5 for more about MPI and Condor. The MPI Universe has been superseded by the parallel universe. The grid universe allows users to submit jobs using Condor's interface. These jobs are submitted for execution on grid resources. The java universe allows users to run jobs written for the Java Virtual Machine (JVM). The scheduler universe allows users to submit lightweight jobs to be spawned by the condor_ schedd daemon on the submit host itself. The parallel universe is for programs that require multiple machines for one job. See section 2.10 for more about the Parallel universe Standard Universe

In the standard universe, Condor provides checkpointing and remote system calls. These features make a job more reliable and allow it uniform access to resources from anywhere in the pool. To prepare a program as a standard universe job, it must be relinked with condor_ compile. Most programs can be prepared as a standard universe job, but there are a few restrictions.

Condor checkpoints a job at regular intervals. A checkpoint image is essentially a snapshot of the current state of a job. If a job must be migrated from one machine to another, Condor makes a checkpoint image, copies the image to the new machine, and restarts the job continuing the job from where it left off. If a machine should crash or fail while it is running a job, Condor can restart the job on a new machine using the most recent checkpoint image. In this way, jobs can run for months or years even in the face of occasional computer failures.

Remote system calls make a job perceive that it is executing on its home machine, even though the job may execute on many different machines over its lifetime. When a job runs on a remote machine, a second process, called a condor_ shadow runs on the machine where the job was submitted. When the job attempts a system call, the condor_ shadow performs the system call instead and sends the results to the remote machine. For example, if a job attempts to open a file that is stored on the submitting machine, the condor_ shadow will find the file, and send the data to the machine where the job is running.

To convert your program into a standard universe job, you must use condor_ compile to relink it with the Condor libraries. Put condor_ compile in front of your usual link command. You do not need to modify the program's source code, but you do need access to the unlinked object files. A commercial program that is packaged as a single executable file cannot be converted into a standard universe job.

For example, if you would have linked the job by executing:

% cc main.o tools.o -o program

Then, relink the job for Condor with:

% condor_compile cc main.o tools.o -o program

There are a few restrictions on standard universe jobs:

  1. Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().

  2. Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.

  3. Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.

  4. Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.

  5. Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().

  6. Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.

  7. Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().

  8. File locks are allowed, but not retained between checkpoints.

  9. All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.

  10. A fair amount of disk space must be available on the submitting machine for storing a job's checkpoint images. A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all the checkpoint images for a pool.

  11. On Digital Unix (OSF/1), HP-UX, and Linux, your job must be statically linked. Dynamic linking is allowed on all other platforms.

  12. Reading to or writing from files larger than 2 GB is not supported. Vanilla Universe

The vanilla universe in Condor is intended for programs which cannot be successfully re-linked. Shell scripts are another case where the vanilla universe is useful. Unfortunately, jobs run under the vanilla universe cannot checkpoint or use remote system calls. This has unfortunate consequences for a job that is partially completed when the remote machine running a job must be returned to its owner. Condor has only two choices. It can suspend the job, hoping to complete it at a later time, or it can give up and restart the job from the beginning on another machine in the pool.

Since Condor's remote system call features cannot be used with the vanilla universe, access to the job's input and output files becomes a concern. One option is for Condor to rely on a shared file system, such as NFS or AFS. Alternatively, Condor has a mechanism for transferring files on behalf of the user. In this case, Condor will transfer any files needed by a job to the execution site, run the job, and transfer the output back to the submitting machine.

Under Unix, the Condor presumes a shared file system for vanilla jobs. However, if a shared file system is unavailable, a user can enable the Condor File Transfer mechanism. On Windows platforms, the default is to use the File Transfer mechanism. For details on running a job with a shared file system, see section 2.5.3 on page [*]. For details on using the Condor File Transfer mechanism, see section 2.5.4 on page [*]. PVM Universe

The PVM universe allows programs written for the Parallel Virtual Machine interface to be used within the opportunistic Condor environment. Please see section 2.9 for more details. Grid Universe

The Grid universe in Condor is intended to provide the standard Condor interface to users who wish to start jobs intended for remote management systems. Section 5.3 on page [*] has details on using the Grid universe. The manual page for condor_ submit on page [*] has detailed descriptions of the grid-related attributes. Java Universe

A program submitted to the Java universe may run on any sort of machine with a JVM regardless of its location, owner, or JVM version. Condor will take care of all the details such as finding the JVM binary and setting the classpath. Scheduler Universe

The scheduler universe allows users to submit lightweight jobs to be run immediately, alongside the condor_ schedd daemon on the submit host itself. Scheduler universe jobs are not matched with a remote machine, and will never be preempted. They do not obey the machine's requirements expression.

Originally intended for meta-schedulers such as condor_ dagman, the scheduler universe can also be used to manage jobs of any sort that must run on the submit host.

However, unlike the local universe, the scheduler universe does not use a condor_ starter daemon to manage the job, and thus offers limited features and policy support. The local universe is a better choice for most jobs which must run on the submit host, as it offers a richer set of job management features, and is more consistent with other universes such as the vanilla universe. The scheduler universe may be retired in the future, in favor of the newer local universe. Parallel Universe

The parallel universe allows parallel programs, such as MPI jobs, to be run within the opportunistic Condor environment. Please see section 2.10 for more details. Local Universe

The local universe allows a Condor job to be submitted and executed with different assumptions for the execution conditions of the job. The job does not wait to be matched with a machine. It instead executes right away, on the machine where the job is submitted. The job will never be preempted. The machine requirements are not considered for local universe jobs.

next up previous contents index
Next: 2.5 Submitting a Job Up: 2. Users' Manual Previous: 2.3 Matchmaking with ClassAds   Contents   Index