next up previous contents index
Next: 2.15 Potential Problems Up: 2. Users' Manual Previous: 2.13 Job Monitor   Contents   Index


2.14 Special Environment Considerations

2.14.1 AFS

The Condor daemons do not run authenticated to AFS; they do not possess AFS tokens. Therefore, no child process of Condor will be AFS authenticated. The implication of this is that you must set file permissions so that your job can access any necessary files residing on an AFS volume without relying on having your AFS permissions.

If a job you submit to Condor needs to access files residing in AFS, you have the following choices:

  1. Copy the needed files from AFS to either a local hard disk where Condor can access them using remote system calls (if this is a standard universe job), or copy them to an NFS volume.
  2. If you must keep the files on AFS, then set a host ACL (using the AFS fs setacl command) on the subdirectory to serve as the current working directory for the job. If a standard universe job, then the host ACL needs to give read/write permission to any process on the submit machine. If vanilla universe job, then you need to set the ACL such that any host in the pool can access the files without being authenticated. If you do not know how to use an AFS host ACL, ask the person at your site responsible for the AFS configuration.

The Condor Team hopes to improve upon how Condor deals with AFS authentication in a subsequent release.

Please see section 3.12.1 on page [*] in the Administrators Manual for further discussion of this problem.

2.14.2 NFS Automounter

If your current working directory when you run condor_ submit is accessed via an NFS automounter, Condor may have problems if the automounter later decides to unmount the volume before your job has completed. This is because condor_ submit likely has stored the dynamic mount point as the job's initial current working directory, and this mount point could become automatically unmounted by the automounter.

There is a simple work around: When submitting your job, use the initialdir command in your submit description file to point to the stable access point. For example, suppose the NFS automounter is configured to mount a volume at mount point /a/ whenever the directory /home/johndoe is accessed. Adding the following line to the submit description file solves the problem.

        initialdir = /home/johndoe

2.14.3 Condor Daemons That Do Not Run as root

Condor is normally installed such that the Condor daemons have root permission. This allows Condor to run the condor_ shadow process and your job with your UID and file access rights. When Condor is started as root, your Condor jobs can access whatever files you can.

However, it is possible that whomever installed Condor did not have root access, or decided not to run the daemons as root. That is unfortunate, since Condor is designed to be run as the Unix user root. To see if Condor is running as root on a specific machine, enter the command

        condor_status -master -l <machine-name>

where machine-name is the name of the specified machine. This command displays a condor_ master ClassAd; if the attribute RealUid equals zero, then the Condor daemons are indeed running with root access. If the RealUid attribute is not zero, then the Condor daemons do not have root access.

NOTE: The Unix program ps is not an effective method of determining if Condor is running with root access. When using ps, it may often appear that the daemons are running as the condor user instead of root. However, note that the ps, command shows the current effective owner of the process, not the real owner. (See the getuid(2) and geteuid(2) Unix man pages for details.) In Unix, a process running under the real UID of root may switch its effective UID. (See the seteuid(2) man page.) For security reasons, the daemons only set the effective UID to root when absolutely necessary (to perform a privileged operation).

If they are not running with root access, you need to make any/all files and/or directories that your job will touch readable and/or writable by the UID (user id) specified by the RealUid attribute. Often this may mean using the Unix command chmod 777 on the directory where you submit your Condor job.

2.14.4 Job Leases

A job lease specifies how long a given job will attempt to run on a remote resource, even if that resource loses contact with the submitting machine. Similarly, it is the length of time the submitting machine will spend trying to reconnect to the (now disconnected) execution host, before the submitting machine gives up and tries to claim another resource to run the job. The goal aims at run only once semantics, so that the condor_ schedd daemon does not allow the same job to run on multiple sites simultaneously.

If the submitting machine is alive, it periodically renews the job lease, and all is well. If the submitting machine is dead, or the network goes down, the job lease will no longer be renewed. Eventually the lease expires. While the lease has not expired, the execute host continues to try to run the job, in the hope that the submit machine will come back to life and reconnect. If the job completes, the lease has not expired,yet the submitting machine is still dead, the condor_ starter daemon will wait for a condor_ shadow daemon to reconnect, before sending final information on the job, and its output files. Should the lease expire, the condor_ startd daemon kills off the condor_ starter daemon and user job.

The user must set a value for job_lease_duration to keep a job running in the case that the submit side no longer renews the lease. There is a tradeoff in setting the value of job_lease_duration. Too small a value, and the job might get killed before the submitting machine has a chance to recover. Forward progress on the job will be lost. Too large a value, and execute resource will be tied up waiting for the job lease to expire. The value should be chosen based on how long is the user willing to tie up the execute machines, how quickly submit machines come back up, and how much work would be lost if the lease expires, the job is killed, and the job must start over from its beginning.

job_lease_duration is only valid for vanilla and java universe jobs. Chirp I/O and streaming I/O (which uses Chirp I/O) may not be used in conjunction with a defined job_lease_duration.

A current limitation is that jobs with a defined job_lease_duration will not reconnect if the jobs flock to a remote pool.

next up previous contents index
Next: 2.15 Potential Problems Up: 2. Users' Manual Previous: 2.13 Job Monitor   Contents   Index