Difference between revisions of "CyberShake Workflow Monitoring"

From SCECpedia
Jump to navigationJump to search
Line 77: Line 77:
 
* *.sub: Submission scripts for each individual job.  If a job keeps failing, you can copy part of the 'arguments' string into a PBS or Slurm script on the remote system and try running the job to see if you can figure out why the job is failing.
 
* *.sub: Submission scripts for each individual job.  If a job keeps failing, you can copy part of the 'arguments' string into a PBS or Slurm script on the remote system and try running the job to see if you can figure out why the job is failing.
 
* *.out.*: The output from the job.  If the job succeeds, the output will only be the kickstart record; if the job fails, it will include stdout and stderr from the job, which can be helpful.
 
* *.out.*: The output from the job.  If the job succeeds, the output will only be the kickstart record; if the job fails, it will include stdout and stderr from the job, which can be helpful.
 
You can also
 

Revision as of 19:57, 27 September 2018

This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful:

  1. Condor queue, on shock
  2. Individual logs for the workflow, on shock
  3. Jobs queued or running, on a remote cluster

A discussion of each is below.

Condor Queue

A good place to start is by looking in the condor queue. Run the command

condor_q -dag -nobatch

This command is aliased to 'cdag' in the cybershk account. This will show you a list of all the top-level workflows, sub workflows, and individual jobs queued on shock:

$>condor_q -dag -nobatch
-- Schedd: shock.usc.edu : <128.125.230.120:9618?... @ 09/27/18 11:27:37
 ID        OWNER/NODENAME                                 SUBMITTED     RUN_TIME ST PRI  SIZE CMD
133483.0   cybershk                                      9/26 19:55   0+15:32:11 R  0     0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS
134699.0    |-subdax_CyberShake_s4440_pre_s4440_preDAX   9/27 08:09   0+03:18:23 R  70    2.9 condor_dagman -p 0 -f -l . -Notification nev
134712.0     |-CheckSgt_CheckSgt_s4440_y                 9/27 08:10   0+00:55:12 R  20    0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C
134713.0     |-CheckSgt_CheckSgt_s4440_x                 9/27 08:10   0+00:53:11 R  20    0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C
134703.0    |-subdax_CyberShake_s4440_Synth_s4440_dax_0  9/27 08:09   0+03:18:11 R  70    2.9 condor_dagman -p 0 -f -l . -Notification nev
134714.0     |-DirectSynth_DirectSynth                   9/27 08:11   0+00:00:00 I  20    0.0 pegasus-kickstart -n scec::DirectSynth:1.0 -
134624.0   cybershk                                      9/27 06:05   0+05:22:11 R  0     0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS
134639.0    |-subdax_AWP_SGT_s3854_AWP_SGT_s3854         9/27 06:17   0+05:10:07 R  50    2.9 condor_dagman -p 0 -f -l . -Notification nev
134717.0     |-Velocity_Params_Velocity_Params_s3854     9/27 08:30   0+00:00:00 I  40    0.0 pegasus-kickstart -n scec::Velocity_Params:1
134751.0     |-AWP_GPU_AWP_GPU_s3854_x                   9/27 10:51   0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
134752.0     |-AWP_GPU_AWP_GPU_s3854_y                   9/27 10:51   0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
11 jobs; 0 completed, 0 removed, 4 idle, 8 running, 0 held, 0 suspended

The way to interpret this is that there are two top-level workflows running, for 2 different sites. Those have Condor IDs 133483 and 134624.

The first one is running the pre and Synth subworkflows; the pre workflow has two CheckSgt jobs running (the 'R') and the Synth workflow has one DirectSynth job queued (the 'I', for Idle).

The second one is running the AWP subworkflow. It has a Velocity_Params and two AWP_GPU jobs, all queued.

Signs of problems

Held Jobs

Held jobs ('H') are an indication of a disconnect between Condor and the job state on a remote system. You can quickly see what jobs are held by running the alias 'cdag_held' in the cybershk account. You can learn more about why the job was held by running

condor_q -l <job id>

or 'cql' on cybershk. Often the information you'll get isn't very informative.

If only a few jobs are held, try removing them individually using condor_rm <job id>, or the 'crm' alias in the cybershk account, since sometimes individual jobs get held without it being a larger issue.

If lots of jobs are held, you should check a few things:

  • Is the remote system they should be running on down?
  • Have the grid certificates expired?
  • If you're running on Titan - is the rvgahp server down? (This you can usually figure out from running condor_q -l on a held job)
  • Are disks or quota full, either on shock or the remote system?

If you still can't figure out the problem, try submitting a test GRAM and then a test Condor job and see if those run successfully.

Workflow disappears

If a workflow disappears from the Condor queue, it has finished for some reason - either because it completed successfully, or because it failed. If you've checked the Run Manager website, or you know it's failed, you should check the individual logs as listed below. In particular, check the end of the *.dag.dagman.out file, which will list the job(s) that failed. You can then look at individual *.out* logs for the job which failed to try to figure out why.

Remote System

Another good way to monitor is to log into the remote system and check that what you see in the remote queue matches what you see in the Condor queue.

For example, if you log into Blue Waters and run 'qstat -u <username>', you should be able to see the 6 remote jobs listed above in their appropriate states of running and idle. There may be a difference of a few seconds in how long a job has been running for.

Note that if you are running on multiple systems, you will have to run condor_q -l and check the pool attribute to figure out what system a job is supposed to run on.

Signs of problems

If there is not a one-to-one correspondence between jobs in the Condor queue and jobs in the remote queue, there is a problem. If load on shock gets too high for an extended period, this can cause Condor to get out of sync with the remote system, and sometimes then Condor will resubmit jobs which are already on the remote system. To fix this, delete all but the last instance of each job on the remote system, since that's the only one which will register as a completion with Condor.

Individual logs

You can also dig into the individual logs for a workflow or a job. This is not usually a good way to monitor, but can be helpful to try to debug issues.

All the logs for a workflow are contained within a directory structure, which is /home/scec-02/cybershk/runs/<site>_<Integrated, SGT, or PP>_dax/dags/cybershk/pegasus . You can also figure out the directory by running condor_q -l <job id> on a job and look at the entry for UserLog.

There are a few different kinds of files which can be helpful to look at:

  • *.dag.dagman.out: File listing workflow status. Periodically it lists how many jobs are running/ready/not ready/held, when jobs are submitted, when they finish, and if there are issues.
  • *.log: File with only individual job submission, execution, and termination events. If jobs are in the Condor queue but not on the remote system, it's good to look here and see if there if Condor thinks the job was submitted to the remote resource.
  • *.sub: Submission scripts for each individual job. If a job keeps failing, you can copy part of the 'arguments' string into a PBS or Slurm script on the remote system and try running the job to see if you can figure out why the job is failing.
  • *.out.*: The output from the job. If the job succeeds, the output will only be the kickstart record; if the job fails, it will include stdout and stderr from the job, which can be helpful.