Difference between revisions of "CyberShake Workflow Monitoring"
Line 25: | Line 25: | ||
134751.0 |-AWP_GPU_AWP_GPU_s3854_x 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW | 134751.0 |-AWP_GPU_AWP_GPU_s3854_x 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW | ||
134752.0 |-AWP_GPU_AWP_GPU_s3854_y 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW | 134752.0 |-AWP_GPU_AWP_GPU_s3854_y 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW | ||
+ | 11 jobs; 0 completed, 0 removed, 4 idle, 8 running, 0 held, 0 suspended | ||
</pre> | </pre> | ||
Line 31: | Line 32: | ||
The first one is running the pre and Synth subworkflows; the pre workflow has two [[CyberShake_Code_Base#CheckSgt|CheckSgt]] jobs running (the 'R') and the Synth workflow has one [[CyberShake_Code_Base#DirectSynth|DirectSynth]] job queued (the 'I', for Idle). | The first one is running the pre and Synth subworkflows; the pre workflow has two [[CyberShake_Code_Base#CheckSgt|CheckSgt]] jobs running (the 'R') and the Synth workflow has one [[CyberShake_Code_Base#DirectSynth|DirectSynth]] job queued (the 'I', for Idle). | ||
− | The second one is running the AWP subworkflow. It has | + | The second one is running the AWP subworkflow. It has a Velocity_Params and two [[CyberShake_Code_Base#AWP-ODC-SGT.2C_GPU_version|AWP_GPU]] jobs, all queued. |
=== Signs of problems === | === Signs of problems === | ||
+ | |||
+ | Held jobs ('H') are an indication of a disconnect between Condor and the job state on a remote system. You can quickly see how many You can learn more about why the job was held by running <pre>condor_q <job id></pre> Often the information you'll get isn't very informative. If only a few jobs are held, try removing them using condor_rm <job id>, or the 'crm' alias in the cybershk account. If many jobs are held, |
Revision as of 18:48, 27 September 2018
This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful:
- Condor queue, on shock
- Individual logs for the workflow, on shock
- Jobs queued or running, on a remote cluster
A discussion of each is below.
Condor Queue
A good place to start is by looking in the condor queue. Run the command
condor_q -dag -nobatch
This command is aliased to 'cdag' in the cybershk account. This will show you a list of all the top-level workflows, sub workflows, and individual jobs queued on shock:
$>condor_q -dag -nobatch -- Schedd: shock.usc.edu : <128.125.230.120:9618?... @ 09/27/18 11:27:37 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 133483.0 cybershk 9/26 19:55 0+15:32:11 R 0 0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS 134699.0 |-subdax_CyberShake_s4440_pre_s4440_preDAX 9/27 08:09 0+03:18:23 R 70 2.9 condor_dagman -p 0 -f -l . -Notification nev 134712.0 |-CheckSgt_CheckSgt_s4440_y 9/27 08:10 0+00:55:12 R 20 0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C 134713.0 |-CheckSgt_CheckSgt_s4440_x 9/27 08:10 0+00:53:11 R 20 0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C 134703.0 |-subdax_CyberShake_s4440_Synth_s4440_dax_0 9/27 08:09 0+03:18:11 R 70 2.9 condor_dagman -p 0 -f -l . -Notification nev 134714.0 |-DirectSynth_DirectSynth 9/27 08:11 0+00:00:00 I 20 0.0 pegasus-kickstart -n scec::DirectSynth:1.0 - 134624.0 cybershk 9/27 06:05 0+05:22:11 R 0 0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS 134639.0 |-subdax_AWP_SGT_s3854_AWP_SGT_s3854 9/27 06:17 0+05:10:07 R 50 2.9 condor_dagman -p 0 -f -l . -Notification nev 134717.0 |-Velocity_Params_Velocity_Params_s3854 9/27 08:30 0+00:00:00 I 40 0.0 pegasus-kickstart -n scec::Velocity_Params:1 134751.0 |-AWP_GPU_AWP_GPU_s3854_x 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW 134752.0 |-AWP_GPU_AWP_GPU_s3854_y 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW 11 jobs; 0 completed, 0 removed, 4 idle, 8 running, 0 held, 0 suspended
The way to interpret this is that there are two top-level workflows running, for 2 different sites. Those have Condor IDs 133483 and 134624.
The first one is running the pre and Synth subworkflows; the pre workflow has two CheckSgt jobs running (the 'R') and the Synth workflow has one DirectSynth job queued (the 'I', for Idle).
The second one is running the AWP subworkflow. It has a Velocity_Params and two AWP_GPU jobs, all queued.
Signs of problems
Held jobs ('H') are an indication of a disconnect between Condor and the job state on a remote system. You can quickly see how many You can learn more about why the job was held by running
condor_q <job id>
Often the information you'll get isn't very informative. If only a few jobs are held, try removing them using condor_rm <job id>, or the 'crm' alias in the cybershk account. If many jobs are held,