CyberShake Workflow Monitoring
From SCECpedia
Jump to navigationJump to search
This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful:
- Condor queue, on shock
- Individual logs for the workflow, on shock
- Jobs queued or running, on a remote cluster
A discussion of each is below.
Condor Queue
A good place to start is by looking in the condor queue. Run the command
condor_q -dag -nobatch
This command is aliased to 'cdag' in the cybershk account. This will show you a list of all the top-level workflows, sub workflows, and individual jobs queued on shock:
$>condor_q -dag -nobatch -- Schedd: shock.usc.edu : <128.125.230.120:9618?... @ 09/27/18 11:27:37 ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 133483.0 cybershk 9/26 19:55 0+15:32:11 R 0 0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS 134699.0 |-subdax_CyberShake_s4440_pre_s4440_preDAX 9/27 08:09 0+03:18:23 R 70 2.9 condor_dagman -p 0 -f -l . -Notification nev 134712.0 |-CheckSgt_CheckSgt_s4440_y 9/27 08:10 0+00:55:12 R 20 0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C 134713.0 |-CheckSgt_CheckSgt_s4440_x 9/27 08:10 0+00:53:11 R 20 0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C 134703.0 |-subdax_CyberShake_s4440_Synth_s4440_dax_0 9/27 08:09 0+03:18:11 R 70 2.9 condor_dagman -p 0 -f -l . -Notification nev 134714.0 |-DirectSynth_DirectSynth 9/27 08:11 0+00:00:00 I 20 0.0 pegasus-kickstart -n scec::DirectSynth:1.0 - 134624.0 cybershk 9/27 06:05 0+05:22:11 R 0 0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS 134639.0 |-subdax_AWP_SGT_s3854_AWP_SGT_s3854 9/27 06:17 0+05:10:07 R 50 2.9 condor_dagman -p 0 -f -l . -Notification nev 134717.0 |-Velocity_Params_Velocity_Params_s3854 9/27 08:30 0+00:00:00 I 40 0.0 pegasus-kickstart -n scec::Velocity_Params:1 134751.0 |-AWP_GPU_AWP_GPU_s3854_x 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW 134752.0 |-AWP_GPU_AWP_GPU_s3854_y 9/27 10:51 0+00:00:00 I 50 0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AWThe way to interpret this is that there are two top-level workflows running, for 2 different sites. Those have Condor IDs 133483 and 134624. The first one is running the pre and Synth subworkflows; the pre workflow has two CheckSgt jobs running (the 'R') and the Synth workflow has one DirectSynth job queued (the 'I', for Idle).Signs of problems