CyberShake Workflow Monitoring

From SCECpedia
Revision as of 18:33, 27 September 2018 by Scottcal (talk | contribs) (Created page with "This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful: # Condor queue, on shock # Individual logs for t...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful:

  1. Condor queue, on shock
  2. Individual logs for the workflow, on shock
  3. Jobs queued or running, on a remote cluster

A discussion of each is below.

Condor Queue

A good place to start is by looking in the condor queue. Run the command

condor_q -dag -nobatch

This command is aliased to 'cdag' in the cybershk account. This will show you a list of all the top-level workflows, sub workflows, and individual jobs queued on shock:

$>condor_q -dag -nobatch
-- Schedd: shock.usc.edu : <128.125.230.120:9618?... @ 09/27/18 11:27:37
 ID        OWNER/NODENAME                                 SUBMITTED     RUN_TIME ST PRI  SIZE CMD
133483.0   cybershk                                      9/26 19:55   0+15:32:11 R  0     0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS
134699.0    |-subdax_CyberShake_s4440_pre_s4440_preDAX   9/27 08:09   0+03:18:23 R  70    2.9 condor_dagman -p 0 -f -l . -Notification nev
134712.0     |-CheckSgt_CheckSgt_s4440_y                 9/27 08:10   0+00:55:12 R  20    0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C
134713.0     |-CheckSgt_CheckSgt_s4440_x                 9/27 08:10   0+00:53:11 R  20    0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C
134703.0    |-subdax_CyberShake_s4440_Synth_s4440_dax_0  9/27 08:09   0+03:18:11 R  70    2.9 condor_dagman -p 0 -f -l . -Notification nev
134714.0     |-DirectSynth_DirectSynth                   9/27 08:11   0+00:00:00 I  20    0.0 pegasus-kickstart -n scec::DirectSynth:1.0 -
134624.0   cybershk                                      9/27 06:05   0+05:22:11 R  0     0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS
134639.0    |-subdax_AWP_SGT_s3854_AWP_SGT_s3854         9/27 06:17   0+05:10:07 R  50    2.9 condor_dagman -p 0 -f -l . -Notification nev
134717.0     |-Velocity_Params_Velocity_Params_s3854     9/27 08:30   0+00:00:00 I  40    0.0 pegasus-kickstart -n scec::Velocity_Params:1
134751.0     |-AWP_GPU_AWP_GPU_s3854_x                   9/27 10:51   0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
134752.0     |-AWP_GPU_AWP_GPU_s3854_y                   9/27 10:51   0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW

The way to interpret this is that there are two top-level workflows running, for 2 different sites.  Those have Condor IDs 133483 and 134624.  The first one is running the pre and Synth subworkflows; the pre workflow has two CheckSgt jobs running (the 'R') and the Synth workflow has one DirectSynth job queued (the 'I', for Idle).

Signs of problems