Difference between revisions of "CyberShake Workflow Monitoring"

From SCECpedia
Jump to navigationJump to search
(Created page with "This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful: # Condor queue, on shock # Individual logs for t...")
 
Line 25: Line 25:
 
134751.0    |-AWP_GPU_AWP_GPU_s3854_x                  9/27 10:51  0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
 
134751.0    |-AWP_GPU_AWP_GPU_s3854_x                  9/27 10:51  0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
 
134752.0    |-AWP_GPU_AWP_GPU_s3854_y                  9/27 10:51  0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
 
134752.0    |-AWP_GPU_AWP_GPU_s3854_y                  9/27 10:51  0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
<pre>
+
</pre>
  
The way to interpret this is that there are two top-level workflows running, for 2 different sites.  Those have Condor IDs 133483 and 134624. The first one is running the pre and Synth subworkflows; the pre workflow has two CheckSgt jobs running (the 'R') and the Synth workflow has one DirectSynth job queued (the 'I', for Idle).
+
The way to interpret this is that there are two top-level workflows running, for 2 different sites.  Those have Condor IDs 133483 and 134624.
 +
 
 +
The first one is running the pre and Synth subworkflows; the pre workflow has two [[CyberShake_Code_Base#CheckSgt|CheckSgt]] jobs running (the 'R') and the Synth workflow has one [[CyberShake_Code_Base#DirectSynth|DirectSynth]] job queued (the 'I', for Idle).
 +
 
 +
The second one is running the AWP subworkflow.  It has the Velocity_Params
  
 
=== Signs of problems ===
 
=== Signs of problems ===

Revision as of 18:34, 27 September 2018

This page outlines the monitoring for a single CyberShake workflow. There are three levels of monitoring which can be useful:

  1. Condor queue, on shock
  2. Individual logs for the workflow, on shock
  3. Jobs queued or running, on a remote cluster

A discussion of each is below.

Condor Queue

A good place to start is by looking in the condor queue. Run the command

condor_q -dag -nobatch

This command is aliased to 'cdag' in the cybershk account. This will show you a list of all the top-level workflows, sub workflows, and individual jobs queued on shock:

$>condor_q -dag -nobatch
-- Schedd: shock.usc.edu : <128.125.230.120:9618?... @ 09/27/18 11:27:37
 ID        OWNER/NODENAME                                 SUBMITTED     RUN_TIME ST PRI  SIZE CMD
133483.0   cybershk                                      9/26 19:55   0+15:32:11 R  0     0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS
134699.0    |-subdax_CyberShake_s4440_pre_s4440_preDAX   9/27 08:09   0+03:18:23 R  70    2.9 condor_dagman -p 0 -f -l . -Notification nev
134712.0     |-CheckSgt_CheckSgt_s4440_y                 9/27 08:10   0+00:55:12 R  20    0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C
134713.0     |-CheckSgt_CheckSgt_s4440_x                 9/27 08:10   0+00:53:11 R  20    0.0 pegasus-kickstart -n scec::CheckSgt:1.0 -N C
134703.0    |-subdax_CyberShake_s4440_Synth_s4440_dax_0  9/27 08:09   0+03:18:11 R  70    2.9 condor_dagman -p 0 -f -l . -Notification nev
134714.0     |-DirectSynth_DirectSynth                   9/27 08:11   0+00:00:00 I  20    0.0 pegasus-kickstart -n scec::DirectSynth:1.0 -
134624.0   cybershk                                      9/27 06:05   0+05:22:11 R  0     0.0 pegasus-dagman -p 0 -f -l . -Lockfile CyberS
134639.0    |-subdax_AWP_SGT_s3854_AWP_SGT_s3854         9/27 06:17   0+05:10:07 R  50    2.9 condor_dagman -p 0 -f -l . -Notification nev
134717.0     |-Velocity_Params_Velocity_Params_s3854     9/27 08:30   0+00:00:00 I  40    0.0 pegasus-kickstart -n scec::Velocity_Params:1
134751.0     |-AWP_GPU_AWP_GPU_s3854_x                   9/27 10:51   0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW
134752.0     |-AWP_GPU_AWP_GPU_s3854_y                   9/27 10:51   0+00:00:00 I  50    0.0 pegasus-kickstart -n scec::AWP_GPU:1.0 -N AW

The way to interpret this is that there are two top-level workflows running, for 2 different sites. Those have Condor IDs 133483 and 134624.

The first one is running the pre and Synth subworkflows; the pre workflow has two CheckSgt jobs running (the 'R') and the Synth workflow has one DirectSynth job queued (the 'I', for Idle).

The second one is running the AWP subworkflow. It has the Velocity_Params

Signs of problems