Difference between revisions of "Workflow Program Requirements"

From SCECpedia
Jump to navigationJump to search
 
Line 7: Line 7:
 
#File and metadata management to track files in system.
 
#File and metadata management to track files in system.
 
#Support for distributed processing across multiple computing resources.
 
#Support for distributed processing across multiple computing resources.
 +
#Dependencies (e.g., HTCondor, Airflow)
  
 
== General Program Guidelines for Programs used in a Workflow ==
 
== General Program Guidelines for Programs used in a Workflow ==

Latest revision as of 21:51, 14 August 2018

We recommend the following programming standards for CME scientific software. These standards help make the codes interoperate. For programs to be hosted as components in workflows, we ask scientists to prepare their code this way.

Characteristics of a Workflow System

Most workflow systems will provide the following capabilities.

  1. Format for expressing program execution order and data dependencies.
  2. Job scheduler that automates the orderly execution of the programs.
  3. File and metadata management to track files in system.
  4. Support for distributed processing across multiple computing resources.
  5. Dependencies (e.g., HTCondor, Airflow)

General Program Guidelines for Programs used in a Workflow

  1. The code should return an exit code when it runs. It should return only two values: Successful return = 0 - Error exit return = 1
  2. For each code, the number of input and output files must be known and should always be the same whenever the program is run.
  3. There should be no compiled in references to programs, pathnames, or files names in the code. It is okay to set default values for filenames, however, it should be possible for users of the code to overwrite any default file names if necessary using command line parameters.
  4. If the code references any other executables, the program should accept a command line parameter to a directory where the executable can be found.
  5. The code should accept command line parameters to any input files names and not assume that the input file it uses is referred to by a particular name.
  6. The code should accept command line parameters to any output files so we can assign the name to the output file and output directory.
  7. The code should read variable inputs either on command line or from a configuration file. A code should have its own config file, or its own section in a config file, that contains parameters related only to itself and no other programs.
  8. If the code passes parameters on the command line, it is better to use attribute name, attribute value pairs (e.g. –o output_file name –i input_file_name) rather than positional values (e.g. first command line parameter is the input file name, second command line parameters is the output filename).
  9. If the code requires any environment variables, when the program starts to run, it should verify that they are successfully read or exit with an error.
  10. When SCEC run’s the program, we will probably submit the program to a job scheduler, using a job scheduler scripting language like PBS. Demonstrating that the software can be run using a PBS interfaces makes is much easier for us to use the program in a workflow.

Broadband Platform Specific Recommendations

  1. Programs should input and output seismograms in broadband platform format, which contains 3 components in a single file.
  2. Programs should work with an arbitrary seismograms of arbitrary length (remove the power of 2 number of samples restriction)

Grid-based Program Guidelines

The Pegasus Programming community has defined a series of guidelines for writing software that will run in a grid environment. Their guidelines are posted on the Pegasus web site. This link will open a PDF file.

Pegasus-developed Grid Program Guidelines