HPC Troubleshooting

This page outlines common issues encountered with scheduling jobs on High Performance Computing Systems (i.e. Expanse, Discovery, Frontera, etc.). Issues are generalized when possible to be broadly applicable across systems. Such issues include job configuration, queue selection, and systems access. Wherever possible, example problems and known solutions will also be listed here.

Out of Memory / Creeping Memory

An Out of Memory (OOM) exception occurs when a job uses more memory than was allocated for it. Due to application overhead, your application should not use 100% of the memory allocated for the job.

Example Problem

In the following OOM example, the UCERF3-ETAS Quakeworx application is run on Expanse on a single shared node with 50GB allocation, 95% utilization (i.e. 47GB), 400 catalogs, at 8 cores per node.

File:UCERF3 ETAS Tutorial 12 ospjob.log.txt

Memory after loop:
      in use memory: 25,748,489
      free memory: 23,534,582
      allocated memory: 49,283,072
      max memory: 49,283,072
      total free memory: 23,534,582
Memory at end of simultation
      in use memory: 25,760,779
      free memory: 23,522,292
      allocated memory: 49,283,072
      max memory: 49,283,072
      total free memory: 23,522,292
[23:55:23.256 (pool-1-thread-1)]: completed 341 (47 ruptures)
[23:55:23.283 (pool-1-thread-1)]: completed binary output 341
[23:55:23.283 (main)]: processing binary filter for 341
[23:55:23.285 (pool-1-thread-1)]: calculating 349
[23:55:23.285 (pool-1-thread-1)]: Instantiating ERF
/expanse/lustre/projects/usc143/qwxdev/apps/expanse/rocky8.8/ucerf3-etas/069e27e/ucerf3-etas-launcher/sbin/u3etas_jar_wrapper.sh: line 33: 2307392 Killed                  java -Djava.awt.headless=true $MEM_ARG -cp $DIR/opensha/opensha-all.jar $@
Thu Jan 16 23:55:24 PST 2025

The "max memory" and "allocated memory" are matched at 47GB (49,283,072 / 1024^2 = 47). Java is reserving that amount of memory from the OS, even though a lot of it is unused within the JVM, that's what the OS sees java using.

Solution

Increase the buffer between what Java is allowed to use and what you request with SLURM. In this case we increases the allocation from 50GB to 55GB and decreased our utilization from 95% (50*.95=47.5) to 90% (55*.9=49.5).

By increasing the memory requested and decreasing the utilization rate, we were able to increase the buffer size while also maintaining the memory available to the application.

As the simulation continues, the memory consumption gradually creeps towards the "max memory" allocated. This is normal and the memory consumed will not exceed the maximum allocation. So long as there is sufficient buffer, the job should not be killed.

See the successful job execution with creeping memory consumption in the following log file.

File:UCERF3 ETAS Tutorial 17 ospjob.log.txt

HPC Troubleshooting

Out of Memory / Creeping Memory

Example Problem

Solution

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools