HPC Troubleshooting
This page outlines common issues encountered with scheduling jobs on High Performance Computing Systems (i.e. Expanse, Discovery, Frontera, etc.). Issues are generalized when possible to be broadly applicable across systems. Such issues include job configuration, queue selection, and systems access. Wherever possible, example problems and known solutions will also be listed here.
Out of Memory / Creeping Memory
An Out of Memory (OOM) exception occurs when a job uses more memory than was allocated for it. Due to application overhead, your application should not use 100% of the memory allocated for the job.
Example Problem
In the following OOM example, the UCERF3-ETAS Quakeworx application is run on Expanse on a single shared node with 50GB allocation, 95% utilization (i.e. 47GB), 400 catalogs, at 8 cores per node.
File:UCERF3 ETAS Tutorial 12 ospjob.log.txt
Memory after loop: in use memory: 25,748,489 free memory: 23,534,582 allocated memory: 49,283,072 max memory: 49,283,072 total free memory: 23,534,582 Memory at end of simultation in use memory: 25,760,779 free memory: 23,522,292 allocated memory: 49,283,072 max memory: 49,283,072 total free memory: 23,522,292 [23:55:23.256 (pool-1-thread-1)]: completed 341 (47 ruptures) [23:55:23.283 (pool-1-thread-1)]: completed binary output 341 [23:55:23.283 (main)]: processing binary filter for 341 [23:55:23.285 (pool-1-thread-1)]: calculating 349 [23:55:23.285 (pool-1-thread-1)]: Instantiating ERF /expanse/lustre/projects/usc143/qwxdev/apps/expanse/rocky8.8/ucerf3-etas/069e27e/ucerf3-etas-launcher/sbin/u3etas_jar_wrapper.sh: line 33: 2307392 Killed java -Djava.awt.headless=true $MEM_ARG -cp $DIR/opensha/opensha-all.jar $@ Thu Jan 16 23:55:24 PST 2025
The "max memory" and "allocated memory" are matched at 47GB (49,283,072 / 1024^2 = 47). Java is reserving that amount of memory from the OS, even though a lot of it is unused within the JVM, that's what the OS sees java using.
Solution
Increase the buffer between what Java is allowed to use and what you request with SLURM. In this case we increases the allocation from 50GB to 55GB and decreased our utilization from 95% (50*.95=47.5) to 90% (55*.9=49.5).
By increasing the memory requested and decreasing the utilization rate, we were able to increase the buffer size while also maintaining the memory available to the application.
As the simulation continues, the memory consumption gradually creeps towards the "max memory" allocated. This is normal and the memory consumed will not exceed the maximum allocation. So long as there is sufficient buffer, the job should not be killed.
See the successful job execution with creeping memory consumption in the following log file.