

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://strike.scec.org/scecwiki/index.php?action=history&amp;feed=atom&amp;title=HPC_Troubleshooting</id>
	<title>HPC Troubleshooting - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://strike.scec.org/scecwiki/index.php?action=history&amp;feed=atom&amp;title=HPC_Troubleshooting"/>
	<link rel="alternate" type="text/html" href="https://strike.scec.org/scecwiki/index.php?title=HPC_Troubleshooting&amp;action=history"/>
	<updated>2026-05-05T06:41:59Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.34.2</generator>
	<entry>
		<id>https://strike.scec.org/scecwiki/index.php?title=HPC_Troubleshooting&amp;diff=30132&amp;oldid=prev</id>
		<title>Bhatthal: Create HPC Troubleshooting page with OOM exception entry.</title>
		<link rel="alternate" type="text/html" href="https://strike.scec.org/scecwiki/index.php?title=HPC_Troubleshooting&amp;diff=30132&amp;oldid=prev"/>
		<updated>2025-01-27T17:53:09Z</updated>

		<summary type="html">&lt;p&gt;Create HPC Troubleshooting page with OOM exception entry.&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;This page outlines common issues encountered with scheduling jobs on High Performance Computing Systems (i.e. Expanse, Discovery, Frontera, etc.). Issues are generalized when possible to be broadly applicable across systems. Such issues include job configuration, queue selection, and systems access. Wherever possible, example problems and known solutions will also be listed here.&lt;br /&gt;
&lt;br /&gt;
== Out of Memory / Creeping Memory ==&lt;br /&gt;
An Out of Memory (OOM) exception occurs when a job uses more memory than was&lt;br /&gt;
allocated for it.  Due to application overhead, your application should not use&lt;br /&gt;
100% of the memory allocated for the job.&lt;br /&gt;
&lt;br /&gt;
=== Example Problem ===&lt;br /&gt;
In the following OOM example, the UCERF3-ETAS Quakeworx application is run on&lt;br /&gt;
Expanse on a single shared node with 50GB allocation, 95% utilization (i.e.&lt;br /&gt;
47GB), 400 catalogs, at 8 cores per node.&lt;br /&gt;
&lt;br /&gt;
[[File:UCERF3 ETAS Tutorial 12 ospjob.log.txt]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Memory after loop:&lt;br /&gt;
      in use memory: 25,748,489&lt;br /&gt;
      free memory: 23,534,582&lt;br /&gt;
      allocated memory: 49,283,072&lt;br /&gt;
      max memory: 49,283,072&lt;br /&gt;
      total free memory: 23,534,582&lt;br /&gt;
Memory at end of simultation&lt;br /&gt;
      in use memory: 25,760,779&lt;br /&gt;
      free memory: 23,522,292&lt;br /&gt;
      allocated memory: 49,283,072&lt;br /&gt;
      max memory: 49,283,072&lt;br /&gt;
      total free memory: 23,522,292&lt;br /&gt;
[23:55:23.256 (pool-1-thread-1)]: completed 341 (47 ruptures)&lt;br /&gt;
[23:55:23.283 (pool-1-thread-1)]: completed binary output 341&lt;br /&gt;
[23:55:23.283 (main)]: processing binary filter for 341&lt;br /&gt;
[23:55:23.285 (pool-1-thread-1)]: calculating 349&lt;br /&gt;
[23:55:23.285 (pool-1-thread-1)]: Instantiating ERF&lt;br /&gt;
/expanse/lustre/projects/usc143/qwxdev/apps/expanse/rocky8.8/ucerf3-etas/069e27e/ucerf3-etas-launcher/sbin/u3etas_jar_wrapper.sh: line 33: 2307392 Killed                  java -Djava.awt.headless=true $MEM_ARG -cp $DIR/opensha/opensha-all.jar $@&lt;br /&gt;
Thu Jan 16 23:55:24 PST 2025&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;max memory&amp;quot; and &amp;quot;allocated memory&amp;quot; are matched at 47GB (49,283,072 /&lt;br /&gt;
1024^2 = 47).  Java is reserving that amount of memory from the OS, even though&lt;br /&gt;
a lot of it is unused within the JVM, that's what the OS sees java using.&lt;br /&gt;
&lt;br /&gt;
=== Solution ===&lt;br /&gt;
Increase the buffer between what Java is allowed to use and what you request&lt;br /&gt;
with SLURM.  In this case we increases the allocation from 50GB to 55GB and&lt;br /&gt;
decreased our utilization from 95% (50*.95=47.5) to 90% (55*.9=49.5).&lt;br /&gt;
&lt;br /&gt;
By increasing the memory requested and decreasing the utilization rate, we were&lt;br /&gt;
able to increase the buffer size while also maintaining the memory available to&lt;br /&gt;
the application.&lt;br /&gt;
&lt;br /&gt;
As the simulation continues, the memory consumption gradually creeps towards&lt;br /&gt;
the &amp;quot;max memory&amp;quot; allocated. This is normal and the memory consumed will not&lt;br /&gt;
exceed the maximum allocation. So long as there is sufficient buffer, the job&lt;br /&gt;
should not be killed.&lt;br /&gt;
&lt;br /&gt;
See the successful job execution with creeping memory consumption in the following log file.&lt;br /&gt;
&lt;br /&gt;
[[File:UCERF3 ETAS Tutorial 17 ospjob.log.txt]]&lt;/div&gt;</summary>
		<author><name>Bhatthal</name></author>
		
	</entry>
</feed>