I agree with your take on tuning RAM. I am running my system at 3200 which the fastest that AMD specs for the memory bus on the Ryzen 9 CPUs. However, I tuned the timing parameters of the ASUS motherboard in order to run the memory modules themselves at the same speed (measured in nanoseconds) that the system would have run if you simply blindly set the XMP profile to 3600 MHz. (By the way the XMP profile didn't work for my G.Skill 3600 DRAM either.)
My own measurements showed the RAM timing only changed PI Benchmark CPU performance by about 3% to 5% going between 3200 and 3600 MHz. A little extra could be eked out by by running the memory controller at 3600 MHz but I preferred to keep it within the AMD consumer spec for maximum stability. Many like to push the DRAM and the memory bus as fast as possible. I choose to stay within the specs the CPU and DRAM vendors recommend. Rather than use an XMP Profile, I tuned the motherboard timing parameters instead on my system.
Overall, I think memory "MHz speed" is overrated. For most well-written applications, the cache hierarchy of the CPU hides memory timing in a system. An X increase in memory bus speed almost always results in something considerably less than X application performance increase (except in pathological benchmarks specifically built to highlight RAM speed).
Regarding RAMDisk, I disagree with your conclusion to never use a RAMDisk at all unless you have very fast storage as related in my prior post. With fast dual x4 PCIe-4.0 NVMe drives, I will agree that RAMDisk can slow down post-processing processes in PI such as those that are used in the PI Benchmark. For post-processing PI processes (as opposed to pre-processing processes like ImageIntegration), the bulk of your RAM is not going to be used unless you are working on a large mosaic image from a full frame camera.
In systems with a slower storage subsystem, the RAMDisk is useful to speed up the Undo / Redo of very large images. The benefits are marginal considering the amount of time used by the process itself but can make a nice noticeable difference when you do things like applying a saved Process History from one image to another.
There are memory bandwidth limited tasks out there, but I'm not sure PI has any. At least, I haven't seen an evidence of it. Everything I have seen about PI performance bottlenecks indicates that it's very latency and I/O sensitive, but not terribly bandwidth sensitive so long as your hardware is reasonably recent. For certain tasks, there appears to be a lot of CPU core to core communications or data consistency/coherency bottlenecks, which is one reason ThreadRippers don't scale anywhere near as well as you would expect for an image processing application.
Since we're talking about AMD, the memory clock and the Infinity Fabric clock are shared. The real reason most applications see an improvement overclocking the memory is because it also overclocks the Infinity Fabric. This provides some bandwidth increase, but the real benefit is significantly reduced latency. Specifically, between chiplets. In the Ryzen 5000/Zen3 design, cache does help to hide a lot of the memory latency, but there are still package level / chiplet to chiplet data consistency and coherency issues to deal with. The L3 is a victim cache, which has certain implications, but it's only coherent on the same die/chiplet. Thus, a 5950x which has two 8 core chipsets has a lot of L3 coherency traffic between the dies/chiplets. That typically means that the CPU on one chiplet must wait a relatively very long time for any data consistency issues to be resolved if that memory location is dirty in another chipsets L3 cache. Overclocking memory overlocks the Infinity Fabric, which reduces latency of these L3 cache coherency communications, thereby increasing performance by reducing CPU idle cycles. ThreadRipper has up to eight 8 core chiplets, all needing L3 coherency across them at times, and that's why the scaling from 16/32/64 cores is so dismal. If PI required less memory consistency across the processing task (for example, something like CineBench), then you'd see far better scaling.
It's unclear to me, as an external observer, if there's something inherent to PI processing that requires this or if it's just unoptimized software. The two best insights, and both point to unoptimized software, to this question are: other image processing applications scale better on large datasets than PI, and Apples M1 performance via Rosetta 2. The M1 PI benchmarks are way higher than they should be for what is effectively a quad core. This is both a latency thing, but also a direct, implicit data consistency result of ARMs weak memory model; in comparison to x86s roughly TSC model. Avoiding too many details, let's just say the the weak ARM memory model forces the software/programmer to deal with data consistency in a way that likely results in far more optimized code after a Rosetta 2 translation (which is how PI is running on M1 currently); x86s roughly TSC model likely hurts applications like PI which require high cache coherency due to programming design choices. However, that's just an external observers guess at what is going on from very limited data and no insider info.
Finally, there's some data to suggest that how AMD implemented the memory/IF clocks prefers certain multiples over others. DDR4-3200 is fine, but the data suggests there's a tiny bit of extra latency at DDR4-3600 than there should be. IIRC, the best scaling has been shown for integer multiples of 133Mhz (266Mhz in DDR4 specs). Thus, DDR4-3200 then DDR4-3466, then DDR4-3733 are the recommended AMD settings. Intel CPUs seems to prefer the 200/400Mhz DDR clock jumps. At least for now, any memory overclock beyond DDR4-3733 is not a good idea on AMD for real world performance because the IF can only clock that high before needing to run a clock divider. Thus, going beyond DDR4-3733 right now increases IF latency, and since most applications are not memory bandwidth bound, performance is reduced. Although, a few benchmarks are outliers.
Edited by LuxTerra, 14 January 2021 - 12:03 PM.