Jump to content

  •  

CNers have asked about a donation box for Cloudy Nights over the years, so here you go. Donation is not required by any means, so please enjoy your stay.

Photo

Thoughts On PixInsight Use of RAMDisk

  • Please log in to reply
45 replies to this topic

#26 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 06:41 AM

If you want pure File I/O performance, the RAMDisk is the best deal regardless of hardware. If you want the best overall mixed CPU / File I/O performance, the RAMDisk may slow you down some if you already have very a fast storage subsystem.


Also, there's RAM, there's RAM and then there is RAM configuration.

A Ryzen 3900X like santafe have can actually reasonably drive 3600 MHZ RAM at CL16.

You need to correctly set it up in the motherboard though, the default voltages are actually all wrong on an ASUS both for the RAM and CPU (still - even with the latest Bios from a couple of weeks ago) so the XMP profile doesn't work. But it can be fixed. (You need to undervolt it, not overvolt it).

I think LTT did a comparison and going to something like 3200 to 3600 is only a 5% to 10% increase in performance (IIRC). However, I've never tried actual 3200 but on 3600, the default-configured settings which theoretically behaves like 3200 vs. properly configured 3600 settings made about a 33% difference in my VC++ build times (similarly massive memory massive CPU application). So I assume it would hold equivalent for PixInsight. BUT again - I haven't tried doing actual 3200 with proper 3200 settings. So it may not be more than a few % points.

Having said that, don't do a RAMDISK [edit: just addressing only santafe here because of his specific 2x PCIe4 combination]. But do properly optimize your RAM - it's still going to get used.

Edited by deonb, 13 January 2021 - 11:01 PM.


#27 jdupton

jdupton

    Gemini

  • *****
  • topic starter
  • Posts: 3,021
  • Joined: 21 Nov 2010
  • Loc: Central Texas, USA

Posted 13 January 2021 - 09:30 AM

deonb,

 

   I agree with your take on tuning RAM. I am running my system at 3200 which the fastest that AMD specs for the memory bus on the Ryzen 9 CPUs. However, I tuned the timing parameters of the ASUS motherboard in order to run the memory modules themselves at the same speed (measured in nanoseconds) that the system would have run if you simply blindly set the XMP profile to 3600 MHz. (By the way the XMP profile didn't work for my G.Skill 3600 DRAM either.)

 

   My own measurements showed the RAM timing only changed PI Benchmark CPU performance by about 3% to 5% going between 3200 and 3600 MHz. A little extra could be eked out by by running the memory controller at 3600 MHz but I preferred to keep it within the AMD consumer spec for maximum stability. Many like to push the DRAM and the memory bus as fast as possible. I choose to stay within the specs the CPU and DRAM vendors recommend. Rather than use an XMP Profile, I tuned the motherboard timing parameters instead on my system.

 

   Overall, I think memory "MHz speed" is overrated. For most well-written applications, the cache hierarchy of the CPU hides memory timing in a system. An X increase in memory bus speed almost always results in something considerably less than X application performance increase (except in pathological benchmarks specifically built to highlight RAM speed).

 

   Regarding RAMDisk, I disagree with your conclusion to never use a RAMDisk at all unless you have very fast storage as related in my prior post. With fast dual x4 PCIe-4.0 NVMe drives, I will agree that RAMDisk can slow down post-processing processes in PI such as those that are used in the PI Benchmark. For post-processing PI processes (as opposed to pre-processing processes like ImageIntegration), the bulk of your RAM is not going to be used unless you are working on a large mosaic image from a full frame camera.

 

   In systems with a slower storage subsystem, the RAMDisk is useful to speed up the Undo / Redo of very large images. The benefits are marginal considering the amount of time used by the process itself but can make a nice noticeable difference when you do things like applying a saved Process History from one image to another.

 

 

John



#28 santafe retiree

santafe retiree

    Viking 1

  • *****
  • Posts: 781
  • Joined: 23 Aug 2014

Posted 13 January 2021 - 10:24 AM

John,

 

What memory settings did you end up using?

 

Tom



#29 dghent

dghent

    Viking 1

  • *****
  • Vendors
  • Posts: 920
  • Joined: 10 Jun 2007

Posted 13 January 2021 - 11:32 AM

If you're all solid state storage, and especially (older) cheap SSD where the cells held in reserve are minimal, using a RAM disk is still a no-brainer. There is no sense in grinding down your SSD's write endurance with writes of large amounts of temporary data if you already have enough RAM that can be spared for an adequately-size RAM disk for temp writes. You can get lost in the debate over minutiae such as PCIe bus speeds and memory clock levels and all that, but the grim reaper for the cell memory on your SSDs is omnipresent no matter what.


Edited by dghent, 13 January 2021 - 11:32 AM.


#30 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 12:01 PM

If you're all solid state storage, and especially (older) cheap SSD where the cells held in reserve are minimal, using a RAM disk is still a no-brainer. There is no sense in grinding down your SSD's write endurance with writes of large amounts of temporary data if you already have enough RAM that can be spared for an adequately-size RAM disk for temp writes. You can get lost in the debate over minutiae such as PCIe bus speeds and memory clock levels and all that, but the grim reaper for the cell memory on your SSDs is omnipresent no matter what.

 

If PixInsight specifies FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE when creating their swap files (which I assume they do), then there is no physical write back to the SSD cells if there is enough available RAM.

 

And if there isn't enough RAM then a RAMDISK won't work either.



#31 jdupton

jdupton

    Gemini

  • *****
  • topic starter
  • Posts: 3,021
  • Joined: 21 Nov 2010
  • Loc: Central Texas, USA

Posted 13 January 2021 - 12:20 PM

Tom,

 

What memory settings did you end up using?

   The specific settings will be different for each CPU and DRAM choice and may have to account for motherboard options defined by your system BIOS. I can explain the methodology in detail via PM but to avoid confusion of others would rather not post the details here. They can be specific to your build. The basic method consists of:

  • Convert the XMP profile specification into the detailed nanosecond timings required for the DRAM modules.
     
  • Calculate a new set of memory cycle timing numbers that simultaneously never exceeds 1) the CPU memory controller specification, and 2) DRAM access timings (in nanoseconds) specification.

   Once these calculations are completed, they are entered in the detailed (advanced) Memory Setup are in your system BIOS. When you are done, you will running at the maximum frequency specified for the CPU memory bus but the actual access speed of the memory is as fast as it would have been had you successfully run with the DRAM's XMP profile.

 

 

John


  • santafe retiree likes this

#32 jdupton

jdupton

    Gemini

  • *****
  • topic starter
  • Posts: 3,021
  • Joined: 21 Nov 2010
  • Loc: Central Texas, USA

Posted 13 January 2021 - 12:55 PM

deonb,

 

If PixInsight specifies FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE when creating their swap files (which I assume they do), then there is no physical write back to the SSD cells if there is enough available RAM.

 

And if there isn't enough RAM then a RAMDISK won't work either.

   If I understand what you are saying, unfortunately, I don't think that is the case. I cannot tell for sure. You can verify the following observations.

  • Look to see where PI is set to store the Swap files. (This is the entries in PI preferences.)
  • Go to that directory and delete any files there while PI is not running.
  • Start PI and open an image frame.
  • Apply a HistogramTransformation stretch to the image. It must actually change the image; an STF visualization stretch is not the same.
  • Check the Swap Folder location and should see files there, now.
  • Close the stretch image without saving.
  • Observe that the Swap files are deleted.
     
  • Reopen an image frame.
  • Apply a HistogramTransformation stretch to the image. It must actually change the image; an STF visualization stretch is not the same.
  • Close PI.
  • Check for any files left in the Swap Folder. They disappear when PI closes but are there and can be copied to other folders when present. 

   I don't know if it is PI or the OS that deletes the files but at least they actually do appear in the folder while in use.

 

 

John


Edited by jdupton, 13 January 2021 - 12:55 PM.


#33 dghent

dghent

    Viking 1

  • *****
  • Vendors
  • Posts: 920
  • Joined: 10 Jun 2007

Posted 13 January 2021 - 01:08 PM

If PixInsight specifies FILE_ATTRIBUTE_TEMPORARY | FILE_FLAG_DELETE_ON_CLOSE when creating their swap files (which I assume they do), then there is no physical write back to the SSD cells if there is enough available RAM.

 

And if there isn't enough RAM then a RAMDISK won't work either.

Yes, but not all PI installations run on Windows, either, so the same file open flags might not be available, or behave similarly, on other OSes. To boot, the scratch space can be used to recover in-process work in the event of PI or the system itself crashing in case the scratch space does happen to be backed by permanent storage. Obviously, having the scratch data backed by permanent storage would be required to rescue work after a system crash rather than just a PI one.

 

I really don't know why this thing is an actual debate. There are certainly bonuses in speed and storage wear (if SSDs are involved) to RAM-backed scratch space, and y'all are just here arguing over really vanishingly small differences in attempt to be for or against it. It is threads like this that just make non-technical users more confused and really doesn't benefit anyone in the end.



#34 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 06:56 PM

deonb,
 
   If I understand what you are saying, unfortunately, I don't think that is the case. I cannot tell for sure. You can verify the following observations.

  • Look to see where PI is set to store the Swap files. (This is the entries in PI preferences.)
  • Go to that directory and delete any files there while PI is not running.
  • Start PI and open an image frame.
  • Apply a HistogramTransformation stretch to the image. It must actually change the image; an STF visualization stretch is not the same.
  • Check the Swap Folder location and should see files there, now.
  • Close the stretch image without saving.
  • Observe that the Swap files are deleted.

  • Reopen an image frame.
  • Apply a HistogramTransformation stretch to the image. It must actually change the image; an STF visualization stretch is not the same.
  • Close PI.
  • Check for any files left in the Swap Folder. They disappear when PI closes but are there and can be copied to other folders when present. 
 I don't know if it is PI or the OS that deletes the files but at least they actually do appear in the folder while in use.
 
 
John


This by itself is not a sufficient test. Windows will show those files as being present while the application is running based on the Windows cache, even if they're not actually on disk. You probably need to power-yank the device while PI is running then boot it up again and then take a look whether the files are still there before starting PI back up.

But even so, it may just be the directory entries and not actually any content, so you'd have to copy them out and examine them to see if they look like valid data.

#35 jdupton

jdupton

    Gemini

  • *****
  • topic starter
  • Posts: 3,021
  • Joined: 21 Nov 2010
  • Loc: Central Texas, USA

Posted 13 January 2021 - 07:06 PM

deonb,

 

   Agreed. I was able to copy them out of the directory to another location and they appeared to contain data. Even that may not prove anything -- as you say Windows may have just copied the cached data to a real file in the new location. The "power yank" and reboot is the better test. I don't think I will do that test at this time on my main system, though. shocked.gif

 

 

John



#36 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 07:22 PM

deonb,

 

   Agreed. I was able to copy them out of the directory to another location and they appeared to contain data. Even that may not prove anything -- as you say Windows may have just copied the cached data to a real file in the new location. The "power yank" and reboot is the better test. I don't think I will do that test at this time on my main system, though. shocked.gif

 

 

John

When you start up PixInsight for the first time it shows the wording for the dialog below (highlight mine). Based on the wording, it seems like they would be using temporary files specifically rather than just general files.
 

----

A system temporary folder is being used for storage of image swap files.

Be aware that files stored on system temporary folders may be deleted automatically
by the operating system and/or file management utilities, depending on specific
platforms and system configurations.

If you use one of these folders to store image swap files created by Pixinsight, you may
have problems if some swap files are removed while the associated images or projects
are still being used in a running instance of the PixInsight core application.

You can define swap file directories with the Preferences process, Directories and
Network section.
 



#37 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 08:04 PM

If you're all solid state storage, and especially (older) cheap SSD where the cells held in reserve are minimal, using a RAM disk is still a no-brainer. There is no sense in grinding down your SSD's write endurance with writes of large amounts of temporary data if you already have enough RAM that can be spared for an adequately-size RAM disk for temp writes. You can get lost in the debate over minutiae such as PCIe bus speeds and memory clock levels and all that, but the grim reaper for the cell memory on your SSDs is omnipresent no matter what.

That's however changing one theoretical complexity for another one.

 

A Sabrent Rocket has 1800 write cycles. Thus on a 1TB rocket you can write 1.8 PB before you're out of cycles.

An ASI6200/QHY600 .fits file is 120mb (largest .fits I know off). So you can write 15 million of those for 1.8 PB.

Put another way, you can do a HistogramTransformation every minute, 24 hours per day, for 28 years before you run out of write cycles.

On 2x 1TB drives that's 56 years. On 2x 2TB drives that's 112 years.

So in all practically terms, the drives will be at MTBF long before it runs out of Write Cycles.
 



#38 dghent

dghent

    Viking 1

  • *****
  • Vendors
  • Posts: 920
  • Joined: 10 Jun 2007

Posted 13 January 2021 - 09:17 PM

There are people with older, lower quality storage, and there is a lot of cheap SSDs out there that use cheap cell tech and maximizes available space over write cycles, or are on laptops where storage (spinning rust or otherwise) operates in low power, low performance mode. Either way, you all are arguing such trivial details that this debate will go only further into weeds and end up only talking about highly optimized and exceedingly-modern systems. We're already butwhaddabouting DRAM clock levels and PCIe 4.0, for gosh sakes. In practice, none of that will really matter. It just says to me that the larger picture has been lost.

 

If someone is technically savvy enough to weigh the benefits and impacts of having or not having a RAM disk for scratch space, combined with the non-performance drawbacks or benefits, then they will be able to make the call for themselves within the context of their hardware's capabilities. And you know what? It's not a big deal at all to try it and see if it's really a benefit and revert back if it's not earth shatteringly better or yields worse results or a degraded experience. But this entire thread makes it seem like going down that path marks a point of no return that warrants intense debate. It's really silly.



#39 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 10:31 PM

 Regarding RAMDisk, I disagree with your conclusion to never use a RAMDisk at all unless you have very fast storage as related in my prior post. With fast dual x4 PCIe-4.0 NVMe drives, I will agree that RAMDisk can slow down post-processing processes in PI such as those that are used in the PI Benchmark.


Sorry, I just meant that as a pile-on to your statement addressed to Santefe, who was in the process of getting 2x PCIe4 drives - not a general conclusion for all cases.
  • jdupton likes this

#40 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 13 January 2021 - 10:58 PM

There are people with older, lower quality storage, and there is a lot of cheap SSDs out there that use cheap cell tech and maximizes available space over write cycles, or are on laptops where storage (spinning rust or otherwise) operates in low power, low performance mode. Either way, you all are arguing such trivial details that this debate will go only further into weeds and end up only talking about highly optimized and exceedingly-modern systems. We're already butwhaddabouting DRAM clock levels and PCIe 4.0, for gosh sakes. In practice, none of that will really matter. It just says to me that the larger picture has been lost.
 
If someone is technically savvy enough to weigh the benefits and impacts of having or not having a RAM disk for scratch space, combined with the non-performance drawbacks or benefits, then they will be able to make the call for themselves within the context of their hardware's capabilities. And you know what? It's not a big deal at all to try it and see if it's really a benefit and revert back if it's not earth shatteringly better or yields worse results or a degraded experience. But this entire thread makes it seem like going down that path marks a point of no return that warrants intense debate. It's really silly.


I was addressing one specific person since he was in the process of buying hardware similar to what I have, who he had a question about whether it's worth upgrading a motherboard in order to support 3x instead of 2x NVME drives.

My replies weren't an attempt to try and help people making the best of existing hardware - it's specific about new hardware that santefe is in the process of buying. So he can't just try it and revert back.

@endless-sky showed earlier in this thread that the difference isn't just trivial. Just by changing bus architecture from PCIe3 to PCIe4 he was able to get some processes in PixInsight running 2 to 8 times faster, and an overall 30 minute speedup in processing time.


And similarly, as far as the DRAM clock levels - my motherboard can't run the Ryzen 3900 recommended 3600MHZ CL16 DRAM at XMP - it will crash due to a bug in the ASUS bios defaults. And by running at DRAM defaults instead, it's measurably 33% slower with my builds. Since Santefe is going to end up with similar hardware than me, I was just letting him know in case he runs into it. If he can run at XMP, then by all means he should do it and not worry about it. I can't.

Again not a general recommendation to everybody - just about this specific hardware combination.

Edited by deonb, 13 January 2021 - 10:59 PM.

  • santafe retiree likes this

#41 LuxTerra

LuxTerra

    Mariner 2

  • -----
  • Posts: 225
  • Joined: 29 Aug 2020

Posted 14 January 2021 - 11:57 AM

deonb,

 

   I agree with your take on tuning RAM. I am running my system at 3200 which the fastest that AMD specs for the memory bus on the Ryzen 9 CPUs. However, I tuned the timing parameters of the ASUS motherboard in order to run the memory modules themselves at the same speed (measured in nanoseconds) that the system would have run if you simply blindly set the XMP profile to 3600 MHz. (By the way the XMP profile didn't work for my G.Skill 3600 DRAM either.)

 

   My own measurements showed the RAM timing only changed PI Benchmark CPU performance by about 3% to 5% going between 3200 and 3600 MHz. A little extra could be eked out by by running the memory controller at 3600 MHz but I preferred to keep it within the AMD consumer spec for maximum stability. Many like to push the DRAM and the memory bus as fast as possible. I choose to stay within the specs the CPU and DRAM vendors recommend. Rather than use an XMP Profile, I tuned the motherboard timing parameters instead on my system.

 

   Overall, I think memory "MHz speed" is overrated. For most well-written applications, the cache hierarchy of the CPU hides memory timing in a system. An X increase in memory bus speed almost always results in something considerably less than X application performance increase (except in pathological benchmarks specifically built to highlight RAM speed).

 

   Regarding RAMDisk, I disagree with your conclusion to never use a RAMDisk at all unless you have very fast storage as related in my prior post. With fast dual x4 PCIe-4.0 NVMe drives, I will agree that RAMDisk can slow down post-processing processes in PI such as those that are used in the PI Benchmark. For post-processing PI processes (as opposed to pre-processing processes like ImageIntegration), the bulk of your RAM is not going to be used unless you are working on a large mosaic image from a full frame camera.

 

   In systems with a slower storage subsystem, the RAMDisk is useful to speed up the Undo / Redo of very large images. The benefits are marginal considering the amount of time used by the process itself but can make a nice noticeable difference when you do things like applying a saved Process History from one image to another.

 

 

John

There are memory bandwidth limited tasks out there, but I'm not sure PI has any. At least, I haven't seen an evidence of it. Everything I have seen about PI performance bottlenecks indicates that it's very latency and I/O sensitive, but not terribly bandwidth sensitive so long as your hardware is reasonably recent. For certain tasks, there appears to be a lot of CPU core to core communications or data consistency/coherency bottlenecks, which is one reason ThreadRippers don't scale anywhere near as well as you would expect for an image processing application.

 

Since we're talking about AMD, the memory clock and the Infinity Fabric clock are shared. The real reason most applications see an improvement overclocking the memory is because it also overclocks the Infinity Fabric. This provides some bandwidth increase, but the real benefit is significantly reduced latency. Specifically, between chiplets. In the Ryzen 5000/Zen3 design, cache does help to hide a lot of the memory latency, but there are still package level / chiplet to chiplet data consistency and coherency issues to deal with. The L3 is a victim cache, which has certain implications, but it's only coherent on the same die/chiplet. Thus, a 5950x which has two 8 core chipsets has a lot of L3 coherency traffic between the dies/chiplets. That typically means that the CPU on one chiplet must wait a relatively very long time for any data consistency issues to be resolved if that memory location is dirty in another chipsets L3 cache. Overclocking memory overlocks the Infinity Fabric, which reduces latency of these L3 cache coherency communications, thereby increasing performance by reducing CPU idle cycles. ThreadRipper has up to eight 8 core chiplets, all needing L3 coherency across them at times, and that's why the scaling from 16/32/64 cores is so dismal. If PI required less memory consistency across the processing task (for example, something like CineBench), then you'd see far better scaling.

 

It's unclear to me, as an external observer, if there's something inherent to PI processing that requires this or if it's just unoptimized software. The two best insights, and both point to unoptimized software, to this question are: other image processing applications scale better on large datasets than PI, and Apples M1 performance via Rosetta 2. The M1 PI benchmarks are way higher than they should be for what is effectively a quad core. This is both a latency thing, but also a direct, implicit data consistency result of ARMs weak memory model; in comparison to x86s roughly TSC model. Avoiding too many details, let's just say the the weak ARM memory model forces the software/programmer to deal with data consistency in a way that likely results in far more optimized code after a Rosetta 2 translation (which is how PI is running on M1 currently); x86s roughly TSC model likely hurts applications like PI which require high cache coherency due to programming design choices. However, that's just an external observers guess at what is going on from very limited data and no insider info.

 

Finally, there's some data to suggest that how AMD implemented the memory/IF clocks prefers certain multiples over others. DDR4-3200 is fine, but the data suggests there's a tiny bit of extra latency at DDR4-3600 than there should be. IIRC, the best scaling has been shown for integer multiples of 133Mhz (266Mhz in DDR4 specs). Thus, DDR4-3200 then DDR4-3466, then DDR4-3733 are the recommended AMD settings. Intel CPUs seems to prefer the 200/400Mhz DDR clock jumps. At least for now, any memory overclock beyond DDR4-3733 is not a good idea on AMD for real world performance because the IF can only clock that high before needing to run a clock divider. Thus, going beyond DDR4-3733 right now increases IF latency, and since most applications are not memory bandwidth bound, performance is reduced. Although, a few benchmarks are outliers.


Edited by LuxTerra, 14 January 2021 - 12:03 PM.

  • jdupton likes this

#42 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 15 January 2021 - 06:52 PM

It's unclear to me, as an external observer, if there's something inherent to PI processing that requires this or if it's just unoptimized software. The two best insights, and both point to unoptimized software, to this question are: other image processing applications scale better on large datasets than PI, and Apples M1 performance via Rosetta 2. The M1 PI benchmarks are way higher than they should be for what is effectively a quad core. This is both a latency thing, but also a direct, implicit data consistency result of ARMs weak memory model; in comparison to x86s roughly TSC model. Avoiding too many details, let's just say the the weak ARM memory model forces the software/programmer to deal with data consistency in a way that likely results in far more optimized code after a Rosetta 2 translation (which is how PI is running on M1 currently); x86s roughly TSC model likely hurts applications like PI which require high cache coherency due to programming design choices. However, that's just an external observers guess at what is going on from very limited data and no insider info.

 

Since PixInsight is written in C++, and they weren't really specifically targeting ARM at this point, they are probably just using sequentially consistent atomics everywhere rather than explicit acquire/release semantics for SC-DRF.

 

However, M1 (ARMv8) has exact instructions for SC-DRF, even better than Intel. I haven't jumped into the details to see what the Rosetta translator does but it could be that stores on the translated code is actually better. Although, they don't see the C++ source code - they just see an xchg instruction so not sure how they can tell if they actually need a full barrier vs just a store-release from there.

 

Maybe with some complex flow analysis, which would be impressive. Or maybe they can detect which compiler created the code and based on that figure out patterns where they can legally swap out an xchg for just a strl? Or they just 'cowboy' it and always do it :). Who knows.


  • LuxTerra likes this

#43 LuxTerra

LuxTerra

    Mariner 2

  • -----
  • Posts: 225
  • Joined: 29 Aug 2020

Posted 15 January 2021 - 08:52 PM

Since PixInsight is written in C++, and they weren't really specifically targeting ARM at this point, they are probably just using sequentially consistent atomics everywhere rather than explicit acquire/release semantics for SC-DRF.

 

However, M1 (ARMv8) has exact instructions for SC-DRF, even better than Intel. I haven't jumped into the details to see what the Rosetta translator does but it could be that stores on the translated code is actually better. Although, they don't see the C++ source code - they just see an xchg instruction so not sure how they can tell if they actually need a full barrier vs just a store-release from there.

 

Maybe with some complex flow analysis, which would be impressive. Or maybe they can detect which compiler created the code and based on that figure out patterns where they can legally swap out an xchg for just a strl? Or they just 'cowboy' it and always do it smile.gif. Who knows.

Good insight. I hadn't gotten around to digging quite that deep yet, but agree, it's appears rather clever. Do you have some sources for that info? I would like to read them.



#44 deonb

deonb

    Mariner 2

  • *****
  • Posts: 286
  • Joined: 16 Jul 2020
  • Loc: Seattle,WA

Posted 15 January 2021 - 10:24 PM

Good insight. I hadn't gotten around to digging quite that deep yet, but agree, it's appears rather clever. Do you have some sources for that info? I would like to read them.


Herb Sutter did a very good series a while ago about the C++ standard memory model titled "Atomic Weapons" that goes through the design behind SC-DRF and the hardware implementation on different platforms. At that point ARMv8 wasn't released yet, but he knew it was coming and incorporated it, and explained it in a very simple to follow way:

Part 1:
https://www.youtube....h?v=A8eCGOqgvH4

Part 2:
https://www.youtube....h?v=KeLBd2EJLOU

Edited by deonb, 15 January 2021 - 10:25 PM.


#45 LuxTerra

LuxTerra

    Mariner 2

  • -----
  • Posts: 225
  • Joined: 29 Aug 2020

Posted 16 January 2021 - 11:48 AM

Herb Sutter did a very good series a while ago about the C++ standard memory model titled "Atomic Weapons" that goes through the design behind SC-DRF and the hardware implementation on different platforms. At that point ARMv8 wasn't released yet, but he knew it was coming and incorporated it, and explained it in a very simple to follow way:

Part 1:
https://www.youtube....h?v=A8eCGOqgvH4

Part 2:
https://www.youtube....h?v=KeLBd2EJLOU

Hadn’t seen those before. Thanks. 
 

The RE of the M1 GPU is moving along; short summary. https://rosenzweig.i...gpu-part-1.html



#46 LuxTerra

LuxTerra

    Mariner 2

  • -----
  • Posts: 225
  • Joined: 29 Aug 2020

Posted 17 January 2021 - 04:47 PM

Herb Sutter did a very good series a while ago about the C++ standard memory model titled "Atomic Weapons" that goes through the design behind SC-DRF and the hardware implementation on different platforms. At that point ARMv8 wasn't released yet, but he knew it was coming and incorporated it, and explained it in a very simple to follow way:

Part 1:
https://www.youtube....h?v=A8eCGOqgvH4

Part 2:
https://www.youtube....h?v=KeLBd2EJLOU

Ok, finally got around to watching them. 2/3rds is standard software memory models and how we finally defined, standardized and converged. The hardware discussion starts around minute 28 in part 2. The ARMv8 is only discussed briefly at approx minute 54. Summary, ARMv8 is the first CPU arch to explicitly target alignment with software memory models (i.e. the compiler). Apples A14/M1 are custom ARMv8 processors. Thus, as mentioned, they should have a significant advantage for this type of workload. 




CNers have asked about a donation box for Cloudy Nights over the years, so here you go. Donation is not required by any means, so please enjoy your stay.


Recent Topics






Cloudy Nights LLC
Cloudy Nights Sponsor: Astronomics