Memory hierarchy: Difference between revisions

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
imported>Stephan Leeds
bad line wrap × 2
 
imported>Artoria2e5
 
Line 47: Line 47:


==Examples==
==Examples==
[[File:Hwloc.png|thumb|right|300px|Memory hierarchy of an AMD Bulldozer server]]
[[File:Hwloc.png|thumb|right|300px|Memory hierarchy of an AMD Bulldozer server as detected by [[hwloc]]'s {{tt|lstopo}} tool]]


The number of levels in the memory hierarchy and the performance at each level has increased over time. The type of memory or storage components also change historically.<ref>{{cite web|url=http://www.computerhistory.org/timeline/memory-storage/|title=Memory & Storage – Timeline of Computer History – Computer History Museum|website=www.computerhistory.org}}</ref> For example, the memory hierarchy of an Intel Haswell Mobile<ref>{{cite web|last=Crothers |first=Brooke |url=http://news.cnet.com/8301-13579_3-57609045-37/dissecting-intels-top-graphics-in-apples-15-inch-macbook-pro/ |title=Dissecting Intel's top graphics in Apple's 15-inch MacBook Pro – CNET |publisher=News.cnet.com |access-date=2014-07-31}}</ref> processor circa 2013 is:
The number of levels in the memory hierarchy and the performance at each level has increased over time. The type of memory or storage components also change historically.<ref>{{cite web|url=http://www.computerhistory.org/timeline/memory-storage/|title=Memory & Storage – Timeline of Computer History – Computer History Museum|website=www.computerhistory.org}}</ref>
* [[Processor register]]s{{dash}}the fastest possible access (usually 1 CPU cycle). A few thousand bytes in size.
{{clear}}
* [[CPU cache|Cache]]
 
** Level 0 (L0), [[micro-operation]]s cache{{dash}}6,144 bytes (6 KiB{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}})<ref>{{cite web|url=http://www.anandtech.com/show/6355/intels-haswell-architecture/6 |title=Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel |publisher=AnandTech |access-date=2014-07-31}}</ref> in size
{|class=wikitable
** Level 1 (L1) [[Opcode|instruction]] cache{{dash}}128 KiB{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}} in size
|+Cache, memory, and external storage hierarchy of a 2020s computer system (AMD [[Zen 4]])
** Level 1 (L1) data cache{{dash}}128 KiB{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}} in size. Best access speed is around 700 [[Gigabyte|GB]]/s.<ref name=sisd_qa_f_mem_hsw>{{cite web|url=http://www.sisoftware.co.uk/?d=qa&f=mem_hsw |title=SiSoftware Zone |publisher=Sisoftware.co.uk |access-date=2014-07-31|archive-url=https://web.archive.org/web/20140913231938/http://www.sisoftware.co.uk/?d=qa&f=mem_hsw|archive-date=2014-09-13}}</ref>
|-
** Level 2 (L2) instruction and data (shared){{dash}}1 [[MiB]]{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}} in size. Best access speed is around 200 GB/s.<ref name=sisd_qa_f_mem_hsw />
! colspan=2|Level !! Size !! Throughput !! Latency !! Notes
** Level 3 (L3) shared cache{{dash}}6 MiB{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}} in size. Best access speed is around 100 GB/s.<ref name=sisd_qa_f_mem_hsw />
|-
** Level 4 (L4) shared cache{{dash}}128 MiB{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}} in size. Best access speed is around 40 GB/s.<ref name=sisd_qa_f_mem_hsw />
| colspan=2| [[Register file]] || 18,432 bits || Up to 256&nbsp;GB/s (512 bits/cycle) || 0.25&nbsp;ns (1 cycle)<ref name=fog>{{cite web |last1=Fog |first1=Agner |title=The microarchitecture of Intel and AMD CPUs |url=https://www.agner.org/optimize/microarchitecture.pdf}} Chapters used: 24.16 Cache and memory access (Zen 4).</ref> || All CPU-related conversion assumes a 4.0&nbsp;GHz clock. Same for below. Full utilization of throughput is impossible on real workloads. Size is provided for each core.
* [[Computer memory|Main memory]] ([[primary storage]]){{dash}}[[GiB]]{{cn|reason=No source provided for IEC units, sources only use metric units like KB, MB, GB, etc|date=May 2021}}{{Original research inline|certain=y|date=May 2021}} in size. Best access speed is around 10 GB/s.<ref name=sisd_qa_f_mem_hsw /> In the case of a [[Non-Uniform Memory Access|NUMA]] machine, access times may not be uniform.
|-
* [[Mass storage]] ([[secondary storage]]){{dash}}[[terabyte]]s in size. {{As of|2017}}, best access speed is from a consumer [[Solid-state drive|solid state drive]] is about 2000 MB/s.<ref>{{cite web|url=http://www.storagereview.com/samsung_960_pro_m2_nvme_ssd_review|title=Samsung 960 Pro M.2 NVMe SSD Review|date=20 October 2016 |publisher=storagereview.com|access-date=2017-04-13}}</ref>
| rowspan=3 | [[CPU cache]]
* [[Nearline storage]] ([[tertiary storage]]){{dash}}up to [[exabytes]] in size. {{As of|2013}}, best access speed is about 160 MB/s.<ref>{{cite web |url=http://www.lto.org/technology/generations.html |title=Ultrium – LTO Technology – Ultrium GenerationsLTO |publisher=Lto.org |access-date=2014-07-31 |url-status=dead |archive-url=https://web.archive.org/web/20110727052050/http://www.lto.org/technology/generations.html |archive-date=2011-07-27 }}</ref>
| L1 data || 32&nbsp;KiB || Up to 64&nbsp;GB/s (64 bytes/4 cycles) || 1&nbsp;ns (4 cycles)<ref name=fog/> || Hardware prefetching is required for maximum throughput. Size and throughput are per-core. Code cache has the same size but is not manipulable as data.
* [[Offline storage]]
|-
| L2 || 1&nbsp;MB || Up to 18.3&nbsp;GB/s (64 bytes/14 cycles) || 3.5&nbsp;ns (14 cycles)<ref name=fog/> || Size and throughput are per-core.
|-
| L3 || 16&ndash;32&nbsp;MB || Up to 5.45&nbsp;GB/s (64 bytes/47 cycles) || 11.75&nbsp;ns (47 cycles)<ref name=fog/> || Size is shared among 8 cores. Throughput is per-core.
|-
| colspan=2 | [[Main memory]] ([[primary storage|primary]])
| 64&nbsp;GiB || ~60&nbsp;GB/s || 82.5&nbsp;ns || Size is shared among all cores. Latency depends on the memory clock and memory timings. In this case, a result from a pair of 32&nbsp;GB DDR5 DIMMs set to 6000&nbsp;MT/s via the factory EXPO profile is used.<ref>{{cite web |title=AMD Ryzen 7000/9000 DDR5 RAM OC Guide XPM and EXPO Profile Benchmarks |url=https://www.ocinside.de/workshop_en/amd_ryzen_7000_9000_ddr5_oc_guide/3/}}</ref>
 
Systems with multiple CPU sockets have an additional [[Non-uniform memory access|NUMA]] delay when a CPU tries to access memory under the control of another NUMA node.
|-
| rowspan=2|[[Mass storage]]<br>([[secondary storage|secondary]])
| [[Solid-state drive]]
| 2&nbsp;TB
| 2000&nbsp;MB/s
| 0.2&nbsp;ms
| Figures for a [[M.2]] [[NVMe]] SSD from 2017, the Samsung 960 Pro.<ref>{{cite web|url=http://www.storagereview.com/samsung_960_pro_m2_nvme_ssd_review|title=Samsung 960 Pro M.2 NVMe SSD Review|date=20 October 2016 |publisher=storagereview.com|access-date=2017-04-13}}</ref>
|-
| [[Hard disk drive]]
|18&nbsp;TB
|rowspan=2|500&nbsp;MB/s
| 4.16&nbsp;ms
| Per-drive figures for Exos 2X18 (ST18000NM0092), an enterprise-grade 3.5 inch SATA HDD.<ref>{{cite web |title=Datasheet Exos 2X18 |url=https://www.seagate.com/content/dam/seagate/migrated-assets/www-content/datasheets/pdfs/exos-2x18-DS2093-1-2202GB-en_SG.pdf}}</ref>
|-
| rowspan=2|[[Nearline storage|Nearline]]<br>([[tertiary storage|tertiary]])
| Spun-down HDDs ([[Non-RAID drive architectures#MAID|MAID]])
| Petabytes
| 25&nbsp;s
| Per-drive figures for Exos 2X18 (ST18000NM0092), from user manual entry for "start/stop times".<ref>{{cite web |title=2X18 SATA Product Manual |url=https://www.seagate.com/content/dam/seagate/migrated-assets/www-content/manuals/exos-x-2x18/pdf/203859600a.pdf}}</ref> In a typical MAID setup, hundreds of spun-down HDDs may be used for petabytes of storage.
|-
| [[Tape library]]
| Exabytes
| 160 MB/s<ref>{{cite web |url=http://www.lto.org/technology/generations.html |title=Ultrium – LTO Technology – Ultrium GenerationsLTO |publisher=Lto.org |access-date=2014-07-31 |url-status=dead |archive-url=https://web.archive.org/web/20110727052050/http://www.lto.org/technology/generations.html |archive-date=2011-07-27 }}</ref>
| Minutes
|
|-
| colspan=2| Offline storage
| Exabytes
| Depends on medium
| colspan=2| Depends on human operation
|}
 
Some CPUs include additional levels of cache between L3 and memory. For example, the [[Haswell microarchitecture]] includes an L4 cache of 128&nbsp;MB on mobile units.<ref>{{cite web|last=Crothers |first=Brooke |url=http://news.cnet.com/8301-13579_3-57609045-37/dissecting-intels-top-graphics-in-apples-15-inch-macbook-pro/ |title=Dissecting Intel's top graphics in Apple's 15-inch MacBook Pro – CNET |publisher=News.cnet.com |access-date=2014-07-31}}</ref><ref name=sisd_qa_f_mem_hsw>{{cite web|url=http://www.sisoftware.co.uk/?d=qa&f=mem_hsw |title=SiSoftware Zone |publisher=Sisoftware.co.uk |access-date=2014-07-31|archive-url=https://web.archive.org/web/20140913231938/http://www.sisoftware.co.uk/?d=qa&f=mem_hsw|archive-date=2014-09-13}}</ref>


The lower levels of the hierarchy{{dash}}from mass storage downwards{{dash}}are also known as [[tiered storage]]. The formal distinction between online, nearline, and offline storage is:<ref name="pearson2010">{{cite web|last=Pearson|first=Tony|year=2010|title=Correct use of the term Nearline.|url=https://www.ibm.com/developerworks/community/blogs/InsideSystemStorage/entry/the_correct_use_of_the_term_nearline2|url-status=dead|archive-url=https://web.archive.org/web/20181127020712/https://www.ibm.com/developerworks/community/blogs/InsideSystemStorage/entry/the_correct_use_of_the_term_nearline2?lang=en|archive-date=2018-11-27|access-date=2015-08-16|work=IBM Developerworks, Inside System Storage}}</ref>
The lower levels of the hierarchy{{dash}}from mass storage downwards{{dash}}are also known as [[tiered storage]]. The formal distinction between online, nearline, and offline storage is:<ref name="pearson2010">{{cite web|last=Pearson|first=Tony|year=2010|title=Correct use of the term Nearline.|url=https://www.ibm.com/developerworks/community/blogs/InsideSystemStorage/entry/the_correct_use_of_the_term_nearline2|url-status=dead|archive-url=https://web.archive.org/web/20181127020712/https://www.ibm.com/developerworks/community/blogs/InsideSystemStorage/entry/the_correct_use_of_the_term_nearline2?lang=en|archive-date=2018-11-27|access-date=2015-08-16|work=IBM Developerworks, Inside System Storage}}</ref>
Line 68: Line 109:
* Offline storage is not immediately available, and requires some human intervention to bring online.
* Offline storage is not immediately available, and requires some human intervention to bring online.


For example, always-on spinning disks are online, while spinning disks that spin down, such as massive arrays of idle disk ([[Non-RAID drive architectures#MAID|MAID]]), are nearline. Removable media such as tape cartridges that can be automatically loaded, as in a [[tape library]], are nearline, while cartridges that must be manually loaded are offline.
For example, always-on spinning disks are online, while spinning disks that spin down, such as massive arrays of idle disk (MAID), are nearline. Removable media such as tape cartridges that can be automatically loaded, as in a [[tape library]], are nearline, while cartridges that must be manually loaded are offline.
 
== Programming ==


Most modern [[Central processing unit|CPUs]] are so fast that, for most program workloads, the [[wikt:bottleneck|bottleneck]] is the [[locality of reference]] of memory accesses and the efficiency of the [[CPU cache|caching]] and memory transfer between different levels of the hierarchy{{Citation needed|date=September 2009}}. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete.  This is sometimes called the ''space cost'', as a larger memory object is more likely to overflow a small and fast level and require use of a larger, slower level. The resulting load on memory use is known as ''pressure'' (respectively ''register pressure'', ''cache pressure'', and (main) ''memory pressure''). Terms for data being missing from a higher level and needing to be fetched from a lower level are, respectively: [[register spilling]] (due to [[register pressure]]: register to cache), [[cache miss]] (cache to main memory), and (hard) [[page fault]] (''real'' main memory to ''virtual'' memory, i.e. mass storage, commonly referred to as ''disk'' regardless of the actual mass storage technology used).
Most modern [[Central processing unit|CPUs]] are so fast that, for most program workloads, the [[wikt:bottleneck|bottleneck]] is the [[locality of reference]] of memory accesses and the efficiency of the [[CPU cache|caching]] and memory transfer between different levels of the hierarchy{{Citation needed|date=September 2009}}. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete.  This is sometimes called the ''space cost'', as a larger memory object is more likely to overflow a small and fast level and require use of a larger, slower level. The resulting load on memory use is known as ''pressure'' (respectively ''register pressure'', ''cache pressure'', and (main) ''memory pressure''). Terms for data being missing from a higher level and needing to be fetched from a lower level are, respectively: [[register spilling]] (due to [[register pressure]]: register to cache), [[cache miss]] (cache to main memory), and (hard) [[page fault]] (''real'' main memory to ''virtual'' memory, i.e. mass storage, commonly referred to as ''disk'' regardless of the actual mass storage technology used).


Modern [[programming language]]s mainly assume two levels of memory, main (''working'') memory and mass storage, though in [[assembly language]] and [[inline assembler]]s in languages such as [[C (programming language)|C]], registers can be directly accessed. Taking optimal advantage of the memory hierarchy requires the cooperation of programmers, hardware, and compilers (as well as underlying support from the operating system):
Modern [[programming language]]s mainly assume two levels of memory, main (''working'') memory and mass storage. The exception is the relatively low-level [[assembly language]] and in the [[inline assembler]]s of higher-level languages such as [[C (programming language)|C]]. Taking optimal advantage of the memory hierarchy requires the cooperation of programmers, hardware, and compilers (as well as underlying support from the operating system):
*''Programmers'' are responsible for moving data between disk and memory through file I/O.
*''Programmers'' are responsible for moving data between disk and memory through file I/O.
*''Hardware'' is responsible for moving data between memory and caches.
*''Hardware'' is responsible for moving data between memory and caches.
*''[[Optimizing compiler]]s'' are responsible for generating code that, when executed, will cause the hardware to use caches and registers efficiently.
*''[[Optimizing compiler]]s'' are responsible for generating code that, when executed, will cause the hardware to use caches and registers efficiently.


Many programmers assume one level of memory.  This works fine until the application hits a performance wall.  Then the memory hierarchy will be assessed during [[code refactoring]].
Many programmers assume one level of memory.  This works fine until the application hits a performance wall.  At that point, the programmer needs to change the code's memory access patterns to that it works well with cache resources.  A classic illustration of the effect of locality and caching is in the form of changing the order of iterating a three-dimensional array. ''Computer Systems: A Programmer's Perspective'' is a classic textbook that deals with this aspect of systems programming.<ref>{{cite web |title=A Programmer's Perspective:  Memory Systems |url=https://csapp.cs.cmu.edu/3e/perspective.html}}</ref>


==See also==
==See also==

Latest revision as of 08:07, 4 October 2025

Template:Short description

File:ComputerMemoryHierarchy.svg
Diagram of the computer memory hierarchy

Template:Memory types Script error: No such module "Distinguish".

In computer architecture, the memory hierarchy separates computer storage into a hierarchy based on response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by their performance and controlling technologies.[1] Memory hierarchy affects performance in computer architectural design, algorithm predictions, and lower level programming constructs involving locality of reference.

Designing for high performance requires considering the restrictions of the memory hierarchy, i.e. the size and capabilities of each component. Each of the various components can be viewed as part of a hierarchy of memories Template:Math in which each member Template:Mvar is typically smaller and faster than the next highest member Template:Math of the hierarchy. To limit waiting by higher levels, a lower level will respond by filling a buffer and then signaling for activating the transfer.

There are four major storage levels.[1]

This is a general memory hierarchy structuring. Many other structures are useful. For example, a paging algorithm may be considered as a level for virtual memory when designing a computer architecture, and one can include a level of nearline storage between online and offline storage.

Properties of the technologies in the memory hierarchy

  • Adding complexity slows the memory hierarchy.[2]
  • CMOx memory technology stretches the flash space in the memory hierarchy[3]
  • One of the main ways to increase system performance is minimising how far down the memory hierarchy one has to go to manipulate data.[4]
  • Latency and bandwidth are two metrics associated with caches. Neither of them is uniform, but is specific to a particular component of the memory hierarchy.[5]
  • Predicting where in the memory hierarchy the data resides is difficult.[5]
  • The location in the memory hierarchy dictates the time required for the prefetch to occur.[5]

Examples

File:Hwloc.png
Memory hierarchy of an AMD Bulldozer server as detected by hwloc's Template:Tt tool

The number of levels in the memory hierarchy and the performance at each level has increased over time. The type of memory or storage components also change historically.[6]

Cache, memory, and external storage hierarchy of a 2020s computer system (AMD Zen 4)
Level Size Throughput Latency Notes
Register file 18,432 bits Up to 256 GB/s (512 bits/cycle) 0.25 ns (1 cycle)[7] All CPU-related conversion assumes a 4.0 GHz clock. Same for below. Full utilization of throughput is impossible on real workloads. Size is provided for each core.
CPU cache L1 data 32 KiB Up to 64 GB/s (64 bytes/4 cycles) 1 ns (4 cycles)[7] Hardware prefetching is required for maximum throughput. Size and throughput are per-core. Code cache has the same size but is not manipulable as data.
L2 1 MB Up to 18.3 GB/s (64 bytes/14 cycles) 3.5 ns (14 cycles)[7] Size and throughput are per-core.
L3 16–32 MB Up to 5.45 GB/s (64 bytes/47 cycles) 11.75 ns (47 cycles)[7] Size is shared among 8 cores. Throughput is per-core.
Main memory (primary) 64 GiB ~60 GB/s 82.5 ns Size is shared among all cores. Latency depends on the memory clock and memory timings. In this case, a result from a pair of 32 GB DDR5 DIMMs set to 6000 MT/s via the factory EXPO profile is used.[8]

Systems with multiple CPU sockets have an additional NUMA delay when a CPU tries to access memory under the control of another NUMA node.

Mass storage
(secondary)
Solid-state drive 2 TB 2000 MB/s 0.2 ms Figures for a M.2 NVMe SSD from 2017, the Samsung 960 Pro.[9]
Hard disk drive 18 TB 500 MB/s 4.16 ms Per-drive figures for Exos 2X18 (ST18000NM0092), an enterprise-grade 3.5 inch SATA HDD.[10]
Nearline
(tertiary)
Spun-down HDDs (MAID) Petabytes 25 s Per-drive figures for Exos 2X18 (ST18000NM0092), from user manual entry for "start/stop times".[11] In a typical MAID setup, hundreds of spun-down HDDs may be used for petabytes of storage.
Tape library Exabytes 160 MB/s[12] Minutes
Offline storage Exabytes Depends on medium Depends on human operation

Some CPUs include additional levels of cache between L3 and memory. For example, the Haswell microarchitecture includes an L4 cache of 128 MB on mobile units.[13][14]

The lower levels of the hierarchyTemplate:Dashfrom mass storage downwardsTemplate:Dashare also known as tiered storage. The formal distinction between online, nearline, and offline storage is:[15]

  • Online storage is immediately available for I/O.
  • Nearline storage is not immediately available, but can be made online quickly without human intervention.
  • Offline storage is not immediately available, and requires some human intervention to bring online.

For example, always-on spinning disks are online, while spinning disks that spin down, such as massive arrays of idle disk (MAID), are nearline. Removable media such as tape cartridges that can be automatically loaded, as in a tape library, are nearline, while cartridges that must be manually loaded are offline.

Programming

Most modern CPUs are so fast that, for most program workloads, the bottleneck is the locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchyScript error: No such module "Unsubst".. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. This is sometimes called the space cost, as a larger memory object is more likely to overflow a small and fast level and require use of a larger, slower level. The resulting load on memory use is known as pressure (respectively register pressure, cache pressure, and (main) memory pressure). Terms for data being missing from a higher level and needing to be fetched from a lower level are, respectively: register spilling (due to register pressure: register to cache), cache miss (cache to main memory), and (hard) page fault (real main memory to virtual memory, i.e. mass storage, commonly referred to as disk regardless of the actual mass storage technology used).

Modern programming languages mainly assume two levels of memory, main (working) memory and mass storage. The exception is the relatively low-level assembly language and in the inline assemblers of higher-level languages such as C. Taking optimal advantage of the memory hierarchy requires the cooperation of programmers, hardware, and compilers (as well as underlying support from the operating system):

  • Programmers are responsible for moving data between disk and memory through file I/O.
  • Hardware is responsible for moving data between memory and caches.
  • Optimizing compilers are responsible for generating code that, when executed, will cause the hardware to use caches and registers efficiently.

Many programmers assume one level of memory. This works fine until the application hits a performance wall. At that point, the programmer needs to change the code's memory access patterns to that it works well with cache resources. A classic illustration of the effect of locality and caching is in the form of changing the order of iterating a three-dimensional array. Computer Systems: A Programmer's Perspective is a classic textbook that deals with this aspect of systems programming.[16]

See also

References

Template:Reflist

  1. a b Script error: No such module "citation/CS1".
  2. Write-combining
  3. Script error: No such module "citation/CS1".
  4. Script error: No such module "citation/CS1".
  5. a b c Script error: No such module "Citation/CS1".
  6. Script error: No such module "citation/CS1".
  7. a b c d Script error: No such module "citation/CS1". Chapters used: 24.16 Cache and memory access (Zen 4).
  8. Script error: No such module "citation/CS1".
  9. Script error: No such module "citation/CS1".
  10. Script error: No such module "citation/CS1".
  11. Script error: No such module "citation/CS1".
  12. Script error: No such module "citation/CS1".
  13. Script error: No such module "citation/CS1".
  14. Script error: No such module "citation/CS1".
  15. Script error: No such module "citation/CS1".
  16. Script error: No such module "citation/CS1".