Non-uniform memory access: Difference between revisions

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
imported>Hiàn
Implementations: convert to templated citation
 
imported>Voidxor
See also: Link CAS latency. Annotate per MOS:SEEALSO.
 
Line 1: Line 1:
{{Short description|Computer memory design used in multiprocessing}}
{{Short description|Computer memory design used in multiprocessing}}


[[File:HP Z820 motherboard.jpg|thumb|The motherboard of an [[HP Z|HP Z820]] workstation with two CPU sockets, each with their own set of eight [[DIMM]] slots surrounding the socket.]]
[[File:HP Z820 motherboard.jpg|thumb|The motherboard of an [[HP Z|HP Z820]] workstation with two CPU sockets, each with their own set of eight [[DIMM]] slots surrounding the socket]]
 
'''Non-uniform memory access''' ('''NUMA''') is a [[computer storage|computer memory]] design used in [[multiprocessing]], where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own [[local memory]] faster than non-local memory (memory local to another processor or memory shared between processors).<ref>{{FOLDOC|Non-uniform+memory+access}}</ref> NUMA is beneficial for workloads with high memory [[locality of reference]] and low [[lock contention]], because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.<ref name="nyu-numa">{{Cite web
'''Non-uniform memory access''' ('''NUMA''') is a [[computer storage|computer memory]] design used in [[multiprocessing]], where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own [[local memory]] faster than non-local memory (memory local to another processor or memory shared between processors).<ref>{{FOLDOC|Non-uniform+memory+access}}</ref> NUMA is beneficial for workloads with high memory [[locality of reference]] and low [[lock contention]], because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.<ref name="nyu-numa">{{Cite web
  | url = http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf
  | url = http://cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf
Line 7: Line 8:
  | date = 2010-05-04
  | date = 2010-05-04
  | access-date = 2014-01-27
  | access-date = 2014-01-27
  | author1 = Nakul Manchanda
  |first1= Nakul |last1=Manchanda
  | author2 = Karan Anand
  |first2= Karan |last2=Anand
  | publisher = New York University
  | publisher = New York University
  | archive-url = https://web.archive.org/web/20131228092942/http://www.cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf
  | archive-url = https://web.archive.org/web/20131228092942/http://www.cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf
Line 19: Line 20:
The first commercial implementation of a NUMA-based Unix system was{{where|date=January 2022}}{{when|date=March 2023}} the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for [[Honeywell Information Systems]] Italy.
The first commercial implementation of a NUMA-based Unix system was{{where|date=January 2022}}{{when|date=March 2023}} the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for [[Honeywell Information Systems]] Italy.


== {{Anchor|Basic concept}}Overview ==
==Overview <span class="anchor" id="Basic concept"></span>==
[[Image:NUMA.svg|right|300px|thumb|One possible architecture of a NUMA system. The processors connect to the bus or crossbar by connections of varying thickness/number. This shows that different CPUs have different access priorities to memory based on their relative location.]]
[[File:NUMA.svg|thumb|One possible architecture of a NUMA system. The processors connect to the bus or crossbar by connections of varying number. This shows that different CPUs have different access priorities to memory based on their relative location.]]


Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the first [[supercomputer]]s. Since then, CPUs increasingly have found themselves "starved for data" and having to stall while waiting for data to arrive from memory (e.g. for Von-Neumann architecture-based computers, see [[Von Neumann architecture#Von Neumann bottleneck|Von Neumann bottleneck]]). Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach.
Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the first [[supercomputer]]s. Since then, CPUs increasingly have found themselves "starved for data" and forced to stall to wait for data to arrive from memory (e.g. for [[Von Neumann architecture|Von-Neumann architecture]]-based computers, see [[Von Neumann architecture#Von Neumann bottleneck|Von Neumann bottleneck]]). Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach.


Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed [[cache memory]] and using increasingly sophisticated algorithms to avoid [[cache miss]]es. But the dramatic increase in size of the operating systems and of the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time.<ref>{{cite web
Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed [[cache memory]] and using increasingly sophisticated algorithms to avoid [[cache miss]]es. But the dramatic increase in size of both the operating systems and the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time.<ref>{{cite web
  | url = https://www.usenix.org/legacy/event/atc11/tech/final_files/Blagodurov.pdf
  | url = https://www.usenix.org/legacy/event/atc11/tech/final_files/Blagodurov.pdf
  | title = A Case for NUMA-aware Contention Management on Multicore Systems
  | title = A Case for NUMA-aware Contention Management on Multicore Systems
Line 57: Line 58:
[[Advanced Micro Devices|AMD]] implemented NUMA with its [[Opteron]] processor (2003), using [[HyperTransport]]. [[Intel]] announced NUMA compatibility for its x86 and [[Itanium]] servers in late 2007 with its [[Nehalem (microarchitecture)|Nehalem]] and [[Tukwila (processor)|Tukwila]] CPUs.<ref>Intel Corp. (2008). Intel QuickPath Architecture [White paper]. Retrieved from http://www.intel.com/pressroom/archive/reference/whitepaper_QuickPath.pdf</ref> Both Intel CPU families share a common [[chipset]]; the interconnection is called Intel [[Intel QuickPath Interconnect|QuickPath Interconnect]] (QPI), which provides extremely high bandwidth to enable high on-board scalability and was replaced by a new version called Intel [[Intel UltraPath Interconnect|UltraPath Interconnect]] with the release of [[Skylake microarchitecture|Skylake]] (2017).<ref>{{cite press release | url = https://www.intel.com/pressroom/archive/releases/2007/20070918corp_b.htm | title = Gelsinger Speaks To Intel And High-Tech Industry's Rapid Technology Cadence | date = September 18, 2007 | publisher = Intel Corporation | accessdate = March 29, 2025}}</ref>
[[Advanced Micro Devices|AMD]] implemented NUMA with its [[Opteron]] processor (2003), using [[HyperTransport]]. [[Intel]] announced NUMA compatibility for its x86 and [[Itanium]] servers in late 2007 with its [[Nehalem (microarchitecture)|Nehalem]] and [[Tukwila (processor)|Tukwila]] CPUs.<ref>Intel Corp. (2008). Intel QuickPath Architecture [White paper]. Retrieved from http://www.intel.com/pressroom/archive/reference/whitepaper_QuickPath.pdf</ref> Both Intel CPU families share a common [[chipset]]; the interconnection is called Intel [[Intel QuickPath Interconnect|QuickPath Interconnect]] (QPI), which provides extremely high bandwidth to enable high on-board scalability and was replaced by a new version called Intel [[Intel UltraPath Interconnect|UltraPath Interconnect]] with the release of [[Skylake microarchitecture|Skylake]] (2017).<ref>{{cite press release | url = https://www.intel.com/pressroom/archive/releases/2007/20070918corp_b.htm | title = Gelsinger Speaks To Intel And High-Tech Industry's Rapid Technology Cadence | date = September 18, 2007 | publisher = Intel Corporation | accessdate = March 29, 2025}}</ref>


=={{Anchor|CCNUMA}}Cache coherent NUMA (ccNUMA)==
==Cache coherent NUMA (ccNUMA)<span class="anchor" id="CCNUMA"></span>==
[[Image:Hwloc.png|right|300px|thumb|Topology of a ccNUMA [[Bulldozer (microarchitecture)|Bulldozer]] server extracted using hwloc's lstopo tool.]]
[[File:Hwloc.png|thumb|Topology of a ccNUMA [[Bulldozer (microarchitecture)|Bulldozer]] server, extracted using hwloc's lstopo tool]]


{{details|Directory-based cache coherence}}
{{details|Directory-based cache coherence}}
Line 66: Line 67:
  | title = ccNUMA: Cache Coherent Non-Uniform Memory Access
  | title = ccNUMA: Cache Coherent Non-Uniform Memory Access
  | year = 2014 | access-date = 2014-01-27
  | year = 2014 | access-date = 2014-01-27
  | publisher = slideshare.net
  |website= slideshare.net
}}</ref>
}}</ref>


Line 76: Line 77:
  | publisher = ACM }}</ref>
  | publisher = ACM }}</ref>


Alternatively, cache coherency protocols such as the [[MESIF protocol]] attempt to reduce the communication required to maintain cache coherency. [[Scalable Coherent Interface]] (SCI) is an [[IEEE]] standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. For example, SCI is used as the basis for the NumaConnect technology.<ref>{{Cite web |title= The Scalable Coherent Interface and Related Standards Projects |author= David B. Gustavson |publisher= [[Stanford Linear Accelerator Center]] |date= September 1991 |work= SLAC Publication 5656 |url= http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-5656.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-5656.pdf |archive-date=2022-10-09 |url-status=live |access-date= January 27, 2014 }}</ref><ref>{{cite web |url=http://www.numascale.com/numa_technology.html |title=The NumaChip enables cache coherent low cost shared memory |publisher=Numascale.com |access-date=2014-01-27 |archive-url=https://web.archive.org/web/20140122115025/http://www.numascale.com/numa_technology.html |archive-date=2014-01-22 |url-status=dead }}</ref>
Alternatively, cache coherency protocols such as the [[MESIF protocol]] attempt to reduce the communication required to maintain cache coherency. [[Scalable Coherent Interface]] (SCI) is an [[IEEE]] standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. For example, SCI is used as the basis for the NumaConnect technology.<ref>{{Cite web |title= The Scalable Coherent Interface and Related Standards Projects |first=David B. |last=Gustavson |publisher=[[Stanford Linear Accelerator Center]] |date= September 1991 |work= SLAC Publication 5656 |url= http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-5656.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.slac.stanford.edu/cgi-wrap/getdoc/slac-pub-5656.pdf |archive-date=2022-10-09 |url-status=live |access-date= January 27, 2014 }}</ref><ref>{{cite web |url=http://www.numascale.com/numa_technology.html |title=The NumaChip enables cache coherent low cost shared memory |website=Numascale.com |access-date=2014-01-27 |archive-url=https://web.archive.org/web/20140122115025/http://www.numascale.com/numa_technology.html |archive-date=2014-01-22 |url-status=dead }}</ref>


==NUMA vs. cluster computing==
==NUMA vs. cluster computing==
Line 87: Line 88:
* [[Java 7]] added support for NUMA-aware memory allocator and [[Garbage collection (computer science)|garbage collector]].<ref>[http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html#numa Java HotSpot Virtual Machine Performance Enhancements]</ref>
* [[Java 7]] added support for NUMA-aware memory allocator and [[Garbage collection (computer science)|garbage collector]].<ref>[http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html#numa Java HotSpot Virtual Machine Performance Enhancements]</ref>
* [[Linux kernel]]:
* [[Linux kernel]]:
**Version 2.5 provided a basic NUMA support,<ref>{{Cite web
** Version 2.5 provided a basic NUMA support,<ref>{{Cite web
  | url = http://lse.sourceforge.net/numa/
  | url = https://lse.sourceforge.net/numa/
  | title = Linux Scalability Effort: NUMA Group Homepage
  | title = Linux Scalability Effort: NUMA Group Homepage
  | date = 2002-11-20 | access-date = 2014-02-06
  | date = 2002-11-20 | access-date = 2014-02-06
  | website = SourceForge.net
  | website = SourceForge.net
}}</ref> which was further improved in subsequent kernel releases.
}}</ref> which was further improved in subsequent kernel releases.
**Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases.<ref>{{Cite web
** Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases.<ref>{{Cite web
  | url = http://kernelnewbies.org/Linux_3.8#head-c16d4288b51f0b50fbf615657e81b0db643fa7a0
  | url = http://kernelnewbies.org/Linux_3.8#head-c16d4288b51f0b50fbf615657e81b0db643fa7a0
  | title = Linux kernel 3.8, Section 1.8. Automatic NUMA balancing
  | title = Linux kernel 3.8, Section 1.8. Automatic NUMA balancing
Line 102: Line 103:
  | title = NUMA in a hurry
  | title = NUMA in a hurry
  | date = 2012-11-14 | access-date = 2014-02-06
  | date = 2012-11-14 | access-date = 2014-02-06
  | author = Jonathan Corbet | publisher = [[LWN.net]]
  | author = Jonathan Corbet |website= [[LWN.net]]
}}</ref>  
}}</ref>  
**Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having [[memory page]]s shared between processes, or the use of transparent [[huge page]]s; new [[sysctl]] settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.<ref>{{Cite web
** Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having [[memory page]]s shared between processes, or the use of transparent [[huge page]]s; new [[sysctl]] settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.<ref>{{Cite web
  | url = http://kernelnewbies.org/Linux_3.13#head-d29c7db2e73bc464eb67ed8de953d0bfc9841636
  | url = http://kernelnewbies.org/Linux_3.13#head-d29c7db2e73bc464eb67ed8de953d0bfc9841636
  | title = Linux kernel 3.13, Section 1.6. Improved performance in NUMA systems
  | title = Linux kernel 3.13, Section 1.6. Improved performance in NUMA systems
Line 113: Line 114:
  | title = Linux kernel documentation: Documentation/sysctl/kernel.txt
  | title = Linux kernel documentation: Documentation/sysctl/kernel.txt
  | access-date = 2014-02-06
  | access-date = 2014-02-06
  | publisher = [[kernel.org]]
  |website= [[kernel.org]]
}}</ref><ref>{{Cite web
}}</ref><ref>{{Cite web
  | url = https://lwn.net/Articles/568870/
  | url = https://lwn.net/Articles/568870/
  | title = NUMA scheduling progress
  | title = NUMA scheduling progress
  | date = 2013-10-01 | access-date = 2014-02-06
  | date = 2013-10-01 | access-date = 2014-02-06
  | author = Jonathan Corbet | publisher = [[LWN.net]]
  | author = Jonathan Corbet |website= [[LWN.net]]
}}</ref>
}}</ref>
* [[OpenSolaris]] models NUMA architecture with lgroups.
* [[OpenSolaris]] models NUMA architecture with lgroups.
* [[FreeBSD]] added support for NUMA architecture in version 9.0.<ref>{{Cite web|title=numa(4)|url=https://www.freebsd.org/cgi/man.cgi?numa(4)|access-date=2020-12-03|website=www.freebsd.org}}</ref>
* [[FreeBSD]] added support for NUMA architecture in version 9.0.<ref>{{Cite web |title=numa(4) |url=https://www.freebsd.org/cgi/man.cgi?numa(4) |access-date=2020-12-03 |website=www.freebsd.org }}</ref>
*[[Silicon Graphics]] [[IRIX]] (discontinued as of 2021) support for ccNUMA architecture over 1240 CPU with Origin server series.
* [[Silicon Graphics]] [[IRIX]] (discontinued as of 2021) support for ccNUMA architecture over 1240 CPU with Origin server series.


== Hardware support ==
== Hardware support ==
As of 2011, ccNUMA systems are multiprocessor systems based on the [[AMD Opteron]] processor, which can be implemented without external logic, and the Intel [[Itanium Processor Family|Itanium processor]], which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the [[Hewlett-Packard|HP]] sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those from [[Silicon Graphics]] were based on [[MIPS architecture|MIPS]] processors and the [[Digital Equipment Corporation|DEC]] [[Alpha 21364]] (EV7) processor.
{{As of|2011|post=,}} ccNUMA systems are multiprocessor systems based on the [[AMD Opteron]] processor, which can be implemented without external logic, and the Intel [[Itanium Processor Family|Itanium processor]], which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the [[Hewlett-Packard|HP]] sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those from [[Silicon Graphics]] were based on [[MIPS architecture|MIPS]] processors and the [[Digital Equipment Corporation|DEC]] [[Alpha 21364]] (EV7) processor.


==See also==
==See also==
{{Div col|colwidth=25em}}
* [[Uniform memory access]] (UMA)
* [[Cache-only memory architecture]] (COMA)
* [[Cache-only memory architecture]] (COMA)
* {{Annotated link|CAS latency}}
* [[HiperDispatch]]
* [[HiperDispatch]]
* [[Partitioned global address space]]
* {{Annotated link|Nodal architecture}}
* [[Nodal architecture]]
* {{Annotated link|Partitioned global address space}}
* [[Scratchpad memory]] (SPM)
* [[Scratchpad memory]] (SPM)
{{div col end}}
* [[Uniform memory access]] (UMA)


==References==
==References==
Line 141: Line 141:


== External links ==
== External links ==
{{Div col|colwidth=25em}}
* [http://lse.sourceforge.net/numa/faq/ NUMA FAQ]
* [https://web.archive.org/web/20060505025517/http://cs.gmu.edu/cne/modules/dsm/yellow/page_dsm.html Page-based distributed shared memory]
* [https://web.archive.org/web/20060924140915/http://opensolaris.org/os/community/performance/numa/ OpenSolaris NUMA Project]
* [https://web.archive.org/web/20040606042837/http://h18002.www1.hp.com/alphaserver/nextgen/overview.wmv Introduction video for the Alpha EV7 system architecture]
* [http://www.alphaprocessors.com/ More videos related to EV7 systems: CPU, IO, etc]
* [http://www.alphaprocessors.com/ More videos related to EV7 systems: CPU, IO, etc]
* [https://web.archive.org/web/20091108151203/http://developer.amd.com/pages/1162007106.aspx NUMA optimization in Windows Applications]
* [https://web.archive.org/web/20110128015336/http://oss.sgi.com/projects/numa/ NUMA Support in Linux at SGI]
* [http://www.realworldtech.com/page.cfm?NewsID=361&amp;date=05-05-2006#361/ Intel Tukwila]
* [http://www.realworldtech.com/page.cfm?NewsID=361&amp;date=05-05-2006#361/ Intel Tukwila]
* [http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032 Intel QPI (CSI) explained]
* [http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032 Intel QPI (CSI) explained]
* [https://web.archive.org/web/20071103124627/http://www.sql-server-performance.com/articles/per/high_call_volume_NUMA_p1.aspx current Itanium NUMA systems]
{{div col end}}


{{Parallel Computing}}
{{Parallel Computing}}


[[Category:Computer memory]]
[[Category:Parallel computing]]
[[Category:Parallel computing]]
[[Category:Computer memory]]

Latest revision as of 02:40, 17 November 2025

Template:Short description

File:HP Z820 motherboard.jpg
The motherboard of an HP Z820 workstation with two CPU sockets, each with their own set of eight DIMM slots surrounding the socket

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).[1] NUMA is beneficial for workloads with high memory locality of reference and low lock contention, because a processor may operate on a subset of memory mostly or entirely within its own cache node, reducing traffic on the memory bus.[2]

NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. They were developed commercially during the 1990s by Unisys, Convex Computer (later Hewlett-Packard), Honeywell Information Systems Italy (HISI) (later Groupe Bull), Silicon Graphics (later Silicon Graphics International), Sequent Computer Systems (later IBM), Data General (later EMC, now Dell Technologies), Digital (later Compaq, then HP, now HPE) and ICL. Techniques developed by these companies later featured in a variety of Unix-like operating systems, and to an extent in Windows NT.

The first commercial implementation of a NUMA-based Unix system wasTemplate:WhereTemplate:When the Symmetrical Multi Processing XPS-100 family of servers, designed by Dan Gielan of VAST Corporation for Honeywell Information Systems Italy.

Overview

File:NUMA.svg
One possible architecture of a NUMA system. The processors connect to the bus or crossbar by connections of varying number. This shows that different CPUs have different access priorities to memory based on their relative location.

Modern CPUs operate considerably faster than the main memory they use. In the early days of computing and data processing, the CPU generally ran slower than its own memory. The performance lines of processors and memory crossed in the 1960s with the advent of the first supercomputers. Since then, CPUs increasingly have found themselves "starved for data" and forced to stall to wait for data to arrive from memory (e.g. for Von-Neumann architecture-based computers, see Von Neumann bottleneck). Many supercomputer designs of the 1980s and 1990s focused on providing high-speed memory access as opposed to faster processors, allowing the computers to work on large data sets at speeds other systems could not approach.

Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this meant installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid cache misses. But the dramatic increase in size of both the operating systems and the applications run on them has generally overwhelmed these cache-processing improvements. Multi-processor systems without NUMA make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access the computer's memory at a time.[3]

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).[4] Another approach to addressing this problem is the multi-channel memory architecture, in which a linear increase in the number of memory channels increases the memory access concurrency linearly.[5]

Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between memory banks. This operation slows the processors attached to those banks, so the overall speed increase due to NUMA heavily depends on the nature of the running tasks.[4]

Implementations

AMD implemented NUMA with its Opteron processor (2003), using HyperTransport. Intel announced NUMA compatibility for its x86 and Itanium servers in late 2007 with its Nehalem and Tukwila CPUs.[6] Both Intel CPU families share a common chipset; the interconnection is called Intel QuickPath Interconnect (QPI), which provides extremely high bandwidth to enable high on-board scalability and was replaced by a new version called Intel UltraPath Interconnect with the release of Skylake (2017).[7]

Cache coherent NUMA (ccNUMA)

File:Hwloc.png
Topology of a ccNUMA Bulldozer server, extracted using hwloc's lstopo tool

Script error: No such module "labelled list hatnote".

Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model.[8]

Typically, ccNUMA uses inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA may perform poorly when multiple processors attempt to access the same memory area in rapid succession. Support for NUMA in operating systems attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary.[9]

Alternatively, cache coherency protocols such as the MESIF protocol attempt to reduce the communication required to maintain cache coherency. Scalable Coherent Interface (SCI) is an IEEE standard defining a directory-based cache coherency protocol to avoid scalability limitations found in earlier multiprocessor systems. For example, SCI is used as the basis for the NumaConnect technology.[10][11]

NUMA vs. cluster computing

One can view NUMA as a tightly coupled form of cluster computing. The addition of virtual memory paging to a cluster architecture can allow the implementation of NUMA entirely in software. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater (slower) than that of hardware-based NUMA.[2]

Software support

Since NUMA largely influences memory access performance, certain software optimizations are needed to allow scheduling threads and processes close to their in-memory data.

  • Microsoft Windows 7 and Windows Server 2008 R2 added support for NUMA architecture over 64 logical cores.[12]
  • Java 7 added support for NUMA-aware memory allocator and garbage collector.[13]
  • Linux kernel:
    • Version 2.5 provided a basic NUMA support,[14] which was further improved in subsequent kernel releases.
    • Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases.[15][16]
    • Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages; new sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.[17][18][19]
  • OpenSolaris models NUMA architecture with lgroups.
  • FreeBSD added support for NUMA architecture in version 9.0.[20]
  • Silicon Graphics IRIX (discontinued as of 2021) support for ccNUMA architecture over 1240 CPU with Origin server series.

Hardware support

Template:As of ccNUMA systems are multiprocessor systems based on the AMD Opteron processor, which can be implemented without external logic, and the Intel Itanium processor, which requires the chipset to support NUMA. Examples of ccNUMA-enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in NEC Itanium-based systems. Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor.

See also

References

Template:Reflist

External links

Template:Parallel Computing

  1. Template:Talk other
  2. a b Script error: No such module "citation/CS1".
  3. Script error: No such module "citation/CS1".
  4. a b Script error: No such module "citation/CS1".
  5. Script error: No such module "citation/CS1".
  6. Intel Corp. (2008). Intel QuickPath Architecture [White paper]. Retrieved from http://www.intel.com/pressroom/archive/reference/whitepaper_QuickPath.pdf
  7. Script error: No such module "citation/CS1".
  8. Script error: No such module "citation/CS1".
  9. Script error: No such module "citation/CS1".
  10. Script error: No such module "citation/CS1".
  11. Script error: No such module "citation/CS1".
  12. NUMA Support (MSDN)
  13. Java HotSpot Virtual Machine Performance Enhancements
  14. Script error: No such module "citation/CS1".
  15. Script error: No such module "citation/CS1".
  16. Script error: No such module "citation/CS1".
  17. Script error: No such module "citation/CS1".
  18. Script error: No such module "citation/CS1".
  19. Script error: No such module "citation/CS1".
  20. Script error: No such module "citation/CS1".