Matrix multiplication algorithm
Template:Short description Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph.[1] Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors (perhaps over a network).
Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of n3Script error: No such module "Check for unknown parameters". field operations to multiply two n × nScript error: No such module "Check for unknown parameters". matrices over that field (Θ(n3)Script error: No such module "Check for unknown parameters". in big O notation). Better asymptotic bounds on the time required to multiply matrices have been known since the Strassen's algorithm in the 1960s, but the optimal time (that is, the computational complexity of matrix multiplication) remains unknown. As of April 2024[update]Template:Dated maintenance category (articles)Script error: No such module "Check for unknown parameters"., the best announced bound on the asymptotic complexity of a matrix multiplication algorithm is O(n2.371552)Script error: No such module "Check for unknown parameters". time, given by Williams, Xu, Xu, and Zhou.[2][3] This improves on the bound of O(n2.3728596)Script error: No such module "Check for unknown parameters". time, given by Alman and Williams.[4][5] However, this algorithm is a galactic algorithm because of the large constants and cannot be realized practically.
Iterative algorithm
The definition of matrix multiplication is that if C = ABScript error: No such module "Check for unknown parameters". for an n × mScript error: No such module "Check for unknown parameters". matrix Template:Mvar and an m × pScript error: No such module "Check for unknown parameters". matrix Template:Mvar, then Template:Mvar is an n × pScript error: No such module "Check for unknown parameters". matrix with entries
From this, a simple algorithm can be constructed which loops over the indices Template:Mvar from 1 through Template:Mvar and Template:Mvar from 1 through Template:Mvar, computing the above using a nested loop:
<templatestyles src="Framebox/styles.css" />
- Input: matrices Template:Mvar and Template:Mvar
- Let Template:Mvar be a new matrix of the appropriate size
- For Template:Mvar from 1 to Template:Mvar:
- For Template:Mvar from 1 to Template:Mvar:
- Let sum = 0Script error: No such module "Check for unknown parameters".
- For Template:Mvar from 1 to Template:Mvar:
- Set sum ← sum + Aik × BkjScript error: No such module "Check for unknown parameters".
- Set Cij ← sumScript error: No such module "Check for unknown parameters".
- For Template:Mvar from 1 to Template:Mvar:
- Return Template:Mvar
This algorithm takes time Θ(nmp)Script error: No such module "Check for unknown parameters". (in asymptotic notation).[1] A common simplification for the purpose of algorithm analysis is to assume that the inputs are all square matrices of size n × nScript error: No such module "Check for unknown parameters"., in which case the running time is Θ(n3)Script error: No such module "Check for unknown parameters"., i.e., cubic in the size of the dimension.[6]
Cache behavior
The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access patterns and cache use of the algorithm;[1] which order is best also depends on whether the matrices are stored in row-major order, column-major order, or a mix of both.
In particular, in the idealized case of a fully associative cache consisting of Template:Mvar bytes and Template:Mvar bytes per cache line (i.e. Template:Sfrac cache lines), the above algorithm is sub-optimal for Template:Mvar and Template:Mvar stored in row-major order. When n > Template:SfracScript error: No such module "Check for unknown parameters"., every iteration of the inner loop (a simultaneous sweep through a row of Template:Mvar and a column of Template:Mvar) incurs a cache miss when accessing an element of Template:Mvar. This means that the algorithm incurs Θ(n3)Script error: No such module "Check for unknown parameters". cache misses in the worst case. As of 2010[update]Template:Dated maintenance category (articles)Script error: No such module "Check for unknown parameters"., the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.[7]
The optimal variant of the iterative algorithm for Template:Mvar and Template:Mvar in row-major layout is a tiled version, where the matrix is implicitly divided into square tiles of size √MScript error: No such module "Check for unknown parameters". by √MScript error: No such module "Check for unknown parameters".:[7][8]
<templatestyles src="Framebox/styles.css" />
- Input: matrices Template:Mvar and Template:Mvar
- Let Template:Mvar be a new matrix of the appropriate size
- Pick a tile size T = Θ(√M)Script error: No such module "Check for unknown parameters".
- For Template:Mvar from 1 to Template:Mvar in steps of Template:Mvar:
- For Template:Mvar from 1 to Template:Mvar in steps of Template:Mvar:
- For Template:Mvar from 1 to Template:Mvar in steps of Template:Mvar:
- Multiply AI:I+T, K:K+TScript error: No such module "Check for unknown parameters". and BK:K+T, J:J+TScript error: No such module "Check for unknown parameters". into CI:I+T, J:J+TScript error: No such module "Check for unknown parameters"., that is:
- For Template:Mvar from Template:Mvar to min(I + T, n)Script error: No such module "Check for unknown parameters".:
- For Template:Mvar from Template:Mvar to min(J + T, p)Script error: No such module "Check for unknown parameters".:
- Let sum = 0Script error: No such module "Check for unknown parameters".
- For Template:Mvar from Template:Mvar to min(K + T, m)Script error: No such module "Check for unknown parameters".:
- Set sum ← sum + Aik × BkjScript error: No such module "Check for unknown parameters".
- Set Cij ← Cij + sumScript error: No such module "Check for unknown parameters".
- For Template:Mvar from Template:Mvar to min(J + T, p)Script error: No such module "Check for unknown parameters".:
- For Template:Mvar from 1 to Template:Mvar in steps of Template:Mvar:
- For Template:Mvar from 1 to Template:Mvar in steps of Template:Mvar:
- Return Template:Mvar
In the idealized cache model, this algorithm incurs only Θ(Template:Sfrac)Script error: No such module "Check for unknown parameters". cache misses; the divisor b √MScript error: No such module "Check for unknown parameters". amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses.[7]
Divide-and-conquer algorithm
An alternative to the iterative algorithm is the divide-and-conquer algorithm for matrix multiplication. This relies on the block partitioning
which works for all square matrices whose dimensions are powers of two, i.e., the shapes are 2n × 2nScript error: No such module "Check for unknown parameters". for some Template:Mvar. The matrix product is now
which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplications recursively, using the scalar multiplication c11 = a11b11Script error: No such module "Check for unknown parameters". as its base case.
The complexity of this algorithm as a function of Template:Mvar is given by the recurrence[6]
accounting for the eight recursive calls on matrices of size n/2Script error: No such module "Check for unknown parameters". and Θ(n2)Script error: No such module "Check for unknown parameters". to sum the four pairs of resulting matrices element-wise. Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution Θ(n3)Script error: No such module "Check for unknown parameters"., the same as the iterative algorithm.[6]
Non-square matrices
A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice[7] splits matrices in two instead of four submatrices, as follows.[9] Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions.
<templatestyles src="Framebox/styles.css" />
- Inputs: matrices Template:Mvar of size n × mScript error: No such module "Check for unknown parameters"., Template:Mvar of size m × pScript error: No such module "Check for unknown parameters"..
- Base case: if max(n, m, p)Script error: No such module "Check for unknown parameters". is below some threshold, use an unrolled version of the iterative algorithm.
- Recursive cases:
- If max(n, m, p) = nScript error: No such module "Check for unknown parameters"., split Template:Mvar horizontally:
- Else, if max(n, m, p) = pScript error: No such module "Check for unknown parameters"., split Template:Mvar vertically:
- Otherwise, max(n, m, p) = mScript error: No such module "Check for unknown parameters".. Split Template:Mvar vertically and Template:Mvar horizontally:
Cache behavior
The cache miss rate of recursive matrix multiplication is the same as that of a tiled iterative version, but unlike that algorithm, the recursive algorithm is cache-oblivious:[9] there is no tuning parameter required to get optimal cache performance, and it behaves well in a multiprogramming environment where cache sizes are effectively dynamic due to other processes taking up cache space.[7] (The simple iterative algorithm is cache-oblivious as well, but much slower in practice if the matrix layout is not adapted to the algorithm.)
The number of cache misses incurred by this algorithm, on a machine with Template:Mvar lines of ideal cache, each of size Template:Mvar bytes, is bounded byTemplate:RTemplate:Rp
Sub-cubic algorithms
Script error: No such module "labelled list hatnote".
Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was Strassen's algorithm, devised by Volker Strassen in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2×2 matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of . Strassen's algorithm is more complex, and the numerical stability is reduced compared to the naïve algorithm,[10] but it is faster in cases where n > 100Script error: No such module "Check for unknown parameters". or so[1] and appears in several libraries, such as BLAS.[11] It is very useful for large matrices over exact domains such as finite fields, where numerical stability is not an issue.
Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems, improving on the constants hidden in the big-O notation has its merits. A table that compares key aspects of the improved version based on recursive multiplication of 2×2-block matrices via 7 block matrix multiplications follows. As usual, gives the dimensions of the matrix and designates the memory size.
| Year | Reference | #matrix multiplications per step | #matrix additions per step | total arithmetic operations | total I/O-complexity |
|---|---|---|---|---|---|
| 1969 | Strassen[12] | 7 | 18 | ||
| 1971 | Winograd[13] | 7 | 15 | ||
| 2017 | Karstadt, Schwartz[14] | 7 | 12 | ||
| 2023 | Schwartz, Vaknin[15] | 7 | 12 |
It is known that a Strassen-like algorithm with a 2×2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert[16] showed that such an algorithm requires at least 15 additions (including subtractions); however, a hidden assumption was that the blocks and the 2×2-block matrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el.[17] applied this base-change trick to more general decompositions than 2×2-block matrices and improved the leading constant for their run times.
It is an open question in theoretical computer science how well Strassen's algorithm can be improved in terms of asymptotic complexity. The matrix multiplication exponent, usually denoted , is the smallest real number for which any matrix over a field can be multiplied together using field operations. The current best bound on is , by Williams, Xu, Xu, and Zhou.[2][4] This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by Don Coppersmith and Shmuel Winograd in 1990.[18] The conceptual idea of these algorithms is similar to Strassen's algorithm: a way is devised for multiplying two k × kScript error: No such module "Check for unknown parameters".-matrices with fewer than k3Script error: No such module "Check for unknown parameters". multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the big-O notation is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers.[19][20] Victor Pan proposed so-called feasible sub-cubic matrix multiplication algorithms with an exponent slightly above 2.77, but in return with a much smaller hidden constant coefficient. [21]
Freivalds' algorithm is a simple Monte Carlo algorithm that, given matrices Template:Mvar, Template:Mvar and Template:Mvar, verifies in Θ(n2)Script error: No such module "Check for unknown parameters". time if AB = CScript error: No such module "Check for unknown parameters"..
AlphaTensor
In 2022, DeepMind introduced AlphaTensor, a neural network that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans and some that were not.[22] Operations were restricted to the Script error: No such module "Unsubst".(normal arithmetic) and finite field (mod 2 arithmetic). The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n2.778).[23] Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for 3×3 matrices remains unknown, even in commutative field.[23] On 4×4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5×5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4×4 algorithm, and separately tweaked Deepmind's 96-step 5×5 algorithm down to 95 steps in mod 2 arithmetic and to 97[24] in normal arithmetic.[25] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a baseline of 80 in both normal and mod 2 arithmetic.
Parallel and distributed algorithms
The divide-and-conquer algorithm sketched earlier can be parallelized in two ways for shared-memory multiprocessors. These are based on the fact that the eight recursive matrix multiplications in
can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in fork–join style pseudocode:[26]
<templatestyles src="Framebox/styles.css" />
Procedure multiply(C, A, B)Script error: No such module "Check for unknown parameters".:
- Base case: if n = 1Script error: No such module "Check for unknown parameters"., set c11 ← a11 × b11Script error: No such module "Check for unknown parameters". (or multiply a small block matrix).
- Otherwise, allocate space for a new matrix Template:Mvar of shape n × nScript error: No such module "Check for unknown parameters"., then:
- Partition Template:Mvar into A11Script error: No such module "Check for unknown parameters"., A12Script error: No such module "Check for unknown parameters"., A21Script error: No such module "Check for unknown parameters"., A22Script error: No such module "Check for unknown parameters"..
- Partition Template:Mvar into B11Script error: No such module "Check for unknown parameters"., B12Script error: No such module "Check for unknown parameters"., B21Script error: No such module "Check for unknown parameters"., B22Script error: No such module "Check for unknown parameters"..
- Partition Template:Mvar into C11Script error: No such module "Check for unknown parameters"., C12Script error: No such module "Check for unknown parameters"., C21Script error: No such module "Check for unknown parameters"., C22Script error: No such module "Check for unknown parameters"..
- Partition Template:Mvar into T11Script error: No such module "Check for unknown parameters"., T12Script error: No such module "Check for unknown parameters"., T21Script error: No such module "Check for unknown parameters"., T22Script error: No such module "Check for unknown parameters"..
- Parallel execution:
- Fork multiply(C11, A11, B11)Script error: No such module "Check for unknown parameters"..
- Fork multiply(C12, A11, B12)Script error: No such module "Check for unknown parameters"..
- Fork multiply(C21, A21, B11)Script error: No such module "Check for unknown parameters"..
- Fork multiply(C22, A21, B12)Script error: No such module "Check for unknown parameters"..
- Fork multiply(T11, A12, B21)Script error: No such module "Check for unknown parameters"..
- Fork multiply(T12, A12, B22)Script error: No such module "Check for unknown parameters"..
- Fork multiply(T21, A22, B21)Script error: No such module "Check for unknown parameters"..
- Fork multiply(T22, A22, B22)Script error: No such module "Check for unknown parameters"..
- Join (wait for parallel forks to complete).
- add(C, T)Script error: No such module "Check for unknown parameters"..
- Deallocate Template:Mvar.
Procedure add(C, T)Script error: No such module "Check for unknown parameters". adds Template:Mvar into Template:Mvar, element-wise:
- Base case: if n = 1Script error: No such module "Check for unknown parameters"., set c11 ← c11 + t11Script error: No such module "Check for unknown parameters". (or do a short loop, perhaps unrolled).
- Otherwise:
- Partition Template:Mvar into C11Script error: No such module "Check for unknown parameters"., C12Script error: No such module "Check for unknown parameters"., C21Script error: No such module "Check for unknown parameters"., C22Script error: No such module "Check for unknown parameters"..
- Partition Template:Mvar into T11Script error: No such module "Check for unknown parameters"., T12Script error: No such module "Check for unknown parameters"., T21Script error: No such module "Check for unknown parameters"., T22Script error: No such module "Check for unknown parameters"..
- In parallel:
- Fork add(C11, T11)Script error: No such module "Check for unknown parameters"..
- Fork add(C12, T12)Script error: No such module "Check for unknown parameters"..
- Fork add(C21, T21)Script error: No such module "Check for unknown parameters"..
- Fork add(C22, T22)Script error: No such module "Check for unknown parameters"..
- Join.
Here, fork is a keyword that signal a computation may be run in parallel with the rest of the function call, while join waits for all previously "forked" computations to complete. partitionScript error: No such module "Check for unknown parameters". achieves its goal by pointer manipulation only.
This algorithm has a critical path length of Θ(log2 n)Script error: No such module "Check for unknown parameters". steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible speedup of Θ(n3/log2 n)Script error: No such module "Check for unknown parameters". on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix Template:Mvar, but a more practical variant achieves Θ(n2)Script error: No such module "Check for unknown parameters". speedup, without using a temporary matrix.[26]
Communication-avoiding and distributed algorithms
On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the communication bandwidth. The naïve algorithm using three nested loops uses Ω(n3)Script error: No such module "Check for unknown parameters". communication bandwidth.
Cannon's algorithm, also known as the 2D algorithm, is a communication-avoiding algorithm that partitions each input matrix into a block matrix whose elements are submatrices of size √M/3Script error: No such module "Check for unknown parameters". by √M/3Script error: No such module "Check for unknown parameters"., where MScript error: No such module "Check for unknown parameters". is the size of fast memory.[27] The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to O(n3/√M)Script error: No such module "Check for unknown parameters"., which is asymptotically optimal (for algorithms performing Ω(n3)Script error: No such module "Check for unknown parameters". computation).[28][29]
In a distributed setting with Template:Mvar processors arranged in a √pScript error: No such module "Check for unknown parameters". by √pScript error: No such module "Check for unknown parameters". 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting O(n2/√p)Script error: No such module "Check for unknown parameters". words, which is asymptotically optimal assuming that each node stores the minimum O(n2/p)Script error: No such module "Check for unknown parameters". elements.[29] This can be improved by the 3D algorithm, which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row.[30] This algorithm transmits O(n2/p2/3)Script error: No such module "Check for unknown parameters". words per processor, which is asymptotically optimal.[29] However, this requires replicating each input matrix element p1/3Script error: No such module "Check for unknown parameters". times, and so requires a factor of p1/3Script error: No such module "Check for unknown parameters". more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime.[30] "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth.[31] On modern distributed computing environments such as MapReduce, specialized multiplication algorithms have been developed.[32]
Algorithms for meshes
There are a variety of algorithms for multiplication on meshes. For multiplication of two n×n on a standard two-dimensional mesh using the 2D Cannon's algorithm, one can complete the multiplication in 3n-2 steps although this is reduced to half this number for repeated computations.[33] The standard array is inefficient because the data from the two matrices does not arrive simultaneously and it must be padded with zeroes.
The result is even faster on a two-layered cross-wired mesh, where only 2n-1 steps are needed.[34] The performance improves further for repeated computations leading to 100% efficiency.[35] The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure.[36]
In a 3D mesh with n3 processing elements, two matrices can be multiplied in using the DNS algorithm.[37]
See also
- Computational complexity of mathematical operations
- Computational complexity of matrix multiplication
- CYK algorithm § Valiant's algorithm
- Matrix chain multiplication
- Method of Four Russians
- Multiplication algorithm
- Sparse matrix–vector multiplication
References
<templatestyles src="Reflist/styles.css" />
- ↑ a b c d Script error: No such module "citation/CS1".
- ↑ a b Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ a b Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ a b c Script error: No such module "citation/CS1".
- ↑ a b c d e Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ a b Template:Cite thesis
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ a b Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ a b Template:Cite thesis
- ↑ Template:Cite thesis
- ↑ Script error: No such module "citation/CS1".
- ↑ a b c Script error: No such module "Citation/CS1".
- ↑ a b Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
- ↑ Script error: No such module "Citation/CS1".
Script error: No such module "Check for unknown parameters".
Further reading
<templatestyles src="Refbegin/styles.css" />
- Script error: No such module "Citation/CS1".
- Script error: No such module "Citation/CS1".
- Script error: No such module "Citation/CS1".
- How To Optimize GEMM