Strength Reduction of Multiplications by Integer Constants Youfeng Wu,
[email protected] Sequent Computer Systems, Inc., D2-798 15450 S. W. Koll Parkway, Beaverton, OR 97006-6063 A b s t r a c t : Replacing a multiplication of an integer
suitable for instruction-level parallelism. For example, the fourth instruction in the above sequence can be run in parallel with any of the first three instructions on a processor that can issue two or more instructions. This paper discusses how to convert multiplications of integer constants into sequences of SUB/ADD/SHIFT/LEA instructions. More powerful instructions will not be considered. Also we assume that the integer constant is positive, and negative numbers can be handled as suggested by Bernstein [4]. Section 2 formulates the problem. Sections 3 and 4 describe two previous solutions. Section 5 develops our techniques. Section 6 presents the performance experiments and the results. Section 7 briefly discusses the overflow issue. Section 8 concludes the paper.
constant by a sequence of simple instructions can be beneficial because it reduces the total number of cycles and also provides more opportunity for instruction level parallelism. In this paper we developed an algorithm that selectively replaces multiplications of integer constants with sequences of SUB, ADD, SHIFT, and L E A instructions. The algorithm runs fast and the generated instruction sequences are of high quality. The number of simple instructions generated for a multiplication of an integer constant is about 84.5 % of the number of I-bits in the binmT representation of the constant. Also, for several popular processors, the simple instruction sequences significantly reduces the execution time to perform the multiplications. Key words: Integer multiplication, code generation, code optimization, instruction-level parallelism.
2. P r o b l e m F o r m u l a t i o n If only ADD is allowed in the simple instruction sequence, an elegant mathematical formulation of the problem can be found in Knuth [2], namely the concept of %ddition-chain". For a multiplication of X * N, where X is an unknown and N is a positive number, an %dditionchain" for N is defmed as a sequence of integers
1. I n t r o d u c t i o n For pipelined or superscalar processors, a multiplication usually is many times slower than the simple instructions like ADD, SUB, SHIFT (X SHIFT Y = X * 2Y), or LEA (LEA (B,I,S) = B + I * 2 s, where S = 1, 2, or 3). Replacing a multiplication with a sequence of simple instructions reduces the number of cycles taken by the operation and also provides more opportunity for instruction level parallelism. For example, for the Intel Pentium processor [1], a multiplication of a 4-byte integer takes at least 10 cycles, but each ADD, SUB, SHIFT, or LEA takes only one cycle. If we replace the multiplication reg0*136 by the following two instructions (note that 136 = 128 + 8), regl = reg0 SHIFT 7 --- regl = reg0 * 128 reg2 = L E A (regl, reg0, 3) --- reg2 = regl + reg0 * 8 the multiplication will take only 2 cycles (plus a possible address generaation inter-lock cycle 1). Since the speed discrepancy between the multiplication and the sequence of simple instructions is so high, we can usually use a sequence of more than two instructions and still gain performance improvement. For example, we can use the following five instructions to perform reg0 * 1950 and thereby use only 5 cycles (note that 1950 = 2048 - ((1 + 2) * 32 + 2)): regl = LEA (reg0, reg0, 1) -- regl = reg0 + reg0 * 2 reg2 = regl SHIFT 5 -- reg2 = reg 1 * 32 reg3 = LEA (reg2, reg0, 1) -- reg3 = reg2 + reg0 * 2 reg4 = reg0 SHIFT 11 -- reg4 = reg0 *2048 reg5 = reg4 SUB reg3 -- reg5 = reg4 - reg3 Besides saving CPU cycles, a sequence of simple instructions is the RISC style of coding and is more
1 = a 0, a 1, a2,..., ar = N
that are generated by the rule a k = aj ÷ ai, for i, j s.t. k > j _>i for all k in [1... r]. Once an addition-chain is found, the calculation of X * N can be performed by the following sequence of additions: a0=X ak=aj
+a i
k>j>_i
X*N = ar . For example, an addition-chain for the number 9 can be "1, 2, 4, 5, 9" because 9 = 5+4, 5=4+1, 4=2+2, 2=1+1. From this addition-chain, the following sequence of additions can be generated to calculate reg0*9: reg 1 = reg0 + reg0 reg2 = regl + regl reg3 = reg2 + reg0 reg4 = reg3 + reg2 It is not difficult to fired an addition-chain for any integer constant N. For example, if N is an even number, N = N/2 + N/2 and if N is an odd number, N = (N/2 + 1) + N/2, and we can repeat this decomposition on N/2 until it reaches 1. However, the problem to find the optimal chain (the one with the smallest number of additions) is believed to be NP-complete. In the case where ADD, SUB, SHIFT, and LEA instructions are all allowed, Deniel Magenheimer et. al. [3] extended the concept of the addition-chain to a "SUB/ADD/SHIFT/LEA-chain" (we will call it a SASLchain). That is, the rule
1We will ignore the address generation inter-lock cycles [1] in the rest of the paper for ease of discussion.
. . . . . ~ 9 ~¸ . . . .
ACM SIGPLAN Notices, Volume 30, No. 2, February 1995
42
a k = aj + a i, for i~ j s.t. k > j 2 i for all k in [1. r]. is extended to a k = aj + a i or ak = aj - a i or a k = aj SHIFT a i or a k = LEA(a b aj, 1) or a k = LEA(ai, aj, 2) or a k = LEA(ai, aj, 3) for i, j s.t. k > j >_i , k in [1. r]. So the p r o b l e m of replacing a multiplication of an integer constant N by a sequence of SUB/ADD/SHWT[LEA instructions is equivalent to the problem of finding a SASL-chain for N. For example, an SASL-chain for 9 is "1, 9" because 9 = L E A ( l , 1, 3), and the m u l t i p l i c a t i o n reg0*9 can be p e r f o r m e d with LEA(reg0,reg0,3). Although including S U B / S H I F T / L E A instructions can reduce the length o f the simple instruction sequence, it significantly increases the size of the search space for the minimal sequence than searching an optimal additionchain, and it is believed that the problem of finding the shortest SASL-ehain for an integer constant N remains NP-complete. Any practical solution for finding a good S A S L - c h a i n for a c o n s t a n t will p r o b a b l y involve heuristics.
4. A Rule-based Heuristic The rule-based system used by Deniel Magenheimer et. al [3] is very powerful. Many tricks that we humans do can be specified as rules, and they m a y be picked up by the inference engine. Deniel Magenheimer [3] et. al. mentioned that the rule-based approach generated optimal sequences for all numbers under 10,000 except 12. We believe this can be done, and it should be simple to add more rules to cover the 12 cases. However, we speculate that the rulebased approach may not run as fast as a procedural approach• Also, it may not be cost-effective to include a rule-based system in an existing compiler solely for the purpose of reducing the strength of multiplications by integer constants. This situation m a y change because people are proposing rule based approaches for several other compiler optimizations. 5. O u r N e w Heuristics For converting a multiplication of an integer constant to a sequence of A D D / S U B / S H I F T / L E A instructions, our algorithm consists of a basic part, which generates an LEA and possibly a S H I F T for each 1-bit in the binary representation of the constant, and two additional parts that try to reduce the number of 1-bits in the constant quickly. In the following discussions, we assume LEA (B, I, S) = B + I * 2 S, where S = 1, 2,..., or M A X S.
3. A Branch and Bound Heuristic Robert Bernstein [4] provided a nice "branch and bound" algorithm for searching a sequence of ADD/SUB/SHIFT instructions to replace a multiplication of an integer constant. For a number N, Bemstein's algorithm first recursively searches for the simple instruction sequences
5.1. The Basic Algorithm Each integer constant N can be represented as: k N = ~_ ai*21, where a i = 0 or 1.
for N + 1, N - 1, and N / 2 i. Then it chooses the one with the smallest cost and applies an ADD, SUB, or SHIFT to obtain the sequence for N. A source program for this algorithm is available from Preston Briggs [5]• We
l=t/
Assume a u and a v are the lowest two 1-bits and u < v. Namely, au= a v = 1, and aj = 0 for and j such that j < u or u < j < v. The basic algorithm considers the following three cases. Case 1. u = 0. This means a 0 = 1. Assume t = min (v, MAX_S). We have N = 1 + 2 t * N '. We can first calculate N' and then obtain N using the following LEA instruction: N = L E A (1, N', t)
extended it to search for the sequence for N / (2 i + 1) as well, so as to use L E A instructions as much as possible. This heuristic can discover many good sequences. But the extended Bernstein algorithm still didn't take full advantage of the L E A instruction. For example, the SASLchain it discovered for 29 is (note that 29 = (8 - 1) * 4 + 1): 8 = 1 SHIFT 3 7=8SUB 1 28 = 7 S H I F T 2 29 = 28 A D D 1. While we notice that 29 = 32 - 3. Therefore the following three instructions are sufficient: 3 = L E A ( l , 1, 1) 32 = 1 SHIFT 5 29 = 32 SUB 1. A l s o , the e x t e n d e d B e r n s t e i n ' s a l g o r i t h m is computationaly expensive. Later we will compare the performance of this algorithm with that of our algorithm.
Case 2. 0 < u _<MAX_S. That is, N = 2 u + N'. In this case, we can first calculate N' and then obtain N using an LEA instruction, as follows: N = LEA (N', 1, u) Case 3. u > M A X S. That i s N = 2 u * N ' . I n t h i s case, we can first calculate N' and then obtain N using a SHIFT instruction, as follows: N = N' SHIFT u. To i m p l e m e n t the basic algorithm efficiently, we introduce the following "condensed representation" of an integer number.
43
[4,2]. R80[0] = 4 requires that case 3 be applied, and we
k For an integer number N = y~ ai*2 i, assume the number i=0 of 1-bits in its binary representation is r, and the 1-bits in the binary sequence are at the following locations: L0, L1,... , Lr_l, where Li+ 1 > L i >_0 That is r-1 N = ~
have 80 = N' SHIFT 4, where N' = 5 = 20 + 22. R5[0..1] = [0, R80[1]] = [0, 2]. After applying Case 1 again, we have 5 = L E A (1, 1, 2). Altogether this algorithm produces the following instruction sequence (note that 657
=((1 + 1 " 4 ) ' 1 6 + 2 ) ' 8 +
aLj*2 Lj
The condensed representation of the number N is RN[0,... , r-l] where RN[0] = L0, and RN[i] = L i - Li_ 1, for i = 1,..., r-1. For example, 1951 = 2 0 + 2 1 + 2 2 + 2 3
+24 + 2 7 + 2 8
1):
5 = LEA (1, 1, 2) 80 = 5 SHIFT 4 82 = L E A (80, 1, 1) 657 = L E A (1, 82, 3). Note that 657 has four 1-bits, and the simple instruction sequence generated for it by the basic algorithm uses 3 LEA instructions and one S H I F T instruction. In general, we have the following proposition about the simple instruction sequences generated by the basic algorithm. P r o p o s i t i o n : Assume a number N has r 1-bits. The basic algorithm uses r-1 L E A instructions and no m o r e log2N - MAX_S than 1 + k ~ A X _ - S + 2 j S H I F T instructions.
+
29 + 210 can be represented as [0,1,1,1,1,3,1,1,1]. The basic algorithm is described in Algorithm 1, where the funcdon d e c o m p o s e _ m u l 0 will be described Section 5.4. At this point, we can assume it is a call to basic_deeompose_mul itself and the reeursion will terminate.
P r o o f : For both ease 1 and ease 2, when an L E A is used, N' has one less 1-bits than N. For ease 3, no L E A is used, and N' becomes case 1 and has the same number of 1bits as N. When N = 1, no L E A is needed. So the number of L E A instructions used to calculate N will be r 1. Furthermore, the S H I F T instructions are only used in ease 3. The fwst S H I F T will be used if u _> 1 + MAX_S. When a S H I F T is used in ease 3, N' becomes ease 1. For a number in ease 1 coming to case 3, there are two paths: 1) coming to case 3 directly, or 2) first coming to ease 2 and then coming to ease 3. If the number comes to ease 3 directly f r o m case 1, there m u s t be v - u _> 1 + 2*MAX_S. That is, at least 2 * M A X _ S + 2 bits are consumed w h e n coming from case 1 to ease 3. If the number comes to ease 2 and then to case 3, going from case 1 to case 2 consumes at least M A X _ S + 1 bits, and going from ease 2 to case 3 consumes at least M A X _ S + 1 bits. Again, at least 2 * M A X _ S + 2 bits are consumed when coming from case 1 to ease 3. For any number N, it has at most 1 + l o g 2 N significant bits. Applying the basic algorithm to it, the first S H I F T consumes at least MAX_S + 1 bits, and each subsequent S H I F T consumes at least 2 * M A X _ S + 2 bits. So the n u m b e r o f SHIFT, S, satisfies: M A X _ S + 1 + (S - 1) * ( 2 * M A X _ S + 2 ) _< 1 + log2N That is log2N - MAX S
_En_t~zffam..e;. basic decomp..o~9~.m..p..L~,~zy~,..N).......................
Input: N - - the constant to be multiplied r - - the number of 1-bits in N. R ~ . . . r - _ . l ] . . z ; - ~e..9.9.ndg~e.d.Lepresegtation of N. O u t p u t : emit L E A / S H I F T instructions for multiplying the constant N and return the root o f the instruction tree emitted. Process: ff (RIO] == O) { [* Case 1 */ t = R[1] > m a x (MAX_S, R[I]); R[I] -= t; N' = decompose mul (R[1...r-1], r - 1, N); return e m i t _ H A (1, N', t); ] else if (R[0] _<MAX_S) { /* Case 2 */ r0 = R[0];
R[1] += r0; N' = decompose mul (R[1...r-1], r - 1, N); return e m i t L E A (N', 1, r0); } else { /* Case 3 */ rO = R [ 0 ] ;
R[0] = 0; N' = deeompose_mul (R, r, N); return emit S H I F T (N', r0);
s