Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

### Cumulative distribution function – Math.pow() based implementation

[java]

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;

import static java.lang.Math.abs;

import static java.lang.Math.exp;

import static java.lang.Math.pow;

import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingPow {

_;

private static final double P = 0.2316419;

private static final double B1 = 0.319381530;

private static final double B2 = -0.356563782;

private static final double B3 = 1.781477937;

private static final double B4 = -1.821255978;

private static final double B5 = 1.330274429;

public static double cumulativeDistribution(double x) {

double t = 1 / (1 + P * abs(x));

double t1 = B1 * pow(t, 1);

double t2 = B2 * pow(t, 2);

double t3 = B3 * pow(t, 3);

double t4 = B4 * pow(t, 4);

double t5 = B5 * pow(t, 5);

double b = t1 + t2 + t3 + t4 + t5;

double cd = 1 – standardNormalDistribution(x) * b;

return x < 0 ? 1 – cd : cd;

}

public static double standardNormalDistribution(double x) {

double top = exp(-0.5 * pow(x, 2));

double bottom = sqrt(2 * PI);

return top / bottom;

}

}

[/java]

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

### Cumulative distribution function – Multiplication based implementation

[java]

package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;

import static java.lang.Math.abs;

import static java.lang.Math.exp;

import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingMult {

_;

private static final double P = 0.2316419;

private static final double B1 = 0.319381530;

private static final double B2 = -0.356563782;

private static final double B3 = 1.781477937;

private static final double B4 = -1.821255978;

private static final double B5 = 1.330274429;

public static double cumulativeDistribution(double x) {

double t = 1 / (1 + P * abs(x));

double t1 = t;

double t2 = t * t;

double t3 = t2 * t;

double t4 = t3 * t;

double t5 = t4 * t;

double b1 = B1 * t1;

double b2 = B2 * t2;

double b3 = B3 * t3;

double b4 = B4 * t4;

double b5 = B5 * t5;

double b = b1 + b2 + b3 + b4 + b5;

double cd = 1 – standardNormalDistribution(x) * b;

return x < 0 ? 1 – cd : cd;

}

public static double standardNormalDistribution(double x) {

double top = exp(-0.5 * x * x);

double bottom = sqrt(2 * PI);

return top / bottom;

}

}

[/java]

### Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

[java]

package name.dhruba.black.scholes;

import java.util.Random;

import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult;

import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow;

import org.junit.Test;

public class TestCumulativeDistributionCalculator {

@Test

public void test() {

Random r = new Random();

// 10m iteration warmup to ensure jit is underway

for (int i = 0; i < Math.pow(10, 7); i++) {

double d = r.nextDouble();

CumulativeDistributionUsingPow.cumulativeDistribution(d);

CumulativeDistributionUsingMult.cumulativeDistribution(d);

}

// execute both for a variety of iterations and store results

long[][] times = new long[4][4];

for (int i = 0; i < 4; i++) {

int iterations = (int) Math.pow(10, i + 4);

times[i] = testHelper(iterations, r);

}

// computations complete; print times

for (int i = 0; i < times.length; i++) {

System.out.format("Over %1$9s iterations pow took %2$5s ms and "

+ "mult took %3$4s ms; mult was %4$3s faster.n",

times[i][0], times[i][1], times[i][2], times[i][3]);

}

}

private long[] testHelper(int iterations, Random r) {

// generate data for given iterations

double[] data = new double[iterations];

for (int i = 0; i < iterations; i++) {

data[i] = r.nextDouble();

}

// benchmark pow for given iterations

long t1 = System.currentTimeMillis();

for (int i = 0; i < iterations; i++) {

CumulativeDistributionUsingPow.cumulativeDistribution(data[i]);

}

t1 = System.currentTimeMillis() – t1;

// benchmark multiplication for given iterations

long t2 = System.currentTimeMillis();

for (int i = 0; i < iterations; i++) {

CumulativeDistributionUsingMult.cumulativeDistribution(data[i]);

}

t2 = System.currentTimeMillis() – t2;

// save times taken

return new long[] { iterations, t1, t2, t1 / t2 };

}

}

[/java]

### Running times and speedup factors

Running the above benchmark test prints the following output.

Over 10000 iterations pow took 8 ms and mult took 2 ms; mult was 4 faster. Over 100000 iterations pow took 82 ms and mult took 4 ms; mult was 20 faster. Over 1000000 iterations pow took 794 ms and mult took 36 ms; mult was 22 faster. Over 10000000 iterations pow took 9116 ms and mult took 422 ms; mult was 21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.

The funny thing here is that the method is doing an approximation of the normal distribution using a fifth order polynomial, but when you call pow() you are forcing the CPU to interpolate and calculate another polynomial for each term in the original polynomial, plus unavoiable logic to handle negative numbers, infinities etc. Sadly I have yet to encounter a compiler that has a peephole optimisation for integral powers.

I liked your article on Black Scholes which led me here. Your intuition is correct. What you might want to look at is Horner’s rule. Sedgewick’s book probably has a nice writeup, but any numerical analysis text will cover the same material which you discovered via thoughtful experimentation. Improved algorithms not only run faster, but often result in more accuracy because roundoff error is reduced.

Funny thing about computers is that just writing out the formulas and coding in a straightforward approach can miss the forest for the trees. The history of the Fast Fourier Transform was an example where the faster algorithm was achieved by the way a clever person would perform the the calculation by hand. Follows the same principle of Horner’s rule: nesting of calculations via recursion or loops.

This is not completely fair, since you are ‘buffering’ the values and . You can only do that in some scenarios. I think this would be a better comparison:

double b1 = B1 * t;

double b2 = B2 * t*t;

double b3 = B3 * t*t*t;

double b4 = B4 * t*t*t*t;

double b5 = B5 * t*t*t*t*t;

Besides, the one in the example can easily be looped, while both the one you wrote and this one cannot (;

Good point. Well spotted. Agreed. 🙂

I’m sure that Math.pow is slower than multiplication (at least for elevate to square), but I wonder if the results are the same. There are the same accuracy using one or other?