How to print assembly for your Java code in OS X

0. Write program.

[java]
package name.dhruba;
public class Main {
public static void main(String[] args) {
System.out.println("Hello World!");
}
}
[/java]

1. Add JVM arguments to your program.

-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

2. Run your program. You will probably see the output below.

Java HotSpot(TM) 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output
Could not load hsdis-amd64.dylib; library not loadable; PrintAssembly is disabled
Hello World!
Process finished with exit code 0

3. Download the missing library from here. The direct link to the lib file itself is here. I downloaded the one named ‘gnu-bsd-libhsdis-amd64.dylib’ as I’m running 64-bit. This produces a file on your system called ‘hsdis-amd64.dylib’.

4. Move it to where, the JDK that you are running with, looks for it. I was using Java8.

sudo mv ~/Downloads/hsdis-amd64.dylib /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib

5. Run your program again and see assembly! The output for Hello World is huge so I can’t paste all of it here but here’s the initial bit where you can see that the disassembler has been loaded.

Java HotSpot(TM) 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010f4c23d0:
Code:
[Disassembling for mach='i386:x86-64']
[Entry Point]
  # {method} {0x00000001280fe660} 'arraycopy' '(Ljava/lang/Object;ILjava/lang/Object;II)V' in 'java/lang/System'
  # parm0:    rsi:rsi   = 'java/lang/Object'
  # parm1:    rdx       = int
  # parm2:    rcx:rcx   = 'java/lang/Object'
  # parm3:    r8        = int
  # parm4:    r9        = int
  #           [sp+0x60]  (sp of caller)
  0x000000010f4c2540: mov    0x8(%rsi),%r10d
  0x000000010f4c2544: shl    $0x3,%r10
  0x000000010f4c2548: cmp    %r10,%rax
  0x000000010f4c254b: je     0x000000010f4c2558
  0x000000010f4c2551: jmpq   0x000000010f407b60  ;   {runtime_call}
  0x000000010f4c2556: xchg   %ax,%ax

Credit: Nitsan Wakart.

Advertisements

Java pitfall: How to prevent Runtime.getRuntime().exec() from hanging

Runtime.getRuntime().exec() is used to execute a command line program from within the Java program as below.

[java]
import java.io.File;
import java.io.IOException;

public class ProcessExecutor {

public static void main(String[] args) throws IOException, InterruptedException {

String command = "c:my.exe";
String workingDir = "c:myworkingdir";

// start execution
Process process = Runtime.getRuntime().exec(command, null, new File(workingDir));

// wait for completion
process.waitFor();

}

}
[/java]

However the command line program being run above may block/deadlock as it did for me on Windows 7. I was trying to run a program that produced a lot of output. I could run the program standalone but through Java it hung indefinitely. Thread dumps showed nothing.

After being quite puzzled for a while as to why this was happening finally I found the answer in Java 7 api docs for Process.

Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, or even deadlock.

So, in fact, the fix for the above program is as follows.

[java]
import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;

public class ProcessExecutor {

public static void main(String[] args) throws IOException, InterruptedException {

String command = "c:my.exe";
String workingDir = "c:myworkingdir";

// start execution
Process process = Runtime.getRuntime().exec(command, null, new File(workingDir));

// exhaust input stream
BufferedInputStream in = new BufferedInputStream(process.getInputStream());
byte[] bytes = new byte[4096];
while (in.read(bytes) != -1) {}

// wait for completion
process.waitFor();

}

}
[/java]

This is so bad. Not only is this unexpected but it is also undocumented in the exec call. Also another problem is that if you are timing the total execution time for a given command and don’t care about the output you need to read the output anyway and probably subtract the reading time from the total execution time. I’m not sure how accurate that will be.

Surely there could have been a better way to handle this for the user in the api internals. So windows 7 must be one of those OSs with small buffer sizes then. Anyway, at least you know now. Obviously you don’t have to read it into nothing as I’m doing above. You can write it to stdout or a file.

Update: A commenter made a good point that I’d forgotten to read the error stream above. Don’t forget to do so in your own code!

Project Jigsaw is deferred to Java 9

Project Jigsaw, the modularity proposal in Java 8, has been deferred to Java 9 to prevent the Java 8 release from being delayed. It reminds of Project Lambda previously being deferred to Java 8 so that Java 7 wouldn’t be delayed 🙂 Although this may appear as a recurring trend I think it is a good thing to strip releases of features that cannot be achieved within the allotted time so that major releases are not delayed. After all this is the real world and I’m sure we’d all not wait any longer than necessary for lambdas in Java 8!

HotSpot PermGen removal commit made

As you may know, after the takeover of Java by Oracle and the subsequent decision to merge JRockit into Hotspot, PermGen was due to be removed from Hotspot entirely. Two days ago, a commit (openjdk list message) has been made into the hotspot gc repository to remove permgen (which is the storage of class metadata) and use native memory in its place. Class metadata shall now be represented as C++ classes and will now be allocated in the metaspace linked to classloaders according to the commit message. [Source]

Performance Pattern: Multiplication is ~20x faster than Math.pow()

Last time, in the performance pattern series, I wrote about how a bitwise equivalent implementation of the modulo operator was a significantly faster alternative to the modulo operator (around 5x faster). This time, I look at a possible ~20x speedup simply using multiplication in place of the Math.pow() operator. As a reference implementation I use the cumulative distribution function from the Black Scholes algorithm (and the Greeks) because it uses the Math.pow() function quite a bit. First – let’s look at the pow based implementation.

Cumulative distribution function – Math.pow() based implementation

[java]
package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.pow;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingPow {

_;

private static final double P = 0.2316419;
private static final double B1 = 0.319381530;
private static final double B2 = -0.356563782;
private static final double B3 = 1.781477937;
private static final double B4 = -1.821255978;
private static final double B5 = 1.330274429;

public static double cumulativeDistribution(double x) {
double t = 1 / (1 + P * abs(x));
double t1 = B1 * pow(t, 1);
double t2 = B2 * pow(t, 2);
double t3 = B3 * pow(t, 3);
double t4 = B4 * pow(t, 4);
double t5 = B5 * pow(t, 5);
double b = t1 + t2 + t3 + t4 + t5;
double cd = 1 – standardNormalDistribution(x) * b;
return x < 0 ? 1 – cd : cd;
}

public static double standardNormalDistribution(double x) {
double top = exp(-0.5 * pow(x, 2));
double bottom = sqrt(2 * PI);
return top / bottom;
}

}
[/java]

As you can see there are a number of calls to Math.pow() in both the cumulative distribution function and the standard normal distribution. Let us now replace those with equivalent multiplication operations.

Cumulative distribution function – Multiplication based implementation

[java]
package name.dhruba.black.scholes.utils;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.sqrt;

public enum CumulativeDistributionUsingMult {

_;

private static final double P = 0.2316419;
private static final double B1 = 0.319381530;
private static final double B2 = -0.356563782;
private static final double B3 = 1.781477937;
private static final double B4 = -1.821255978;
private static final double B5 = 1.330274429;

public static double cumulativeDistribution(double x) {
double t = 1 / (1 + P * abs(x));
double t1 = t;
double t2 = t * t;
double t3 = t2 * t;
double t4 = t3 * t;
double t5 = t4 * t;
double b1 = B1 * t1;
double b2 = B2 * t2;
double b3 = B3 * t3;
double b4 = B4 * t4;
double b5 = B5 * t5;
double b = b1 + b2 + b3 + b4 + b5;
double cd = 1 – standardNormalDistribution(x) * b;
return x < 0 ? 1 – cd : cd;
}

public static double standardNormalDistribution(double x) {
double top = exp(-0.5 * x * x);
double bottom = sqrt(2 * PI);
return top / bottom;
}

}
[/java]

Benchmark – Math.pow() versus Multiplication

Benchmarks are generally speaking remarkably difficult to do correctly. Here I’ve done a best effort implementation. The first thing this test does is run both implementations for 10m iterations to forcibly enable JIT. It normally takes around 10k iterations for JIT to come on but I’ve deliberately overcompensated.

Then it runs both implementations for a variety of iteration counts ranging from 10k to 10m each time increasing iteration count by a factor of 10. For each iteration count it stores the results but does not print output until the very end to not involve the system libraries while benchmarking. At the end it prints the results for each iteration count.

[java]
package name.dhruba.black.scholes;

import java.util.Random;

import name.dhruba.black.scholes.utils.CumulativeDistributionUsingMult;
import name.dhruba.black.scholes.utils.CumulativeDistributionUsingPow;

import org.junit.Test;

public class TestCumulativeDistributionCalculator {

@Test
public void test() {

Random r = new Random();

// 10m iteration warmup to ensure jit is underway
for (int i = 0; i < Math.pow(10, 7); i++) {
double d = r.nextDouble();
CumulativeDistributionUsingPow.cumulativeDistribution(d);
CumulativeDistributionUsingMult.cumulativeDistribution(d);
}

// execute both for a variety of iterations and store results
long[][] times = new long[4][4];
for (int i = 0; i < 4; i++) {
int iterations = (int) Math.pow(10, i + 4);
times[i] = testHelper(iterations, r);
}

// computations complete; print times
for (int i = 0; i < times.length; i++) {
System.out.format("Over %1$9s iterations pow took %2$5s ms and "
+ "mult took %3$4s ms; mult was %4$3s faster.n",
times[i][0], times[i][1], times[i][2], times[i][3]);
}

}

private long[] testHelper(int iterations, Random r) {

// generate data for given iterations
double[] data = new double[iterations];
for (int i = 0; i < iterations; i++) {
data[i] = r.nextDouble();
}

// benchmark pow for given iterations
long t1 = System.currentTimeMillis();
for (int i = 0; i < iterations; i++) {
CumulativeDistributionUsingPow.cumulativeDistribution(data[i]);
}
t1 = System.currentTimeMillis() – t1;

// benchmark multiplication for given iterations
long t2 = System.currentTimeMillis();
for (int i = 0; i < iterations; i++) {
CumulativeDistributionUsingMult.cumulativeDistribution(data[i]);
}
t2 = System.currentTimeMillis() – t2;

// save times taken
return new long[] { iterations, t1, t2, t1 / t2 };

}

}
[/java]

Running times and speedup factors

Running the above benchmark test prints the following output.

Over     10000 iterations pow took     8 ms and mult took    2 ms; mult was   4 faster.
Over    100000 iterations pow took    82 ms and mult took    4 ms; mult was  20 faster.
Over   1000000 iterations pow took   794 ms and mult took   36 ms; mult was  22 faster.
Over  10000000 iterations pow took  9116 ms and mult took  422 ms; mult was  21 faster.

So, unless I’ve made mistakes above, you can see that simple multiplication can be ~20 faster than equivalent Math.pow() operations and this, I believe, is universal. It will be the same for C++ and probably other languages too. And this is because Math.pow() deals with the generic case of raising a number to any given power. The lookup involved in computing the result (via a native call in Java) is what takes additional time.

Once again, we’ve looked at a way of optimising performance that isn’t entirely obvious and cannot generally be found in official documentation. These are things one learns by working with others who know more than we do. And, by the way, for all the “premature optimisation is the root of all evil” hippie believers I don’t want to hear any of it. If you’re writing high latency code such as webapps or data access code this approach of remaining oblivious to performance may be acceptable but low latency requires an opposite mindset. In low latency you have to preempt implementations with faster ones and you have to question everything you see. It’s a dark art and one that takes a lifetime to learn and here I try to share with you whatever I learn one tip at a time.

The Black Scholes Algorithm: The Greeks in Java

Last time in the Black Scholes series I wrote about how to write the Black Scholes algorithm in Java in a maintainable and readable way by decomposing its constituent parts for easy reference. This time I look at modifying the previous implementation to incorporate The Greeks. Though – this time – there’s isn’t any great degree of decomposition to do as each Greek has its own formula and each one is implemented separately anyway.

What are The Greeks and how do they relate to the Black Scholes algorithm? This is best answered by Risk Glossary. Below I merely present a decomposed implementation of Black Scholes and The Greeks. In terms of performance considerations I’ve ensured that no equivalent operations are called more than once. I’ve also eliminated object allocation completely – the only allocation made is of a result array which contains six doubles. However, I have not applied advanced optimisations such as the use of multiplication in place of the pow() function.

I want to add a disclaimer that there isn’t anything very new in this post that isn’t already out there. I’m sure this has been done lots of times out there but I’m posting this for two reasons: 1) I wanted to post my take on the implementation in a decomposed fashion 2) I didn’t really find anything out there implementing the Greeks in Java. I hope it helps others looking for something similar.

The Greeks in Java

The inputs to the The Greeks formulas are the same as those available to the The Black Scholes algorithm so both can implemented in common scope. Most Greeks have different formulas for call and put options with the exception of gamma and vega which have been implemented common to call and put options. Theta was the most complex formula compared to the rest of them and so I’ve broken that one down into left and right halves. The calculate() method returns a six element double array which contains the following values in order: (1) price (2) delta (3) gamma (4) vega (5) theta (6) rho. If you want to know what the inputs are see my previous article and for the formulas implemented below look here.

[java]
package name.dhruba.black.scholes;

import static java.lang.Math.PI;
import static java.lang.Math.abs;
import static java.lang.Math.exp;
import static java.lang.Math.log;
import static java.lang.Math.pow;
import static java.lang.Math.sqrt;

public enum BlackScholesGreeks2 {

_;

private static final double P = 0.2316419;
private static final double B1 = 0.319381530;
private static final double B2 = -0.356563782;
private static final double B3 = 1.781477937;
private static final double B4 = -1.821255978;
private static final double B5 = 1.330274429;

public static double[] calculate(boolean c,
double s, double k, double r, double t, double v) {

double[] p = new double[6];

double d1 = d1(s, k, r, t, v);
double d2 = d2(s, k, r, t, v);

double sd1 = standardNormalDistribution(d1);
double cd1 = cumulativeDistribution(d1, sd1);
double thetaLeft = -(s * sd1 * v) / (2 * sqrt(t));

if (c) {

double cd2 = cumulativeDistribution(d2);

// price
p[0] = s * cd1 – k * exp(-r * t) * cd2;

// delta
p[1] = cd1;

// theta
double thetaRight = r * k * exp(-r * t) * cd2;
p[4] = thetaLeft – thetaRight;

// rho
p[5] = k * t * exp(-r * t) * cd2;

} else {

double pcd1 = cumulativeDistribution(-d1);
double pcd2 = cumulativeDistribution(-d2);

// price
p[0] = k * exp(-r * t) * pcd2 – s * pcd1;

// delta
p[1] = cd1 – 1;

// theta
double thetaRight = r * k * exp(-r * t) * pcd2;
p[4] = thetaLeft + thetaRight;

// rho
p[5] = -k * t * exp(-r * t) * pcd2;

}

// gamma
p[2] = sd1 / (s * v * sqrt(t));

// vega
p[3] = s * sd1 * sqrt(t);

return p;

}

private static double d1(double s, double k, double r, double t, double v) {
double top = log(s / k) + (r + pow(v, 2) / 2) * t;
double bottom = v * sqrt(t);
return top / bottom;
}

private static double d2(double s, double k, double r, double t, double v) {
return d1(s, k, r, t, v) – v * sqrt(t);
}

public static double cumulativeDistribution(double x) {
return cumulativeDistribution(x, standardNormalDistribution(x));
}

public static double cumulativeDistribution(double x, double sdx) {
double t = 1 / (1 + P * abs(x));
double t1 = B1 * pow(t, 1);
double t2 = B2 * pow(t, 2);
double t3 = B3 * pow(t, 3);
double t4 = B4 * pow(t, 4);
double t5 = B5 * pow(t, 5);
double b = t1 + t2 + t3 + t4 + t5;
double cd = 1 – sdx * b;
return x < 0 ? 1 – cd : cd;
}

public static double standardNormalDistribution(double x) {
double top = exp(-0.5 * pow(x, 2));
double bottom = sqrt(2 * PI);
return top / bottom;
}

}
[/java]

Testing the implementation

I’ve written a simple test based on values assumed to be correct from an online calculator. Rather oddly for some of the values although my answers matched with that online calculator they differ from those on Wolfram Alpha very slightly. I don’t know why Wolfram Alpha is produced different values. If you know let me know.

[java]
package name.dhruba.black.scholes;

import static org.junit.Assert.assertEquals;

import org.junit.Test;

public class TestBlackScholesGreeks {

@Test
public void testGreeks() {

boolean c;
double s, k, r, t, v;
double[] p;

c = true;
s = 56.25;
k = 55;
r = 0.0285;
t = 0.34;
v = 0.28;

p = BlackScholesGreeks2.calculate(c, s, k, r, t, v);

assertEquals(4.561, round(p[0], 3), 0);
assertEquals(0.610, round(p[1], 3), 0);
assertEquals(0.042, round(p[2], 3), 0);
assertEquals(12.587, round(p[3], 3), 0);
assertEquals(-6.030, round(p[4], 3), 0);
assertEquals(10.110, round(p[5], 3), 0);

c = false;
p = BlackScholesGreeks2.calculate(c, s, k, r, t, v);

assertEquals(2.781, round(p[0], 3), 0);
assertEquals(-0.390, round(p[1], 3), 0);
assertEquals(0.042, round(p[2], 3), 0);
assertEquals(12.587, round(p[3], 3), 0);
assertEquals(-4.478, round(p[4], 3), 0);
assertEquals(-8.409, round(p[5], 3), 0);

}

static double round(double d, int places) {
int factor = (int) Math.pow(10, places);
return (double) Math.round(d * factor) / factor;
}

}
[/java]

Oddities

One thing I noticed is that for the price formula for a put option reordering certain operations, even though the overall operation was equivalent, produced different digits towards the end of the value; in other words, towards the final decimal places. The following two operations, for example, although equivalent, produce different digits for the final few decimal places. If anyone knows what’s going on here do let me know.

p[0] = pcd2 * k * exp(-r * t) - pcd1 * s;
p[0] = k * exp(-r * t) * pcd2 - s * pcd1;

Did this help you or do you have any improvements or fixes? Let me know in the comments.