How to print assembly for your Java code in OS X

0. Write program.

package name.dhruba;
public class Main {
public static void main(String[] args) {
System.out.println("Hello World!");

1. Add JVM arguments to your program.

-XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

2. Run your program. You will probably see the output below.

Java HotSpot(TM) 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output
Could not load hsdis-amd64.dylib; library not loadable; PrintAssembly is disabled
Hello World!
Process finished with exit code 0

3. Download the missing library from here. The direct link to the lib file itself is here. I downloaded the one named ‘gnu-bsd-libhsdis-amd64.dylib’ as I’m running 64-bit. This produces a file on your system called ‘hsdis-amd64.dylib’.

4. Move it to where, the JDK that you are running with, looks for it. I was using Java8.

sudo mv ~/Downloads/hsdis-amd64.dylib /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib

5. Run your program again and see assembly! The output for Hello World is huge so I can’t paste all of it here but here’s the initial bit where you can see that the disassembler has been loaded.

Java HotSpot(TM) 64-Bit Server VM warning: PrintAssembly is enabled; turning on DebugNonSafepoints to gain additional output
Loaded disassembler from /Library/Java/JavaVirtualMachines/jdk1.8.0.jdk/Contents/Home/jre/lib/hsdis-amd64.dylib
Decoding compiled method 0x000000010f4c23d0:
[Disassembling for mach='i386:x86-64']
[Entry Point]
  # {method} {0x00000001280fe660} 'arraycopy' '(Ljava/lang/Object;ILjava/lang/Object;II)V' in 'java/lang/System'
  # parm0:    rsi:rsi   = 'java/lang/Object'
  # parm1:    rdx       = int
  # parm2:    rcx:rcx   = 'java/lang/Object'
  # parm3:    r8        = int
  # parm4:    r9        = int
  #           [sp+0x60]  (sp of caller)
  0x000000010f4c2540: mov    0x8(%rsi),%r10d
  0x000000010f4c2544: shl    $0x3,%r10
  0x000000010f4c2548: cmp    %r10,%rax
  0x000000010f4c254b: je     0x000000010f4c2558
  0x000000010f4c2551: jmpq   0x000000010f407b60  ;   {runtime_call}
  0x000000010f4c2556: xchg   %ax,%ax

Credit: Nitsan Wakart.

Upcoming Coursera course: Principles of Reactive Programming

Just a quick note to say that a new Coursera course begins on 4th Nov 2013 that looks very interesting indeed titled ‘Principles of Reactive Programming‘ taught by Martin Odersky, Erik Meijer and Roland Kuhn. It is intended as a follow on course from its precursor course ‘Functional Programming Principles in Scala‘ also taught by Martin Odersky that concludes right when the new one starts. To hear the instructors give an introduction to the new course watch this video.

Scala Cookbook: Simplifying nested pattern matching

Pattern matching in Scala is a natural and intuitive way of expressing branched logic and can greatly simplify complex branched logic in terms of readability but what if you need to branch within a branch? This can result in unsightly code and duplication of logic. Let’s see this first hand with an example.

Let’s try to code a function removeAt(n, List) which removes the zero indexed nth element from the supplied list and returns the resulting list. Our first attempt, as we transcribe our thoughts step by step into code, may look as follows.

def removeAt[T](n: Int, xs: List[T]): List[T] = n match {
case 0 => xs match {
case Nil => Nil
case y :: ys => removeAt(n – 1, ys)
case _ => xs match {
case Nil => Nil
case y :: ys => y :: removeAt(n – 1, ys)

Here you can see that, although the logic is correct and reasonably easy to interpret, there is duplication of code and the function could be made even easier to read and a little shorter. On our second attempt we may try to extract the duplicate Nil => Nil cases outside of the match as below.

def removeAt[T](n: Int, xs: List[T]): List[T] =
if (xs.isEmpty) Nil else n match {
case 0 => xs match {
case y :: ys => removeAt(n – 1, ys)
case _ => xs match {
case y :: ys => y :: removeAt(n – 1, ys)

This also works and has in fact achieved our goals – the function is shorter and does not duplicate code. However, at this point, it is blending if/else logic with pattern matching and most importantly producing warnings in Scala IDE in Eclipse. The warnings state that in our secondary (internal) pattern match we do not cater for the Nil case and that’s correct.

Description	Resource	Path	Location	Type
match may not be exhaustive. It would fail on the following input: Nil	ListUtils.scala	/scala-test/src/week5	line 13	Scala Problem
match may not be exhaustive. It would fail on the following input: Nil	ListUtils.scala	/scala-test/src/week5	line 16	Scala Problem

Even though Scala IDE may not see that in the bigger picture the logic is actually correct it is correct in saying what it is saying specifically and I also consider it bad practice to leave warnings in code so let’s go through one more simplification attempt. How else can we restructure this function to achieve the same goals?

def removeAt[T](n: Int, xs: List[T]): List[T] =
xs match {
case Nil => Nil
case y :: ys => n match {
case 0 => removeAt(n – 1, ys)
case _ => y :: removeAt(n – 1, ys)

The answer is to invert the order of the high level pattern match subjects by making the subject that’s being pattern matched redundantly rise to become the primary pattern match with the other subject then becoming secondary. And there we have the end result of the simplification.


A couple of after thoughts for completion follow here. Firstly – the above progression in implementation strategies was chosen to allow illustration of how to incrementally simplify a function and does not suggest that one cannot in fact end up with the final solution the first time. Secondly – the implementation algorithm above which is needlessly verbose has also been chosen to be illustrative of nested pattern matching and isn’t the simplest and most concise possible solution. For that see below.

def removeAt[T](n: Int, xs: List[T]): List[T] =
(xs take n) ::: (xs drop n+1)

Scala Cookbook: Tail recursive factorial

The direct port of a human readable description of the factorial algorithm may look as follows.

def factorial(x: BigInt): BigInt = {
if (x == 0) 1 else x * factorial(x – 1)

However – this isn’t tail recursive and will fail for large inputs.

Exception in thread "main" java.lang.StackOverflowError
	at java.math.BigInteger.getInt(

The classic nested accumulator helper function idiom can be used to make this tail recursive.

def factorial(x: BigInt): BigInt = {
def f(x: BigInt, acc: BigInt): BigInt = {
if (x == 0) acc else f(x – 1, x * acc)
f(x, 1)

So what’s going on here? Well it’s quite simple really.

The first function uses a number of stack frames proportional to the size of the input due to the fact that it is using right expansion. In other words when it calls itself it multiplies the result by x tainting the stack so that it can’t be thrown away or reused. The stack is only flattened when fully expanded and then collapsed.

The second function is tail recursive – in other words – it calls itself as the final call. This is essentially compiled to a loop by the Scala compiler using only one stack frame regardless of the size of the input. It is therefore not possible to get a StackOverflowError with a tail recursive function.

Note the @tailrec annotation which when applied if it compiles then your function is certified tail recursive by the Scala compiler. It is a quick and easy way to check whether your function is tail recursive at compile time as well as serving as a useful hint for other developers maintaining your code.

A pattern matched equivalent of the tail recursive function for fun is below.

def factorial4(x: BigInt): BigInt = {
def f(x: BigInt, acc: BigInt): BigInt = x match {
case y: BigInt if y == 0 => acc
case _ => f(x – 1, x * acc)
f(x, 1)

Scala Cookbook: Pattern matching on BigInt

Let’s take the factorial algorithm as an example.

def factorial(x: BigInt): BigInt = {
if (x == 0) 1 else x * factorial(x – 1)

One might expect to convert this into a pattern matched equivalent as follows.

def factorial(x: BigInt): BigInt = x match {
case 0 => 1
case _ => x * factorial(x – 1)

In fact this will give you the error below.

type mismatch;  found   : Int(0)  required: BigInt	Factorial.scala

The reason is that BigInt is not a case class and does not have an unapply() method in its companion object.

However a small modification allows the pattern matching implementation to succeed.

def factorial(x: BigInt): BigInt = x match {
case y: BigInt if y == 0 => 1
case _ => x * factorial(x – 1)

Arguably the if/else equivalent is simpler but if you always use pattern matching you’ll want this version.

OpenCL Cookbook: 10 tips for high performance kernels

Today we adorn our performance hat in the OpenCL Cookbook series. OpenCL kernel optimisation is a black art and very very hard. Recently when I came across a list of 10 tips for high performance kernels in Matthew Scarpino’s excellent book: OpenCL in Action I just had to share it as it was a true gem. Bear in mind that this is an abridged account – if you want more detail read the appropriate section in the book.

  1. Unroll loops: Comparison operations are expensive – if you know how many iterations you need simply perform them individually.
  2. Disable processing of denormalised numbers: The processing of denormalised numbers takes time. Disable them using the -cl-denorms-are-zero option. You can also disable processing of infinite values and NaNs using -cl-finite-math-only.
  3. Transfer constant primitive values to the kernel with compiler defines instead of private memory parameters. Example: clBuildProgram(program, 0, NULL, "-DSIZE=128", NULL, NULL);.
  4. Store small variable values in private memory instead of local memory: Private memory is faster and should be used unless you need to share data between work items in which case use local memory.
  5. Avoid local memory bank conflicts by accessing local memory sequentially: Successive 32-bit elements are stored in successive local memory banks. Sequential bank access is parallel whereas contending on same bank access is serial.
  6. Avoid using modulo operator: This operator is slow and alternatives should be used instead.
  7. Reuse private variables throughout the kernel but use macros to distinguish each separate use.
  8. For multiply-and-add operations use the fma function if available.
  9. Inline non-kernel functions in a program file: This uses more memory but removes the context switches and stack operations associated with regular function calls.
  10. Avoid branch miss penalties by coding conditional statements to be true more often than false: Processors often optimise for the true case and this can result in a penalty for the false case known as the branch miss penalty. Code your conditionals to evaluate to true as often as possible.

It’s easy to list rules for optimising kernels but the truth is that optimising kernels is hard; very hard. The best way to approach it is, in my opinion, to profile your application and kernels as well as to experiment with various changes. There will certainly be optimisations that you’ll apply that will turn out to be ineffectual or even slower so trial and error is key.

AMD release APP SDK 2.8, CodeXL 1.0 and Bolt

This is old news now that AMD have released APP SDK 2.8, CodeXL 1.0 and Bolt. Bolt is a C++ abstraction of OpenCL which allows you to write OpenCL applications without actually writing OpenCL. I personally haven’t tried it as we are not going to be using it at work but it sounds a good step forward in making GPGPU more accessible to the mainstream. For anyone interested there are a number of blog posts by AMD on how to use Bolt.

OpenCL Cookbook: Multi-device utilisation strategies

A quick note on all the different ways in which one can utilise multiple devices in OpenCL. This overlaps with my previous post on the same subject in the series but is a higher level overview than that previous post.

Here are the different ways one can load balance across multiple devices.

  • Use CPU and GPU devices.
  • Use multiple GPU devices.

For each of the above one can utilise them in the following ways.

  • Single context – Use one context in total across all devices and one command queue per device.
  • Multiple contexts single-host-thread – Use one context and one command queue per device.
  • Multiple contexts multi-host-threaded – Same as multiple contexts above but use one host thread per device to enqueue to the command queue for that device. This can also be in the form of starting one host process instance per device which in essence is the same.

For an elaboration on the latter strategies see my previous post on the subject.

Project Rogue: Conception and specification

Project Rogue in action in Crysis 2
Project Rogue in action in Crysis 2


Project Rogue was conceived out of necessity in September 2012. Constituent components were researched in October 2012 and then subsequently ordered end of October 2012. A week later the project was complete in the sense that it functioned as a whole. It was alive and breathing. One month later, in December 2012, Project Rogue has now completed its extensive accessorisation phase and now I reveal it to you in all its glory.

How was Project Rogue conceived?

Early September 2012 I started working with some insanely powerful desktop workstation hardware – dual xeon 16 core 32 thread watercooled cpus, 8 water cooled gpus, 128GB ram, dual 1200W power supplies and immensely large cases. We were loading up these beasts with heavy computations that were utilising both multiple physical cpus and multiple gpus.

At that time my personal machine at home was a humble 11″ macbook air which although previously had been more than adequate for my needs it had now essentially been made suitably redundant and in a sense outdone. It was clear that I would need something a bit more ambitious to be able to do the kind of work at home that I was doing at work.

Something that would allow me to do uncompromising computations on the cpu and across multi-gpus but within the constraints of a personal budget (which I far exceeded in the end due to my own uncompromising nature). Something that within loose constraints would be uncompromising in its demands for high quality components and power and not settle for what was just good enough – something almost rogue in nature.

What is Project Rogue?

Sadly it’s nothing as grand as the preceding section makes it out to be where I got a little carried away but to me, actually, it is grand for reasons I’ll get into below. Project Rogue is a home computation and gaming rig with modern high quality components within the constraints of a home budget (sadly a characteristically generous one).

There are two things that make Project Rogue highly significant to me which is why it is being made into a series of posts on this blog. Firstly, it is a real world machine born out of a real world need created from scratch in the modern day and I’m sure that there are many that would benefit from an account of the entire process of creating such a rig. I certainly wish there was such a guide for me when I started.

Secondly, and more importantly, this was my first pc build ever which I admit without reservation and given that fact and the calibre of this machine I feel a sense of achievement and would like to catalogue this milestone of mine. I also think that the latter point makes, this account to come, even more valuable for those out there who haven’t done this before in making it that much more accessible and approachable to them.

I have to say, that whilst my assumption was (as always) that I was the only one who hadn’t done anything like this before, that could not have been further from the truth. I was really very surprised to learn that most ex-colleagues and some current colleagues over the years had either had no exposure to hardware to date or had simply become totally disconnected from it in mainstream commodity development.

I’ve been there and I know what it’s like. You’re developing on a virtual machine. You don’t need to know where it runs and you don’t need to care. The illusion of abstraction and portability reassures you that as long as you write your code to be fast it will automatically run faster when put on better hardware but in reality that is not necessarily true. Hardware is never proactively part of the development process and developers are generally unaware of what hardware they are running on in production.

Home computation and gaming rig component list

Here follows the component and accessory list for a home computation and gaming rig. Each component was picked with a particular motivation in mind and I’ll write about those motivations in a separate post. But generally I tried to go for the best, not unreasonably expensive, quality components that were released most recently.

For the experts out there if you are expecting a workstation or server specification build or a high end consumer build such as a multi cpu, xeon or sandybridge build you’ll be disappointed. This is, sadly, a bit simpler and a bit cheaper but at least up to date with ivybridge and the latest 7970 cards. It is a consumer grade build built with two gpu computations and gaming in mind. It is uncompromising in the gpus chosen however. If you’re expecting a full build watercooled solution you’ll also be sorely disappointed. I needed to start with air-cooling but no doubt next up is water.

In terms of cost – the total cost of the following component list comes to £4385 but a few parts having been bought have been replaced and the sales of the older parts are still pending and once they go through the total cost will reduce to £3533 which sounds a little less mindless.

# Component Price Model Vendor Notes
1 Case £143 Corsair Obsidian 650D Amazon, OC Windowed and matte black
2 Motherboard £231 Asus Maximus V Formula/ThunderFX Amazon, OC Multi-gpu PCI-E, dual band wifi
3 PSU £205 Corsair AX1200 Amazon, OC Fully modular and future proof
4 CPU £240 Intel IvyBridge i7-3770K Amazon, OC 4 cores, 8 threads
5 CPU Cooler £80 Corsair H100 Amazon, OC Sealed liquid cooler
6 Memory £200 Corsair 16GB Dominator Platinum Amazon, OC Pure aesthetic beauty
7 GPU 1 £395 Asus Matrix HD7970 Platinum Amazon, OC Top end triple slot AMD radeon card
8 GPU 2 £395 Asus Matrix HD7970 Platinum Amazon, OC Top end triple slot AMD radeon card
9 GPU 3 £380 MSI R7970 Lightning Boost Edition Amazon, OC Top end dual slot AMD radeon card
10 GPU 4 £380 MSI R7970 Lightning Boost Edition Amazon, OC Top end dual slot AMD radeon card
11 Monitor £489 Asus PB278Q Amazon, OC 27″ LED, fully adjustable
12 SSD 1 £171 Corsair Force GS 240GB Amazon, OC 90K IOPS, bright red
13 SSD 2 £171 Corsair Force GS 240GB Amazon, OC 90K IOPS, bright red
14 HDD 1 £66 WD Scorpio Blue 1TB (Spec) (PDF) Amazon, OC Fast 2.5″ 1TB HD
15 HDD 2 £66 WD Scorpio Blue 1TB (Spec) (PDF) Amazon, OC Fast 2.5″ 1TB HD
16 Bluray writer £80 Asus BW-12B1ST Amazon, OC Multi-purpose drive
17 4x SSD Rack £58 Icy Dock 4×2.5″ SSD/HD Rack Amazon, OC 4x drive hot swap rack
18 Fan controller £47 Lamptron FC5V2 Amazon, OC Multi-colour multi-output
19 120mm fans £18 Corsair SP120 Quiet Edition Amazon, OC Radiator fans with red fan rims!
20 120mm fans £10 Corsair AF120 Quiet Edition Amazon, OC Airflow fans with red rims!
21 200mm fan £15 Bitfenix Spectre Pro Amazon, OC Quiet black 200mm fan
22 Speakers £183 Corsair SP2500 Amazon, OC Top end 2.1 speakers
23 Keyboard 1 £11 MS Keyboard 600 Amazon, OC Cheap and nasty
24 Keyboard 2 £70 Logitech G510 Amazon, OC Soft touch with macros
25 Mouse 1 £45 Logitech Performance MX Amazon, OC Cordless
26 Mouse 2 £35 Logitech G500 Amazon, OC Corded
27 USB3 Stick £50 Corsair Survivor Stealth 64GB Amazon, OC Stunning and fast
28 USB3 adapter cable £7 Akasa USB3 adapter cable Amazon, OC Converts usb3 to mb header
29 Thermal paste £10 Artic Silver 5 Amazon, OC Thermal paste and cleaner
30 Cable ties 1 £3 Large velcro cable ties Amazon, OC Multi-coloured and chunky
31 Cable ties 2 £4 Large velcro cable ties Amazon, OC Black and chunky
32 Cable ties 3 £12 Small velcro cable ties Amazon, OC Thin ties bulk pack
33 Power meter £19 Efergy Socket 2.0 Amazon, OC Shows real time watts used
34 Sata cables 1 £5 Akasa 50cm sata3 Amazon, OC Fat sata cable
35 Sata cables 2 £17 4x Akasa super-slim 50cm sata3 Amazon, OC Super slim form factor
36 DVD-R £7 25x Verbatim DVD-R Amazon, OC System restore discs
37 CD/DVD pens £5 Fellowes CD/DVD Pens Amazon, OC No comment
38 External HDD £62 Samsung M3 1TB Amazon, OC Out of case backup

A sneak peek at Project Rogue
Sadly there was only room for two but the sight of four was nice while it lasted!

What’s next in the series?

Project Rogue will become a series starting with a step by step account of its own build in the form of a photo journal and then continuing to document its evolution in real time. I will also focus on important aspects of a computer build that would be relevant to anyone attempting any build. Post Project Rogue there will be another good few builds coming each offering a progression of some kind of previous builds and if I haven’t run out of cash by then and had to shut the site down I’ll document those too! Stay tuned!