With the JDK 7 developer preview out and a final release fast approaching it’s important to not only to become aware of what the new version offers but also, in certain areas where existing programming paradigms have radically changed, to make a mental shift in the way we think and understand how we can leverage these new paradigms best to our advantage. One such area is that of finding and exploiting areas of latent parallelism using a coarse grained parallelism approach.
As I mentioned in my previous post about the JDK7 developer preview being released – we’ve been using jsr166y and extra166y at work for some time now and this post really stems from an impassioned discussion that took place on finding and exploiting areas of latent parallelism in code so here’s what I have to say on the matter (inspired obviously by Doug Lea, Brian Goetz and my esteemed colleagues). The traditional and very much outdated mindset has only understood threads and ever since java 5 the executor framework on top. However this mechanism is fundamentally limited in its design in the extent of parallelism it can offer.
Firstly threads are expensive not only in their creation and stack size allocation but also in terms of context switching between them. Deciding on how many threads to have is also always at best an educated guess. A particular service within a process may decide to use all available cores but if every service in the process does the same then you have a disproportionately large number of threads and I have worked with applications with more than 150-200 threads operating at a time. Secondly, the executor framework has helped considerably in taking away some of the decision making from the developer and absorbing that complexity but it still suffers from heavy contention from multiple threads on the internal queue of tasks that it holds again adversely impacting performance. Thirdly, threads and executor frameworks normally do not scale up or down based on the hardware that they’re running on and certainly do not scale based on load. Their performance is very much constant by way of their underlying design.
Enter the fork join framework and parallel arrays. This is not a paragraph about how to use these new features but, in my opinion, a far more important note on how to rid ourselves of a legacy mindset on parallelism and make room for a new one. The fork join framework and parallel arrays (which are backed by the fork join framework and fork join pool internally) should not be perceived only as threading tools. That’s very dangerous as it means that we are only likely to use them in those areas where we previously used threads. They can in fact help us find and exploit areas of latent parallelism.
What does that mean? In all applications there are areas of code that operate sequentially. This code may be thread confined or stack confined and we almost never reconsider the way they perform. With FJ/PA we can now start making these areas concurrent. How is this an improvement? Well FJ/PA offer the following key features which makes them an ideal fit for such a use case.
Firstly, they are fundamentally decoupled from the number of threads in the way they add value which is a good thing. They tend to perform well regardless of how many threads they are using. Secondly, instead of using a single work queue for all threads they use one work queue per thread. This means further decoupling between threads and the way tasks are stored. Thirdly, given multiple work queues and multiple threads, FJ/PA perform work stealing. Every queue is a double ended queue and when one thread has completed all its tasks it then starts to process the tasks from the tail of another queue and because it is dequeuing off the tail there is no contention on the head of the queue from which the owner of the queue is dequeuing. Not only that but the largest tasks are placed towards the end of queues so that when another thread does steal work off another queue it gets enough work to effectively reduce the interval at which it steals again thereby again reducing contention. And finally, and most importantly, given a piece of FJ/PA code it will not only scale up but effectively scale down based not only on the hardware it runs but also on the load of the incoming work. When you understand this new paradigm suddenly the legacy paradigm seems so primitive and fundamentally stunted.
So the next time you are browsing your code consider using jsr166y and extra166y to find and exploit latent areas of parallelism. Generally the rule of thumb should be that this approach works best for operations that are cpu intensive and the legacy paradigm is better for io or network bound operations for obvious reasons. If operations are io or network bound there is less contention and the limitations of the legacy paradigm are less exposed. Don’t forget that the two libraries above can be used in java 6 so there’s no need to wait for java 7!