A few days ago I was using Java Future for some parallel computation and noticed something interesting with the performance. Here are some findings.
Note: normally it is a bad idea to have Future threads updating variables outside their own scope. This piece of code is for illustration purposes only.
Here is a simplified version of the code:
Basically it forks a number of threads based on number of CPU cores and each thread will be incrementing a variable repeatedly.
For reference, this was tested with OpenJDK 11 on a Ryzen 2200G CPU.
The variable being updated can either be an array defined within the lambda expression:
for (long j = 0; j < 2500000000L; j++) {
local_result[0] += 1;
}
return local_result[0];
Or it can be an array defined outside:
for (long j = 0; j < 2500000000L; j++) {
main_result[slot] += 1;
}
return main_result[slot];
And the two versions have huge difference in performance. The local variable version finished in around 5 seconds while the one using external variable needed 35 seconds to complete.
What is going on? Comparing the byte codes (generated with "javap -c -p classname") of the lambda function between the two:
The version on the left is using variables outside the thread scope. Note that:
(1) for external variable version, the array and index are implicitly passed into the lambda function as parameters
(2) otherwise the two versions are basically the same, except some variable numbering (e.g. lstore_3 vs lstore_1 for the loop counter)
Then why there is a big performance hit when using variables outside the lambda function?
My guess is, it comes from much lower level... the CPU cache.
Using "perf" to measure the the local variable version:
perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 4.96s
Performance counter stats for 'java LocalVsParent':
19,720.00 msec task-clock:u # 3.915 CPUs utilized
0 cpu-migrations:u # 0.000 K/sec
3,528 page-faults:u # 0.179 K/sec
60,434,426,159 instructions:u
4,149,399 branch-misses:u # 0.04% of all branches
10,084,402,945 branches:u # 511.379 M/sec
39,389,694 cache-references:u # 1.997 M/sec
9,439,462 cache-misses:u # 23.964 % of all cache refs
5.037206853 seconds time elapsed
19.687272000 seconds user
0.032057000 seconds sys
And the external variable version:
perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 35.28s
Performance counter stats for 'java LocalVsParent':
139,148.66 msec task-clock:u # 3.935 CPUs utilized
0 cpu-migrations:u # 0.000 K/sec
3,709 page-faults:u # 0.027 K/sec
60,463,724,838 instructions:u
4,343,415 branch-misses:u # 0.04% of all branches
10,091,903,050 branches:u # 72.526 M/sec
1,658,685,818 cache-references:u # 11.920 M/sec
1,626,364,192 cache-misses:u # 98.051 % of all cache refs
35.358457694 seconds time elapsed
138.982741000 seconds user
0.163984000 seconds sys
The number of instructions and branches etc are similar. But see that the external variable version has a whopping 98% cache miss? That is probably why it has such a poor performance.
Conclusion? It is hard to predict cache handling when you are stressing the CPU using a high-level language such as Java. Besides, when using Future or thread computation in Java, it is usually a good idea to avoid updating variables outside the thread scope as you will need to be aware of all the "volatile", "atomic", and "synchronized" stuff.