Sunday, December 20, 2020

Java asynchronous computation and performance

A few days ago I was using Java Future for some parallel computation and noticed something interesting with the performance.  Here are some findings.

Note: normally it is a bad idea to have Future threads updating variables outside their own scope.  This piece of code is for illustration purposes only.

Here is a simplified version of the code:

Basically it forks a number of threads based on number of CPU cores and each thread will be incrementing a variable repeatedly.

For reference, this was tested with OpenJDK 11 on a Ryzen 2200G CPU.

The variable being updated can either be an array defined within the lambda expression:

                for (long j = 0; j < 2500000000L; j++) {
                    local_result[0] += 1;
                }
                return local_result[0];

Or it can be an array defined outside:

                for (long j = 0; j < 2500000000L; j++) {
                    main_result[slot] += 1;
                }
                return main_result[slot];

And the two versions have huge difference in performance.  The local variable version finished in around 5 seconds while the one using external variable needed 35 seconds to complete.

What is going on? Comparing the byte codes (generated with "javap -c -p classname") of the lambda function between the two:

The version on the left is using variables outside the thread scope.  Note that:

(1) for external variable version, the array and index are implicitly passed into the lambda function as parameters

(2) otherwise the two versions are basically the same, except some variable numbering (e.g. lstore_3 vs lstore_1 for the loop counter)

Then why there is a big performance hit when using variables outside the lambda function?

My guess is, it comes from much lower level... the CPU cache.

Using "perf" to measure the the local variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 4.96s

 Performance counter stats for 'java LocalVsParent':

         19,720.00 msec task-clock:u              #    3.915 CPUs utilized          
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,528      page-faults:u             #    0.179 K/sec                  
    60,434,426,159      instructions:u                                              
         4,149,399      branch-misses:u           #    0.04% of all branches        
    10,084,402,945      branches:u                #  511.379 M/sec                  
        39,389,694      cache-references:u        #    1.997 M/sec                  
         9,439,462      cache-misses:u            #   23.964 % of all cache refs    

       5.037206853 seconds time elapsed

      19.687272000 seconds user
       0.032057000 seconds sys

 

And the external variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 35.28s

 Performance counter stats for 'java LocalVsParent':

        139,148.66 msec task-clock:u              #    3.935 CPUs utilized          
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,709      page-faults:u             #    0.027 K/sec                  
    60,463,724,838      instructions:u                                              
         4,343,415      branch-misses:u           #    0.04% of all branches        
    10,091,903,050      branches:u                #   72.526 M/sec                  
     1,658,685,818      cache-references:u        #   11.920 M/sec                  
     1,626,364,192      cache-misses:u            #   98.051 % of all cache refs    

      35.358457694 seconds time elapsed

     138.982741000 seconds user
       0.163984000 seconds sys

 

The number of instructions and branches etc are similar.  But see that the external variable version has a whopping 98% cache miss?  That is probably why it has such a poor performance.

Conclusion?  It is hard to predict cache handling when you are stressing the CPU using a high-level language such as Java.  Besides, when using Future or thread computation in Java, it is usually a good idea to avoid updating variables outside the thread scope as you will need to be aware of all the "volatile", "atomic", and "synchronized" stuff.



No comments: