Monday, August 1, 2022

Java Collections.shuffle poor performance with parallelism

 Previously when using a Java program to simulate the 100 prisoners problem, noticed an interesting performance issue: the Collections.shuffle method will get slower when there are more threads to run the simulation in parallel. And running the VisualVM profiler does indicate that the performance issue came from Collections.shuffle when preparing a list of randomized integer.

This exercise is an attempt to collect more information and do an initial analysis.


The prisoner problem simulation needed a list of n integer from 1 to n (or 0 to n-1) randomly stored in it.  Here is a simplified implementation of two approaches.

The easy approach is to add 0..n-1 to the list and then use Collections.shuffle to randomize it.

private static List<Integer> initBoxesByShuffle() {
List<Integer> boxes = new ArrayList<>(LIST_SIZE);
for (int i = 0; i < LIST_SIZE; i++) {
return boxes;

A not-so-efficient approach is to loop from 0 to n-1 and insert each value at random places of the list. This is slower as the ArrayList.add(pos, val) method will need to shift all existing elements after the insertion point by one position (using other List implementation will help but let's just assume we are stuck with ArrayList).

private static List<Integer> initBoxesByRandInsert() {
List<Integer> boxes = new ArrayList<>(LIST_SIZE);
Random rand = new Random();
for (int i = 1; i < LIST_SIZE; i++) {
boxes.add(rand.nextInt(boxes.size()), i);
return boxes;

Performance was measured with the perf command, e.g.:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses
,branches,cache-misses,cache-references /usr/lib64/jvm/java-11-openjdk/bin/java -classpath . ShufflePerformance

Machine used was an AMD 5600G (6C12T) with 44GiB of memory running OpenSUSE Tumbleweed. JDK used was OpenJDK 11.0.16.


Performance data collected when trying to run the two approaches with different number of threads (range from 1 to 5):

A few things to note:

  • Row 1 and 6 show the two approaches executed with single thread (parallelism = 1). The time elapsed (i.e. execution time) of Collections.shuffle is much better than random insert (~10 seconds vs ~30 seconds).
  • However, when the parallelism is increased to divide the same amount of work among different threads, suddenly Collections.shuffle is getting slower. Looking at the time elapsed columns for Collections.shuffle (row 1 to row 5), the execution time roughly doubled every time an extra thread is added. In contrast, for the random insert approach (row 6 to row 10), the execution time went down steadily as more threads are added


  • And looking at the user time column (i.e. amount of work done by all CPU cores), the values stayed roughly the same for random insert approach. This explain why when more threads were added, the execution time decreased. But, the values for the Collections.shuffle kept increasing when more threads were added. That means the effort actually increased when we tried to divide the same task among more threads.
  • That was interesting because if we look at the instructions and branches columns, we can see that (1) the numbers stayed relatively the same for each approach no matter how many threads we were using; and (2) the random insert approach actually requires more works. 
  • The random insert approach has a much higher branch-misses (both by count and by %) but they stay relatively the same regardless of how many threads were used. However, the Collections.shuffle approach saw the branch-misses % figures going up with number of threads.


  • The cache-misses % stated relatively the same for random insert approach but reached over 80% for Collections.shuffle.


Initial analysis

If the high cache-misses rate is the reason of the poor performance, what is the cause? And will a different architecture (OS and CPU) yield a different result?

In theory the Collections.shuffle should be more efficient as it runs with linear time by swapping elements within the list without additional copying or moving. Could it be the case that the random access nature of the swapping needed the whole ArrayList to be in cache for it to run efficiently?