Friday, July 16, 2021

Cost of implicit casting in Java

Java has implicit casting (e.g. from float to double). Curious to know what is the performance impact. For example:

// direct cast to double
methodWithDoubleAsParam((double)Integer.parseInt(anInteger))


vs

// explicitly cast to float and implicitly to double
methodWithDoubleAsParam((float)Integer.parseInt(anInteger))


For the above example, when looking at the generated byte code, the direct casting used the "i2d" (integer to double) instruction. The indirect and implicit casting used "i2f" and "f2d" instructions.

......
  private void lambda$run$2(java.lang.String[], int);
    Code:
       0: aload_0
       1: aload_1
       2: iconst_0
       3: aaload
       4: invokestatic  #20                 // Method java/lang/Integer.parseInt:(Ljava/lang/String;)I
       7: i2f
       8: f2d
       9: invokevirtual #21                 // Method dummy:(D)V
      12: return

  private void lambda$run$1(java.lang.String[], int);
    Code:
       0: aload_0
       1: aload_1
       2: iconst_0
       3: aaload
       4: invokestatic  #20                 // Method java/lang/Integer.parseInt:(Ljava/lang/String;)I
       7: i2d
       8: invokevirtual #21                 // Method dummy:(D)V
      11: return
......

 

On machine with OpenJDK 11, the extra cost for each call is about 2ns.

Full source code available.



Saturday, June 19, 2021

Squeezelite with docker

Besides those dedicated Raspberry Pis running piCorePlayer, sometimes I wanted to simulate a Squeezebox device on my work PC to play music. The easiest way is to run Squeezelite with docker.

A minor patch to the docker run script so that I can pass in upsampling parameters to it as my USB DAC can handle up to 768kHz.

Then just start the container with something like this, where D50s is my DAC:

docker run --rm --env SQUEEZELITE_AUDIO_DEVICE=hw:CARD=D50s --env SQUEEZELITE_SPECIFY_SERVER=yes --env SQUEEZELITE_SERVER_PORT=192.168.100.6:3483 --env SQUEEZELITE_NAME=openSUSE_PC --env SQUEEZELITE_OPTS='-r 705600,768000 -R vE::4:28:99:100:50' --device /dev/snd --name squeezelite --net host -d giof71/squeezelite

Monday, June 7, 2021

Ripping audio tracks from a DVD

Quick notes on ripping audio tracks from a DVD (Unplugged by The Corrs) using command line:


# rip dvd by track (total 17)
for i in {1..17}; do
  mplayer dvd:// -chapter $i-$i -dumpstream -dumpfile $i.vob;
done

# find out which audio stream has the 2-channel pcm and copy it into a flac file
for i in {1..17}; do
  tmp=`ffprobe -v error -show_format -show_streams $i.vob | grep -B 1 pcm_dvd | head -n 1 | cut -d '=' -f 2-`
  ai="$(($tmp-1))"
  ffmpeg -i $i.vob -map 0:a:$ai -vn -f flac $i.flac;
done

# rename flac files by track... TBD


# set metadata
for i in *.flac; do
  ALBUM='Unplugged'
  ARTIST='The Corrs'
  tracknumber=`echo $i | cut -d ' ' -f 1 | sed 's/^0*//'`
  title=`echo $i | cut -d ' ' -f 2- | cut -f 1 -d '.'`
  metaflac --set-tag="ALBUM=$ALBUM" --set-tag="ARTIST=$ARTIST" --set-tag="tracknumber=$tracknumber" --set-tag="title=$title" --remove-tag=encoder "$i"
done

Sunday, May 30, 2021

My current audio setup

 

This is my current audio setup with free software.

  • The Logitech Media Server (LMS) is the music hub. It serves local music files and streams music from Tidal
  • A couple of Raspberry Pis running piCorePlayer as music players. The piCorePlayer is lightweight enough that it can run on the very first generation of Raspberry Pi with just 256MB of RAM
  • More piCorePlayer devices can be added for multi-room setup
  • The desktop setup has a Topping D50s DAC. Recently replaced the KRK Rokit 5 Gen3 active speakers with passive Triangle BR03 and a class-T amplifier. Thinking to replace the amplifier with a proper integrated amplifier (or maybe even a pre-amp + power amp) so I can add a turntable
  • The bedside headphone setup has an old Mini USB DAC and a DIY headamp


Tuesday, April 27, 2021

Experiment with Dart and Flutter


 

Decided to try out Dart and Flutter for Android development. Some thoughts after re-implementing AndSafe:

  • For simple / standard UI, defining and implementing each page with code is a lot easier than using UI designer
  • If you know JS / node.js, writing Dart code feel like home
  • Although Dart utilizes ahead-of-time compiled code, doing CPU and memory heavy tasks (e.g. scrypt key derivation) is still noticeably slow. Ended up using Dart FFI to call C implementation of scrypt.

I first implemented Android Safe 10 years ago in 2010. This re-write is certainly more enjoyable and easy, thanks to matured development environment.

Wednesday, March 17, 2021

Limit cloudflared upstream connections

Just archived my cloudflared patch repo. My patch was a hack to get around a run-away issue: when there are sudden in-rush of requests or network delay, cloudflared will create lots of connection to upstream DNS-over-HTTPS servers. This will trigger the upstream throttling cloudflared and causing it to create even more connections to upstream. The machine will ended up with high CPU usage and no DNS request being resolved.

That is because cloudflared used golang "http.Transport" for the connection without setting a max limit. My hack hard-coded the max number of connection to 2 to avoid the issue. But it is probably inappropriate if cloudflared is used in an enterprise environment.

Luckily someone worked on a fix by adding a command line parameter to specify the max connection. Just add "--max-upstream-conns num_con" as parameter when starting cloudflared.

Sunday, February 28, 2021

Penney's Game

 Just finished the book Humble Pi. Interesting read, especially if you are interested in math trivia and read about how little math mistakes can cause serious disasters.

In the chapter about probability, it mentioned the Penney's Game and how people's misconception about independent event makes the game interesting.

I ended up writing a program that simulate coin flipping to verify the claims. The result checked out but I did make a mistake when first implementing the logic.

When looping the coin flipping result, I used counter to keep track of the number of matches for each player. When a mismatch happens, I incorrectly reset the counter to zero. Instead, I should have checked if the last one / two coin flipping results match player's pattern.


Tuesday, January 26, 2021

Stable matching

While cleaning my working directory, found a Python function that I wrote a few months ago for solving the stable marriage problem. But couldn't remember why I wrote it...

Anyway, refactored and pushed to github.

Thursday, December 24, 2020

Tamp down Logitech Media Server polling

With the latest Logitech Media Server 8.0, it now has online music library integration. However, by default, it polls the music service every hour. That is too much for my poor Raspberry Pi with only 512MB of memory and a slow CPU.  Not to mention it is also running my DNS ads blocking and tunneling services.

A simple fix is to modify the polling interval (the file Plugin.pm can be found under /usr/share/perl5/Slim/Plugin/OnlineLibrary. Edit the variable POLLING_INTERVAL).

It also helps to lower the scanner priority: go to Server Settings, under Performance, change "Scanner Priority" to something lower than normal.


Sunday, December 20, 2020

Java asynchronous computation and performance

A few days ago I was using Java Future for some parallel computation and noticed something interesting with the performance.  Here are some findings.

Note: normally it is a bad idea to have Future threads updating variables outside their own scope.  This piece of code is for illustration purposes only.

Here is a simplified version of the code:

Basically it forks a number of threads based on number of CPU cores and each thread will be incrementing a variable repeatedly.

For reference, this was tested with OpenJDK 11 on a Ryzen 2200G CPU.

The variable being updated can either be an array defined within the lambda expression:

                for (long j = 0; j < 2500000000L; j++) {
                    local_result[0] += 1;
                }
                return local_result[0];

Or it can be an array defined outside:

                for (long j = 0; j < 2500000000L; j++) {
                    main_result[slot] += 1;
                }
                return main_result[slot];

And the two versions have huge difference in performance.  The local variable version finished in around 5 seconds while the one using external variable needed 35 seconds to complete.

What is going on? Comparing the byte codes (generated with "javap -c -p classname") of the lambda function between the two:

The version on the left is using variables outside the thread scope.  Note that:

(1) for external variable version, the array and index are implicitly passed into the lambda function as parameters

(2) otherwise the two versions are basically the same, except some variable numbering (e.g. lstore_3 vs lstore_1 for the loop counter)

Then why there is a big performance hit when using variables outside the lambda function?

My guess is, it comes from much lower level... the CPU cache.

Using "perf" to measure the the local variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 4.96s

 Performance counter stats for 'java LocalVsParent':

         19,720.00 msec task-clock:u              #    3.915 CPUs utilized          
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,528      page-faults:u             #    0.179 K/sec                  
    60,434,426,159      instructions:u                                              
         4,149,399      branch-misses:u           #    0.04% of all branches        
    10,084,402,945      branches:u                #  511.379 M/sec                  
        39,389,694      cache-references:u        #    1.997 M/sec                  
         9,439,462      cache-misses:u            #   23.964 % of all cache refs    

       5.037206853 seconds time elapsed

      19.687272000 seconds user
       0.032057000 seconds sys

 

And the external variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 35.28s

 Performance counter stats for 'java LocalVsParent':

        139,148.66 msec task-clock:u              #    3.935 CPUs utilized          
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,709      page-faults:u             #    0.027 K/sec                  
    60,463,724,838      instructions:u                                              
         4,343,415      branch-misses:u           #    0.04% of all branches        
    10,091,903,050      branches:u                #   72.526 M/sec                  
     1,658,685,818      cache-references:u        #   11.920 M/sec                  
     1,626,364,192      cache-misses:u            #   98.051 % of all cache refs    

      35.358457694 seconds time elapsed

     138.982741000 seconds user
       0.163984000 seconds sys

 

The number of instructions and branches etc are similar.  But see that the external variable version has a whopping 98% cache miss?  That is probably why it has such a poor performance.

Conclusion?  It is hard to predict cache handling when you are stressing the CPU using a high-level language such as Java.  Besides, when using Future or thread computation in Java, it is usually a good idea to avoid updating variables outside the thread scope as you will need to be aware of all the "volatile", "atomic", and "synchronized" stuff.