Thursday, December 24, 2020

Tamp down Logitech Media Server polling

With the latest Logitech Media Server 8.0, it now has online music library integration. However, by default, it polls the music service every hour. That is too much for my poor Raspberry Pi with only 512MB of memory and a slow CPU.  Not to mention it is also running my DNS ads blocking and tunneling services.

A simple fix is to modify the polling interval (the file can be found under /usr/share/perl5/Slim/Plugin/OnlineLibrary. Edit the variable POLLING_INTERVAL).

It also helps to lower the scanner priority: go to Server Settings, under Performance, change "Scanner Priority" to something lower than normal.

Sunday, December 20, 2020

Java asynchronous computation and performance

A few days ago I was using Java Future for some parallel computation and noticed something interesting with the performance.  Here are some findings.

Note: normally it is a bad idea to have Future threads updating variables outside their own scope.  This piece of code is for illustration purposes only.

Here is a simplified version of the code:

Basically it forks a number of threads based on number of CPU cores and each thread will be incrementing a variable repeatedly.

For reference, this was tested with OpenJDK 11 on a Ryzen 2200G CPU.

The variable being updated can either be an array defined within the lambda expression:

                for (long j = 0; j < 2500000000L; j++) {
                    local_result[0] += 1;
                return local_result[0];

Or it can be an array defined outside:

                for (long j = 0; j < 2500000000L; j++) {
                    main_result[slot] += 1;
                return main_result[slot];

And the two versions have huge difference in performance.  The local variable version finished in around 5 seconds while the one using external variable needed 35 seconds to complete.

What is going on? Comparing the byte codes (generated with "javap -c -p classname") of the lambda function between the two:

The version on the left is using variables outside the thread scope.  Note that:

(1) for external variable version, the array and index are implicitly passed into the lambda function as parameters

(2) otherwise the two versions are basically the same, except some variable numbering (e.g. lstore_3 vs lstore_1 for the loop counter)

Then why there is a big performance hit when using variables outside the lambda function?

My guess is, it comes from much lower level... the CPU cache.

Using "perf" to measure the the local variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 4.96s

 Performance counter stats for 'java LocalVsParent':

         19,720.00 msec task-clock:u              #    3.915 CPUs utilized          
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,528      page-faults:u             #    0.179 K/sec                  
    60,434,426,159      instructions:u                                              
         4,149,399      branch-misses:u           #    0.04% of all branches        
    10,084,402,945      branches:u                #  511.379 M/sec                  
        39,389,694      cache-references:u        #    1.997 M/sec                  
         9,439,462      cache-misses:u            #   23.964 % of all cache refs    

       5.037206853 seconds time elapsed

      19.687272000 seconds user
       0.032057000 seconds sys


And the external variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 35.28s

 Performance counter stats for 'java LocalVsParent':

        139,148.66 msec task-clock:u              #    3.935 CPUs utilized          
                 0      cpu-migrations:u          #    0.000 K/sec                  
             3,709      page-faults:u             #    0.027 K/sec                  
    60,463,724,838      instructions:u                                              
         4,343,415      branch-misses:u           #    0.04% of all branches        
    10,091,903,050      branches:u                #   72.526 M/sec                  
     1,658,685,818      cache-references:u        #   11.920 M/sec                  
     1,626,364,192      cache-misses:u            #   98.051 % of all cache refs    

      35.358457694 seconds time elapsed

     138.982741000 seconds user
       0.163984000 seconds sys


The number of instructions and branches etc are similar.  But see that the external variable version has a whopping 98% cache miss?  That is probably why it has such a poor performance.

Conclusion?  It is hard to predict cache handling when you are stressing the CPU using a high-level language such as Java.  Besides, when using Future or thread computation in Java, it is usually a good idea to avoid updating variables outside the thread scope as you will need to be aware of all the "volatile", "atomic", and "synchronized" stuff.

Thursday, December 17, 2020

Estimating the value of pi

 Just for fun... and testing out the Python ray library for distributed computing: here are some simple scripts to estimate the value of pi.

Tuesday, December 15, 2020

Rederly demo

Glad to learn that Rederly is going to take the WeBWorK Open Problem Library and utilize it in a new platform.  The project is open source, but seems that they also sell it as a service.

Took it for a spin. The renderer seems solid. Backend has some strange design but overall ok.  At least it makes sense when you look at the code.  Some functionalities still missing (e.g. class enrollment by admin) or require tuning (e.g. listing the whole class of several hundred students is going to be slow) though.

Wrote some scripts to run the full stack (frontend, backend, renderer, and db) as containers.  Also tested it on Azure with docker-compose (with minor changes on disk volume mount).

Saturday, December 12, 2020

Google CTF 2020 - writeonly

Stumbled across a writeup on the Google CTF 2020 event.  Found it interesting that although the team used pwntools, they made it over-complicated by writing the shellcode in C and then extracted the assembly code for the exploit injection.  Isn't it the whole point of using pwntools is to help you generate the shellcode assembly?

Anyway, I looked up the challenge and found source code for the challenge itself and an implementation of a clean (official?) exploit.  Here are the results of me playing with that code.  Changes include:

- modified the Dockerfile so I can run the challenge locally.  Used socat to expose the executable via port 1337 of the container

- as for the actual exploit, instead of doing a complete shellcode injection, I modified to code to just dump the flag file.

- this modification also avoided overwriting the child code with bunch of NOPs. It injects code precisely at the start of the infinite loop of the child thread (check_flag+0x8). This can be found by looking at the end of the disassembled code of the check_flag function:

 4022d9:       bf 01 00 00 00          mov    $0x1,%edi
 4022de:       e8 fd cf 04 00          callq  44f2e0 <__sleep>
 4022e3:       e9 52 ff ff ff          jmpq   40223a <check_flag+0x8>


- commands used to build the docker image, disassembling child's function, and running the exploit etc can be found in the Makefile


Detailed description of the challenge and complete source code available on github.

Friday, December 4, 2020

How (not) to add another dimension to a relational database table

This is just a rant. I am not going to mention the name of the software, ok?

So I was working on an open source project, trying to add a new feature to it. The software can display some questions in random order and allow user to enter answers. Say we have five questions, Question A to Question E, randomly shown and the user entered answers as such:

What is being displayed on screen...
Question BAnswer B
Question EAnswer E
Question DAnswer D
Question CAnswer C
Question AAnswer A

And here are the corresponding rows in database when user saved those answers:

What is being stored on db...
1Question AAnswer B
2Question BAnswer E
3Question CAnswer D
4Question DAnswer C
5Question EAnswer A

Instead of using one row to store each question and answer pair, whoever wrote that code decided to utilize the "answer" column independently from other columns and store values in the order displayed on screen!!

They even included comments in the code:

# note that answers are stored in display order...

I mean... WTF!? How am I going to retrieve the data? Am I supposed to use the same random seed to see how questions are ordered on screen and match against the answers?

Sunday, November 29, 2020

A phono preamp

Here is my Muffy phono preamp and the matching power supply. They are not the offical PCBs though. The official online store was out of stock. So I downloaded the free (but older) design files and sent them to a PCB prototyping factory for printing. And yes, I picked white for the PCB color.



But the question is: I don't even own a turntable... why do I even bother to go all the way to have the PBCs printed, ordering all the resistors/capacitors/voltage regulators, and finally hunting down a suitable enclosure to house everything?   🙄

Monday, November 16, 2020

Retiring some of my PostgreSQL BuildFarm animals

To support the PostgreSQL community, I have been running multiple PostgreSQL BuildFarm animals since 2013. They run on my spare ARM SoC boards such as BeagleBone Black and Odroid C2.

Since the primary storage of these devices are micro SD cards, compiling programs on them put a heavy burden on the lifespan of the cards. Usually I needed to replace them every 2 years due to wearing.

Recently, two of the animals failed again, namely flier (AArch64 with clang) and mayfly (AArch64 with gcc).

I am thinking to retire these two. There are more and more people contributing to the BuildFarm running AARCH64 machines. Some of them are running on cloud machines / VMs and should provide better performance and stability. There seems to be immediate need to resurrect my AArch64 animals.

So instead of spending more $ on SD cards / external storage for my SoC boards, I will probably put my animals at rest.


Saturday, January 11, 2020

ROCm with Ryzen 2200G

For reference, here is the output from running /opt/rocm/bin/rocminfo on Ryzen 2200G (with vega 8) after installing ROCm: