Clarence's Wicked Mind: December 2020

Thursday, December 24, 2020

Tamp down Logitech Media Server polling

With the latest Logitech Media Server 8.0, it now has online music library integration. However, by default, it polls the music service every hour. That is too much for my poor Raspberry Pi with only 512MB of memory and a slow CPU. Not to mention it is also running my DNS ads blocking and tunneling services.

A simple fix is to modify the polling interval (the file Plugin.pm can be found under /usr/share/perl5/Slim/Plugin/OnlineLibrary. Edit the variable POLLING_INTERVAL).

It also helps to lower the scanner priority: go to Server Settings, under Performance, change "Scanner Priority" to something lower than normal.

Sunday, December 20, 2020

Java asynchronous computation and performance

A few days ago I was using Java Future for some parallel computation and noticed something interesting with the performance. Here are some findings.

Note: normally it is a bad idea to have Future threads updating variables outside their own scope. This piece of code is for illustration purposes only.

Here is a simplified version of the code:

Basically it forks a number of threads based on number of CPU cores and each thread will be incrementing a variable repeatedly.

For reference, this was tested with OpenJDK 11 on a Ryzen 2200G CPU.

The variable being updated can either be an array defined within the lambda expression:

                for (long j = 0; j < 2500000000L; j++) {
                    local_result[0] += 1;
                }
                return local_result[0];

Or it can be an array defined outside:

                for (long j = 0; j < 2500000000L; j++) {
                    main_result[slot] += 1;
                }
                return main_result[slot];

And the two versions have huge difference in performance. The local variable version finished in around 5 seconds while the one using external variable needed 35 seconds to complete.

What is going on? Comparing the byte codes (generated with "javap -c -p classname") of the lambda function between the two:

The version on the left is using variables outside the thread scope. Note that:

(1) for external variable version, the array and index are implicitly passed into the lambda function as parameters

(2) otherwise the two versions are basically the same, except some variable numbering (e.g. lstore_3 vs lstore_1 for the loop counter)

Then why there is a big performance hit when using variables outside the lambda function?

My guess is, it comes from much lower level... the CPU cache.

Using "perf" to measure the the local variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 4.96s

Performance counter stats for 'java LocalVsParent':

         19,720.00 msec task-clock:u              #    3.915 CPUs utilized
                 0      cpu-migrations:u          #    0.000 K/sec
             3,528      page-faults:u             #    0.179 K/sec
    60,434,426,159      instructions:u
         4,149,399      branch-misses:u           #    0.04% of all branches
    10,084,402,945      branches:u                # 511.379 M/sec
        39,389,694      cache-references:u        #    1.997 M/sec
         9,439,462      cache-misses:u            #   23.964 % of all cache refs

       5.037206853 seconds time elapsed

      19.687272000 seconds user
       0.032057000 seconds sys

And the external variable version:

perf stat -e task-clock,cpu-migrations,page-faults,instructions,branch-misses,branches,cache-references,cache-misses java LocalVsParent
Total is 10000000000
Time taken: 35.28s

Performance counter stats for 'java LocalVsParent':

        139,148.66 msec task-clock:u              #    3.935 CPUs utilized
                 0      cpu-migrations:u          #    0.000 K/sec
             3,709      page-faults:u             #    0.027 K/sec
    60,463,724,838      instructions:u
         4,343,415      branch-misses:u           #    0.04% of all branches
    10,091,903,050      branches:u                #   72.526 M/sec
     1,658,685,818      cache-references:u        #   11.920 M/sec
     1,626,364,192      cache-misses:u            #   98.051 % of all cache refs

      35.358457694 seconds time elapsed

     138.982741000 seconds user
       0.163984000 seconds sys

The number of instructions and branches etc are similar. But see that the external variable version has a whopping 98% cache miss? That is probably why it has such a poor performance.

Conclusion? It is hard to predict cache handling when you are stressing the CPU using a high-level language such as Java. Besides, when using Future or thread computation in Java, it is usually a good idea to avoid updating variables outside the thread scope as you will need to be aware of all the "volatile", "atomic", and "synchronized" stuff.

Thursday, December 17, 2020

Estimating the value of pi

Just for fun... and testing out the Python ray library for distributed computing: here are some simple scripts to estimate the value of pi.

Tuesday, December 15, 2020

Rederly demo

Glad to learn that Rederly is going to take the WeBWorK Open Problem Library and utilize it in a new platform. The project is open source, but seems that they also sell it as a service.

Took it for a spin. The renderer seems solid. Backend has some strange design but overall ok. At least it makes sense when you look at the code. Some functionalities still missing (e.g. class enrollment by admin) or require tuning (e.g. listing the whole class of several hundred students is going to be slow) though.

Wrote some scripts to run the full stack (frontend, backend, renderer, and db) as containers. Also tested it on Azure with docker-compose (with minor changes on disk volume mount).

Saturday, December 12, 2020

Google CTF 2020 - writeonly

Stumbled across a writeup on the Google CTF 2020 event. Found it interesting that although the team used pwntools, they made it over-complicated by writing the shellcode in C and then extracted the assembly code for the exploit injection. Isn't it the whole point of using pwntools is to help you generate the shellcode assembly?

Anyway, I looked up the challenge and found source code for the challenge itself and an implementation of a clean (official?) exploit. Here are the results of me playing with that code. Changes include:

- modified the Dockerfile so I can run the challenge locally. Used socat to expose the executable via port 1337 of the container

- as for the actual exploit, instead of doing a complete shellcode injection, I modified to code to just dump the flag file.

- this modification also avoided overwriting the child code with bunch of NOPs. It injects code precisely at the start of the infinite loop of the child thread (check_flag+0x8). This can be found by looking at the end of the disassembled code of the check_flag function:

...
4022d9:       bf 01 00 00 00          mov    $0x1,%edi
4022de:       e8 fd cf 04 00          callq 44f2e0 <__sleep>
4022e3:       e9 52 ff ff ff          jmpq   40223a <check_flag+0x8>

- commands used to build the docker image, disassembling child's function, and running the exploit etc can be found in the Makefile

Detailed description of the challenge and complete source code available on github.

Friday, December 4, 2020

How (not) to add another dimension to a relational database table

This is just a rant. I am not going to mention the name of the software, ok?

So I was working on an open source project, trying to add a new feature to it. The software can display some questions in random order and allow user to enter answers. Say we have five questions, Question A to Question E, randomly shown and the user entered answers as such:

What is being displayed on screen...
Questions	Answers
Question B	Answer B
Question E	Answer E
Question D	Answer D
Question C	Answer C
Question A	Answer A

And here are the corresponding rows in database when user saved those answers:

What is being stored on db...
problem_id	question	answer
1	Question A	Answer B
2	Question B	Answer E
3	Question C	Answer D
4	Question D	Answer C
5	Question E	Answer A

Instead of using one row to store each question and answer pair, whoever wrote that code decided to utilize the "answer" column independently from other columns and store values in the order displayed on screen!!

They even included comments in the code:

# note that answers are stored in display order...

I mean... WTF!? How am I going to retrieve the data? Am I supposed to use the same random seed to see how questions are ordered on screen and match against the answers?