August 12, 2022

Redis Clone: Virtual Threads (Project Loom)

In the last post we profiled the naive Redis Clone. One of thing showing up was the flushing of the response.

In this post we try out Virtual Threads and Structured Concurrency from the JDK 19 preview, and see if we can use it to improve on the flushing. This is certainly not a primary use case for these feature, but let’s try it out anyway.

First, I installed a JDK 19 preview (build 19-ea+34-2229). Then I ran the existing app with it, to make sure the performance is still in the same ballpark:

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets       342161.69          ---          ---         2.00145         1.47100        11.26300        59.13500    101542.08
Gets      3421575.73   1283516.83   2138058.90         1.99868         1.47100        11.26300        58.62300    457923.67
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals    3763737.42   1283516.83   2138058.90         1.99893         1.47100        11.26300        58.62300    559465.75

I also checked how many threads are used. The app uses a thread per connection, therefore it uses over 256 threads (8 times 32). The extra threads are the JIT, GC, main, and other JVM internal threads:

$ jps
> 1890 redis-experiment.jar
$ ls -l /proc/1890/task/ | wc -l
290

Virtual Threads

Java is on its way to get virtual threads (also known as Project Loom). These are threads similar to Go routines. You can spin up 10'000nds, even millions of virtual threads. These then allows Java to keep its existing concurrency approach but scale it to way finer grained operations.

Too Many Threads
Figure 1. Too Many Threads

The first thing I did is to replace the existing thread pool executor with a virtual thread executor. Otherwise, I keep the existing blocking code. Additionally, in JDK 19, you also need to enable this preview feature by using the --enable-preview flag.

Using Virtual Threads:
var scheduler = Executors.newVirtualThreadPerTaskExecutor();
var socket = new ServerSocket();
socket.bind(new InetSocketAddress("0.0.0.0", 16379));

Then I ran that version. And the performance is about the same:

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets       353590.98          ---          ---         1.96611         1.80700         4.19100        36.60700    104933.91
Gets      3535869.61   1356612.30   2179257.31         1.96678         1.80700         4.22300        36.60700    480863.88
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals    3889460.58   1356612.30   2179257.31         1.96672         1.80700         4.22300        36.60700    585797.79

However, there is a big difference. It only used around 56 OS threads. The whole idea is that you can use a thread for each unit of work (like a web request) and have tons of them, without creating too many OS level threads.

$ jps
> 3429 redis-experiment-classic.jar
> 7059 Jps
ls -l /proc/3429/task/ | wc -l
> 56

Tyring Delayed Flushing with Structured Concurrency Constructs

Since virtual threads are cheap compared to OS-based threads, you can try to use them for small concurrent tasks. So, I experimented to see if I even can use virtual threads to do some overlapping IO operations. The main plan is this: Instead of always flushing, I delay the flushing. Then I try to read from the client. And only if there is no incoming work from the client, then the results are flushed. This way many pipelined client requests can be processed before the flushing.

I used the experimental StructuredTaskScope.ShutdownOnFailure feature from JEP 428 for this. Add the command line option --add-modules jdk.incubator.concurrent for that:

Delayed Flushing:
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
    while (true) {
        args.clear();
        var lineReading = scope.fork(reader::readLine);
        String line;
        try{
            // Try to get the next command with a short time
            // If we get the data with in the timeout, then do not flush the writes.
            // This way the writes are batched up of multiple commands
            line = lineReading.get(2, TimeUnit.MICROSECONDS);
        } catch (TimeoutException e){
            // If we didn't get any command, flush out written data
            scope.fork(()->{writer.flush(); return null;});
            line = lineReading.get();
        }

        // Existing code ...

        var reply = executeCommand(args);
        if (reply == null) {
            writer.write("$-1\r\n");
        } else {
            writer.write("$" + reply.length() + "\r\n" + reply + "\r\n");
        }
        // We flush later, when we repeat the loop
    }
    // Ensure we got all background task done
    scope.join();
    // Ensure we get any failures in background threads
    scope.throwIfFailed();
}

And the performance is…​Catastrophically bad, roughly one order of magnitude:

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets        42444.92          ---          ---        16.39105        15.16700        49.91900       187.39100     12596.23
Gets       424402.63     25985.48    398417.16        16.44213        15.23100        50.17500       186.36700     23104.71
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals     466847.55     25985.48    398417.16        16.43749        15.23100        50.17500       186.36700     35700.94

I didn’t investigate why the performance is so bad. One thing which is hideous is that the solution requires catching an exception, because the API has no way to wait for a timeout otherwise. That could also be part of the performance issue.

So, I went back and tried a simpler approach. What if we just do the flush in the background? I removed the delayed flushing and just did the flush in the background:

Background Flushing:
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
    while (true) {
        args.clear();
        var line = reader.readLine();

        // Original code

        var reply = executeCommand(args);
        if (reply == null) {
            writer.write("$-1\r\n");
        } else {
            writer.write("$" + reply.length() + "\r\n" + reply + "\r\n");
        }

        scope.fork(()->{
            writer.flush();
            return null;
        });
    }
    scope.join();
    scope.throwIfFailed();
}

And the performance is…​Awful! One order of magnitude slower!

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets       325339.40          ---          ---         2.13675         1.89500         6.59100        39.67900     96549.78
Gets      3253351.21   1176211.82   2077139.38         2.14078         1.90300         6.59100        39.67900    424230.01
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals    3578690.61   1176211.82   2077139.38         2.14041         1.90300         6.59100        39.67900    520779.79

Again, I didn’t investigate what destroys the performance. I have two guesses: - The classic Java IO has locks on its streams. Since we now flush in the background, we might create a lot of contention on a stream lock, making the performance worse. - Since we flush in the background, we might flush data on another physical core with cold caches, thus creating more cache / memory traffic between the cores.\ - Or it is something completely different ;)

Conclusion: For this tight IO loop, 'just smearing' virtual thread over it does not work.

Virtual Threads Still Help with More Connections

So, I rolled back to use the virtual threads only for the connection handling and left the flushing as is.

However, the virtual threads are still shining if more connections are coming in. I ran the benchmark with more connections (1024):

sudo docker run --name redisb --rm --network host memtier_benchmark \
--server=$SERVER_IP --port=16379 \
-t 8 -c 128 --test-time=30 --distinct-client-seed -d 256 --pipeline=30

With the classic thread pool, where each connection gets an OS-level thread, the latency starts to struggle:

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets       370192.21          ---          ---         7.35006         5.66300        40.44700       127.99900    109860.62
Gets      3701749.16   1466524.81   2235224.35         7.34796         5.66300        40.44700       126.97500    515125.68
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals    4071941.37   1466524.81   2235224.35         7.34815         5.66300        40.44700       126.97500    624986.29

# Amount of threads
$ ls -l /proc/5985/task/ | wc -l
> 1063

The virtual thread implementation has no trouble keeping up with more connections:

ALL STATS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec
----------------------------------------------------------------------------------------------------------------------------
Sets       345859.81          ---          ---         8.06748         7.32700        27.00700        62.71900    102639.55
Gets      3458420.58   1306051.09   2152369.49         8.06871         7.35900        27.00700        62.46300    465058.51
Waits           0.00          ---          ---             ---             ---             ---             ---          ---
Totals    3804280.39   1306051.09   2152369.49         8.06860         7.35900        27.00700        62.46300    567698.06

$ ls -l /proc/7112/task/ | wc -l
> 56

Next Steps

Well, I still want to improve on the IO as discussed in the benchmark post. So, I’ll go down the Java NIO route and do some non-blocking IO.

Stay tuned ;)

Tags: Performance Java Development Redis-Clone