April 15, 2023

Mysterious Hanging Java Process

I recently had a Java (Bamboo Dev Instance) app just hanging when it started up. Since it was a dev setup, which sometimes has other issues, I did the usual things, like:

  • Turn it on and off again, aka restart it.

  • Clean up the working directory, like a mvn clean, make clean or what ever.

  • Check out a stable branch.

  • Run against different sets of dependency versions

The app kept on being stuck. So, I was forced to debug it =).

Waiting for Response
Figure 1. Hanging Process
  1. First I took a few thread dumps jstack -p {pid} of the hanging process. Then I looked at the stack to see if I can see a thread which seems to being stuck. One thread stuck out, doing IO:

"localhost-startStop-1@15993" daemon prio=5 tid=0x1a nid=NA runnable
  java.lang.Thread.State: RUNNABLE
	  at sun.nio.fs.UnixNativeDispatcher.stat0(UnixNativeDispatcher.java:-1)
	  at sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:291)
	  at sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)
	  at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:55)
	  at sun.nio.fs.UnixFileStore.<init>(UnixFileStore.java:70)
	  at sun.nio.fs.LinuxFileStore.<init>(LinuxFileStore.java:48)
	  at sun.nio.fs.LinuxFileSystem.getFileStore(LinuxFileSystem.java:128)
	  at sun.nio.fs.UnixFileSystem$FileStoreIterator.readNext(UnixFileSystem.java:213)
	  at sun.nio.fs.UnixFileSystem$FileStoreIterator.hasNext(UnixFileSystem.java:224)
	  - locked <0x3e8a> (a sun.nio.fs.UnixFileSystem$FileStoreIterator)
	  at org.apache.lucene.util.IOUtils.getFileStore(IOUtils.java:584)
	  at org.apache.lucene.util.IOUtils.spinsLinux(IOUtils.java:528)
	  at org.apache.lucene.util.IOUtils.spins(IOUtils.java:517)
	  at org.apache.lucene.util.IOUtils.spins(IOUtils.java:492)
  1. Ok, that looked like this thread iterates over files, indicated by the hasNext and readNext. It might iterate over a gigantic list of files?

  2. To confirm this theory, I added a break point to the UnixNativeDispatcher.stat method. However, the break point didn’t trigger. So, it seems stuck on the stat syscall.

  3. Unfortunately, the thread dump doesn’t give you the variable values involved. So, I looked at the UnixNativeDispatcher.stat code and saw that the code takes a UnixPath as the main variable:

    static void stat(UnixPath path, UnixFileAttributes attrs) throws UnixException {
        NativeBuffer buffer = copyToNativeBuffer(path);
        try {
            stat0(buffer.address(), attrs);
        } finally {
            buffer.release();
        }
    }
  1. So, I took a heap dump of the process. Since it is local, I used VisualVM.

  2. Then I searched for the UnixPath instances. Luckily, there were only a few.

  3. There I saw immediately: A path to ~/storage/Music:

Root Cause
Figure 2. Root cause, a classic.
  1. Well, this immediately explained the issue. This path is a mounted path to data on a NAS, and the NAS was down. So, the syscall was hanging waiting for response of the offline NAS.

  2. I unmounted the directories, restarted the process and it started as expected.

Summary & Lesson

  • Offline NAS can make your system calls hang / or very slow, file system calls might wait for the NAS to respond.

  • Don’t mount NAS stuff to your home directory, as apps are more likely to read from it for various reasons.

The cause for the waiting
Figure 3. Summary
Tags: Linux Java Development