Description
ConcurrentLruCache was extracted from MimeTypeUtils
for use in NamedParameterJdbcTemplate
to improve its concurrent performance (#24197). Unfortunately, the original logic using a synchronized, bounded LinkedHashMap
is faster than this new one. If the cache reaches capacity then this change introduced in 5.3.0 only exacerbates the problem.
ConcurrentLruCache
is implemented as a combination of a read/write lock, ConcurrentHashMap
, and ConcurrentLinkedDeque
. If the cache is under capacity then it performs only the map read, which makes it incredibly fast. However once it has reached capacity then it maintains the LRU eviction order by removing and adding the key to the queue, which provides access order. Thus, the tail is the MRU key and searching is an O(n) scan. This is performed under the read lock for exclusive access on a cache write, which would evict an entry. I am not sure if the read/write lock is necessary and it might be replaced with an exclusive lock on the write path only though perhaps was added only for pragmatic concerns as a failsafe.
This design has many problems that leads to poor performance:
- The O(n) scan of the queue assumes a small cache if being accessed frequently.
- A read/write lock is more expensive than an exclusive lock, often built on top of one, because it has to manage more complex state. This type of lock is reasonable when the critical section is very expensive, e.g. I/O, but is a poor choice for short, fast critical sections.
- As LRU reorders the key to the tail of the queue on every access, that becomes a single point of contention. This means that fundamentally the performance will be limited by how fast entries can be appended to the tail, and is made worse with contention and bus traffic when using a lock-free algorithm.
Caches typically follow a Zipfian distribution (power-law), hence hot and cold entries. In a JMH benchmark with that distribution and a full cache, I observed ~270,000 reads/s with 8 threads on a 4-core SMT laptop. In comparison, a synchronized LinkedHashMap in LRU order performs at 10-11M reads/s, a 40x speedup. If #24197 is an observed bottleneck, then that problem still exists and was likely made worse.
The approach that I use in my caches (ConcurrentLinkedHashMap, Guava, Caffeine) is to amortize the lock penalty. Instead of trying to make LRU thread-safe, the work is enqueued into a buffer and replayed under the lock in batches. This allows readers to be penalized for the cost of appending to the buffer, where multiple buffers spreads out these writes to avoid contention. The replay is a simple tryLock that, if acquired, drains the buffer to apply the cheap LRU reordering operations. That approach can easily raise the throughput to 100M+ reads/s.
For a simple cache to embed, you might consider including ConcurrentLinkedHashMap. That was used broadly during its lifetime and small enough to often be copied elsewhere. Groovy enhanced their copy to support Java 8 compute methods, so that may be an attractive option, which was subsequently embedded by micronaut for similar use-cases. I can't speak to the Groovy variant, but the original has proven to be robust and easily hackable so there should be a low maintenance burden if either is used. Embedding avoids the dependency and a lot of projects, like mssql-jdbc, seem to have preferred doing that.