Description
Prior to torvalds/linux@8788f67 MADV_RANDOM had the documented explicit behavior of preventing the kernel from leveraging read ahead, see posix_madvise(2) man page. It also hinted to the kernel to retain pages in page cache. Since this commit, present in 6.4 and later, the implicit behavior is actually inverted and the kernel free pages aggressively because it short circuits the kernel's second chance LRU mechanisms.
In moderate to heavily loaded OpenShift clusters the net outcome is that an etcd compaction triggered by Kube API Server every five minutes which normally takes no more than 900ms could take up to 20s. Hosts that normally had near zero major page faults were seeing upwards of 600 faults per second.
In our testing, removing the MADV_RANDOM hint restored previous performance and had no observable increase in overall memory usage. We recommend removing this mmap hint on all versions of Linux as it seems to have no negative impact on 6.3 and earlier kernels. An alternative would be to keep MADV_RANDOM but enable mlock(2) at the same time.
The hint was originally added here 88f777f
CC @dusk125