Skip to content

MADV_RANDOM causes significant increase in major page faults in Linux 6.4 and later #939

Open
@sdodson

Description

@sdodson

Prior to torvalds/linux@8788f67 MADV_RANDOM had the documented explicit behavior of preventing the kernel from leveraging read ahead, see posix_madvise(2) man page. It also hinted to the kernel to retain pages in page cache. Since this commit, present in 6.4 and later, the implicit behavior is actually inverted and the kernel free pages aggressively because it short circuits the kernel's second chance LRU mechanisms.

In moderate to heavily loaded OpenShift clusters the net outcome is that an etcd compaction triggered by Kube API Server every five minutes which normally takes no more than 900ms could take up to 20s. Hosts that normally had near zero major page faults were seeing upwards of 600 faults per second.

In our testing, removing the MADV_RANDOM hint restored previous performance and had no observable increase in overall memory usage. We recommend removing this mmap hint on all versions of Linux as it seems to have no negative impact on 6.3 and earlier kernels. An alternative would be to keep MADV_RANDOM but enable mlock(2) at the same time.

The hint was originally added here 88f777f

CC @dusk125

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions