Skip to content

NVMe Read Performance Issues with ZFS (submit_bio to io_schedule) #8381

Open
@bwatkinson

Description

@bwatkinson

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.5
Linux Kernel 4.18.20-100
Architecture x86_64
ZFS Version 0.7.12-1
SPL Version 0.7.12-1

Describe the problem you're observing

We are currently seeing poor read performance with ZFS 0.7.12 with our Samsung PM1725a devices in a Dell PowerEdge R7425.

Describe how to reproduce the problem

Briefly describing our setup, we currently have four PM1724a devices attached to the PCIe root complex in NUMA domain 1 on an AMD EPYC 7401 processor. In order to measure read throughput of ZFS, XFS, and the Raw Devices, the XDD tool was used which is available at:

[email protected]:bwatkinson/xdd.git

In all cases I am presenting, kernel 4.18.20-100 was used and I disabled all CPU's not on socket 0 within the kernel. I also issued asynchronous sequential reads to the file systems/devices while pinning all XDD threads to NUMA domain 1 and Socket 0's memory banks. I conducted four tests consisting of measuring throughput for the raw devices, XFS, and ZFS 0.7.12. For the Raw Device tests, I had 6 I/O threads per devices with a request sizes of 1 MB and a total of 32 GB read from each device using Direct I/O. In the XFS case, I created a single XFS file system on each of the 4 devices. In each of the XFS file systems, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB using Direct I/O. In the ZFS Single ZPool case I created a single ZPool composed of 4 VDEVs and read 128 GB file of random data using 24 I/O threads with request sizes of 1 MB. In the ZFS Multiple ZPool case I create 4 separate ZPools each consisting of a single VDEV. In each of the ZPools, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB. In both the Single ZPool and Multipl ZPool cases I set the record sizes for all pools to 1 MB and I set the primarycache=none. We decided to disable the ARC in all cases, because we were reading 128 GB of data, which was exactly equal to 2x available memory on Socket 0. Even with the ARC enabled we were seeing no performance benefits. Below are the throughput measurements I collected for each of these case.

Performance Results:
Raw Device - 12,7246.724 MB/s
XFS Direct I/O - 12,734.633 MB/s
ZFS Single Zpool - 6,452.946 MB/s
ZFS Multiple Zpool – 6,344.915 MB/s

In order to try and solve what was cutting ZFS read performance in half, I generated flame graphs using the following tool:

http://www.brendangregg.com/flamegraphs.html

In general I found most of the perf samples were occurring in zio_wait. It is in this call that io_scheudle is called. Comparing the ZFS flame graphs to XFS flame graphs, I found that the number of samples between the submit_bio and io_schedule was significantly larger in the ZFS case. I decided to take timestamps of each call to io_schedule for both ZFS and XFS to measure the latency between the calls. Below is a link to to histograms as well as total elapsed time in microseconds between io_schedule calls for the tests I described above. In total I collected 110,000 timestamps. In plotted data the first 10,000 timestamps were ignored to allow for the file systems to reach a steady state.

https://drive.google.com/drive/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh?usp=sharing

In general, ZFS has a significant latency between io_scheudule calls. I have also verified that the output from iostat shows a larger await_r value for ZFS over XFS for these tests as well. In general it seems ZFS is letting request sit in the hardware queues longer than XFS and the Raw Devices causing a huge performance penalty in ZFS reads (effectively cutting the available device bandwidth in half).

In general, has this issue been noticed with NVMe SSD’s and ZFS and is there a current fix to this issue? If there is no current fix, is this issue being worked on?

Also, I have tried to duplicate the XFS results using ZFS 0.8.0-rc2 using Direct I/O, but the performance for 0.8.0 read almost exactly matched ZFS 0.7.12 read performance without Direct I/O.

Include any warning/errors/backtraces from the system logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: StaleNo recent activity for issueType: PerformancePerformance improvement or performance problem

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions