macho: faster and more memory efficient linker #13260

kubkon · 2022-10-22T14:53:03Z

This PR is a culmination of a general rewrite of our MachO linkers (traditional and incremental) I set out to do just before SYCL, and after I finished first batch of work on the COFF linker. If I were to summarise the major achievement of this PR it is the rewriting of the majority of the linker in the spirit of data-oriented design which led to

significantly reduced link times to the point where, dare I say, we start to become competitive with lld and ld64 - you can find some numbers below, and
reduced memory usage by avoiding unnecessary allocs, and instead re-parsing data when actually needed.

Some benches

zld refers to our linker as a standalone binary, lld to LLVM's linker, and ld64 to Apple's linker. All in all, I should point out that we are still missing a number of optimisations in the linker such as cstring deduplication, compression of dynamic linker's relocations, and synthesising of the unwind info section, so this difference between us and other linkers will most likely shrink a little. Also note that for linking larger files, we are currently slowed down by our sha256 implementation, but once this is optimised, I think we should be able to reach/beat lld (which is currently the fastest out of zld, lld and ld64). I haven't compared with mold yet mainly because I used to experience breakage, but that was a while back and perhaps mold's stability on macOS has improved since.

redis-server

❯ hyperfine ./zld ./lld --warmup 60 
Benchmark 1: ./zld
  Time (mean ± σ):      43.4 ms ±   0.5 ms    [User: 33.7 ms, System: 8.2 ms]
  Range (min … max):    42.8 ms …  44.8 ms    66 runs
 
Benchmark 2: ./lld
  Time (mean ± σ):      48.9 ms ±   0.7 ms    [User: 42.5 ms, System: 17.4 ms]
  Range (min … max):    48.0 ms …  50.7 ms    58 runs
 
Summary
  './zld' ran
    1.13 ± 0.02 times faster than './lld'

❯ hyperfine ./zld ./ld64 --warmup 60 
Benchmark 1: ./zld
  Time (mean ± σ):      43.5 ms ±   0.5 ms    [User: 33.7 ms, System: 8.3 ms]
  Range (min … max):    42.7 ms …  45.6 ms    65 runs
 
Benchmark 2: ./ld64
  Time (mean ± σ):      47.1 ms ±   0.6 ms    [User: 60.0 ms, System: 14.4 ms]
  Range (min … max):    46.2 ms …  49.6 ms    62 runs
 
Summary
  './zld' ran
    1.08 ± 0.02 times faster than './ld64'

stage3

❯ hyperfine ./zld ./lld --warmup 5  
Benchmark 1: ./zld
  Time (mean ± σ):      3.057 s ±  0.011 s    [User: 2.644 s, System: 0.399 s]
  Range (min … max):    3.045 s …  3.084 s    10 runs
 
Benchmark 2: ./lld
  Time (mean ± σ):      1.152 s ±  0.014 s    [User: 1.285 s, System: 0.232 s]
  Range (min … max):    1.140 s …  1.178 s    10 runs
 
Summary
  './lld' ran
    2.65 ± 0.03 times faster than './zld'

❯ hyperfine ./zld ./ld64 --warmup 5 
Benchmark 1: ./zld
  Time (mean ± σ):      3.051 s ±  0.003 s    [User: 2.643 s, System: 0.396 s]
  Range (min … max):    3.045 s …  3.056 s    10 runs
 
Benchmark 2: ./ld64
  Time (mean ± σ):      2.347 s ±  0.009 s    [User: 3.876 s, System: 0.216 s]
  Range (min … max):    2.331 s …  2.359 s    10 runs
 
Summary
  './ld64' ran
    1.30 ± 0.01 times faster than './zld'

Motivation

Firstly, some motivation for the rewrite. While working on the COFF linker I realised that conflating traditional with incremental states is a bad idea since then we do not optimise for either - traditional context will allow for optimisations that are otherwise not achievable or awkward when working in the incremental context. A perfect example would be preallocating output sections which in traditional context will happen towards the end of the linking process to simplify the entire process, while is required upfront in the incremental context for obvious reasons.

No relocs/code pre-parsing per Atom

Prior to this rewrite, we would preparse the code and relocs per each Atom aka a subsection of an input section per relocatable object file, and store the results on the heap. This is not only slow but also completely unnecessary. We can actually delay the work until we actually need it. This approach is now followed throughout.

Linker now follows standard stages

Like lld, mold and ld64, we also implement linking in stages, e.g., first comes symbol resolution, then we parse input sections into atoms, we then do dead code stripping (if desired), then create synthetic atoms such GOT cells, then create thunks if required, etc. This significantly simplified the entire linker as we do a very specialised work per stage and no more.

We do not store any code or relocs per synthetic atoms

Instead of generating the code and relocs per synthetic atoms (GOT, stubs, etc) we only track their numbers, VM addresses and targets, while we generate the code and relocate when writing to the final image. In fact, we do not even need to track the addresses beyond the start and size of each synthetic section. I will refactor this in the future also.

Thunks

While at it, I also went ahead and implemented range extending thunks which mean we can now link larger programs on arm64 without erroring out in the linker. For more info, see #9764. One word of explanation is that contrary to what the issue suggested, we extend jump range via thunks rather than branch islands. For those unfamiliar, both methods extend the range of jump for the given RISC ISA, however, thunks use the scratch register and a load to load unreachable target's address into the scratch register and branch via register. As such, a thunk is 12 bytes on arm64. Branch islands on the other hand are 4 bytes as they are simple bl #next_label instructions. Branch islands are thus short range extenders where in order to jump further in the file, we chain the jumps by jumping between islands until reaching the actual target.

Future work

If you browse over the changes, you will notice that I have introduced quite a bit of duplicated code. This is intentional but only temporary and I will be deduping common bits in-tree. In general however, zld.zig will contain the the main entry point and state tracking for the traditional linker, while MachO.zig will contain incremental state tracking.

kubkon/zld gitrev 5733ed87abe2f07e1330c3232a252e9defec638a

1. If an object file was not compiled with `MH_SUBSECTIONS_VIA_SYMBOLS` such a hand-written ASM on x86_64, treat the entire object file as not suitable for dead code stripping aka a GC root. 2. If there are non-extern relocs within a section, treat the entire section as a root, at least temporarily until we work out the exact conditions for marking the atoms live.

andrewrk · 2022-10-22T17:00:56Z

You can count the drone CI as a success. It passed all the tests and simply ran out of CI time when it was creating the tarball at the end. Furthermore, this is affecting master branch and 10b8c4d should reduce failure rate.

Edit: oh, also you already had a full successful run and only added docs after that :)

andrewrk

What a beauty.

btw you can give 3 commands at the same time to hyperfine

kubkon · 2022-10-22T17:15:55Z

What a beauty.

btw you can give 3 commands at the same time to hyperfine

Oh shit, did not know that! I'll keep that in mind for the future, thanks!

kubkon added 18 commits October 22, 2022 07:59

macho: upstream rewritten traditional linker, zld

d57639a

kubkon/zld gitrev 5733ed87abe2f07e1330c3232a252e9defec638a

macho: remove unused nlist flags

04a590a

macho: fix 32bit build

a32bc0e

macho: fix incorrectly erroring out with multiple sym definition

baceaf6

macho: skip parsing __eh_frame until we know how to handle it

e70adc7

macho: always create __TEXT segment

33d1942

macho: do not assume __la_symbol_ptr was created

5b9eaf2

macho: fix bug in incorrectly splicing nodes in trie

c15e03a

macho: fix silly error where we would incorrectly skip a valid binding

15f1e6a

macho: revert changes to file descriptors mgmt

bc1e714

macho: relax SUBTRACTOR assumption it not being a defined global

5edb8e0

macho: fix handling of lack of subsections and tracking of inner symbols

2c2ff45

macho: gracefully handle uninitialized symtabs in objects

5823f19

macho: rewrite movq to leaq when TLV is not external pointer

93cc549

macho: fix incorrect lookup of symbols when calculating subtractors

f8cbe29

macho: do not skip over SUBTRACTOR reloc when dead stripping

0cc4d54

macho: some docs

593b75b

andrewrk approved these changes Oct 22, 2022

View reviewed changes

andrewrk merged commit e67c756 into master Oct 22, 2022

andrewrk deleted the zld-sync branch October 22, 2022 17:14

lordnoriyuki mentioned this pull request Oct 23, 2022

zig build failure on MacOS aarch64 at "error: ToDo" in zig/src/link/MachO/DwarfInfo.zig #13269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

macho: faster and more memory efficient linker #13260

macho: faster and more memory efficient linker #13260

Uh oh!

kubkon commented Oct 22, 2022

Uh oh!

andrewrk commented Oct 22, 2022 •

edited

Loading

Uh oh!

andrewrk left a comment

Uh oh!

kubkon commented Oct 22, 2022

Uh oh!

Uh oh!

Uh oh!

macho: faster and more memory efficient linker #13260

macho: faster and more memory efficient linker #13260

Uh oh!

Conversation

kubkon commented Oct 22, 2022

Some benches

Motivation

No relocs/code pre-parsing per Atom

Linker now follows standard stages

We do not store any code or relocs per synthetic atoms

Thunks

Future work

Uh oh!

andrewrk commented Oct 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewrk left a comment

Choose a reason for hiding this comment

Uh oh!

kubkon commented Oct 22, 2022

Uh oh!

Uh oh!

andrewrk commented Oct 22, 2022 •

edited

Loading