Description
Apologies for the vague title and the long issue! I'm grateful for all the help you've already given me in getting to this point with your excellent work on the HAL. I very much appreciate your efforts, and my hope for this long letter is that it's received in that spirit of appreciation.
We've recently discovered something strange: our SPI DMA transfers would produce some meaningful data for a few bytes, and then more or less stop long before the configured datalen. We'd get maybe a half dozen to a dozen somewhat distinct values at the beginning before getting "stuck" transmitting the same byte over and over again. Initially, we suspected either something in our code was incompatible with #444 because we happened to pull those changes in at about the same time, but that so far has seemed more coincidental than anything else.
Indeed, this has been a particularly interesting one to track down, because it seems unrelated to (logical) code changes at all. We only need to transmit data, so we ran into this problem using dma_write
1. However, changing to using dma_transfer
with an empty receive buffer seems to avoid the problem entirely. From what I can tell, dma_transfer
has an identical logical flow to dma_write
, except for some additions that more or less no-op right out with an empty receiver (foreshadowing!). In fact, no amount of perturbing the source (e.g. git bisect
, both on our project's code and the HAL's) revealed a difference in the behavior tied to any particular logic change. No matter the configuration of the peripheral, it seemed, dma_write
would work inconsistently but dma_transfer
would always behave as we expected. A big clue came when I discovered that we had a "load bearing" #[inline(always)]
: removing that annotation would produce the expected DMA transfer with dma_write
, and including it would "break" the output.
I believe the main difference there was that forcing inlining acted as a de-optimization. The function in question, transmit_chunk
, is called from two places, so inlining transmit_chunk
raised the "cost" of space-wise optimizations and thus de-inlined much of its body. Especially, dma_write
and its transitive callees. This, along with the behavior difference when doing more "non-functional" work with dma_transfer
, is what first suggested to me a timing problem.
Some further context that's probably helpful at this point: we've got a hard-real-time bound on getting our DMA transfer out the door (we're emitting pixel data for a video frame), so we've very aggressively optimized the code path through the HAL that initiates transfers for consistency & speed. Mostly that's been ensuring everything on the hot path is memory-resident so we aren't coupled through the cache: touching the flash, even just once or twice, seems to introduce enough variability that we blow our targets.
We've achieved that mainly by a combination of #[inline(always)]
and #[link_section = ".rwtext"]
2. The latter is a spelling of #[ram]
that still allows LLVM to decide to inline the function (which will be important a little later). We've also turned on LTO and are always running in release mode (though, with debug symbols and -C force-frame-pointers
). The upshot is that we've gotten the whole "start dma write" down to about 2000 cycles3 with almost no variability: either within a given execution or across whole reset/power cycles. It's even remarkably consistent across reflashes, when code "away" from the hot path is changed.
Ok, so that's the setup, now for the punchline: the most reliable way I've found to produce the desired operation of the DMA-driven SPI peripheral is this to add a single nop
after tx.prepare_transfer(..)
:
--- a/esp-hal-common/src/spi.rs
+++ b/esp-hal-common/src/spi.rs
@@ -1904,6 +1906,8 @@ where
reset_dma_before_load_dma_dscr(reg_block);
tx.prepare_transfer(self.dma_peripheral(), false, ptr, len)?;
+ unsafe { core::arch::asm!("nop") }
+
self.clear_dma_interrupts();
reset_dma_before_usr_cmd(reg_block);
Suffice to say it took a lot of reading disassembly and failed attempts to figure that out, but it does seem so far resilient in the face of code changes that none of my other attempts have been. The problem seems to only present itself when LLVM maximally inlines all the various traits and register access blocks &c, which only happens some of the time. The main variable, I believe, are those functions we said to "either inline or place in RAM, your choice." The outcome is very stable from a given pile of code: it seems to happen 100% of the time or 0% of the time for a given binary. However, I find it very hard to predict what changes to that code will cause LLVM to optimize differently, so none of my other attempts have been robust in terms of getting LLVM to optimize enough that we meet our performance targets, but not so much as to break the output.
The positioning of the nop
is meant to act as a "buffer" between the DMA and SPI peripherals during CPU programming. When it does optimize "fully," LLVM produces disassembly that looks like this:
40381784: lui a1,0x6003f # set a1 to 0x6003_f000 , the gdma peripheral base address
# ... # ... most setup elided (it all seems to match the steps in the TRM) ...
403817c6: lw a0,224(a1) # read a0 from gdma_base + 224 (0xe0), which is GDMA_OUT_LINK_CH0_REG
403817ca: lui a2,0x200 # set bit 21 in a2
403817ce: or a0,a0,a2 # set bit 21 in a0
403817d0: sw a0,224(a1) # write a0 back to GDMA_OUT_LINK_CH0_REG, setting GDMA_OUTLINK_START_CH0
403817d4: lw a0,0(a1) # immediately read GDMA_INT_RAW_CH0_REG
403817d6: andi a0,a0,64 # check bit 6 (GDMA_OUT_DSCR_ERR_CH0_INT_RAW)
403817da: bnez a0,... # if its set, bail
403817de: nop
403817e0: lui a0,0x61 # this pair...
403817e4: addi a0,a0,3 # ... sets a0 to 0x0006_1003
403817e6: lui s1,0x60024 # set s1 to 0x6002_4000 , the SPI2 peripheral base address
403817ea: sw a0,56(s1) # writes a0 to spi2_base + 56 (0x38), SPI_DMA_INT_CLR_REG
# setting bits 18, 17, 12, 1, and 0, i.e. clears:
# 18: SPI_MST_TX_AFIFO_REMPTY_ERR_INT_CLR
# 17: SPI_MST_RX_AFIFO_WFULL_ERR_INT_CLR
# 12: SPI_TRANS_DONE_INT_CLR
# 1: SPI_DMA_OUTFIFO_EMPTY_ERR_INT_CLR
# 0: SPI_DMA_INFIFO_FULL_ERR_INT_CLR
403817ec: lw a0,48(s1) # read a0 from spi2_base + 48 (0x30), SPI_DMA_CONF_REG
403817ee: lui a1,0xe0000 # set bits 31, 30, 29 in a1
403817f2: or a0,a0,a1 # set bits 31, 30, 29 in a0
403817f4: sw a0,48(s1) # write a0 back to spi2_base + 48 (0x30), SPI_DMA_CONF_REG
# i.e. triggers
# 31: SPI_DMA_AFIFO_RST
# 30: SPI_BUF_AFIFO_RST
# 29: SPI_RX_AFIFO_RST
403817f6: lw a0,0(s1) # read a0 from spi_base + 0 , SPI_CMD_REG
403817f8: lui a1,0x1000 # ..
403817fc: or a0,a0,a1 # ..
403817fe: sw a0,0(s1) # set bit 24 in SPI_CMD_REG ( SPI_USR )
The nop
slots right in between writes to addresses at the 0x6003_f000 (DMA) base, and the subsequent interactions with the hardware registers based at 0x6002_400 (SPI2). I started with a much larger number of nops (128), which also functioned as I hoped, and was more than a little surprised to find that even a single nop makes the difference.
But, now I'm not sure where to go next. Probably adding a single nop
isn't the right answer, but it does show how sensitive to change the issue is: it's very easy to "fix" it by accident. From what I can tell, the machine code faithfully implements the procedures described in the TRM for the esp32c3, but maybe there's a missing step? Some bit that needs to be cleared? There's no mention (nor mechanism) that I can find for waiting on the DMA's internal programming like there is with the update bit for SPI; is that not a concern in this case? It does seem strange that we immediately check the DMA's interrupt register for descriptor errors (is that guaranteed to be set within a single cycle of turning on the outlink?), but that can't be our problem: the descriptors are valid, they're just not being used.
I've also got little to no intuition about how broad this problem is: either across projects or esp32* platforms. It's plausible that cranking the optimization settings for one of the bog-standard examples would show the same behavior (I don't think we removed anything that would prohibit LLVM from inlining, just added mandates)?
Thanks again!
Footnotes
-
We haven't yet switched over to the new SpiDataMode +
write
functionality from Half-duplex SPI #444, but we're looking forward to it! We did run an experiment early on that suggested it suffered from the issue problem, though I haven't gone back and checked. ↩ -
As an aside, we'd be glad to take any suggestions y'all have for how to make this process a little more robust/less intrusive. So far the most reliable way I've found to do it is manually tracing through the disassembly for memory addresses that start with
0x42...
for functions (or0x3c...
for data) and then manually annotating the things I discover. This works, but isn't exactly error-free or particularly reproducible (though I am eternally grateful to whichever wonderful person taught riscv objdump to recognize lui/addi pairs and put the resultant address in a comment). What's more, I have no idea how to implement that without forking whichever dependency I find myself in to add the annotations, which naturally comes with its own set of costs too. ↩ -
1783 as of the time of this writing, as measured by a static cycle counter and
mpccr
(which adds about ~16 cycles of measurement overhead). ↩
Metadata
Metadata
Assignees
Type
Projects
Status