Closed
Description
Currently Flash attention is available in CUDA and Metal backends in #5021.
From the paper: Flash attention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. [...] it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. [..]
Thing is whether dedicated Intel GPUs can benefit from it or not and it will be interesting to see how much the performance improves.