[SYCL] Implement Flash attention.

Currently Flash attention is available in CUDA and Metal backends in #5021. 

From the paper: Flash attention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. [...] it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. [..]


Thing is whether dedicated Intel GPUs can benefit from it or not and it will be interesting to see how much the performance improves.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Implement Flash attention. #7141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SYCL] Implement Flash attention. #7141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions