Skip to content

[SYCL] Implement Flash attention. #7141

Closed
@qnixsynapse

Description

@qnixsynapse

Currently Flash attention is available in CUDA and Metal backends in #5021.

From the paper: Flash attention is an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. [...] it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. [..]

Thing is whether dedicated Intel GPUs can benefit from it or not and it will be interesting to see how much the performance improves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions