KV Cache getting prefilled and emptied during training phase

Although the KV cache implementation for inference looks good, the block_kv_cache is getting prefilled in the training phase leading to increase in compute and memory consumption.

https://github.com/huggingface/nanoVLM/blob/098db57e18b867aee604d8e8b2923b6bff1e6657/models/vision_language_model.py#L51