Release Release v4.53.0 · huggingface/transformers

Release v4.53.0

Gemma3n

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    torch_dtype=torch.bfloat16,
    model="google/gemma-3n-e4b",
    device="cuda",
)
output = pipe(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
    text="<image_soft_token> in this image, there is"
)

print(output)

Dia

Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs.
It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing.
Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).

Model Architecture:
Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as
rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while
for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook
tokens and decodes them back into audio.

Add Dia model by @buttercrab in #38405

Kyutai Speech-to-Text

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

Add kyutai stt by @eustlb in #38909

Read more about the model in the documentation

V-JEPA 2

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

Add V-JEPA 2 by @qubvel in #38746

Read more about the model in the documentation.

Arcee

Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.

The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

Add Arcee model support by @Crystalcareai in #38621

Read more about the model in the documentation.

ColQwen2

ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

Add ColQwen2 to 🤗 transformers by @tonywu71 in #35778

Read more about the model in the documentation.

MiniMax

MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.

The architecture of MiniMax is briefly described as follows:

Total Parameters: 456B
Activated Parameters per Token: 45.9B
Number Layers: 80
Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
Hidden Size: 6144
Vocab Size: 200,064

For more details refer to the release blog post.

Add support for MiniMax's MiniMax-Text-01 by @geetu040 in #35831

Read more about the model in the documentation.

Encoder-Decoder Gemma

T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.

T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.

The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.

Encoder-Decoder Gemma by @bzhangGo in #38332

Read more about the model in the documentation.

GLM-4.1V

The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!

GLM-4.1V Model support by @zRzRzRzRzRzRzR in #38431

Read more about the model in the documentation.

Falcon H1

The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.

[MODEL] Add Falcon H1 by @younesbelkada in #38249

Read more about the model in the documentation.

LightGlue

The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed
by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.

Similar to SuperGlue, this model consists of matching
two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the
SuperPoint model, it can be used to match two images and
estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

The abstract from the paper is the following:

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple
design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements.
Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much
easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much
faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited
appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like
3D reconstruction. The code and trained models are publicly available at this https URL

Add LightGlue model by @sbucaille in #31718

Read more about the model in the documentation.

dots.llm1

The abstract from the report is the following:

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.

[Model] add dots1 by @redmoe-moutain in #38143

Read more about the model in the documentation.

SmolLM3

SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.

Add SmolLM3 by @anton-l in #38755

Read more about the model in the documentation.

Performance optimizations

Kernels

In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the @use_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.

To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.

Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)

print(tokenizer.decode(output[0], skip_special_tokens=True))

More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗

Add kernelize to transformers by @MekkCyber in #38205

Flash Attention 3

Support for Flash Attention 3 is added across the most popular models.

Support for Flash Attention 3 by @EduardDurech in #38972

Notable repository maintenance & refactors

Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.

We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.

Apply GradientCheckpointingLayer to the whole repo by @qubvel in #38913
No more Tuple, List, Dict by @Rocketknight1 in #38797
Deprecate TF + JAX by @Rocketknight1 in #38758

Breaking changes

Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.

🔴 Update default dtype for pipelines to auto by @Vaibhavs10 in #38882
🚨🚨 Fix initialization of Mask2Former by @Cyrilvallez in #38864
🚨 🚨 Inherited CausalLM Tests by @Rocketknight1 in #37590
🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288
🔴 [VLM] modeling updates by @zucchini-nlp in #38317
🚨 🚨 Fix custom code saving by @Rocketknight1 in #37716
🚨🚨[core] Completely rewrite the masking logic for all attentions by @Cyrilvallez in #37866
🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108
🔴[Attention] Attention refactor for Whisper-based models by @vasqu in #38235
Add CB by @ArthurZucker in #38085

Bugfixes and improvements

CI reporting improvements by @ydshieh in #38230
Revert parallelism temporarily by @LysandreJik in #38240
tp plan should not be NONE by @ArthurZucker in #38255
[Falcon H1] Fix Typo in Integration Test by @dhiaEddineRhaiem in #38256
[compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127
fix multi-image case for llava-onevision by @cyr0930 in #38084
Add tearDown method to Quark to solve OOM issues by @MekkCyber in #38234
Clearer error on import failure by @LysandreJik in #38257
[whisper] small changes for faster tests by @gante in #38236
Simplify DTensor Check for modeling_utils.py by @amd-xiaoyu12 in #38245
Improve typing in TrainingArgument by @cyyever in #36944
Fix: missing else branch to handle "--load_best_model_at_end" in training_args.py by @danielyxyang in #38217
assign the correct torchao data layout for xpu by @jiqing-feng in #37781
Remove Japanese sequence_classification doc and update references by @ritsumei-aoi in #38246
Protect ParallelInterface by @ArthurZucker in #38262
Update Model Card for Mamba by @ParagEkbote in #37863
docs(swin): Update Swin model card to standard format by @BryanBradfo in #37628
add XPU info print in print_env by @yao-matrix in #38282
[whisper] move processor test into processor test file 🧹 by @gante in #38266
[Whisper] handle deprecation of forced_decoder_ids by @gante in #38232
add liger-kernel to docker file by @ydshieh in #38292
Fix tp error when torch distributed is already initialized by @SunMarc in #38294
More typing in src/transformers/training_args.py by @cyyever in #38106
refine transformers env output by @yao-matrix in #38274
Update CI Docker base image for AMD tests by @ahadnagy in #38261
Fix HybridChunedCache & Llama4 by @Cyrilvallez in #38299
Oups typo for HybridChunkedCache by @Cyrilvallez in #38303
[Tests] Cleanup Janus Testcase by @yaswanth19 in #38311
[emu3] fix conversion script by @zucchini-nlp in #38297
Fix run_slow by @cyyever in #38314
Fix typo: change 'env' to 'environment' in .circleci/config.yml by @AbdessamadEnabih in #38273
Adds use_repr to model_addition_debugger_context by @RyanMullins in #37984
[tf/flax] handle forced_decoder_ids deletion by @gante in #38316
[Whisper + beam search] fix usage of beam_indices by @gante in #38259
Expose AutoModelForTimeSeriesPrediction for import by @jinan-zhou in #38307
[custom_generate] don't forward custom_generate and trust_remote_code by @gante in #38304
add vasqu to self-comment-ci.yml by @ydshieh in #38324
Fix some tests (especially compile with fullgraph=True on Python<3.11) by @Cyrilvallez in #38319
[performance_optim] reduce frequency of declaring attention_mask in Ascend NPU flash attention by @FightingZhen in #38278
refactor can_save_slow_tokenizer by @itazap in #37722
[FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321
Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag by @inf3rnus in #36835
Use Gradient Checkpointing Layer in Jamba & Blip Related Models by @alex-jw-brooks in #38310
Never fallback to eager implicitly by @Cyrilvallez in #38327
Remove duplicate docstring: resample by @qqii in #38305
Update BioGPT model card by @Aguedoom in #38214
docs(swinv2): Update SwinV2 model card to new standard format by @BryanBradfo in #37942
[docs]: update roformer.md model card by @KsuParkhamchuk in #37946
new failure CI reports for all jobs by @ydshieh in #38298
Hot fix for AMD CI workflow by @ydshieh in #38349
Uninstall kernels for AMD docker images by @ydshieh in #38354
[VLMs] add helpers for get/set embedding by @zucchini-nlp in #38144
switch to device agnostic device calling for test cases by @yao-matrix in #38247
[OPT] Fix attention scaling by @vasqu in #38290
Fix all import errors based on older torch versions by @Cyrilvallez in #38370
Fix incorrect batching audio index calculation for Phi-4-Multimodal by @Isotr0py in #38103
Protect get_default_device for torch<2.3 by @Cyrilvallez in #38376
[Falcon H1] Fix slow path forward pass by @dhiaEddineRhaiem in #38320
Improved cache docs by @manueldeprada in #38060
for now disable compile by @ArthurZucker in #38383
Use one utils/notification_service.py by @ydshieh in #38379
Better check in initialize_weights by @Cyrilvallez in #38382
fix typos by @DeVikingMark in #38336
fix typo: tokenizer -> tokenize by @foldl in #38357
Stop TF weight rename reDOS by @Rocketknight1 in #38325
[cli] cli usable without torch by @gante in #38386
update gemma tests by @ydshieh in #38384
Stop autoconverting custom code checkpoints by @Rocketknight1 in #37751
Add AMD MI300 CI caller leveraging self-hosted runner scale set workflow in hf-workflows by @jitesh-gupta in #38132
Fix image token mask in Gemma3 by @Cyrilvallez in #38295
[transformers x vLLM] standardize processors by @zucchini-nlp in #37915
[paligemma] fix processor with suffix by @zucchini-nlp in #38365
[video utils] group and reorder by number of frames by @zucchini-nlp in #38374
[aya vision] fix processor for vLLM by @zucchini-nlp in #38371
guard size mismatch check to only quantized models by @SunMarc in #38397
[chat] improvements for thinking models and reduce default verbosity by @gante in #38322
Fix convert to original state dict for VLMs by @hiyouga in #38385
[chat] use the checkpoint's generation_config.json as base parameterization by @gante in #38330
Fix Qwen2.5-VL Video Processor by @yeliudev in #38366
[CSM] infer codec model with no_grad + audio eos label by @eustlb in #38215
Add report_repo_id to mi300 workflow by @ivarflakstad in #38401
[CSM] update model id by @eustlb in #38211
[cleanup] delete deprecated kwargs in qwen2_audio 🧹 by @gante in #38404
[tests] remove overload for deleted test (test_offloaded_cache_implementation) by @gante in #37896
[mllama] Allow pixel_values with inputs_embeds by @dxoigmn in #38334
Update Model Card for Mamba-2 by @ParagEkbote in #37951
Updated Zoedepth model card by @miniMaddy in #37898
Updated BigBird Model card as per #36979. by @RogerSinghChugh in #37959
Updated BERTweet model card. by @RogerSinghChugh in #37981
New bart model card by @RogerSinghChugh in #37858
Update granite.md by @Tanuj-rai in #37791
Falcon-H1 - Fix auto_docstring and add can_return_tuple decorator by @yonigozlan in #38260
Updated model card for OLMo2 by @andyvu923 in #38394
Add mi300 to amd daily ci workflows definition by @ivarflakstad in #38415
Change slack channel for mi250 CI by @ivarflakstad in #38410
Fix an error in verify_tp_plan for keys without '.' by @liwii in #38420
[qwen-vl] Look for vocab size in text config by @zucchini-nlp in #38372
Update CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424
enable large_gpu and torchao cases on XPU by @yao-matrix in #38355
Disable mi210 scheduled CI by @ivarflakstad in #38411
Update error when using additional and/or masks by @Cyrilvallez in #38429
Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers by @ydshieh in #38413
make Llama4TextMoe forward more readable by @JJJYmmm in #37529
[core] support tensor-valued _extra_state values in from_pretrained by @pstjohn in #38155
Fix typo in tokenization_utils_base.py docstring by @cwngan in #38418
Fix convert weights for InternVL by @yonigozlan in #38233
Trigger doc-builder job after style bot by @ydshieh in #38398
Remove redundant test_sdpa_equivalence test by @Rocketknight1 in #38436
Fix MoE gradient test by @Rocketknight1 in #38438
Fix from_args_and_dict ProcessorMixin by @yonigozlan in #38296
Fix handling of slow/fast image processors in image_processing_auto.py by @yonigozlan in #38161
Updated the Model docs - for the ALIGN model by @1himan in #38072
Updated the model card for ViTMAE by @mreraser in #38302
Model card for mobilenet v1 and v2 by @yuanjua in #37948
Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335
Fix GLM4 checkpoints by @ydshieh in #38412
feat: add cache retention for requests by @McPatate in #38446
[Tests] Clean up test cases for few models by @yaswanth19 in #38315
Fix TypeError in save_pretrained error handling (fixes #38422) by @rahulrshetty45 in #38449
Cleanup BatchFeature and BatchEncoding by @lgeiger in #38459
Fix Gemma3IntegrationTest by @ydshieh in #38471
[Qwen2.5-Omni] Fix dtype of cos,sin when used with flash attention by @HarryHsing in #38453
fix: handle no scheduler passed by user by @McPatate in #38407
make it go brrrr by @ArthurZucker in #38409
Fix convert_internvl_weights_to_hf.py to support local paths by @xvyv99 in #38264
Fix incorrect bbox_embed initialization when decoder_bbox_embed_share=False in GroundingDINO by @islemyakoubi in #38238
[Tests] Reduced model size for albert-test model by @saqlain2204 in #38480
Align TP check by @SunMarc in #38328
protect dtensor import by @SunMarc in #38496
[docs] add xpu environment variable for gpu selection by @faaany in #38194
Remove deprecated use_flash_attention_2 parameter by @cyyever in #37131
Fix setting FLASH_ATTENTION_DETERMINISTIC after importing by @HollowMan6 in #37185
[seamless_m4t] Skip some tests when speech is not available by @remi-or in #38430
Update Loss Functions to Accept Tensor num_items_in_batch by @NEREUScode in #38029
[generate] add soft deprecations on custom generation methods by @gante in #38406
[generate] move SinkCache to a custom_generate repo by @gante in #38399
remove unhandled parameter by @itazap in #38145
Fix amp deprecation issue by @SunMarc in #38100
[flax/mistral] support sliding_window: null in config by @yiding in #37402
Num parameters in model.safetensors.index.json by @LysandreJik in #38531
Remove type annotation in Siglip Attention Module by @yaswanth19 in #38503
Fix Gemma2IntegrationTest by @ydshieh in #38492
Fix blip2 tests by @ydshieh in #38510
[tests] expand flex-attn test for vision models by @zucchini-nlp in #38434
Don't use default attn if pre-set in sub-config by @zucchini-nlp in #38526
update emu3 test by @jiqing-feng in #38543
Update docker image to use av by @ydshieh in #38548
[bugfix] [WIP] fix apply_rotary_emb error on Ascend NPU by @FightingZhen in #38491
[TP] Change command in tests to python3 by @S1ro1 in #38555
Explicitly setting encoding in tokenization_utils_base.py by @Muqi1029 in #38553
Fix utils/notification_service.py by @ydshieh in #38556
Name change AOPermod -> ModuleFqn by @drisspg in #38456
Fix hqq issue by @SunMarc in #38551
[docs] Format fix by @stevhliu in #38414
[janus] Fix failing tests on mi3XX by @remi-or in #38426
Fix chameleon tests by @ydshieh in #38565
update utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563
Fix deepseekv3 by @ydshieh in #38562
[FlexAttn] Fix models with unique characteristics by @vasqu in #38433
fix(attention_visualizer): add default value for image_seq_length by @IceGiraffe in #38577
allow custom head_dim for qwen2_moe by @bzantium in #37188
Docs: fix code formatting in torchao docs by @Manalelaidouni in #38504
feat: add repository field to benchmarks table by @McPatate in #38582
[Dinov2] Enable device_map="auto" support by @aryanchauhan31 in #38487
tests/roformer: fix couple roformer tests on gpus by @dvrogozh in #38570
New gpt neo model card by @RogerSinghChugh in #38505
Updated deprecated typing imports with equivalents for Python 3.9+ by @Sai-Suraj-27 in #38546
added fast image processor for ZoeDepth and expanded tests accordingly by @henrikm11 in #38515
[qwen-omni] fix sliding window by @zucchini-nlp in #38525
Remove custom pytest and pluggy by @ydshieh in #38589
pin pandas by @ydshieh in #38605
Allow mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)
Avoid overwrite existing local implementation when loading remote custom model by @Isotr0py in #38474
fix spelling errors by @davidjsonn in #38608
Remove isort from dependencies by @Sai-Suraj-27 in #38616
Fix return_dict=False giving errors in a few VLM models by @ydshieh in #38519
docs: fix dark mode logo display. by @johncaged in #38586
Fix typo in LLaVa documentation by @mynameismon in #38618
[Nit] Add Note on SigOpt being in Public Archive Mode by @ParagEkbote in #38610
Updated Aria model card by @1himan in #38472
Fix MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575
enable more test cases on xpu by @yao-matrix in #38572
Improve test_initialization by @ydshieh in #38607
Use torch 2.7.1 on CircleCI jobs by @ydshieh in #37856
[generation] bring back tests on vision models by @zucchini-nlp in #38603
update ColQwen2ModelIntegrationTest by @ydshieh in #38583
Improve test_initialization for SwiftFormer by @ydshieh in #38636
fix: support grad clipping for TP through replicating non-sharded modules by @kmehant in #36132
Don't run AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615
fix total batch size calculation in trainer by @inkcherry in #38286
fix torch_dtype on awq by @jiqing-feng in #38463
Better CI by @ydshieh in #38552
remove ipex_optimize_model usage by @yao-matrix in #38632
Skip torchscript tests for 2 models by @ydshieh in #38643
Fix InternVL integration test by @ydshieh in #38612
Use torch 2.7.1 on daily CI by @ydshieh in #38620
Fix qwen2-audio chat template audio placeholder insertion by @Isotr0py in #38640
Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable by @sbucaille in #38664
fix: "check out" as verb by @DePasqualeOrg in #38678
Fix attention mask expansion when converting to executorch by @pweglik in #38637
Fix some models import by @nicelulu in #38694
Fix retrieve function signature and remove faiss requirement by @Fiona-Waters in #38624
Fix TypeError: 'NoneType' object is not iterable for esm by @dbleyl in #38667)
Docs: update bitsandbytes torch.compile compatibility by @matthewdouglas in #38651
Drop as_target_processor from the call and pad methods by @marcndo in #38642
Created model card for XLM model by @AshAnand34 in #38595
Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout by @AshAnand34 in #38596
Created model card for xlm-roberta-xl by @AshAnand34 in #38597
Fix aya_vision test by @ydshieh in #38674
Standardize ByT5 model card format by @yanamis in #38699
Fix smart resize by @rdonggroq in #38706
Update some tests for torch 2.7.1 by @ydshieh in #38701
Logging message for is_bitsandbytes_available() by @ved1beta in #38528
Fix llava tests by @ydshieh in #38722
Use OSError by @cyyever in #38712
[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping by @alexzms in #38703
Fix typo in Language Modeling example scripts and update TPU type by @framoncg in #38652
Add AGENTS.md by @Rocketknight1 in #38734
New canine model card by @RogerSinghChugh in #38631
Fixed a multiple-devices issue in SmolVLM model by @remi-or in #38736
[llava] fix integration tests with Siglip by @zucchini-nlp in #38732
fix: Add method to get image features in PaliGemmaForConditionalGeneration by @YushunXiang in #38730
from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim by @yao-matrix in #38689
fix: bf16 with TPU is allowed in configuration by @yevvonlim in #38670
[DeepSeek-V3] implement when q_lora_rank is None by @bzantium in #38743
Revert "Trigger doc-builder job after style bot" by @ydshieh in #38735
Add z-loss to Bamba for v2 by @daviswer in #37842
Better typing for num_items_in_batch by @SunMarc in #38728
Prepare for TF+Jax deprecation by @Rocketknight1 in #38760
Remove IPEX requirement for bitsandbytes on CPU by @matthewdouglas in #38594
Update repo consistency check by @Rocketknight1 in #38763
fix(qwen3_moe): pass kwargs to self_attn by @llllvvuu in #38691
Update pegasus model card by @dross20 in #38675
Make style bot trigger CI after push by @ydshieh in #38754
chore(pixtral): emit block attention mask when using flash attention by @starcatmeow in #38741
Update altCLIP model card by @EmileAydar in #38306
Add Qwen2 MoE model card by @rileyafox in #38649
[masking utils] check None instead of try/except by @zucchini-nlp in #38561
[Hotfix] Fix style bot by @ydshieh in #38779
Fix masking utils by @Cyrilvallez in #38783
[video processors] support frame sampling within processors by @zucchini-nlp in #38105
Skip some export tests on torch 2.7 by @ydshieh in #38677
Reduce verbosity for average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785
Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in #38770
[docs] Add int4wo + 2:4 sparsity example to TorchAO README by @jcaip in #38592
Fix qwen_2_5 omni by @ydshieh in #38658
Fix llava_onevision tests by @ydshieh in #38791
Reword README in light of model definitions by @LysandreJik in #38762
Fix Typos in Comments: "quantitation" → "quantization", "averege" → "average" by @leopardracer in #38766
Initialize flash attn flag by @farnasirim in #38768
Fix mllama by @ydshieh in #38704
build: 📌 Remove upper bound on PyTorch by @KyleMylonakisProtopia in #38789
Remove all traces of low_cpu_mem_usage by @Cyrilvallez in #38792
[Docs] New DiT model card by @yushi2006 in #38721
Add missing div in Pegasus model card by @dross20 in #38773
Updated moonshine modelcard by @SohamPrabhu in #38711
refactor create_token_type_ids_from_sequences by @itazap in #37681
[docs] update cache docs with new info by @zucchini-nlp in #38775
Fix erroneous docstring for the ordering of SWA layers by @norpadon in #38794
Fix configs and doc for the Qwens by @Cyrilvallez in #38808
Unbreak optimum-executorch by @guangy10 in #38646
Disable custom MRA kernels for ROCm by @ahadnagy in #38738
Use HF papers by @qgallouedec in #38184
Simplify and update trl examples by @qgallouedec in #38772
Better pipeline type hints ✨ by @qubvel in #38049
Fix llava_next tests by @ydshieh in #38813
Expectation fixes and added AMD expectations by @remi-or in #38729
Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817
Refactor DBRX tests to use CausalLMModelTest base classes by @Rocketknight1 in #38475
change fsdp_strategy to fsdp in TrainingArguments in accelerate doc by @PT-10 in #38807
Fix a minor security issue by @ydshieh in #38815
Fix trainer.py not showing signature columns by @nenesekai in #38465
Add V-JEPA for video classification model by @qubvel in #38788
fixed docstring in modular_qwen2_5_vl.py by @lawrencefeng17 in #38798
[docs] Update docs moved to the course by @stevhliu in #38800
[docs] updated roberta model card by @allmight05 in #38777
Updated Albert model Card by @souvikchand in #37753
[internvl] fix video inference by @zucchini-nlp in #38811
Fix redundant code in Janus by @yaswanth19 in #38826
bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming by @ManuelFay in #38627
Fix peft integration by @Cyrilvallez in #38841
Fix broken notebooks link in Italian training docs by @VolodymyrBg in #38834
Fix broken tag in Longformer model card by @dross20 in #38828
[BugFix] QA pipeline edge case: align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761
GraniteMoeHybrid: Allow for only shared expert case. by @shawntan in #38801
Updated aya_vision.md by @1himan in #38749
Remove merge conflict artifacts in Albert model doc by @druvdub in #38849
[video processor] fix BC when no video config if found by @zucchini-nlp in #38840
Fix incorrect width ratio calculation in Llama4 image processor by @Jingxiang-Zhang in #38842
Allow customization of sdpa in executorch.py by @kimishpatel in #38827
Fix qwen2_5_vl tests by @ydshieh in #38845
Improve auxiliary_in_channels default behavior in UperNet by @simonreise in #37540
Fix qwen3 tests by @ydshieh in #38862
Update CvT documentation with improved usage examples and additional … by @sezan92 in #38731
Update roc bert docs by @SohamPrabhu in #38835
Post-PR fixes! by @Rocketknight1 in #38868
enable misc test cases on XPU by @yao-matrix in #38852
Fix phi4_multimodal tests by @ydshieh in #38816
Fix qwen3_moe tests by @ydshieh in #38865
Fix HQQ model param device transfer issue by @HighCWu in #38466
Fixed markdown for BertTokenizer's '[CLS]' token. by @eu90h in #38506
null deepspeed_plugin in args for wandb callback fake trainer by @winglian in #38867
More PYUP fixes by @cyyever in #38883
Fix loop var naming by @Rocketknight1 in #38885
[bugfix] fix ATTN_MASK_NPU device mismatch error on multi-device NPU … by @qykong in #38876
log: Add logging when using split_batches and per_device_train_batch_size by @KeshavSingh29 in #38633
Docs: Add custom fine-tuning tutorial to TrOCR model page by @Ashutosh-4485 in #38847
36978 | Fast image processor for DPT model by @samrae7 in #37481
[video processor] fix slow tests by @zucchini-nlp in #38881
Update bamba model card by @druvdub in #38853
Add support for specifying revisions when pushing to Hub via internal Trainer call by @IsaacBreen in #36852
Use raise from e in hub.py utility by @Wauplin in #37241
[phi-4] use mel filters from audio utils by @eustlb in #36966
Fix fsmt tests by @ydshieh in #38904
Fix unnecessary super calls by @cyyever in #38897
align xpu's autocast behavior w/ cuda by using device agnostic torch APIs by @yao-matrix in #38284
Fix FalconMambaIntegrationTests by @ydshieh in #38566
Skip sdpa tests if submodule does not support sdpa by @ivarflakstad in #38907
Fix ReDOS in tokenizer digit substitution by @Rocketknight1 in #38844
feat: Add granite architectures to auto tokenizer name mappings by @gabe-l-hart in #38802
Allow make-fixup on main branch, albeit slowly by @Rocketknight1 in #38892
feat: add flexible Liger Kernel configuration to TrainingArguments by @hamza-hcompany in #38911
Remove deprecated classes in modeling_utils.py by @Cyrilvallez in #38919
Skip some tests for now by @ydshieh in #38931
Modernbert fixes by @remi-or in #38912
add pytorch-xpu Dockerfile by @yao-matrix in #38875
Remove ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922
[static cache] fix device map per layer in VLMs by @zucchini-nlp in #38488
Add kwargs for timm.create_model in TimmWrapper by @qubvel in #38860
Pin PyTorch extras for AMD containers by @ahadnagy in #38941
Correctly raise error for awq quantization by @Cyrilvallez in #38945
Fix more flaky test_initialization by @ydshieh in #38932
Switch to use A10 progressively by @ydshieh in #38936
Fix custom generate from local directory by @manueldeprada in #38916
Update blip model card by @devkade in #38513
Gaudi3 CI by @IlyasMoutawwakil in #38790
Fix DTensor import compatibility for PyTorch < 2.5 by @Benoqtr in #38836
Fix(informer): Correct tensor shape for input_size=1 by @Flink-ddd in #38856
[modular] CLI allows positional arguments, and more defaults names for the optional arg by @Cyrilvallez in #38979
Remove dead protected imports by @Cyrilvallez in #38980
Break tie in Expectations and gemma3 fixes by @remi-or in #38943
Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors by @yonigozlan in #38157
fix: add bool operator to tokenizer to avoid bloated asserts by @kallewoof in #38899
Add support for auto_docstring with model outputs by @yonigozlan in #38242
fix mistral and mistral3 tests by @ydshieh in #38978
[Feature] Support is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818
Fix rag by @ydshieh in #38585
[docs] Typos - Single GPU efficient training features by @casinca in #38964
[qwen] refactor attentions for vision/audio by @zucchini-nlp in #38930
Removing extra space in large command for speech-pretraining example by @dggaytan in #38705
[Attention] Small fix on output attentions by @vasqu in #38948
Fixes for Arcee model by @Cyrilvallez in #39001
Added scikit-learn to the example image-classification requirements.txt by @mylonjones in #37506
Update attention_visualizer.py by @Tanuj-rai in #37860
Skip non-selected experts for qwen3_moe by @seven-mile in #38133
Fix undeterministic order in modular dependencies by @Cyrilvallez in #39005
Granite speech - minor fixes to support training with the HF trainer by @avihu111 in #38833
Fix bugs in DynamicCache by @tugsbayasgalan in #37880
Update self-comment-ci.yml user list by @ivarflakstad in #39014
Skip sdpa dispatch on flash test due to unsupported head dims by @ivarflakstad in #39010
[HPU][Critical Issue Fix] ThreadPool instead of Pool for parallel pre-processing by @dsmertin in #39002
Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code,… by @marcndo in #38954
[LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim by @sbucaille in #39021
Add zero dim tensor check when using flash_attention by @ranzhejiang in #38280
Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1 by @efsotr in #37332
[AutoModelForMaskGeneration] Remove duplicate code by @NielsRogge in #38622
[video processor] support torchcodec and decrease cuda memory usage by @zucchini-nlp in #38880
Drop unnecessary tokens in GPT2Model generation by @null-pointer-access in #39016
Fix the seamless_m4t cannot work on Gaudi by @yuanwu2017 in #38363
fix: astronomical loss with ModernBERT when using gradient checkpointing by @umarbutler in #38982)
fix gemma3 grad acc by @SunMarc in #37208
Remove script datasets in tests by @lhoestq in #38940
Fix grammatical error in models documentation by @marcndo in #39019
refactor: remove custom BarkLayerNorm by @eginhard in #39003
[Kyutai-STT] correct model type + model id by @eustlb in #39035
Two ReDOS fixes by @Rocketknight1 in #39013
[tests] remove TF tests (uses of require_tf) by @gante in #38944
Granite speech speedup + model saving bugfix by @avihu111 in #39028
Fix Bad Outputs in Fast Path for GraniteMoeHybrid by @alex-jw-brooks in #39033

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ydshieh
- CI reporting improvements (#38230)
- add liger-kernel to docker file (#38292)
- add vasqu to self-comment-ci.yml (#38324)
- new failure CI reports for all jobs (#38298)
- Hot fix for AMD CI workflow (#38349)
- Uninstall kernels for AMD docker images (#38354)
- Use one utils/notification_service.py (#38379)
- update gemma tests (#38384)
- Update CsmForConditionalGenerationIntegrationTest (#38424)
- Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers (#38413)
- Trigger doc-builder job after style bot (#38398)
- Fix GLM4 checkpoints (#38412)
- Fix Gemma3IntegrationTest (#38471)
- Fix Gemma2IntegrationTest (#38492)
- Fix blip2 tests (#38510)
- Update docker image to use av (#38548)
- Fix utils/notification_service.py (#38556)
- Fix chameleon tests (#38565)
- update utils/notification_service.py for AMD vs Nvidia (#38563)
- Fix deepseekv3 (#38562)
- Remove custom pytest and pluggy (#38589)
- pin pandas (#38605)
- Fix return_dict=False giving errors in a few VLM models (#38519)
- Improve test_initialization (#38607)
- Use torch 2.7.1 on CircleCI jobs (#37856)
- update ColQwen2ModelIntegrationTest (#38583)
- Improve test_initialization for SwiftFormer (#38636)
- Don't run AriaForConditionalGenerationModelTest on CircleCI (#38615)
- Better CI (#38552)
- Skip torchscript tests for 2 models (#38643)
- Fix InternVL integration test (#38612)
- Use torch 2.7.1 on daily CI (#38620)
- Fix aya_vision test (#38674)
- Update some tests for torch 2.7.1 (#38701)
- Fix llava tests (#38722)
- Revert "Trigger doc-builder job after style bot" (#38735)
- Make style bot trigger CI after push (#38754)
- [Hotfix] Fix style bot (#38779)
- Skip some export tests on torch 2.7 (#38677)
- Fix qwen_2_5 omni (#38658)
- Fix llava_onevision tests (#38791)
- Fix mllama (#38704)
- Fix llava_next tests (#38813)
- Fix a minor security issue (#38815)
- Fix qwen2_5_vl tests (#38845)
- Fix qwen3 tests (#38862)
- Fix phi4_multimodal tests (#38816)
- Fix qwen3_moe tests (#38865)
- Fix fsmt tests (#38904)
- Fix FalconMambaIntegrationTests (#38566)
- Skip some tests for now (#38931)
- Fix more flaky test_initialization (#38932)
- Switch to use A10 progressively (#38936)
- fix mistral and mistral3 tests (#38978)
- Fix rag (#38585)
@ArthurZucker
- tp plan should not be NONE (#38255)
- Protect ParallelInterface (#38262)
- Add CB (#38085)
- 🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong (#38288)
- for now disable compile (#38383)
- make it go brrrr (#38409)
@younesbelkada
- [MODEL] Add Falcon H1 (#38249)
@cyr0930
- fix multi-image case for llava-onevision (#38084)
@cyyever
- Improve typing in TrainingArgument (#36944)
- More typing in src/transformers/training_args.py (#38106)
- Fix run_slow (#38314)
- Remove deprecated use_flash_attention_2 parameter (#37131)
- Use OSError (#38712)
- More PYUP fixes (#38883)
- Fix unnecessary super calls (#38897)
@ritsumei-aoi
- Remove Japanese sequence_classification doc and update references (#38246)
@yao-matrix
- add XPU info print in print_env (#38282)
- refine transformers env output (#38274)
- switch to device agnostic device calling for test cases (#38247)
- enable large_gpu and torchao cases on XPU (#38355)
- enable more test cases on xpu (#38572)
- remove ipex_optimize_model usage (#38632)
- from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim (#38689)
- enable misc test cases on XPU (#38852)
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs (#38284)
- add pytorch-xpu Dockerfile (#38875)
@vasqu
- 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108)
- [FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)
- [OPT] Fix attention scaling (#38290)
- 🔴[Attention] Attention refactor for Whisper-based models (#38235)
- [FlexAttn] Fix models with unique characteristics (#38433)
- [Attention] Small fix on output attentions (#38948)
@itazap
- refactor can_save_slow_tokenizer (#37722)
- remove unhandled parameter (#38145)
- refactor create_token_type_ids_from_sequences (#37681)
@eustlb
- [CSM] infer codec model with no_grad + audio eos label (#38215)
- [CSM] update model id (#38211)
- [phi-4] use mel filters from audio utils (#36966)
- Add kyutai stt (#38909)
- [Kyutai-STT] correct model type + model id (#39035)
@RogerSinghChugh
- Updated BigBird Model card as per #36979. (#37959)
- Updated BERTweet model card. (#37981)
- New bart model card (#37858)
- New gpt neo model card (#38505)
- New canine model card (#38631)
@1himan
- Updated the Model docs - for the ALIGN model (#38072)
- Updated Aria model card (#38472)
- Updated aya_vision.md (#38749)
@Avasam
- Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)
@remi-or
- [seamless_m4t] Skip some tests when speech is not available (#38430)
- [janus] Fix failing tests on mi3XX (#38426)
- Fixed a multiple-devices issue in SmolVLM model (#38736)
- Expectation fixes and added AMD expectations (#38729)
- Modernbert fixes (#38912)
- Break tie in Expectations and gemma3 fixes (#38943)
@tonywu71
- Add ColQwen2 to 🤗 transformers (#35778)
@geetu040
- Add support for MiniMax's MiniMax-Text-01 (#35831)
- Fix MiniMax (docs and integration tests checkpoint) (#38575)
@sbucaille
- Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable (#38664)
- Add LightGlue model (#31718)
- [LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim (#39021)
@samrae7
- 36978 | Fast image processor for DPT model (#37481)
@Crystalcareai
- Add Arcee model support (#38621)
@zRzRzRzRzRzRzR
- GLM-4.1V Model support (#38431)
@bzhangGo
- Encoder-Decoder Gemma (#38332)
@redmoe-moutain
- [Model] add dots1 (#38143)
@EduardDurech
- Support for Flash Attention 3 (#38972)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v4.53.0

Release v4.53.0

Gemma3n

Dia

Kyutai Speech-to-Text

V-JEPA 2

Arcee

ColQwen2

MiniMax

Encoder-Decoder Gemma

GLM-4.1V

Falcon H1

LightGlue

dots.llm1

SmolLM3

Performance optimizations

Kernels

Example

Flash Attention 3

Notable repository maintenance & refactors

Breaking changes

Bugfixes and improvements

Significant community contributions

Contributors

Uh oh!