Release v4.53.0
Gemma3n
Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
torch_dtype=torch.bfloat16,
model="google/gemma-3n-e4b",
device="cuda",
)
output = pipe(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
text="<image_soft_token> in this image, there is"
)
print(output)
Dia
Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs.
It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing.
Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture:
Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as
rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while
for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook
tokens and decodes them back into audio.
- Add Dia model by @buttercrab in #38405
Kyutai Speech-to-Text
Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:
- kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
- kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy
Read more about the model in the documentation
V-JEPA 2
V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
Read more about the model in the documentation.
Arcee
Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.
- Add Arcee model support by @Crystalcareai in #38621
Read more about the model in the documentation.
ColQwen2
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
Read more about the model in the documentation.
MiniMax
MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.
The architecture of MiniMax is briefly described as follows:
- Total Parameters: 456B
- Activated Parameters per Token: 45.9B
- Number Layers: 80
- Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
- Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
- Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
- Hidden Size: 6144
- Vocab Size: 200,064
For more details refer to the release blog post.
Read more about the model in the documentation.
Encoder-Decoder Gemma
T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.
T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.
The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.
Read more about the model in the documentation.
GLM-4.1V
The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!
- GLM-4.1V Model support by @zRzRzRzRzRzRzR in #38431
Read more about the model in the documentation.
Falcon H1
The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.
- [MODEL] Add Falcon H1 by @younesbelkada in #38249
Read more about the model in the documentation.
LightGlue
The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed
by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.
Similar to SuperGlue, this model consists of matching
two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the
SuperPoint model, it can be used to match two images and
estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple
design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements.
Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much
easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much
faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited
appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like
3D reconstruction. The code and trained models are publicly available at this https URL
- Add LightGlue model by @sbucaille in #31718
Read more about the model in the documentation.
dots.llm1
The abstract from the report is the following:
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.
- [Model] add dots1 by @redmoe-moutain in #38143
Read more about the model in the documentation.
SmolLM3
SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.
Read more about the model in the documentation.
Performance optimizations
Kernels
In previous versions, installing the kernels
library would automatically activate the custom kernels added to transformers
, because the @use_kernel_forward_from_the_hub
decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile
, non-determinism, and inconsistent outputs.
To address this, we've introduced a new opt-in mechanism called kernelize
. You can now enable kernel usage explicitly by passing use_kernels=True
to from_pretrained
. The use_kernel_forward_from_the_hub
decorator now simply stores the kernel name that the user wants to use — and kernelize
handles the rest under the hood.
Example
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗
- Add kernelize to transformers by @MekkCyber in #38205
Flash Attention 3
Support for Flash Attention 3 is added across the most popular models.
- Support for Flash Attention 3 by @EduardDurech in #38972
Notable repository maintenance & refactors
Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.
We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.
- Apply GradientCheckpointingLayer to the whole repo by @qubvel in #38913
- No more Tuple, List, Dict by @Rocketknight1 in #38797
- Deprecate TF + JAX by @Rocketknight1 in #38758
Breaking changes
Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.
- 🔴 Update default
dtype
for pipelines toauto
by @Vaibhavs10 in #38882 - 🚨🚨 Fix initialization of Mask2Former by @Cyrilvallez in #38864
- 🚨 🚨 Inherited CausalLM Tests by @Rocketknight1 in #37590
- 🚨Early-error🚨 config will error out if
output_attentions=True
and the attn implementation is wrong by @ArthurZucker in #38288 - 🔴 [VLM] modeling updates by @zucchini-nlp in #38317
- 🚨 🚨 Fix custom code saving by @Rocketknight1 in #37716
- 🚨🚨[core] Completely rewrite the masking logic for all attentions by @Cyrilvallez in #37866
- 🔴🔴🔴 [
Attention
] Refactor Attention Interface for Bart-based Models by @vasqu in #38108 - 🔴[
Attention
] Attention refactor for Whisper-based models by @vasqu in #38235 - Add CB by @ArthurZucker in #38085
Bugfixes and improvements
- CI reporting improvements by @ydshieh in #38230
- Revert parallelism temporarily by @LysandreJik in #38240
- tp plan should not be NONE by @ArthurZucker in #38255
- [Falcon H1] Fix Typo in Integration Test by @dhiaEddineRhaiem in #38256
- [
compile
] re-enable for Qwen-VL models by @zucchini-nlp in #38127 - fix multi-image case for llava-onevision by @cyr0930 in #38084
- Add tearDown method to Quark to solve OOM issues by @MekkCyber in #38234
- Clearer error on import failure by @LysandreJik in #38257
- [whisper] small changes for faster tests by @gante in #38236
- Simplify DTensor Check for modeling_utils.py by @amd-xiaoyu12 in #38245
- Improve typing in TrainingArgument by @cyyever in #36944
- Fix: missing else branch to handle "--load_best_model_at_end" in training_args.py by @danielyxyang in #38217
- assign the correct torchao data layout for xpu by @jiqing-feng in #37781
- Remove Japanese sequence_classification doc and update references by @ritsumei-aoi in #38246
- Protect ParallelInterface by @ArthurZucker in #38262
- Update Model Card for Mamba by @ParagEkbote in #37863
- docs(swin): Update Swin model card to standard format by @BryanBradfo in #37628
- add XPU info print in print_env by @yao-matrix in #38282
- [whisper] move processor test into processor test file 🧹 by @gante in #38266
- [Whisper] handle deprecation of
forced_decoder_ids
by @gante in #38232 - add
liger-kernel
to docker file by @ydshieh in #38292 - Fix tp error when torch distributed is already initialized by @SunMarc in #38294
- More typing in src/transformers/training_args.py by @cyyever in #38106
- refine
transformers env
output by @yao-matrix in #38274 - Update CI Docker base image for AMD tests by @ahadnagy in #38261
- Fix HybridChunedCache & Llama4 by @Cyrilvallez in #38299
- Oups typo for HybridChunkedCache by @Cyrilvallez in #38303
- [Tests] Cleanup Janus Testcase by @yaswanth19 in #38311
- [emu3] fix conversion script by @zucchini-nlp in #38297
- Fix run_slow by @cyyever in #38314
- Fix typo: change 'env' to 'environment' in .circleci/config.yml by @AbdessamadEnabih in #38273
- Adds use_repr to model_addition_debugger_context by @RyanMullins in #37984
- [tf/flax] handle
forced_decoder_ids
deletion by @gante in #38316 - [Whisper + beam search] fix usage of
beam_indices
by @gante in #38259 - Expose AutoModelForTimeSeriesPrediction for import by @jinan-zhou in #38307
- [custom_generate] don't forward
custom_generate
andtrust_remote_code
by @gante in #38304 - add
vasqu
toself-comment-ci.yml
by @ydshieh in #38324 - Fix some tests (especially compile with fullgraph=True on Python<3.11) by @Cyrilvallez in #38319
- [performance_optim] reduce frequency of declaring attention_mask in Ascend NPU flash attention by @FightingZhen in #38278
- refactor can_save_slow_tokenizer by @itazap in #37722
- [
FlexAttention
] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321 - Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag by @inf3rnus in #36835
- Use Gradient Checkpointing Layer in Jamba & Blip Related Models by @alex-jw-brooks in #38310
- Never fallback to eager implicitly by @Cyrilvallez in #38327
- Remove duplicate docstring: resample by @qqii in #38305
- Update BioGPT model card by @Aguedoom in #38214
- docs(swinv2): Update SwinV2 model card to new standard format by @BryanBradfo in #37942
- [docs]: update roformer.md model card by @KsuParkhamchuk in #37946
- new failure CI reports for all jobs by @ydshieh in #38298
- Hot fix for AMD CI workflow by @ydshieh in #38349
- Uninstall
kernels
for AMD docker images by @ydshieh in #38354 - [VLMs] add helpers for get/set embedding by @zucchini-nlp in #38144
- switch to device agnostic device calling for test cases by @yao-matrix in #38247
- [
OPT
] Fix attention scaling by @vasqu in #38290 - Fix all import errors based on older torch versions by @Cyrilvallez in #38370
- Fix incorrect batching audio index calculation for Phi-4-Multimodal by @Isotr0py in #38103
- Protect
get_default_device
for torch<2.3 by @Cyrilvallez in #38376 - [Falcon H1] Fix slow path forward pass by @dhiaEddineRhaiem in #38320
- Improved cache docs by @manueldeprada in #38060
- for now disable compile by @ArthurZucker in #38383
- Use one
utils/notification_service.py
by @ydshieh in #38379 - Better check in
initialize_weights
by @Cyrilvallez in #38382 - fix typos by @DeVikingMark in #38336
- fix typo:
tokenizer
->tokenize
by @foldl in #38357 - Stop TF weight rename reDOS by @Rocketknight1 in #38325
- [cli] cli usable without torch by @gante in #38386
- update gemma tests by @ydshieh in #38384
- Stop autoconverting custom code checkpoints by @Rocketknight1 in #37751
- Add AMD MI300 CI caller leveraging self-hosted runner scale set workflow in hf-workflows by @jitesh-gupta in #38132
- Fix image token mask in Gemma3 by @Cyrilvallez in #38295
- [transformers x vLLM] standardize processors by @zucchini-nlp in #37915
- [paligemma] fix processor with suffix by @zucchini-nlp in #38365
- [video utils] group and reorder by number of frames by @zucchini-nlp in #38374
- [aya vision] fix processor for vLLM by @zucchini-nlp in #38371
- guard size mismatch check to only quantized models by @SunMarc in #38397
- [chat] improvements for thinking models and reduce default verbosity by @gante in #38322
- Fix convert to original state dict for VLMs by @hiyouga in #38385
- [chat] use the checkpoint's
generation_config.json
as base parameterization by @gante in #38330 - Fix Qwen2.5-VL Video Processor by @yeliudev in #38366
- [CSM] infer codec model with no_grad + audio eos label by @eustlb in #38215
- Add report_repo_id to mi300 workflow by @ivarflakstad in #38401
- [CSM] update model id by @eustlb in #38211
- [cleanup] delete deprecated kwargs in qwen2_audio 🧹 by @gante in #38404
- [tests] remove overload for deleted test (
test_offloaded_cache_implementation
) by @gante in #37896 - [mllama] Allow
pixel_values
withinputs_embeds
by @dxoigmn in #38334 - Update Model Card for Mamba-2 by @ParagEkbote in #37951
- Updated Zoedepth model card by @miniMaddy in #37898
- Updated BigBird Model card as per #36979. by @RogerSinghChugh in #37959
- Updated BERTweet model card. by @RogerSinghChugh in #37981
- New bart model card by @RogerSinghChugh in #37858
- Update granite.md by @Tanuj-rai in #37791
- Falcon-H1 - Fix auto_docstring and add can_return_tuple decorator by @yonigozlan in #38260
- Updated model card for OLMo2 by @andyvu923 in #38394
- Add mi300 to amd daily ci workflows definition by @ivarflakstad in #38415
- Change slack channel for mi250 CI by @ivarflakstad in #38410
- Fix an error in verify_tp_plan for keys without '.' by @liwii in #38420
- [qwen-vl] Look for vocab size in text config by @zucchini-nlp in #38372
- Update
CsmForConditionalGenerationIntegrationTest
by @ydshieh in #38424 - enable large_gpu and torchao cases on XPU by @yao-matrix in #38355
- Disable mi210 scheduled CI by @ivarflakstad in #38411
- Update error when using additional and/or masks by @Cyrilvallez in #38429
- Fix CircleCI not triggered when PR is opened from a branch of
huggingface/transformers
by @ydshieh in #38413 - make Llama4TextMoe forward more readable by @JJJYmmm in #37529
- [core] support tensor-valued _extra_state values in
from_pretrained
by @pstjohn in #38155 - Fix typo in tokenization_utils_base.py docstring by @cwngan in #38418
- Fix convert weights for InternVL by @yonigozlan in #38233
- Trigger doc-builder job after style bot by @ydshieh in #38398
- Remove redundant test_sdpa_equivalence test by @Rocketknight1 in #38436
- Fix MoE gradient test by @Rocketknight1 in #38438
- Fix
from_args_and_dict
ProcessorMixin by @yonigozlan in #38296 - Fix handling of slow/fast image processors in image_processing_auto.py by @yonigozlan in #38161
- Updated the Model docs - for the ALIGN model by @1himan in #38072
- Updated the model card for ViTMAE by @mreraser in #38302
- Model card for mobilenet v1 and v2 by @yuanjua in #37948
- Merge type hints from
microsoft/python-type-stubs
(post dropping support for Python 3.8) by @Avasam in #38335 - Fix GLM4 checkpoints by @ydshieh in #38412
- feat: add cache retention for requests by @McPatate in #38446
- [Tests] Clean up test cases for few models by @yaswanth19 in #38315
- Fix TypeError in save_pretrained error handling (fixes #38422) by @rahulrshetty45 in #38449
- Cleanup
BatchFeature
andBatchEncoding
by @lgeiger in #38459 - Fix
Gemma3IntegrationTest
by @ydshieh in #38471 - [Qwen2.5-Omni] Fix dtype of cos,sin when used with flash attention by @HarryHsing in #38453
- fix: handle no scheduler passed by user by @McPatate in #38407
- make it go brrrr by @ArthurZucker in #38409
- Fix convert_internvl_weights_to_hf.py to support local paths by @xvyv99 in #38264
- Fix incorrect bbox_embed initialization when decoder_bbox_embed_share=False in GroundingDINO by @islemyakoubi in #38238
- [Tests] Reduced model size for albert-test model by @saqlain2204 in #38480
- Align TP check by @SunMarc in #38328
- protect dtensor import by @SunMarc in #38496
- [docs] add xpu environment variable for gpu selection by @faaany in #38194
- Remove deprecated use_flash_attention_2 parameter by @cyyever in #37131
- Fix setting FLASH_ATTENTION_DETERMINISTIC after importing by @HollowMan6 in #37185
- [seamless_m4t] Skip some tests when speech is not available by @remi-or in #38430
- Update Loss Functions to Accept Tensor num_items_in_batch by @NEREUScode in #38029
- [generate] add soft deprecations on custom generation methods by @gante in #38406
- [generate] move
SinkCache
to acustom_generate
repo by @gante in #38399 - remove unhandled parameter by @itazap in #38145
- Fix amp deprecation issue by @SunMarc in #38100
- [flax/mistral] support sliding_window: null in config by @yiding in #37402
- Num parameters in model.safetensors.index.json by @LysandreJik in #38531
- Remove type annotation in Siglip Attention Module by @yaswanth19 in #38503
- Fix
Gemma2IntegrationTest
by @ydshieh in #38492 - Fix blip2 tests by @ydshieh in #38510
- [tests] expand flex-attn test for vision models by @zucchini-nlp in #38434
- Don't use default attn if pre-set in sub-config by @zucchini-nlp in #38526
- update emu3 test by @jiqing-feng in #38543
- Update docker image to use
av
by @ydshieh in #38548 - [bugfix] [WIP] fix apply_rotary_emb error on Ascend NPU by @FightingZhen in #38491
- [TP] Change command in tests to
python3
by @S1ro1 in #38555 - Explicitly setting encoding in tokenization_utils_base.py by @Muqi1029 in #38553
- Fix
utils/notification_service.py
by @ydshieh in #38556 - Name change AOPermod -> ModuleFqn by @drisspg in #38456
- Fix hqq issue by @SunMarc in #38551
- [docs] Format fix by @stevhliu in #38414
- [janus] Fix failing tests on mi3XX by @remi-or in #38426
- Fix
chameleon
tests by @ydshieh in #38565 - update
utils/notification_service.py
for AMD vs Nvidia by @ydshieh in #38563 - Fix
deepseekv3
by @ydshieh in #38562 - [
FlexAttn
] Fix models with unique characteristics by @vasqu in #38433 - fix(attention_visualizer): add default value for image_seq_length by @IceGiraffe in #38577
- allow custom head_dim for qwen2_moe by @bzantium in #37188
- Docs: fix code formatting in torchao docs by @Manalelaidouni in #38504
- feat: add
repository
field to benchmarks table by @McPatate in #38582 - [Dinov2] Enable device_map="auto" support by @aryanchauhan31 in #38487
- tests/roformer: fix couple roformer tests on gpus by @dvrogozh in #38570
- New gpt neo model card by @RogerSinghChugh in #38505
- Updated deprecated typing imports with equivalents for Python 3.9+ by @Sai-Suraj-27 in #38546
- added fast image processor for ZoeDepth and expanded tests accordingly by @henrikm11 in #38515
- [qwen-omni] fix sliding window by @zucchini-nlp in #38525
- Remove custom pytest and pluggy by @ydshieh in #38589
- pin pandas by @ydshieh in #38605
- Allow
mlm_probability
to be set toNone
whenmlm=False
in DataCollatorForLanguageModeling by @KameniAlexNea in #38522) - Avoid overwrite existing local implementation when loading remote custom model by @Isotr0py in #38474
- fix spelling errors by @davidjsonn in #38608
- Remove
isort
from dependencies by @Sai-Suraj-27 in #38616 - Fix
return_dict=False
giving errors in a few VLM models by @ydshieh in #38519 - docs: fix dark mode logo display. by @johncaged in #38586
- Fix typo in LLaVa documentation by @mynameismon in #38618
- [Nit] Add Note on SigOpt being in Public Archive Mode by @ParagEkbote in #38610
- Updated Aria model card by @1himan in #38472
- Fix
MiniMax
(docs and integration tests checkpoint) by @geetu040 in #38575 - enable more test cases on xpu by @yao-matrix in #38572
- Improve
test_initialization
by @ydshieh in #38607 - Use torch 2.7.1 on CircleCI jobs by @ydshieh in #37856
- [generation] bring back tests on vision models by @zucchini-nlp in #38603
- update
ColQwen2ModelIntegrationTest
by @ydshieh in #38583 - Improve
test_initialization
forSwiftFormer
by @ydshieh in #38636 - fix: support grad clipping for TP through replicating non-sharded modules by @kmehant in #36132
- Don't run
AriaForConditionalGenerationModelTest
on CircleCI by @ydshieh in #38615 - fix total batch size calculation in trainer by @inkcherry in #38286
- fix torch_dtype on awq by @jiqing-feng in #38463
- Better CI by @ydshieh in #38552
- remove ipex_optimize_model usage by @yao-matrix in #38632
- Skip torchscript tests for 2 models by @ydshieh in #38643
- Fix
InternVL
integration test by @ydshieh in #38612 - Use torch 2.7.1 on daily CI by @ydshieh in #38620
- Fix qwen2-audio chat template audio placeholder insertion by @Isotr0py in #38640
- Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable by @sbucaille in #38664
- fix: "check out" as verb by @DePasqualeOrg in #38678
- Fix attention mask expansion when converting to executorch by @pweglik in #38637
- Fix some models import by @nicelulu in #38694
- Fix retrieve function signature and remove faiss requirement by @Fiona-Waters in #38624
- Fix TypeError: 'NoneType' object is not iterable for esm by @dbleyl in #38667)
- Docs: update bitsandbytes torch.compile compatibility by @matthewdouglas in #38651
- Drop as_target_processor from the call and pad methods by @marcndo in #38642
- Created model card for XLM model by @AshAnand34 in #38595
- Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout by @AshAnand34 in #38596
- Created model card for xlm-roberta-xl by @AshAnand34 in #38597
- Fix
aya_vision
test by @ydshieh in #38674 - Standardize ByT5 model card format by @yanamis in #38699
- Fix smart resize by @rdonggroq in #38706
- Update some tests for torch 2.7.1 by @ydshieh in #38701
- Logging message for
is_bitsandbytes_available()
by @ved1beta in #38528 - Fix
llava
tests by @ydshieh in #38722 - Use OSError by @cyyever in #38712
- [add-new-model-like] Robust search & proper outer '),' in tokenizer mapping by @alexzms in #38703
- Fix typo in Language Modeling example scripts and update TPU type by @framoncg in #38652
- Add AGENTS.md by @Rocketknight1 in #38734
- New canine model card by @RogerSinghChugh in #38631
- Fixed a multiple-devices issue in SmolVLM model by @remi-or in #38736
- [llava] fix integration tests with Siglip by @zucchini-nlp in #38732
- fix: Add method to get image features in PaliGemmaForConditionalGeneration by @YushunXiang in #38730
- from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim by @yao-matrix in #38689
- fix: bf16 with TPU is allowed in configuration by @yevvonlim in #38670
- [DeepSeek-V3] implement when q_lora_rank is None by @bzantium in #38743
- Revert "Trigger doc-builder job after style bot" by @ydshieh in #38735
- Add z-loss to Bamba for v2 by @daviswer in #37842
- Better typing for num_items_in_batch by @SunMarc in #38728
- Prepare for TF+Jax deprecation by @Rocketknight1 in #38760
- Remove IPEX requirement for bitsandbytes on CPU by @matthewdouglas in #38594
- Update repo consistency check by @Rocketknight1 in #38763
- fix(qwen3_moe): pass kwargs to self_attn by @llllvvuu in #38691
- Update pegasus model card by @dross20 in #38675
- Make style bot trigger CI after push by @ydshieh in #38754
- chore(pixtral): emit block attention mask when using flash attention by @starcatmeow in #38741
- Update altCLIP model card by @EmileAydar in #38306
- Add Qwen2 MoE model card by @rileyafox in #38649
- [masking utils] check
None
instead of try/except by @zucchini-nlp in #38561 - [Hotfix] Fix style bot by @ydshieh in #38779
- Fix masking utils by @Cyrilvallez in #38783
- [video processors] support frame sampling within processors by @zucchini-nlp in #38105
- Skip some export tests on torch 2.7 by @ydshieh in #38677
- Reduce verbosity for
average_tokens_across_devices=True
andworld size = 1
by @qgallouedec in #38785 - Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in #38770
- [docs] Add int4wo + 2:4 sparsity example to TorchAO README by @jcaip in #38592
- Fix
qwen_2_5 omni
by @ydshieh in #38658 - Fix
llava_onevision
tests by @ydshieh in #38791 - Reword README in light of model definitions by @LysandreJik in #38762
- Fix Typos in Comments: "quantitation" → "quantization", "averege" → "average" by @leopardracer in #38766
- Initialize flash attn flag by @farnasirim in #38768
- Fix
mllama
by @ydshieh in #38704 - build: 📌 Remove upper bound on PyTorch by @KyleMylonakisProtopia in #38789
- Remove all traces of
low_cpu_mem_usage
by @Cyrilvallez in #38792 - [Docs] New DiT model card by @yushi2006 in #38721
- Add missing div in Pegasus model card by @dross20 in #38773
- Updated moonshine modelcard by @SohamPrabhu in #38711
- refactor create_token_type_ids_from_sequences by @itazap in #37681
- [docs] update cache docs with new info by @zucchini-nlp in #38775
- Fix erroneous docstring for the ordering of SWA layers by @norpadon in #38794
- Fix configs and doc for the Qwens by @Cyrilvallez in #38808
- Unbreak optimum-executorch by @guangy10 in #38646
- Disable custom MRA kernels for ROCm by @ahadnagy in #38738
- Use HF papers by @qgallouedec in #38184
- Simplify and update trl examples by @qgallouedec in #38772
- Better pipeline type hints ✨ by @qubvel in #38049
- Fix
llava_next
tests by @ydshieh in #38813 - Expectation fixes and added AMD expectations by @remi-or in #38729
- Use
wandb.run.url
instead ofwandb.run.get_url()
(deprecated) by @qgallouedec in #38817 - Refactor DBRX tests to use CausalLMModelTest base classes by @Rocketknight1 in #38475
- change fsdp_strategy to fsdp in TrainingArguments in accelerate doc by @PT-10 in #38807
- Fix a minor security issue by @ydshieh in #38815
- Fix trainer.py not showing signature columns by @nenesekai in #38465
- Add V-JEPA for video classification model by @qubvel in #38788
- fixed docstring in modular_qwen2_5_vl.py by @lawrencefeng17 in #38798
- [docs] Update docs moved to the course by @stevhliu in #38800
- [docs] updated roberta model card by @allmight05 in #38777
- Updated Albert model Card by @souvikchand in #37753
- [internvl] fix video inference by @zucchini-nlp in #38811
- Fix redundant code in Janus by @yaswanth19 in #38826
- bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming by @ManuelFay in #38627
- Fix peft integration by @Cyrilvallez in #38841
- Fix broken notebooks link in Italian training docs by @VolodymyrBg in #38834
- Fix broken tag in Longformer model card by @dross20 in #38828
- [BugFix] QA pipeline edge case:
align_to_words=True
inQuestionAnsweringPipeline
can lead to duplicate answers by @yushi2006 in #38761 - GraniteMoeHybrid: Allow for only shared expert case. by @shawntan in #38801
- Updated aya_vision.md by @1himan in #38749
- Remove merge conflict artifacts in Albert model doc by @druvdub in #38849
- [video processor] fix BC when no video config if found by @zucchini-nlp in #38840
- Fix incorrect width ratio calculation in Llama4 image processor by @Jingxiang-Zhang in #38842
- Allow customization of sdpa in executorch.py by @kimishpatel in #38827
- Fix
qwen2_5_vl
tests by @ydshieh in #38845 - Improve
auxiliary_in_channels
default behavior in UperNet by @simonreise in #37540 - Fix
qwen3
tests by @ydshieh in #38862 - Update CvT documentation with improved usage examples and additional … by @sezan92 in #38731
- Update roc bert docs by @SohamPrabhu in #38835
- Post-PR fixes! by @Rocketknight1 in #38868
- enable misc test cases on XPU by @yao-matrix in #38852
- Fix
phi4_multimodal
tests by @ydshieh in #38816 - Fix
qwen3_moe
tests by @ydshieh in #38865 - Fix HQQ model param device transfer issue by @HighCWu in #38466
- Fixed markdown for BertTokenizer's '[CLS]' token. by @eu90h in #38506
- null deepspeed_plugin in args for wandb callback fake trainer by @winglian in #38867
- More PYUP fixes by @cyyever in #38883
- Fix loop var naming by @Rocketknight1 in #38885
- [bugfix] fix ATTN_MASK_NPU device mismatch error on multi-device NPU … by @qykong in #38876
- log: Add logging when using split_batches and per_device_train_batch_size by @KeshavSingh29 in #38633
- Docs: Add custom fine-tuning tutorial to TrOCR model page by @Ashutosh-4485 in #38847
- 36978 | Fast image processor for DPT model by @samrae7 in #37481
- [video processor] fix slow tests by @zucchini-nlp in #38881
- Update bamba model card by @druvdub in #38853
- Add support for specifying revisions when pushing to Hub via internal Trainer call by @IsaacBreen in #36852
- Use
raise from e
inhub.py
utility by @Wauplin in #37241 - [phi-4] use mel filters from audio utils by @eustlb in #36966
- Fix
fsmt
tests by @ydshieh in #38904 - Fix unnecessary super calls by @cyyever in #38897
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs by @yao-matrix in #38284
- Fix
FalconMambaIntegrationTests
by @ydshieh in #38566 - Skip sdpa tests if submodule does not support sdpa by @ivarflakstad in #38907
- Fix ReDOS in tokenizer digit substitution by @Rocketknight1 in #38844
- feat: Add granite architectures to auto tokenizer name mappings by @gabe-l-hart in #38802
- Allow make-fixup on main branch, albeit slowly by @Rocketknight1 in #38892
- feat: add flexible Liger Kernel configuration to TrainingArguments by @hamza-hcompany in #38911
- Remove deprecated classes in modeling_utils.py by @Cyrilvallez in #38919
- Skip some tests for now by @ydshieh in #38931
- Modernbert fixes by @remi-or in #38912
- add pytorch-xpu Dockerfile by @yao-matrix in #38875
- Remove
ALL_LAYERNORM_LAYERS
by @Cyrilvallez in #38922 - [static cache] fix device map per layer in VLMs by @zucchini-nlp in #38488
- Add kwargs for timm.create_model in TimmWrapper by @qubvel in #38860
- Pin PyTorch extras for AMD containers by @ahadnagy in #38941
- Correctly raise error for awq quantization by @Cyrilvallez in #38945
- Fix more flaky
test_initialization
by @ydshieh in #38932 - Switch to use A10 progressively by @ydshieh in #38936
- Fix custom generate from local directory by @manueldeprada in #38916
- Update blip model card by @devkade in #38513
- Gaudi3 CI by @IlyasMoutawwakil in #38790
- Fix DTensor import compatibility for PyTorch < 2.5 by @Benoqtr in #38836
- Fix(informer): Correct tensor shape for input_size=1 by @Flink-ddd in #38856
- [modular] CLI allows positional arguments, and more defaults names for the optional arg by @Cyrilvallez in #38979
- Remove dead protected imports by @Cyrilvallez in #38980
- Break tie in Expectations and gemma3 fixes by @remi-or in #38943
- Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors by @yonigozlan in #38157
- fix: add bool operator to tokenizer to avoid bloated asserts by @kallewoof in #38899
- Add support for auto_docstring with model outputs by @yonigozlan in #38242
- fix
mistral
andmistral3
tests by @ydshieh in #38978 - [Feature] Support
is_split_into_words
in theTokenClassificationPipeline
. by @yushi2006 in #38818 - Fix
rag
by @ydshieh in #38585 - [docs] Typos - Single GPU efficient training features by @casinca in #38964
- [qwen] refactor attentions for vision/audio by @zucchini-nlp in #38930
- Removing extra space in large command for speech-pretraining example by @dggaytan in #38705
- [
Attention
] Small fix on output attentions by @vasqu in #38948 - Fixes for Arcee model by @Cyrilvallez in #39001
- Added scikit-learn to the example image-classification requirements.txt by @mylonjones in #37506
- Update attention_visualizer.py by @Tanuj-rai in #37860
- Skip non-selected experts for qwen3_moe by @seven-mile in #38133
- Fix undeterministic order in modular dependencies by @Cyrilvallez in #39005
- Granite speech - minor fixes to support training with the HF trainer by @avihu111 in #38833
- Fix bugs in DynamicCache by @tugsbayasgalan in #37880
- Update self-comment-ci.yml user list by @ivarflakstad in #39014
- Skip sdpa dispatch on flash test due to unsupported head dims by @ivarflakstad in #39010
- [HPU][Critical Issue Fix] ThreadPool instead of Pool for parallel pre-processing by @dsmertin in #39002
- Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code,… by @marcndo in #38954
- [LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim by @sbucaille in #39021
- Add zero dim tensor check when using flash_attention by @ranzhejiang in #38280
- Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1 by @efsotr in #37332
- [AutoModelForMaskGeneration] Remove duplicate code by @NielsRogge in #38622
- [video processor] support torchcodec and decrease cuda memory usage by @zucchini-nlp in #38880
- Drop unnecessary tokens in GPT2Model generation by @null-pointer-access in #39016
- Fix the seamless_m4t cannot work on Gaudi by @yuanwu2017 in #38363
- fix: astronomical loss with ModernBERT when using gradient checkpointing by @umarbutler in #38982)
- fix gemma3 grad acc by @SunMarc in #37208
- Remove script datasets in tests by @lhoestq in #38940
- Fix grammatical error in models documentation by @marcndo in #39019
- refactor: remove custom BarkLayerNorm by @eginhard in #39003
- [Kyutai-STT] correct model type + model id by @eustlb in #39035
- Two ReDOS fixes by @Rocketknight1 in #39013
- [tests] remove TF tests (uses of
require_tf
) by @gante in #38944 - Granite speech speedup + model saving bugfix by @avihu111 in #39028
- Fix Bad Outputs in Fast Path for GraniteMoeHybrid by @alex-jw-brooks in #39033
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ydshieh
- CI reporting improvements (#38230)
- add
liger-kernel
to docker file (#38292) - add
vasqu
toself-comment-ci.yml
(#38324) - new failure CI reports for all jobs (#38298)
- Hot fix for AMD CI workflow (#38349)
- Uninstall
kernels
for AMD docker images (#38354) - Use one
utils/notification_service.py
(#38379) - update gemma tests (#38384)
- Update
CsmForConditionalGenerationIntegrationTest
(#38424) - Fix CircleCI not triggered when PR is opened from a branch of
huggingface/transformers
(#38413) - Trigger doc-builder job after style bot (#38398)
- Fix GLM4 checkpoints (#38412)
- Fix
Gemma3IntegrationTest
(#38471) - Fix
Gemma2IntegrationTest
(#38492) - Fix blip2 tests (#38510)
- Update docker image to use
av
(#38548) - Fix
utils/notification_service.py
(#38556) - Fix
chameleon
tests (#38565) - update
utils/notification_service.py
for AMD vs Nvidia (#38563) - Fix
deepseekv3
(#38562) - Remove custom pytest and pluggy (#38589)
- pin pandas (#38605)
- Fix
return_dict=False
giving errors in a few VLM models (#38519) - Improve
test_initialization
(#38607) - Use torch 2.7.1 on CircleCI jobs (#37856)
- update
ColQwen2ModelIntegrationTest
(#38583) - Improve
test_initialization
forSwiftFormer
(#38636) - Don't run
AriaForConditionalGenerationModelTest
on CircleCI (#38615) - Better CI (#38552)
- Skip torchscript tests for 2 models (#38643)
- Fix
InternVL
integration test (#38612) - Use torch 2.7.1 on daily CI (#38620)
- Fix
aya_vision
test (#38674) - Update some tests for torch 2.7.1 (#38701)
- Fix
llava
tests (#38722) - Revert "Trigger doc-builder job after style bot" (#38735)
- Make style bot trigger CI after push (#38754)
- [Hotfix] Fix style bot (#38779)
- Skip some export tests on torch 2.7 (#38677)
- Fix
qwen_2_5 omni
(#38658) - Fix
llava_onevision
tests (#38791) - Fix
mllama
(#38704) - Fix
llava_next
tests (#38813) - Fix a minor security issue (#38815)
- Fix
qwen2_5_vl
tests (#38845) - Fix
qwen3
tests (#38862) - Fix
phi4_multimodal
tests (#38816) - Fix
qwen3_moe
tests (#38865) - Fix
fsmt
tests (#38904) - Fix
FalconMambaIntegrationTests
(#38566) - Skip some tests for now (#38931)
- Fix more flaky
test_initialization
(#38932) - Switch to use A10 progressively (#38936)
- fix
mistral
andmistral3
tests (#38978) - Fix
rag
(#38585)
- @ArthurZucker
- @younesbelkada
- [MODEL] Add Falcon H1 (#38249)
- @cyr0930
- fix multi-image case for llava-onevision (#38084)
- @cyyever
- @ritsumei-aoi
- Remove Japanese sequence_classification doc and update references (#38246)
- @yao-matrix
- add XPU info print in print_env (#38282)
- refine
transformers env
output (#38274) - switch to device agnostic device calling for test cases (#38247)
- enable large_gpu and torchao cases on XPU (#38355)
- enable more test cases on xpu (#38572)
- remove ipex_optimize_model usage (#38632)
- from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim (#38689)
- enable misc test cases on XPU (#38852)
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs (#38284)
- add pytorch-xpu Dockerfile (#38875)
- @vasqu
- 🔴🔴🔴 [
Attention
] Refactor Attention Interface for Bart-based Models (#38108) - [
FlexAttention
] Reenable flex for encoder-decoder and make the test more robust (#38321) - [
OPT
] Fix attention scaling (#38290) - 🔴[
Attention
] Attention refactor for Whisper-based models (#38235) - [
FlexAttn
] Fix models with unique characteristics (#38433) - [
Attention
] Small fix on output attentions (#38948)
- 🔴🔴🔴 [
- @itazap
- @eustlb
- @RogerSinghChugh
- @1himan
- @Avasam
- Merge type hints from
microsoft/python-type-stubs
(post dropping support for Python 3.8) (#38335)
- Merge type hints from
- @remi-or
- [seamless_m4t] Skip some tests when speech is not available (#38430)
- [janus] Fix failing tests on mi3XX (#38426)
- Fixed a multiple-devices issue in SmolVLM model (#38736)
- Expectation fixes and added AMD expectations (#38729)
- Modernbert fixes (#38912)
- Break tie in Expectations and gemma3 fixes (#38943)
- @tonywu71
- Add ColQwen2 to 🤗 transformers (#35778)
- @geetu040
- @sbucaille
- @samrae7
- 36978 | Fast image processor for DPT model (#37481)
- @Crystalcareai
- Add Arcee model support (#38621)
- @zRzRzRzRzRzRzR
- GLM-4.1V Model support (#38431)
- @bzhangGo
- Encoder-Decoder Gemma (#38332)
- @redmoe-moutain
- [Model] add dots1 (#38143)
- @EduardDurech
- Support for Flash Attention 3 (#38972)