Skip to content

Releases: huggingface/transformers

ModernBERT Decoder (based on v4.53.2)

16 Jul 11:59
0e4b793
Compare
Choose a tag to compare

A new model is added to transformers: ModernBERT Decoder
It is added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-modernbert-decoder-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the ModernBERT Decoder model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

ModernBERT Decoder

ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.

Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.

Usage example

ModernBERT Decoder can be found on the Huggingface Hub.

Using pipeline:

import torch
from transformers import pipeline

generator = pipeline(
    task="text-generation",
    model="blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device=0
)
generator("The future of artificial intelligence is", max_length=50, num_return_sequences=1)

# For sequence classification
classifier = pipeline(
    task="text-classification",
    model="blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device=0
)
classifier("This movie is really great!")

Using AutoModel:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("blab-jhu/test-32m-dec")
model = AutoModelForCausalLM.from_pretrained(
    "blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

# For sequence classification
from transformers import AutoModelForSequenceClassification

classifier_model = AutoModelForSequenceClassification.from_pretrained(
    "blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device_map="auto",
    num_labels=2
)

text = "This movie is really great!"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = classifier_model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}")
print(f"Prediction probabilities: {predictions}")

Using the transformers CLI:

echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0

Patch Release v4.53.2

11 Jul 12:39
37f8b0b
Compare
Choose a tag to compare

This patch contains the following bug fixes:

  • Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
  • [bugfix] fix flash attention 2 unavailable error on Ascend NPU (#39166)
  • Fix errors when use verl to train GLM4.1v model (#39199)
  • [pagged-attention] fix off-by-1 error in pagged attention generation (#39258)
  • [smollm3] add tokenizer mapping for smollm3 (#39271)
  • [sliding window] revert and deprecate (#39301)
  • fix Glm4v batch videos forward (#39172)
  • Add a default value for position_ids in masking_utils (#39310)

Patch Release v4.53.1

04 Jul 08:29
896e9ce
Compare
Choose a tag to compare

This patch contains several bug fixes. The following commits are included:

  • Fix: unprotected import of tp plugin (#39083)
  • Fix key mapping for VLMs (#39029)
  • Several fixes for Gemma3n(#39135)
  • [qwen2-vl] fix FA2 inference (#39121)
  • [smolvlm] fix video inference (#39147)
  • Fix multimodal processor get duplicate arguments when receive kwargs for initialization (#39125)
  • when delaying optimizer creation only prepare the model (#39152)
  • Add packed tensor format support for flex/sdpa/eager through the mask! (#39194)

Release v4.53.0

26 Jun 16:07
Compare
Choose a tag to compare

Release v4.53.0

Gemma3n

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

image

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    torch_dtype=torch.bfloat16,
    model="google/gemma-3n-e4b",
    device="cuda",
)
output = pipe(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
    text="<image_soft_token> in this image, there is"
)

print(output)

Dia

image

Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs.
It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing.
Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).

Model Architecture:
Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as
rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while
for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook
tokens and decodes them back into audio.

Kyutai Speech-to-Text

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

  • kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
  • kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

Read more about the model in the documentation

V-JEPA 2

drawing

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

Read more about the model in the documentation.

Arcee

image

Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.

The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

Read more about the model in the documentation.

ColQwen2

ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

image

Read more about the model in the documentation.

MiniMax

image

MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.

The architecture of MiniMax is briefly described as follows:

  • Total Parameters: 456B
  • Activated Parameters per Token: 45.9B
  • Number Layers: 80
  • Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
    • Number of attention heads: 64
    • Attention head dimension: 128
  • Mixture of Experts:
    • Number of experts: 32
    • Expert hidden dimension: 9216
    • Top-2 routing strategy
  • Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
  • Hidden Size: 6144
  • Vocab Size: 200,064

For more details refer to the release blog post.

Read more about the model in the documentation.

Encoder-Decoder Gemma

image

T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.

T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.

The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.

Read more about the model in the documentation.

GLM-4.1V

The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!

Read more about the model in the documentation.

Falcon H1

image

The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.

  • [MODEL] Add Falcon H1 by @you...
Read more

Kyutai-STT (based on v4.52.4)

24 Jun 16:05
6bdd4ec
Compare
Choose a tag to compare

A new model is added to transformers: Kyutai-STT
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-Kyutai-STT-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Kyutai-STT model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

Kyutai-STT

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

  • kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
  • kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

Usage example

Kyutai-STT can be found on the Huggingface Hub.

Inference

import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

# 2. load audio samples
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
inputs = processor(
    ds[0]["audio"]["array"],
)
inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
print(processor.batch_decode(output_tokens, skip_special_tokens=True))

Batched Inference

import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

# 2. load audio samples
ds = load_dataset(
    "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
    print(output)

V-JEPA 2 (based on v4.52.4)

11 Jun 14:59
84710a4
Compare
Choose a tag to compare

A new model is added to transformers: V-JEPA 2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-VJEPA-2-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the VJEPA-2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

VJEPA-2

drawing

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

Usage example

VJEPA-2 can be found on the Huggingface Hub. V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.

The snippet below shows how to load the V-JEPA 2 model using the AutoModel class.

import torch
from torchcodec.decoders import VideoDecoder
import numpy as np

processor = AutoVideoProcessor.from_pretrained("facebook/vjepa2-vitl-fpc64-256")
model = AutoModel.from_pretrained(
    "facebook/vjepa2-vitl-fpc64-256",
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"

vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data  # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
outputs = model(**video)

# V-JEPA 2 encoder outputs, same as calling `model.get_vision_features()`
encoder_outputs = outputs.last_hidden_state

# V-JEPA 2 predictor outputs
predictor_outputs = outputs.predictor_output.last_hidden_state

ColQwen2 (based on v4.52.4)

02 Jun 13:07
c72ba69
Compare
Choose a tag to compare

A new model is added to transformers: ColQwen2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-ColQwen2-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the ColQwen2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.

ColQwen2

image

ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

Usage example

ColQwen2 can be found on the Huggingface Hub.

import requests
import torch
from PIL import Image

from transformers import ColQwen2ForRetrieval, ColQwen2Processor
from transformers.utils.import_utils import is_flash_attn_2_available

# Load the model and the processor
model_name = "vidore/colqwen2-v1.0-hf"

model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # "cpu", "cuda", or "mps" for Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
)
processor = ColQwen2Processor.from_pretrained(model_name)

# The document page screenshots from your corpus
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

images = [
    Image.open(requests.get(url1, stream=True).raw),
    Image.open(requests.get(url2, stream=True).raw),
]

# The queries you want to retrieve documents for
queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
]

# Process the inputs
inputs_images = processor(images=images).to(model.device)
inputs_text = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

print("Retrieval scores (query x image):")
print(scores)

If you have issue with loading the images with PIL, you can use the following code to create dummy images:

images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to quantize the weights to int4.

import requests
import torch
from PIL import Image

from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor

model_name = "vidore/colqwen2-v1.0-hf"

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="cuda",
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"

images = [
    Image.open(requests.get(url1, stream=True).raw),
    Image.open(requests.get(url2, stream=True).raw),
]

queries = [
    "When was the United States Declaration of Independence proclaimed?",
    "Who printed the edition of Romeo and Juliet?",
]

# Process the inputs
inputs_images = processor(images=images, return_tensors="pt").to(model.device)
inputs_text = processor(text=queries, return_tensors="pt").to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**inputs_images).embeddings
    query_embeddings = model(**inputs_text).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

print("Retrieval scores (query x image):")
print(scores)

Patch release: v4.52.4

30 May 09:18
Compare
Choose a tag to compare

The following commits are included in that patch release:

  • [qwen-vl] Look for vocab size in text config (#38372)
  • Fix convert to original state dict for VLMs (#38385)
  • [video utils] group and reorder by number of frames (#38374)
  • [paligemma] fix processor with suffix (#38365)
  • Protect get_default_device for torch<2.3 (#38376)
  • [OPT] Fix attention scaling (#38290)

Patch release v4.52.3

22 May 14:40
Compare
Choose a tag to compare

Patch release v4.52.3

We had to protect the imports again, a series of bad events.
Here are the two prs for the patch:

Patch release v4.52.2

21 May 13:47
237c7c3
Compare
Choose a tag to compare

Patch release v4.52.2

We had to revert #37877 because of a missing flag that was overriding the device map. We re-introduced the changes because they allow native 3D parallel training in Transformers. Sorry everyone for the troubles! 🤗