Skip to main content

Overview

The LLM-based ASR models combine a Wav2Vec2 encoder with a Llama decoder for autoregressive text generation. These models support optional language conditioning and offer the highest transcription accuracy across 1,600+ languages. The December 2025 update introduced “Unlimited” variants that can process audio of any length.
LLM models achieve state-of-the-art performance with character error rates (CER) below 10% for 78% of the 1,600+ supported languages when using the 7B variant.

Architecture

The LLM model family uses an encoder-decoder architecture:
[Audio 16kHz] → Wav2Vec2 Feature Extractor → Wav2Vec2 Encoder → Linear Projection → Llama Decoder → [Vocab Logits]
                (CNN downsampling ~320x)       (Transformer)     (to 4096-dim)       (Transformer)
                                               (1024/1280/2048)                      (4096-dim)

Key Components

  • Wav2Vec2 Encoder: Produces contextualized audio embeddings (1024/1280/2048-dim depending on model size)
  • Linear Projection: Projects audio embeddings to match Llama decoder’s 4096-dimensional input space
  • Llama Decoder: Autoregressive transformer decoder for text generation
  • Final Projection: Maps decoder outputs to vocabulary logits
  • Beam Search: Generates multiple hypotheses and selects the best transcription

Model Variants

Standard LLM Models (with Language Conditioning)

omniASR_LLM_300M / omniASR_LLM_300M_v2
  • Parameters: 1,627,603,584
  • Download Size: 6.1 GiB (FP32)
  • Inference VRAM: ~5 GiB (BF16, batch=1, 30s audio)
  • Speed: ~1x real-time (RTF: 0.090)
  • Audio Embedding: 1024-dim
  • Decoder Dimension: 4096-dim
  • Vocabulary Size: 9,812 (v1) / 10,288 (v2)
  • Features: Optional language conditioning

Unlimited Length Models

Released in December 2025, these variants support transcription of unlimited audio length:
omniASR_LLM_Unlimited_300M_v2
  • Parameters: 1,627,603,584
  • Max Audio Length: Unlimited
  • Segment Size: 15 seconds
  • Context Window: 1 previous segment
  • Speed (30s): RTF 0.092 (~1x)
  • Speed (15min): RTF 0.206 (~0.5x)
  • VRAM: ~5 GiB
Unlimited Model Notes:
  • Not described in the original research paper (released after publication)
  • Accuracy comparable to standard LLM models
  • Fine-tuning recipes currently not supported
  • Can be extended for real-time/streaming applications

Language Conditioning

LLM models support optional language identification to improve transcription quality:

Without Language Conditioning

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_v2")

# Audio-only transcription (language auto-detected)
audio_files = ["/path/to/audio1.flac", "/path/to/audio2.wav"]
transcriptions = pipeline.transcribe(audio_files, batch_size=2)
The models were trained with an 80/20 split of samples with and without language IDs, enabling robust performance in both scenarios. However, providing language codes is recommended for best results.
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_7B_v2")

audio_files = [
    "/path/to/english.wav",
    "/path/to/mandarin.flac",
    "/path/to/russian.wav"
]

# Provide language codes for better accuracy
lang_codes = ["eng_Latn", "cmn_Hans", "rus_Cyrl"]

transcriptions = pipeline.transcribe(
    audio_files,
    lang=lang_codes,
    batch_size=3
)

Language Code Format

Languages follow the format {language_code}_{script}:
  • eng_Latn - English (Latin script)
  • cmn_Hans - Mandarin Chinese (Simplified)
  • cmn_Hant - Mandarin Chinese (Traditional)
  • rus_Cyrl - Russian (Cyrillic script)
  • ara_Arab - Arabic (Arabic script)
  • hin_Deva - Hindi (Devanagari script)
# Access supported languages programmatically
from omnilingual_asr.models.wav2vec2_llama.lang_ids import supported_langs

print(f"Total languages: {len(supported_langs)}")  # 1600+
print("eng_Latn" in supported_langs)  # True

Unlimited Length Models

How It Works

Unlimited models use a segmented approach with context:
  1. Segmentation: Audio split into 15-second segments
  2. Contextual Decoding: Each segment uses embeddings from the previous segment
  3. Iterative Processing: Segments decoded sequentially with rolling context
  4. Text Accumulation: Transcriptions concatenated to form complete output
# Internal processing (from models/README.md)
# Training: N=15 seconds per segment, M=1 previous segment for context
# Inference: Process segments iteratively with context window

Usage Example

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load unlimited length model
pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_3B_v2")

# Transcribe long-form audio (e.g., 10-minute podcast)
long_audio = ["/path/to/podcast.wav"]  # 10 minutes
transcriptions = pipeline.transcribe(
    long_audio,
    lang=["eng_Latn"],
    batch_size=1
)

print(transcriptions[0])  # Full 10-minute transcription
Standard LLM Models: Maximum audio length is 40 seconds. For longer audio, use Unlimited variants or split into segments.

Usage Patterns

Basic Transcription

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_1B_v2")

audio_files = ["/path/to/audio.flac"]
transcriptions = pipeline.transcribe(audio_files, batch_size=1)
print(transcriptions[0])

Mixed Language Batch

# Different languages in same batch
audio_files = [
    "/path/to/spanish.wav",
    "/path/to/japanese.flac",
    "/path/to/swahili.wav"
]

lang_codes = ["spa_Latn", "jpn_Jpan", "swa_Latn"]

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_7B_v2")
transcriptions = pipeline.transcribe(audio_files, lang=lang_codes, batch_size=3)

HuggingFace Dataset Integration

from datasets import load_dataset
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load dataset
dataset = load_dataset(
    "facebook/omnilingual-asr-corpus",
    "lij_Latn",  # Ligurian
    split="train",
    streaming=True
)
batch = next(dataset.iter(5))

# Convert to pipeline format
audio_data = [
    {"waveform": x["array"], "sample_rate": x["sampling_rate"]}
    for x in batch["audio"]
]

pipeline = ASRInferencePipeline(model_card="omniASR_LLM_Unlimited_1B_v2")
transcriptions = pipeline.transcribe(audio_data, batch_size=2)

for orig, pred in zip(batch["raw_text"], transcriptions):
    print(f"Ground Truth: {orig}")
    print(f"Predicted:    {pred}\n")

Custom Beam Search Configuration

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
from omnilingual_asr.models.wav2vec2_llama.model import Wav2Vec2LlamaBeamSearchConfig

# Configure beam search
beam_config = Wav2Vec2LlamaBeamSearchConfig(
    nbest=1,           # Number of hypotheses
    length_norm=False  # Length normalization
)

pipeline = ASRInferencePipeline(
    model_card="omniASR_LLM_3B_v2",
    beam_search_config=beam_config
)

transcriptions = pipeline.transcribe(audio_files, batch_size=2)

Autoregressive Generation

Unlike CTC models, LLM models generate text sequentially (token-by-token):
  1. Audio Encoding: Wav2Vec2 encoder processes full audio
  2. Projection: Audio embeddings projected to Llama space (4096-dim)
  3. Decoder Context: Optional language ID or previous segments added
  4. Beam Search: Generate multiple hypotheses autoregressively
  5. Selection: Best hypothesis selected based on beam search score
# Internal generation flow (simplified from pipeline.py:377-398)
decoder_context, decoder_context_seq_lens, audio_embeddings = model(
    batch, return_decoder_inputs=True
)

hypothesis_tokens, hypothesis_lens = beam_search_generator.generate_hypotheses(
    decoder_context_inputs=decoder_context,
    decoder_context_seq_lens=decoder_context_seq_lens,
    audio_embeddings=audio_embeddings,
    batch=None
)

# Decode tokens to text
for i in range(hypothesis_tokens.shape[0]):
    tokens = hypothesis_tokens[i, :hypothesis_lens[i]]
    text = token_decoder(tokens)
This autoregressive approach enables:
  • Language modeling: Better fluency and grammar
  • Context awareness: Uses previous tokens to inform generation
  • Flexibility: Supports language conditioning and context examples

Performance Characteristics

Speed vs. Accuracy Trade-off

Model SizeRTF (30s)AccuracyVRAMBest Use Case
300M0.090Good5 GiBEdge deployment, cost-sensitive
1B0.091Better6 GiBBalanced production
3B0.093Great10 GiBHigh-quality production
7B0.092Best17 GiBResearch, maximum accuracy
RTF (Real-Time Factor): ~0.09 means the model processes 1 second of audio in ~0.09 seconds (about real-time speed).

CER Performance

The 7B LLM model achieves:
  • CER < 10% for 78% of 1,600+ languages
  • State-of-the-art results across diverse language families
  • Improved performance with language conditioning
See per-language results for detailed metrics.

Input Validation

The model performs validation at every forward pass to ensure correct inputs:
# From models/wav2vec2_llama/model.py
class Wav2Vec2LlamaModel:
    def ensure_valid_forward_inputs(self, batch):
        # LLM+LID: Audio + optional language ID
        # LLM+ZS: Audio + exactly 10 context examples
        ...
  • Standard LLM Models: Accept audio with optional language codes
  • Zero-Shot Model: Requires exactly 10 context examples (see Zero-Shot page)
  • Unlimited Models: No audio length restriction
  • Batch Format: Uses fairseq2 Seq2SeqBatch with optional .example fields

Model Selection Guide

Standard LLM

Use when:
  • Audio is under 40 seconds
  • Language is known or auto-detectable
  • Need maximum accuracy
  • Real-time processing acceptable
Recommended: omniASR_LLM_7B_v2

Unlimited LLM

Use when:
  • Audio is >40 seconds (podcasts, lectures)
  • Processing long-form content
  • Need streaming capability (custom integration)
  • Accuracy comparable to standard models
Recommended: omniASR_LLM_Unlimited_7B_v2

Smaller Models (300M/1B)

Use when:
  • Limited GPU memory (under 8 GiB)
  • Cost-sensitive deployment
  • Faster processing preferred
  • Moderate accuracy acceptable

CTC Models

Use when:
  • Speed is critical (need 16x-96x faster)
  • Language conditioning not needed
  • On-device deployment
  • Batch processing large volumes
See: CTC Models

Advanced Features

Custom Model Loading

from fairseq2.models.hub import load_model
from fairseq2.data.tokenizers.hub import load_tokenizer
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline

# Load model and tokenizer separately
model = load_model("omniASR_LLM_3B_v2", device="cuda", dtype=torch.bfloat16)
tokenizer = load_tokenizer("omniASR_LLM_3B_v2")

# Pass to pipeline
pipeline = ASRInferencePipeline(
    model_card=None,
    model=model,
    tokenizer=tokenizer
)

Batch Size Optimization

# Find optimal batch size for your GPU
for batch_size in [1, 2, 4, 8]:
    try:
        start = time.time()
        pipeline.transcribe(audio_files[:batch_size], batch_size=batch_size)
        elapsed = time.time() - start
        print(f"Batch {batch_size}: {elapsed:.2f}s")
    except torch.cuda.OutOfMemoryError:
        print(f"Batch {batch_size}: OOM")
        break

Next Steps

Zero-Shot Models

Learn about in-context learning for unseen languages

Model Specifications

Detailed comparison of all model variants

CTC Models

Fast parallel generation for production

Inference Guide

Complete transcription workflows and examples