AI ResearchKR

Why GPT-4o Is So Fast: The Critical Difference Between Multimodal and Omni Models

A token-level analysis comparing the pipeline approach (STT→LLM→TTS) text bottleneck with native omni model token fusion. Explains why GPT-4o and MiniCPM-o are fundamentally faster.

Why GPT-4o Is So Fast: The Critical Difference Between Multimodal and Omni Models

Why GPT-4o Is So Fast: The Critical Difference Between Multimodal and Omni Models

When GPT-4o launched, what surprised most people wasn't its performance. It was the speed. Ask it something by voice, and it responds in near real-time with emotion in its voice. It felt fundamentally different from every voice AI before it.

And then MiniCPM-o 4.5 matched that GPT-4o-level performance with just 9B parameters. How?

The answer lies in the "Omni architecture." More precisely, it comes down to how different modalities of data are tokenized and mixed inside a single model.

In this article, we dissect the difference between the pipeline approach and the native Omni approach at the token level.

The Pipeline Approach: Why It's Slow and Awkward

Before 2024, most "voice AI" systems looked like this:

User speech → [Whisper] → text → [LLM] → text → [TTS] → AI speech

Three independent models running sequentially. Let's look at what happens at each stage.

Pipeline Approach: Information Loss and Latency from the Text Bottleneck

Stage 1: Speech → Text (STT)

An STT model like Whisper converts speech to text. This is where the first information loss occurs.

Say the user says "Really?" That single word carries rich information:

  • Tone of surprise (pitch rising)
  • A hint of skepticism (specific intonation pattern)
  • Speaking speed (fast = excitement, slow = doubt)

STT compresses all of this into the text string "Really?" — seven characters. Tone, emotion, speed, speaker characteristics — all gone.

Stage 2: Text → Text (LLM)

The LLM receives only the text "Really?" It can't distinguish whether it was said in surprise, sarcastically, or with genuine curiosity. It can only guess from context.

Stage 3: Text → Speech (TTS)

TTS reads aloud the text generated by the LLM. But TTS also has no information about what emotion to convey. The result: flat, robotic speech.

The Fundamental Problem: The Text Bottleneck

The root issue is that text is a bottleneck.

In human conversation, the actual words convey only a fraction of the total information. Psychology research suggests that nonverbal elements (tone, speed, facial expressions) account for over 90% of emotional communication.

In the pipeline, this 90% evaporates at the STT stage.

Then there's the latency problem. Three models running sequentially:

  • STT: ~300ms
  • LLM: ~500ms (to first token)
  • TTS: ~200ms
  • Total: 1 second or more

In natural conversation, a 1-second delay feels deeply unnatural. Think about how even a 0.5-second delay on a phone call makes you say "Hello? Can you hear me?"

The Omni Approach: Everything Becomes Tokens

Let's see how GPT-4o and MiniCPM-o solve this.

The core idea is simple: convert text, speech, and images all into the same format of tokens, then process them together in a single transformer.

Native Omni: All Modalities Processed in One Token Space
User speech + video → [Unified Encoder] → mixed token sequence → [Single LLM] → mixed tokens → [Decoder] → AI speech + text

The critical difference from the pipeline: there's no intermediate step of compressing everything into text.

How Each Modality Gets Tokenized

Let's get specific about how omni models turn different types of data into "tokens."

Text Tokenization: Nothing New Here

Text is tokenized with BPE (Byte Pair Encoding) or SentencePiece. Words get split into subword tokens, each mapped to an integer ID. No innovation here.

Image Tokenization: Patches to Embeddings

Images are processed by ViT (Vision Transformer) family encoders. MiniCPM-o uses SigLip2.

The process:

  1. Split the image into small patches (e.g., 14x14 pixels)
  2. Vectorize each patch via linear projection
  3. Add position embeddings
  4. Pass through transformer encoder → generate visual tokens

Key insight: higher resolution means more patches, which means more visual tokens. SigLip2 processing up to 1.8M pixels means it generates highly detailed visual tokens. This is precisely why OCR performance is so strong.

Audio Tokenization: This Is Where It Gets Interesting

There are two main approaches to audio tokenization.

Audio Tokenization: Continuous Embeddings vs Discrete Tokens
Method 1: Continuous Embeddings

Use an encoder like Whisper to convert audio into a continuous vector sequence.

  • 16kHz audio → mel spectrogram → encoder → embedding sequence
  • Advantage: rich speech information (tone, emotion, speed) preserved in the vectors
  • This is what MiniCPM-o uses for speech input
Method 2: Discrete Tokens

Use audio codecs (EnCodec, SoundStream, etc.) to convert speech into a discrete token sequence.

  • 16kHz audio → codec encoder → discrete token ID sequence
  • Uses RVQ (Residual Vector Quantization) with multiple codebook levels
  • Advantage: next-token prediction works the same way as for text
  • This is the approach GPT-4o is believed to use

MiniCPM-o combines both approaches:

  • Input: Whisper's continuous embeddings
  • Output: Streaming Flow Matching Decoder + Interleaved Speech Token Decoder
  • This hybrid approach enables Full-Duplex (simultaneous input/output)

When Tokens Meet Inside the LLM

This is the most important part. How are tokens from different modalities processed inside a single transformer?

[visual_token1][visual_token2]...[visual_tokenN][audio_token1][audio_token2]...[text_token1][text_token2]...

This mixed sequence passes through the transformer's self-attention as a whole. At this point:

  • Audio tokens can attend to visual tokens
  • Text tokens can attend to audio tokens
  • All modalities directly reference each other

This is impossible in a pipeline. When you say "describe this image" by voice, an omni model simultaneously references both your vocal tone and the image content to generate its response.

Dimension Alignment: Making All Tokens Speak the Same Language

Tokens from different encoders may have different dimensions. SigLip2's output dimension, Whisper's output dimension, and Qwen3's embedding dimension could all differ.

Projection layers handle this alignment:

Visual tokens (SigLip2 output) → Linear Projection → LLM dimension
Audio tokens (Whisper output) → Linear Projection → LLM dimension
Text tokens → Embedding table → LLM dimension

This projection is crucial. Once tokens from different modalities are mapped into the same vector space, the transformer processes them without distinction. It's similar to how Korean and English can be understood without translation once they're mapped into the same embedding space.

Full-Duplex: Listening and Speaking Simultaneously

MiniCPM-o's Full-Duplex capability is the pinnacle of this tokenization architecture.

Traditional voice AI is half-duplex. VAD (Voice Activity Detection) senses when the user is speaking, and the system responds only after the user stops. No interruptions possible.

In Full-Duplex:

  • User speech is tokenized in real-time and fed into the model
  • The model generates response tokens while simultaneously receiving new input tokens
  • When the user interrupts, new input tokens are immediately reflected through the model's attention
  • The model recognizes "oh, the user interrupted" and stops or adjusts its response

This works because both input and output exist in the same token space. In a pipeline, STT must fully complete before the LLM can start, making Full-Duplex structurally impossible.

Why 9B Is Enough

"But how can 9B match GPT-4o?" — a fair question.

1. The Power of Specialist Encoders

MiniCPM-o's 9B refers to the LLM parameters alone. On top of this, pretrained specialist encoders like SigLip2 (vision) and Whisper (speech) are added. These are models already trained on hundreds of millions of data points in their respective domains.

The LLM only needs to "think." SigLip2 handles seeing, Whisper handles hearing. 9B of reasoning capacity is enough to synthesize this information.

2. Token Efficiency

In the pipeline approach, speech information is compressed into text, losing information. In the omni approach, rich tokens flow directly into the model, allowing the same parameters to leverage more information.

An analogy: the pipeline analyzes a full-color photo after converting it to black and white. Omni analyzes the full-color original.

3. End-to-End Optimization

Each module in a pipeline is trained independently. The STT is optimized only for "converting speech to accurate text" — it doesn't know what the LLM needs.

In an omni model, the entire system is optimized end-to-end. The speech encoder learns to tokenize speech "in a form the LLM can best understand." This alignment enables high performance even in smaller models.

What This Means in Practice

Secure Data Processing

Consider a scenario where you're verbally discussing a contract while simultaneously analyzing the document image.

Pipeline: voice → external STT API → text → external LLM API → external TTS API. Sensitive contract details pass through multiple external services.

Omni model: process speech and document image locally in one pass. Data never leaves the device. MiniCPM-o's 11GB VRAM is sufficient.

Real-Time Interpretation

In interpretation, a 0.5-second difference changes the flow of conversation.

Pipeline: speech recognition (300ms) + translation (500ms) + speech synthesis (200ms) = minimum 1 second delay

Omni model: speech tokens → immediately translated speech token output. TTFT 0.6 seconds

Multimodal Analysis

"I see something unusual in this X-ray, can you explain it?" — requesting by voice while showing the image.

In a pipeline, speech and image are processed separately. In an omni model, the "something unusual" audio tokens and X-ray image tokens connect directly through attention. The model even reads the "where is it unusual?" context from your vocal tone.

Current Limitations and Future Directions

Let's be honest about current limitations.

Token Count Explosion

High-resolution image + long audio input simultaneously causes token counts to spike dramatically. Visual tokens from a 1.8M pixel image plus audio tokens from 30 seconds of speech can total thousands. Since transformer self-attention scales quadratically with token count, this can become a bottleneck.

Cross-Modal Interference

Processing all modalities in one transformer can cause "interference" where training on one modality degrades performance on another. Research into MoE (Mixture of Experts) and modality-specific adapters is ongoing to mitigate this.

Training Data

Omni model training requires modality-aligned data — for example, "(while looking at this image) explain it like this" with image, speech, and text simultaneously present. Such data is far scarcer than text-only data.

Summary

AspectPipelineOmni (Native)
Information PreservationCompressed to text, emotion/tone lostOriginal information preserved
LatencySequential modules (~1s+)Single model (~0.6s)
Full-DuplexStructurally impossiblePossible
Cross-Modal ReasoningLimitedDirect reference via attention
SecurityMultiple external API dependenciesLocal processing possible
Model SizeSum of 3 modelsSingle unified model

An omni model isn't simply "a model that handles multiple modalities." It's a model that fuses tokens from different senses in a unified space.

GPT-4o is fast not because "the servers are powerful," but because the architecture itself is designed to eliminate latency. And MiniCPM-o matching this with 9B parameters is proof that the power of this architecture transcends model size.

Links

Stay Updated

Follow us for the latest posts and tutorials

Subscribe to Newsletter

Related Posts