Capabilities

Multimodality

AI that can see, hear, and reason across text, images, audio, and more

What it is

Multimodal models process more than just text, they handle images, audio, video, and other data types within the same model architecture. Different modalities are tokenized using modality-specific encoders (e.g., image patches become visual tokens) and fed into the transformer alongside text tokens.

This allows a model to: analyze images and describe them, answer questions about photos, generate images from text descriptions, transcribe and reason about audio, and understand documents with mixed text and visuals.

GPT-4o, Claude 3.5+, and Gemini are all multimodal. Image generation models like DALL-E use related but distinct architectures (diffusion models).

Why it matters

Most real-world data isn't pure text. Documents have tables and charts, products have images, meetings produce audio. Multimodal models unlock AI applications across these domains. For client work, understanding what a model can and can't perceive across modalities prevents you from proposing solutions that don't actually work.

Resources

How Multimodal AI Models Work

youtube.com· Visual explainer of how multimodal models process text, images, and audio, and the architectures enabling cross-modal understanding.

9 min

Deep Dive into LLMs like ChatGPT (section: preview of things to come ~3:09:00)

youtube.com· Brief but insightful on how multimodality is being integrated into LLMs. Discusses the shift from text-only to native multi-modal processing (GPT-4o's "omni" approach).

6 min

What Is Artificial Intelligence?

ibm.com· The 2024 trends section provides great context: multimodal models combining vision, speech, and NLP; smaller models doing more; the trajectory toward unified multi-modal assistants.

15 min

What is Multimodal AI? How LLMs Process Text, Images, and More

youtube.com· Explains how LLMs process multiple input types (text, images, audio) and the architectures that enable cross-modal understanding. Complements the first IBM multimodal video with more focus on the "how."

7 min