Capabilities
Multimodality
AI that can see, hear, and reason across text, images, audio, and more
What it is
Multimodal models process more than just text, they handle images, audio, video, and other data types within the same model architecture. Different modalities are tokenized using modality-specific encoders (e.g., image patches become visual tokens) and fed into the transformer alongside text tokens.
This allows a model to: analyze images and describe them, answer questions about photos, generate images from text descriptions, transcribe and reason about audio, and understand documents with mixed text and visuals.
GPT-4o, Claude 3.5+, and Gemini are all multimodal. Image generation models like DALL-E use related but distinct architectures (diffusion models).