Official

openai / gpt-4o

OpenAI's high-intelligence chat model

  • Public
  • 34.7K runs
  • License
Iterate in playground

GPT‑4o is OpenAI’s most advanced flagship model, offering natively multimodal capabilities across text, vision, and audio. It delivers GPT-4‑level performance with faster response times and lower cost, making it ideal for real-time, high-volume applications. GPT‑4o supports audio inputs and outputs, handles images and text simultaneously, and is designed to feel conversational and responsive — like interacting with a human assistant in real time.


Key Capabilities

  • Multimodal input & output: Supports text, images, audio (input) and audio/text (output)
  • Real-time audio responsiveness: Latency as low as 232 ms
  • 1M token context window for deep reasoning over long content (API)
  • High performance across reasoning, math, and code tasks
  • Unified model for all modalities—no need to switch between specialized models

Benchmark Highlights

MMLU (Language understanding):        87.2%
HumanEval (Python coding):           90.2%
GSM8K (Math word problems):          94.4%
MMMU (Vision QA):                    74.1%
VoxCeleb (Speaker ID):               95%+ (est.)
Audio Latency (end-to-end):          ~232–320ms

Use Cases

  • Real-time voice assistants and spoken dialogue agents
  • Multimodal document Q&A (PDFs with diagrams, charts, or images)
  • Code writing, explanation, and debugging
  • High-volume summarization and extraction from audio/text/image
  • Tutoring, presentations, and interactive education tools

🔧 Developer Notes

  • Available via OpenAI API and ChatGPT (Free, Plus, Team, Enterprise)
  • In ChatGPT, GPT‑4o is now the default GPT-4-level model
  • Audio input/output is supported only in ChatGPT for now
  • Image and text input supported via both API and ChatGPT
  • Supports streaming, function calling, tool use, and vision APIs
  • Context window of 128k tokens in ChatGPT; 1M tokens via API (limited release)