Official

openai / gpt-4.1

OpenAI's Flagship GPT model for complex tasks.

  • Public
  • 2.4K runs
  • License
Iterate in playground

Pricing

Official model
Pricing for official models works differently from other models. Instead of being billed by time, you’re billed by input and output, making pricing more predictable.

This model is priced by how many input tokens are sent and how many output tokens are generated.

Check out our docs for more information about how per-token pricing works on Replicate.

Readme

GPT-4.1 is a high-performance language model optimized for real-world applications, delivering major improvements in coding, instruction following, and long-context comprehension. It supports up to 1 million tokens of context, features a June 2024 knowledge cutoff, and is designed to be more reliable and cost-effective across a wide range of use cases — from building intelligent agents to processing large codebases and documents. GPT‑4.1 offers improved reasoning, faster output, and significantly enhanced formatting fidelity.


Key Capabilities

  • 1M token context window for large document/code handling
  • Improved instruction following, including format adherence, content control, and negative/ordered instructions
  • Top-tier performance in coding tasks and diffs
  • Optimized for agentic workflows, long-context reasoning, and tool use
  • Real-world tested across legal, financial, engineering, and developer tools

Benchmark Highlights

SWE-bench Verified (Coding): 54.6% 

MultiChallenge (Instruction): 38.3% 

IFEval (Format compliance): 87.4% 

Video-MME (Long video QA): 72.0% 

Aider Diff Format Accuracy: 53% 

Graphwalks (Multi-hop reasoning): 62% 

Use Cases

  • Building agentic systems with strong multi-turn coherence
  • Editing and understanding large codebases or diff formats
  • Complex data extraction from lengthy documents
  • Highly structured content generation
  • Multimodal reasoning tasks (e.g., charts, diagrams, videos)

🔧 Developer Notes

  • Available via OpenAI API only
  • Supports up to 32,768 output tokens
  • Compatible with prompt caching and Batch API
  • Designed for production-scale performance and reliability

🧪 Real-World Results

  • Windsurf: 60% higher accuracy on internal code benchmarks; smoother tool usage
  • Qodo: Better suggestions in 55% of pull request reviews, with higher precision and focus
  • Blue J: 53% more accurate on complex tax scenarios
  • Thomson Reuters: 17% improvement in long-document legal review
  • Carlyle: 50% better retrieval accuracy across large financial files