GPT-5 Review: Key Features, Insights, and Performance Breakdown

The launch of GPT-5 presents a fascinating crossroads in the evolution of artificial intelligence, showcasing remarkable capabilities while still leaving much to be desired. OpenAI's latest iteration, now available for free for nearly a billion users, opens doors to more intelligent interactions, yet it also exposes the limitations that come with technological advancements.

Free Access for Nearly a Billion Users

OpenAI's introduction of GPT-5 in the free tier of ChatGPT marks a landmark moment for AI accessibility. This strategic step allows nearly a billion users to interact with a more sophisticated AI model, although some limitations still apply to the free version. The competitive pricing of its API further enhances its appeal, offering a cost-effective alternative to rivals like Anthropic's Claude.

Despite some technical hiccups during the official livestream—including mathematically impossible bar graphs and moments where the model displayed hallucinations while discussing its reduction capabilities—GPT-5's potential merits a closer look.

Benchmark Performance: Mixed Results

Logic and Reasoning Capabilities

In a significant test of its logic and reasoning skills, GPT-5 excelled with a remarkable score of 9 out of 10 on widely circulated public questions from SimpleBench. However, this achievement comes with a crucial caveat: many of these public questions may have inadvertently influenced the training data, raising questions about the model's ability to generalize its knowledge effectively.

When assessed on the complete, non-public SimpleBench, GPT-5's performance dipped, achieving only a 57–58% accuracy rate. While commendable, this figure falls short of the 70% threshold that would suggest a genuine leap forward in AI capabilities. Such results reaffirm that GPT-5 is not a catalyst for the much-anticipated leap towards Artificial General Intelligence (AGI).

Hallucination Reduction: Marginal Improvements

OpenAI asserts that GPT-5 generates 44% fewer responses with significant factual errors. However, a closer examination using established benchmarks like SimpleQA reveals a more conservative improvement, as GPT-5 appears to marginally outpace GPT-4 on hallucination metrics. Major factual inaccuracies still surface around 5% of the time in everyday user interactions, highlighting persistent challenges in reliability.

Software Engineering: A Direct Challenge to Anthropic

One area where GPT-5 exhibits standout performance is in software engineering. OpenAI has effectively positioned GPT-5 as a strong competitor to Anthropic's Claude models, surpassing them in various coding benchmarks, including SweetBench Verified. In practical coding scenarios, GPT-5's superior bug detection capabilities may have profound implications for professionals relying on AI assistance, potentially impacting Anthropic's revenue derived from developer-led services.

Visual Understanding and Multimodality

GPT-5 also demonstrates exceptional skills in understanding complex visual data. During evaluations involving images, charts, and tables, it has surpassed Gemini DeepThink in the multimodal understanding (MMU) benchmark. This is notably impressive, especially since Gemini DeepThink operates on a subscription model that charges $250 monthly, underscoring GPT-5's accessibility even in specialized domains.

Disappointing Context Window Limitations

Despite its myriad advancements, GPT-5's context window remains relatively constrained compared to its competitors. For example, models like Gemini 2.5 Pro are capable of processing nearly one million tokens, while GPT-5 is still restricted to a few hundred thousand tokens. This limitation significantly hampers the model's ability to engage with lengthy documents, thereby diminishing its utility in comprehensive analyses.

Health and Medical Applications

Promisingly, GPT-5 appears to be making strides in health-related applications, potentially facilitating expert-level text-based diagnoses in various scenarios. Interestingly, the GPT-5 Mini model even secured a higher score on the HealthBench Consensus benchmark than its larger counterpart. This anomaly suggests complexities in optimization that could benefit from further exploration.

Lack of Progress in Key Areas

Some aspects of GPT-5 reveal limited advancements compared to previous models. Notably, translation abilities remain mostly unchanged from GPT-4, overlooking an area with clear potential for practical AI applications.

Further, OpenAI's internal benchmarks reflect stagnation in critical AI-enhancing capabilities:

Machine Learning Engineering Bench: Minimal progress in machine learning engineering tasks.
OpenAI Pull Requests: Little improvement in handling tasks conducted at OpenAI.
Paper Bench: No notable progress in replicating cutting-edge AI research.
Internal Research Bottlenecks: Consistently solving only 2% of 20 internal research and engineering hurdles, equivalent to GPT-4's performance.

These benchmarks highlight areas where advancements in self-improvement remain elusive, casting doubt on the trajectory of AI development.

Safety Improvements: A New Approach

In addressing safety concerns, OpenAI has adopted a "safe completions" approach for managing potentially problematic queries. Instead of binary categorization based on perceived user intent, the model prioritizes the safety of its responses. This shift centers on delivering information without delving into the reasoning behind user inquiries, reflecting an evolution in how AI interfaces with sensitive topics.

Specialized Performance Highlights

Performance across specialized benchmarks varies notably:

GPT-5 achieved a score of 10% on ARC-AGI-2 (pattern recognition), trailing Grok-4, which scored 16%.
A commendable record of 88.4% on Google's GPQA (scientific proof benchmark) marks significant potential in the realm of scientific inquiry.
The model also performs admirably on "Humanity's Last Exam," although concerns regarding benchmark maxing hang overhead.

Model Simplification and Accessibility

In simplifying user experience, OpenAI has deprecated all prior models in favor of GPT-5, streamlining the interface by removing the complex model selection feature. While this may enhance accessibility, it effectively removes choices for users who previously preferred specific iterations of the technology.

The Future of AI Development

Overall, GPT-5 represents considerable progress while simultaneously raising questions about the rapidity of AI advancement. Even proponents of ongoing AI evolution have revised their predictions following the launch of GPT-5, emphasizing its lack of substantial self-improvement capabilities—an expectation many had held regarding the path towards AGI in the near future.

The incremental enhancements showcased by GPT-5 suggest that although the field continues to evolve, the transformative breakthroughs anticipated may take longer and require innovative approaches that extend beyond mere architectural scaling.

While GPT-5 showcases significant advancements, it also highlights areas for potential growth and improvement within AI. The mix of strong performance in specific domains and notable limitations suggests that the journey towards truly revolutionary breakthroughs is ongoing. Don’t miss out on the opportunity to explore GPT-5's features for yourself; start using it today and share your experience with us! Join the conversation and be part of the future of AI.