GPT-4.5

OpenAI’s GPT-4.5 Is 10 Times More Efficient With 63% Fewer Hallucinations

OpenAI released GPT-4.5, arriving on Feb. 27, 2025, as the largest and smartest general-purpose AI model the company has released so far. The first is a newspaper backed by all-known AI company OpenAI, based on their GPT-4o architecture and the most extensive training for statements close to reality as it manages 10 times better computing efficiency over its predecessor, a more powerful writing, and also 19% fewer hallucinations than GPT-4o (52% hallucinations), improved emotional intelligence, etc. GPT-4.While not a frontier model like o1 or o3-mini, 5 excels in writing, programming, and the real-world practical problem-solving in 15 languages. Initially provided to of ChatGPT Pro users, it would be available to Plus, Team and Edu users within the following week.

Release Summary

OpenAI publicly introduced GPT-4.5 is a major leap forward in its line of models, with a public release on February 27, 2025, as a research preview. Through various optimizations, however, this new model takes the original GPT-4o architecture as its foundation but achieves an astonishing ten times greater computational efficiency. This is unlike the more narrow STEM reasoning models that OpenAI has created with GPT-4.5 is intended as a wide-ranging, multipurpose model that occupies an important position in the capabilities hierarchy between its immediate predecessor GPT-4o and the company’s more tailored models o1 and o3-mini.

The release strategy is similar to OpenAI’s usual gradual rollout. Initially available only to ChatGPT Pro users and developers via the API, the company says it’ll broaden availability to ChatGPT Plus, Team and Edu users early next week. This gradual rollout is impossible at present because of hardware limitations; that Sam Altman said they are “out of GPUs,” and ”we’ll be adding tens of thousands of GPUs next week” for a wider release.

Major Architectural Improvements

Hybrid Training Approach

GPT-4.5 is a major change in how OpenAI trained it. The system layers on top of the traditional unsupervised model which is how the model initially acquires general knowledge about the world with raw but not unstructured input data, capabilities for chain-of-think reasoning. This means GPT-4. GPT-3, a Big OpenAI language model guided by Transformers, manages a stellar 53% on high-school Capture The Flag (CTF) pursuits, but merely 16% on your more strenuous collegiate problems 5.

This iterative process included novel supervision as well as the traditional supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) approaches previously used to tune GPT-4o. It is in this amalgamation that OpenAI’s new CEO Sam Altman says that “for the first time, it feels like talking to a thoughtful person to me,” claiming to “have been amazed at actually getting good advice from an AI”.

Alignment Breakthroughs

GPT-4.5 brings a new Instruction Hierarchy system explicitly created to minimize the risk of prompt injection attacks, which were a frequent avenue of attack on earlier models. This approach showed better performance on conflicting instructions (76% vs. 68% for GPT-4o), specifically when system/user messages conflicted.

These improvements in alignment follow OpenAI’s strategy of working on safety in parallel to its capability improvements. The release states that they view every increase in the capabilities of their model as “an opportunity to make the models safer” and that they hope their work will “provide a solid foundation for aligning even more capable future models”.

Multimodal Safety

Safety continues to be a focus for OpenAI with the GPT-4.5 release. With 99% notunsafe on sexual/minors, this is a major step for content Moderation[1] But this heightened focus on safety does come at a cost, with a 31% overrefusal rate on ambiguous image-text inputs.

OpenAI carried out extensive safety testing against their Preparedness Framework prior to deployment. “Scaling the GPT paradigm led to capability improvements across our test suite,” these evaluations found. In the related system card, the firm has published a full report of the findings from these assessments, consistent with its mission of openness on AI safety.

Capability Breakdown

GPT 4. 5 makes large strides in what it can do over older models. It’s better at answering questions about people, it makes fewer mistakes, and it handles coding tasks better. The new model is also more secure, defending against malicious requests better.

The core performance metrics

GPT-4.5 certainly stands out when you crunch the numbers. The most striking gain is in the PersonQA Accuracy metric, which judges how well the model answers questions about real people. GPT-4.5 outscoring 78% vs a mere 28% for GPT-4o — nearly three times better! That should make the new model a lot more dependable when you ask about celebrities, historical figures, or experts in various domains.

The reduction in hallucination rate is one of the most significant improvements. GPT-4.5 generates information purely out of thin air only 19 percent of the time, compared with 52 percent for GPT-4o. This is big news if you need reliable answers. Consider students researching for papers or businesses making decisions — they rely on facts, not fiction.

Metric GPT-4o GPT-4.5 Improvement
PersonQA Accuracy 28% 78% +178%
Hallucination Rate 52% 19% -63%
MMLU (Yoruba) 62% 68% +10%
SWE-bench Verified 32% 38% +19%

The model also does better with languages that aren’t English. For example, it scores 68% on MMLU tests in Yoruba (a language spoken in West Africa), which is 10% better than GPT-4o. This shows OpenAI is working to make their AI more helpful worldwide, not just for English speakers.

For software engineers, there’s good news too. GPT-4.5 gets 38% on the SWE-bench Verified test, which checks if it can write working code to solve real programming problems. That’s a 19% improvement over GPT-4o. While it’s still not perfect, it means the model can help with more coding tasks than before.

Novel Safety Systems

Safety is a big deal with powerful AI, and GPT-4.5 comes with several new protections. These systems work behind the scenes to keep the AI from being used in harmful ways.

Apollo Research Scheming Test

This test checks if an AI can be tricked into making harmful plans. GPT-4.5 scores 35% lower than OpenAI’s o1 model on this test, which is good news. It means GPT-4.5 is less likely to help someone who’s trying to use it for bad purposes. The test works by asking the AI tricky questions that seem innocent but could lead to harmful advice.

For example, if someone asks a series of seemingly normal questions that gradually lead toward something dangerous, GPT-4.5 is better at spotting this pattern and refusing to help. This makes it harder for people to “jailbreak” the AI or get around its safety features.

CBRN Mitigations

CBRN stands for Chemical, Biological, Radiological, and Nuclear – all very dangerous areas where AI could potentially cause harm. GPT-4.5 has special safety features for these topics, and the results are impressive. When tested on tasks related to creating biological threats, it had a 0% success rate after safety mitigations were applied.

This means that even when researchers tried their hardest to get the AI to provide dangerous information about making biological weapons or harmful substances, it refused every time. This is crucial for preventing misuse of AI in ways that could harm many people.

The safety team at OpenAI tested thousands of prompts in these dangerous areas, including:

  • Requests for detailed instructions on making dangerous chemicals
  • Questions about creating biological agents
  • Attempts to get information about radiological weapons
  • Inquiries about nuclear technology misuse

In all these cases, GPT-4.5’s safety systems blocked the requests completely.

Political Persuasion Guardrails

Another important safety feature is protection against political manipulation. GPT-4.5 scores 51% on the “notunsafe” metric when faced with adversarial red teaming that tries to make it spread political propaganda. This is better than GPT-4o, which scored only 40%.

This means GPT-4.5 is less likely to be used as a tool for spreading false political information or manipulating public opinion. The model can recognize when someone is trying to get it to create politically biased content that might mislead people.

These guardrails work by:

  • Detecting attempts to create political propaganda
  • Identifying requests for content that unfairly favors one political view
  • Recognizing attempts to create misleading political messages
  • Refusing to generate content designed to manipulate voters

With elections happening in many countries, these protections help ensure that AI won’t become a tool for spreading false information or influencing voters unfairly.

Comparative Analysis With Contemporary Models

GPT-4.5 isn’t alone in the AI world. Let’s look at how it stacks up against other top models. Each has its own strengths and special features that make it unique. 🧠

Claude 3.7 Sonnet (Anthropic)

Anthropic has created the strong GPT-4 competitor Claude 3.7 Sonnet. 5, but there are some important differences in how it operates, and what it excels at.

Architecture Claude 3.7 Sonnet employs a hybrid reasoning system that allows users to opt for quick responses or deeper thought. It is like that you have two types — a fast one for simple questions and an “extended thinking” one for when you need more insightful responses. In thinking mode, it reveals its workings out for each step in the process, so people can follow how it arrived at its conclusions. What makes Claude 3.7 Sonnet unique is its huge 128K token output potential. This is 15 times larger than prior Sonnet models. To put this into perspective, your single output is capable of being 100 pages of writing in a single go. This enormous memory makes it ideal for lengthy conversations and intricate tasks that require a lot of back-and-forth.

Key Differentiators Over coding, Claude 3.7 Sonnet is truly powerful. It score a whopping 70.3% on SWE-bench (the software engineering exam), beating GPT-4 by a wide margin. 5’s 38%. Claude is particularly well-suited for programmers and developers.

Claude 3.7 Sonnet also excels at creating outlines for long-form content with structure. Claude can help you organize your thoughts (if you need to write research papers, reports, articles, etc.) in a clear manner which makes it easy for readers to understand.

Safety Posture  Safety is the core identity of Anthropic, contrasting with that of OpenAI. Claude employs something termed “Constitutional AI”  a collection of principles meant to steer its actions. GPT-4.5, by contrast, relies on Reinforcement Learning from Human Feedback (RLHF).

As a result, Claude has been able to cut unnecessary refusals by 45% compared to older versions. That’s to say it’s more discerning about whether the requests are genuinely harmful or ones that it is okay to answer.

Grok 3 (xAI)

Elon Musk’s xAI built Grok 3 that provides some very cool features compared to other models.

Grok 3 Unique Features A staggering 1 million token context window for Grok 3. This is significantly bigger than GPT-4.5’s 32K window. With this massive context, Grok 3 can “remember” entire books or conversations in a single session.

Grok 3 includes special modes known as DeepSearch and Think Modes. This is particularly strong for math problems  Grok 3 hits an astounding 96% on the AIME math benchmark. It is a very challenging exam that even gifted high school students find difficult to pass.

Deployment The hardware underpinning Grok 3 is enormous. It operates on 200,000 NVIDIA H100 GPUs, and training took approximately 200 million GPU-hours. This immense computational power enables Grok 3 to rapidly process information and tackle complex tasks.

Grok 3 has one particular feature that stands out, as it’s connected to X/Twitter data in real time. Its up-to-date access to information from social media  other models cannot access that as easily  is very important.

DeepSeek-R1

DeepSeek-R1 has a wholly different approach to AI training, which brings a few unique strengths with it.

Training methodology What’s remarkable about DeepSeek-R1 is that it learns via a pure reinforcement learning method known as GRPO (Generative Reinforcement Learning with Preference Optimization). This is different from most models that utilize labeled data, DeepSeek-R1 starts learning more like humans — through trial and error.

Training occurs in 3 phases, beginning with a cold start process preventing the model from producing nonsense. The project therefore created these groups step by step as it helps DeepSeek-R1 learn how to reason this way.

Specialisation DeepSeek-R1 is exceptionally good at math. The system scores 85% on the GPQA (Graduate-level Physics Questions with Answers) test and GPT-4.5’s 78%. That is particularly good for anything scientific and technical.

The model is capable of taking complex problems, dividing them into smaller parts, and solving them one at a time. It feels like you’re watching someone work out a tough math problem on a whiteboard.

OpenAI o3-mini

The o3-mini is OpenAI’s smaller reasoning model, made for efficiency, but still retains good performance.

Hardware profile The o3-mini is based on 576 NVIDIA Blackwell GPUs organized in 8 racks of standard servers. These are among the most powerful AI chips on the market. Then, it is designed for processing numerous requests simultaneously, which is useful for businesses who must cater to large numbers of users.

Performance Tradeoffs Although o3-mini is a smaller, more efficient version of GPT-4.5, there are some tradeoffs with the new hardware. It achieves a ”notunsafe” percentage of 26% for adversarial prompts in safety tests relative to GPT-4.5’s stronger 51%. This means GPT-4.5 is better at deflecting efforts to make it generate harmful content.

However, o3-mini is powerful for its size. Running it in “high” mode, it costs less than half as much to run as the normal o1 model, and also scores 182 ELO points higher on competitive coding. That makes it a viable option for developers looking for decent performance on a budget.

GPT-4/4o Legacy Models

The previous generation of GPT-4 and GPT-4o models still have their applications, but they do have drawbacks when compared to GPT-4.5

Key limitations  One major issue with GPT-4 is its hallucination rate – the hallucination rate is 52% versus GPT-4.5’s much lower 19%. This means GPT-4 is more prone to hallucinations, which can be a dire problem for many critical tasks.

GPT-4 has only a 8K token context window, whereas GPT-4.5 has the ability to manage 32K, which constrains the quantity of information that GPT-4 can examine simultaneously.

Cost Profile Here’s where things get interesting – GPT-4o is actually cheaper than GPT-4.5:

Model Input Cost Output Cost
GPT-4o $2.50/1M tokens $10/1M tokens
GPT-4.5 $75/1M tokens $150/1M tokens
GPT-4 $30/1M tokens $60/1M tokens

As you can see, GPT-4.5 is much more expensive than both GPT-4 and GPT-4o. This high cost reflects its advanced capabilities, but means users need to consider whether they really need its power for their specific tasks.

For many everyday uses, GPT-4o offers a better balance of performance and cost. It’s 90% cheaper than GPT-4 and even more affordable compared to GPT-4.5.

Technical Deep Dive: GPT-4.5 Safety Systems

GPT-4.5 has strong safety features to keep it from being harmful. OpenAI tested it carefully before release to make sure it’s safe to use. Let’s look at how they protect us from risks. 🛡️

Preparedness Framework Implementation

OpenAI created a special system called the “Preparedness Framework” to check if their AI models are safe before letting people use them. This framework helps them spot possible dangers and fix them. For GPT-4.5, they checked four main risk areas.

Risk Classifications

The Safety Advisory Group at OpenAI gave GPT-4.5 an overall “medium risk” rating after testing. Here’s how each risk area scored:

  1. CBRN (Chemical, Biological, Radiological, Nuclear): Medium RiskThis checks if the AI could help someone make dangerous weapons or harmful substances. GPT-4.5 got a “medium” rating here, which means it has some knowledge but strong protections are in place. Before adding safety features, the model showed it could answer some questions about biological topics with 59% accuracy.
  2. Cybersecurity: Low Risk (53% HS CTF success)This measures if hackers could use the AI to break into computers. GPT-4.5 can solve 53% of high-school level hacking challenges, but only 16% of college-level ones, and just 2% of professional-level problems. This low success rate on harder challenges means it’s not very useful for serious hacking.
  3. Persuasion: Medium RiskThis checks if the AI could be used to trick people or spread fake news. GPT-4.5 got a “medium” rating because it scored 72% on the “MakeMeSay” test, which measures how easily it can be tricked into saying specific things.
  4. Model Autonomy: Low Risk (40% agentic task success)This measures if the AI can act on its own to do complex tasks. GPT-4.5 succeeded on only 40% of tests where it had to work independently, showing it still needs human guidance for most tasks.

Mitigation Layers

OpenAI didn’t just identify risks – they added several protection layers to make GPT-4.5 safer:

  1. Pre-training data filteringThey removed dangerous information about making biological or chemical weapons from the training data. This helps prevent the model from learning how to help with harmful activities.
  2. Safety trainingGPT-4.5 was specially trained to refuse requests for harmful content. This training helps it recognize and reject dangerous questions.
  3. Live monitoring for influence operationsOpenAI watches how people use GPT-4.5 to catch anyone trying to use it for political manipulation or spreading extreme views. This helps them stop misuse quickly.
  4. Content moderation systemsSpecial AI systems check what GPT-4.5 says to make sure it doesn’t break OpenAI’s rules. These systems have gotten better at spotting subtle problems.
  5. Targeted investigationsThe safety team looks into suspicious activities, especially those related to political influence or extremism.

Red Teaming Results

“Red teaming” is when experts try to trick an AI into doing harmful things to find weaknesses. Instead of having humans directly test GPT-4.5, OpenAI used challenging tests created from previous red teaming efforts. These tests are very hard and designed to push the AI’s safety limits.

Attack Type GPT-4.5 GPT-4o
Illicit Advice 51% 50%
Political Persuasion 46% 40%
Self-Harm Elicitation 99% 98%

In this table, higher percentages mean better safety (the AI refused to help with harmful requests). Let’s break down what these results mean:

Illicit Advice: When asked to help with illegal or harmful activities, GPT-4.5 refused 51% of the time, slightly better than GPT-4o’s 50%. This shows a small improvement in safety.

Political Persuasion: When tested on resisting attempts to create political propaganda or manipulation, GPT-4.5 scored 46% compared to GPT-4o’s 40%. This is better, but still shows room for improvement.

Self-Harm Elicitation: GPT-4.5 refused requests related to self-harm 99% of the time, compared to GPT-4o’s 98%. This shows very strong protection in this critical area.

OpenAI also tested GPT-4.5 against two special red teaming evaluation sets:

  1. The first set was created to test o3-mini and includes jailbreaks for illicit advice, extremism, political persuasion, and self-harm content. GPT-4.5 produced safe outputs 51% of the time, slightly better than GPT-4o (50%) but not as good as o1 (63%).
  2. The second set was designed for “deep research” and covers risky advice like attack planning. GPT-4.5 produced safe outputs 46% of the time, better than GPT-4o (40%) but not as good as o1 (68%) or deep research (67%).

Apollo Research, an external group that studies AI safety, found that GPT-4.5 scores lower on “scheming reasoning” tests than o1 but higher than GPT-4o. This means GPT-4.5 is less likely than o1 to come up with clever ways to do harmful things, but more likely than GPT-4o.

These tests show that while GPT-4.5 has improved safety in some areas, it still has weaknesses that OpenAI continues to work on. The company expects scores on these challenging tests to improve over time as they make the model more robust against attacks.

Market Implications

GPT-4.5 is changing how businesses and researchers use AI. Let’s look at who can benefit from it and what it means for the AI world. 💼

Enterprise Use Cases

Companies are excited about GPT-4.5, but they need to think carefully about when to use it instead of other AI models. It’s not always the best choice for every job.

Superior Applications for GPT-4.5

GPT-4.5 really shines in certain areas where it beats other top AI models:

  1. Creative Writing

GPT-4.5 is amazing at writing stories, articles, and marketing content. Tests show it scores 38% higher on coherence than Claude 3.7 Sonnet. This means its writing flows better and makes more sense.

When writing stories, GPT-4.5 keeps track of characters and plots much better than older models. It remembers details from earlier in the story and uses them in natural ways later on. This makes the stories feel more like they were written by a human.

For marketing teams, this means better blog posts, social media content, and ad copy. The text sounds more natural and engaging, which helps grab customers’ attention.

  1. Multilingual Customer Support

GPT-4.5 supports 15 languages really well. This is huge for global companies that need to talk to customers around the world. The languages include:

  • English
  • Spanish
  • French
  • German
  • Portuguese
  • Italian
  • Dutch
  • Russian
  • Japanese
  • Chinese (Simplified and Traditional)
  • Korean
  • Arabic
  • Hindi
  • Turkish
  • Vietnamese

What makes GPT-4.5 special is that it understands cultural context in each language, not just the words. For example, it knows that politeness works differently in Japanese than in Spanish. This helps avoid awkward or offensive translations.

  1. Complex Document Analysis

GPT-4.5 can read long documents and pull out the important information. It’s great at:

  • Summarizing legal contracts
  • Finding key points in research papers
  • Analyzing financial reports
  • Extracting data from technical manuals

With its 32K context window, it can handle documents up to about 50 pages long in a single prompt. This saves employees hours of reading time.

Cost-Benefit Analysis

GPT-4.5 is powerful, but it’s also expensive. Companies need to decide if the benefits are worth the cost.

Model Input Cost Output Cost Best Use Cases
GPT-4.5 $75/1M tokens $150/1M tokens Creative content, complex reasoning
Claude 3.7 $3/1M tokens $15/1M tokens Code generation, structured data
GPT-4o $2.50/1M tokens $10/1M tokens Everyday tasks, customer service

As you can see, GPT-4.5 costs about $0.12 per 1,000 tokens (about 750 words), while Claude 3.7 costs around $0.08 for the same amount. This difference adds up quickly for companies using AI at scale.

For most everyday business tasks, GPT-4o is still the better choice because it’s much cheaper and almost as good. But for high-value creative work or complex reasoning tasks, GPT-4.5’s extra capabilities might be worth the higher price.

OpenAI offers volume discounts for enterprise customers, which can bring the cost down by 15-30% depending on usage. They also provide a special “cached input” rate of $37.50 per million tokens, which helps for applications that send the same prompts repeatedly.

Research Community Impact

GPT-4.5 is having a big effect on AI research, but access is limited. This creates both opportunities and challenges for scientists.

Model Access

OpenAI is being careful about who can use GPT-4.5 and how much they can use it:

  1. Limited API availability

Researchers need to apply for access to the GPT-4.5 API. OpenAI reviews applications and prioritizes academic institutions and non-profits doing safety research. Even with approval, there are strict usage caps – typically 100,000 tokens per day for academic users.

These limits make it hard for smaller research teams to run large experiments. Some researchers have complained that this creates an unfair advantage for big tech companies and well-funded universities.

  1. No public weights release

Unlike some other AI models (like Llama 3), OpenAI does not share the actual model weights for GPT-4.5. This means researchers can’t see how it works inside or modify it for their own experiments.

OpenAI says this closed approach is necessary for safety reasons. They worry that releasing the full model could lead to misuse. But many in the open-source community argue that closed models slow down progress and make it harder to find and fix problems.

  1. Research access program

OpenAI has created a special program for AI safety researchers. Those accepted get higher usage limits and more detailed model information. So far, about 200 research teams have been approved for this program.

Benchmark Contributions

While keeping the model itself private, OpenAI has shared new ways to test and evaluate AI systems:

  1. New SWE-Lancer evaluation suite

OpenAI introduced the SWE-Lancer suite, which tests how well AI models can solve real-world programming problems. Unlike older benchmarks that use simplified coding tasks, SWE-Lancer uses actual freelance programming jobs from platforms like Upwork.

This benchmark is more realistic because it tests:

  • Understanding unclear requirements
  • Working with existing codebases
  • Handling edge cases
  • Writing tests and documentation

The SWE-Lancer suite is available for anyone to use, which helps the whole research community test their models more effectively.

  1. Public Simple Evals GitHub repository

OpenAI has also released “Simple Evals,” a collection of easy-to-use tests for AI models. These tests cover:

  • Basic reasoning
  • Math skills
  • Knowledge of facts
  • Safety and harmful outputs
  • Bias detection

The Simple Evals repository includes code, test cases, and scoring methods. This makes it easier for researchers to compare different models fairly.

  1. Transparency reports

OpenAI has published detailed reports about GPT-4.5’s performance on various benchmarks. These reports include not just the scores but also examples of where the model succeeds and fails.

This transparency helps researchers understand the current state of AI capabilities and identify areas that need improvement. It also builds trust by showing that OpenAI is honest about their model’s limitations.

The research community has mixed feelings about GPT-4.5. Many are excited about its capabilities but frustrated by the limited access. As one AI researcher put it: “We can see what the model can do, but not how it does it. It’s like being shown a magic trick without learning the secret.”

Ethical Considerations

GPT-4.5 raises important ethical questions we need to think about. While it can do amazing things, it also comes with risks we should understand. Let’s look at the main concerns and what OpenAI is doing about them. 🔍

Dual-Use Concerns

“Dual-use” means technology that can be used for both good and harmful purposes. GPT-4.5 has knowledge that could potentially be misused if the safety systems failed.

Biorisk Potential

Before adding safety features, GPT-4.5 showed concerning knowledge about biology that could be misused. It scored 59% accuracy on biological magnification questions, which test understanding of how chemicals build up in food chains. This knowledge, while useful for environmental science, could potentially help someone create harmful substances.

OpenAI found that as models get smarter, they naturally learn more about biology, chemistry, and other sciences. This creates a challenge: the same knowledge that helps doctors and scientists can also help people with bad intentions.

Some examples of what GPT-4.5 knew before safety measures were added:

  • How certain chemicals affect living organisms
  • Basic principles of genetic modification
  • Details about disease transmission

After adding safety features, the model’s ability to answer potentially dangerous biology questions dropped dramatically. Now it refuses to provide detailed information that could be misused, while still helping with legitimate science questions.

OpenAI worked with biology experts to create special safety rules for this area. They tested the model with thousands of biology questions and fine-tuned it to recognize when a question might be leading toward harmful uses.

Nuclear Knowledge

Even more concerning is GPT-4.5’s knowledge about nuclear technology. Tests showed it had 74% accuracy on expert nuclear engineering questions before safety measures were applied. This is much higher than previous models and close to what a trained nuclear engineer might know.

This knowledge includes understanding of:

  • Nuclear reactor designs
  • Radioactive materials properties
  • Basic principles of nuclear weapons

The high accuracy is worrying because nuclear information could be extremely dangerous in the wrong hands. However, OpenAI has added strong safety measures that prevent the model from sharing detailed nuclear information that could be misused.

When asked potentially dangerous nuclear questions, GPT-4.5 now responds with general information only, or refuses to answer altogether. It can still help with basic nuclear physics for educational purposes, but won’t provide specific details that could aid in weapons development.

OpenAI consulted with nuclear security experts to develop these safety measures. They continue to monitor and improve these protections as the technology evolves.

Transparency Measures

To build trust, AI companies need to be open about how their systems work and what risks they might pose. OpenAI has taken some steps toward transparency with GPT-4.5, but gaps remain.

Third-Party Audits

OpenAI didn’t just test GPT-4.5 themselves – they invited outside experts to evaluate it too. This helps ensure the safety testing is fair and thorough.

  1. METR evaluation

The Model Evaluation Through Reasoning (METR) test measures how well an AI model can reason about its own capabilities and limitations. GPT-4.5 was tested on its ability to predict what it could accomplish within a 30-minute time horizon.

The results showed that GPT-4.5 is reasonably good at understanding what it can and can’t do, scoring 68% accuracy on these predictions. This is important because an AI that doesn’t understand its limitations could make dangerous mistakes.

  1. Apollo Research scheming analysis

Apollo Research, an independent AI safety organization, conducted special tests to see if GPT-4.5 could be tricked into creating harmful plans. They used a technique called “scheming analysis” that looks for clever ways to get around safety measures.

Their results showed that GPT-4.5 is less vulnerable to these tricks than GPT-4o, but still has some weaknesses. The model scored 35% lower than OpenAI’s o1 model on scheming tests, which is good news for safety.

Apollo Research published their findings publicly, which helps other researchers understand the model’s strengths and weaknesses.

  1. Anthropic red-teaming collaboration

OpenAI also worked with Anthropic (makers of Claude) to test each other’s models. This unusual collaboration between competitors helped find safety issues that internal testing might have missed.

The joint testing found several new types of safety problems that neither company had discovered on their own. Both companies then improved their safety measures based on these findings.

Documentation Gaps

While OpenAI has shared some information about GPT-4.5, important details are still missing. These gaps make it harder for outside experts to fully evaluate the model’s risks.

  1. No detailed pre-training dataset disclosure

OpenAI hasn’t revealed exactly what data was used to train GPT-4.5. They’ve only said it includes “a diverse range of internet text, books, and other sources up to April 2024.”

This lack of transparency raises several concerns:

  • We don’t know if copyrighted material was used without permission
  • It’s unclear how much non-English content was included
  • We can’t tell if the training data contains harmful biases

Many AI ethics experts have called on OpenAI to share more details about their training data. Without this information, it’s hard to fully understand what biases or gaps might exist in the model.

  1. Limited failure mode analysis

OpenAI has shared some examples of where GPT-4.5 makes mistakes, but their public documentation doesn’t include a comprehensive analysis of failure modes.

A failure mode analysis would show:

  • Common types of errors the model makes
  • Situations where it’s likely to hallucinate
  • Patterns in its reasoning mistakes
  • Edge cases where safety measures fail

Without this information, users might trust the model in situations where it’s likely to fail. This could lead to bad decisions based on incorrect AI outputs.

  1. Incomplete model card

While OpenAI published a system card for GPT-4.5, it lacks some important details that are standard in model cards for other AI systems. Missing information includes:

  • Environmental impact of training
  • Details about the evaluation datasets
  • Quantitative results on bias tests
  • Limitations for specific languages and cultures

These gaps make it harder for organizations to make informed decisions about whether and how to use GPT-4.5.

OpenAI has said they plan to release more documentation over time, but some researchers argue that this information should be available before a model is widely deployed, not after.

The balance between transparency and security is tricky. Sharing too much could help bad actors misuse the technology, but sharing too little prevents proper oversight and evaluation. Finding the right balance remains a challenge for the entire AI industry.

Future Directions

GPT-4.5 marks an important step forward in AI development, though not a revolutionary leap. It shows how AI is getting better while also becoming safer. Let’s wrap up what we’ve learned and look at what might come next. 🚀

Key Takeaways

GPT-4.5 brings several important improvements to the AI landscape. It cuts hallucination rates by 63% compared to GPT-4o, making it more trustworthy for important tasks. The model shows remarkable gains in understanding people, with a 178% improvement in PersonQA accuracy.

These improvements come from OpenAI’s hybrid training approach that combines world knowledge with step-by-step reasoning. This helps the model think more carefully before answering questions.

However, GPT-4.5 isn’t the best at everything. Claude 3.7 Sonnet still beats it in coding tasks, scoring 70.3% on SWE-bench compared to GPT-4.5’s 38%. And Grok 3 outperforms it in math, especially on the challenging AIME benchmark where Grok scores an impressive 96%.

What makes GPT-4.5 special is its balance. It’s good at many different tasks rather than excellent at just one or two. This makes it ideal for creative work and business applications where flexibility matters more than specialized expertise.

Remaining Challenges

Despite significant progress, GPT-4.5 still faces several important challenges:

  1. Persuasion Risks

Perhaps the most concerning issue is GPT-4.5’s improved persuasion abilities. It scores a 72% success rate on MakeMeSay tests, which measure how well it can convince someone to take a specific action. This is much higher than previous models.

While persuasion can be helpful for legitimate purposes like education or marketing, it also creates risks. A model that’s good at persuading people could potentially:

  • Spread misinformation more effectively
  • Manipulate users emotionally
  • Convince people to make poor decisions

OpenAI has added safety measures to prevent misuse, but the high success rate on persuasion tests suggests these protections might not catch everything.

  1. Cost Barriers

At $75 per million tokens for input and $150 per million for output, GPT-4.5 is significantly more expensive than other models. This high cost creates barriers for:

  • Small businesses with limited budgets
  • Independent researchers
  • Educational institutions
  • Non-profit organizations

The price tag means many potential users will stick with older or competing models, limiting GPT-4.5’s real-world impact.

  1. Transparency Gaps

As we discussed earlier, OpenAI hasn’t shared key details about GPT-4.5’s training data, internal workings, or comprehensive failure modes. This lack of transparency makes it harder for outside experts to fully evaluate the model’s risks and benefits.

Future Research Directions

The release of GPT-4.5 points to several promising areas for future AI research:

  1. Specialized vs. General Models

The AI field seems to be splitting into two paths:

  • General-purpose models like GPT-4.5 that can handle many different tasks
  • Specialized models like o1 (reasoning) and o3-mini (STEM) that excel in specific areas

Future research will likely explore how these approaches can complement each other. We might see systems that combine general models for broad knowledge with specialized models for specific tasks.

  1. Safety-Capability Balance

Finding the right balance between capabilities and safety remains a central challenge. GPT-4.5 shows that models can become both more capable and safer, but tradeoffs still exist.

Future work will need to address questions like:

  • How can we make models safer without limiting useful capabilities?
  • What new risks emerge as models become more persuasive?
  • How can we test for unknown safety issues?
  • Multimodal Integration

GPT-4.5 builds on GPT-4o’s multimodal abilities, working with both text and images. Future research will likely expand these capabilities to include:

  • Better video understanding
  • Audio processing and generation
  • 3D spatial reasoning
  • Interactive learning from user feedback
  • Efficiency Improvements

While GPT-4.5 is 10x more computationally efficient than GPT-4o, it still requires massive computing resources. Future research will focus on making models more efficient through:

  • Better training methods that require less data
  • More efficient model architectures
  • Hardware optimizations
  • Distillation techniques to create smaller models with similar capabilities

GPT-4.5 is part of an incremental but important step in the development of AI. It demonstrates that AI systems can grow both in capability and in safety, but that absolute safety is still out of reach.

The model’s balanced capabilities are particularly suited to creative and enterprise use cases in which versatility is important. However, its expense and niche offerings mean it won’t be king of every use.

As AI evolves, the question is changing from “Can AI do this?” to “Should AI do this, and how can it do that safely?” GPT-4.5 doesn’t answer all these questions, and the answer to each is likely to be different at each time, but it does establish progress in addressing all of them.

The primary lesson from GPT-4.5 may be that development of AI isn’t a race to a single finish line, but a journey with many paths forward. Different models will meet different needs, and the models that will be most successful will most likely do so by combining the strengths of different kinds of systems while effectively managing their risks.

How do you expect A.I. to change in another year? Will we also get to see GPT-5, or the benevolent focus will get into more specialized models? The only constant is further rapid change, with new opportunities and new challenges each step of the way.

Written By :
Mohamed Ezz
Founder & CEO – MPG ONE

Similar Posts