Open AI Introduces the o1 Family of Models Capable of Thinking and Advanced Reasoning

AI Ecosystem Updates | Issue# 7 [September 30, 2024]

What?

OpenAI recently introduced a new Large Language Model (LLM) – o1. The AI giant launched the OpenAI o1 series / family of AI models with OpenAI o1-preview – an early version of o1, and OpenAI o1-mini. OpenAI also released information pertaining to the performance of o1 – the next model update in the series, in their technical research post. The company confirmed that the new series of models are under development and the company plans on updating the models iteratively, probably incorporating user feedback with preview versions, such as o1-preview. Here is DeepLearning.AI’s coverage of the launch.

The o1 family of models focus on advanced reasoning, far surpassing previous models, such as GPT-4. The models were developed to tackle complex challenges across science, coding, and math, by mimicking human-like problem-solving strategies using Reinforcement Learning (RL). Applications include annotating cell sequencing data in healthcare, generating quantum optics formulas in physics, running multi-step workflows for developers, and a lot more. Released for ChatGPT and trusted API users, the o1 models represent a significant leap in AI capabilities. While this blog will focus on the capabilities of o1, which is currently in development, and its preview version – o1-preview, more information about o1-mini can be found here.

How it Works – Capabilities

The o1-preview and o1-mini models were developed using a combination of web-scraped data, open-source datasets, and proprietary data contributed by OpenAI and its partners. Using RL, the o1 models are designed to think deeply and productively like a human before responding, using a “chain of thought” approach. Behind the scenes, the models process “reasoning tokens” that OpenAI treats as output tokens and charges the users for, although the tokens are not visible to users.

The “chain of thought” reasoning is a key feature that allows the models to break down complex tasks. RL helps the models learn from their mistakes and refine their thinking process and strategies during a highly data-efficient training process. This makes the models adept at handling difficult reasoning benchmarks like the U.S.A. Math Olympiad (AIME). o1 ranks among the top 500 students nationally in the AIME. It also outperforms PhD-level human experts in specialized tasks like chemistry, biology, and physics problems.

It is worth noting that OpenAI has decided not to display the raw “chain of thought” to users, as it allows the model to express its reasoning freely without being altered by policies or preferences. Monitoring this hidden chain could help detect manipulative behaviors in the future. Instead, OpenAI provides a model-generated summary of the thought process, balancing user experience and competitive advantage while ensuring valuable insights are included in the model’s answers.

Performance and Evaluation

OpenAI o1 achieved impressive results across numerous evaluations and outperformed o1-preview, while both outperform GPT-4o and the state-of-the-art across several ML banchmarks. See below.

  • Coding: A variant of o1 scored in the 49th percentile in the 2024 International Olympiad in Informatics or IOI. With a relaxed submission constraint of 10,000 submissions per problem, the model achieved a score of 362.14, which is above the gold medal threshold – even without any test-time selection strategy. The model scored in the 89th percentile when it comes to competitive programming on Codeforces, while o1-preview ranked in the 62nd percentile. Precisely, this variant developed from o1 outperformed o1, GPT-4o (Elo rating of 808), and 93% of its other competitors with an Elo rating of 1807.
  • Math: In a preliminary exam for the International Mathematics Olympiad (IMO), GPT-4o solved just 13% of the problems, while the advanced reasoning model o1 achieved a much higher success rate of 83%. On AIME, o1 solved 74% of problems on average with a single sample, rising to 83% when using consensus among 64 samples and 93% when re-ranking 1000 samples with a learned scoring function. This places o1 among the top 500 in U.S.A. in AIME. In comparison, GPT-4o was able to solve 12% of the problems.
  • Science and Other Categories: o1 achieved PhD-level competency in scientific tasks and surpassed human experts in GPQA-diamond tasks related to physics, chemistry, and biology. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories. Given its vision perception capabilities, o1 scored 78.2% on MMMU, which makes it competitive with human experts.
  • Safety: o1-preview showed robust safety features, improving on benchmarks for resisting jailbreaks and harmful content with scores like 93.4% in challenging cases. The model scored 84 in one of OpenAI’s toughest jailbreak evaluations, while GPT-4o managed a score of just 22.
  • Human Preference Evaluation: The o1-preview model is strongly favored over GPT-4o in tasks requiring extensive reasoning, such as data analysis, coding, and mathematics. However, it is less preferred by human trainers for certain natural language tasks, indicating that it may not be optimal for every application.

OpenAI o1-mini

For developers requiring an efficient and cost-effective alternative to o1-preview, OpenAI offers OpenAI o1-mini, a faster and cheaper model that is particularly good at STEM reasoning, especially coding tasks (and math as well). OpenAI o1-mini is a powerful and yet efficient model for use cases requiring advanced reasoning capabilities, but not comprehensive world knowledge. More here.

Safety and Alignment

OpenAI’s o1-preview model introduces advancements in safety by integrating human-aligned behavior and values directly into its chain of thought. This approach allows the company to monitor o1-preview’s thought process and also allows the model to reason through safety guidelines, making it more robust in unpredictable scenarios. Extensive testing using the company’s Preparedness Framework, including difficult jailbreak evaluations, demonstrates o1-preview’s superior adherence to safety protocols (see Performance and Evaluations).

OpenAI has strengthened its internal governance, red teaming, and safety training and review processes to match the capabilities of the o1 models. To further its commitment towards a safer and more responsible AI ecosystem, the leading AI company has also forged partnerships with the federal government and AI Safety Institutes in the U.S.A. and U.K., facilitating further research and safety testing for future models. More here. Additionally, here is the system card outlining the safety work undertaken by OpenAI prior to the release of the o1-preview and o1-mini models.

Availability and Usage

ChatGPT Plus and Team users can access both o1-preview and o1-mini in the model picker, with rate limits of 50 queries per week for both models. ChatGPT Enterprise and Edu users have access to the o1-preview and o1-mini models. API developers qualifying for OpenAI’s tier 5 usage can prototype with both models at 20 RPM. API documentation here. OpenAI is planning on raising these limits and is gradually extending o1-mini access to ChaptGPT Free users.

In terms of the cost, o1-preview would cost users $15/$60 per million input/output tokens, which is much higher than $5/$15 for GPT-4o. o1-mini, on the other hand, would cost $3/$12 per million input/output tokens. Additionally, the token limit for o1-preview is about 32,768 tokens, including reasoning tokens, while that of o1-mini is roughly 65,536. OpenAI does recommend budgeting 25,000 tokens specifically for reasoning. The input context window for both the preview and mini models is 128,000 tokens.

Limitations and Next Steps

Being an early version, the o1 models currently lack several features that enhance ChatGPT’s functionality, such as the ability to browse the web for information or handle file and image uploads. OpenAI acknowledges that GPT-4o will still be more effective for many typical use cases. In order to enhance the user experience, the company is working on adding capabilities for browsing, uploading files and images, and other features, in future iterations of the o1 models. Additionally, the API for these models currently doesn’t include support for function calling, streaming, system messages, and other features – something that OpenAI plans on changing in the future.

OpenAI confirms that o1 sees a consistent performance improvement with increase in both training using RL and the time to think. The constrainits on this scaling approach is significantly different from those of LLM pretraining and the company plans to continue investigating them. Lastly, OpenAI is committed to continue to work on improving the robustness and safety features of the o1 models and their alignment with human values and principles.

Thoughts

Recently, OpenAI’s co-founder Sam Altman expressed his confidence in being able to achieve superintelligence in a few thousand days and mentioned that OpenAI is working towards the same. There is no doubt that the thinking o1 models are a step towards that goal. Agentic workflows tailored for particular use cases enhance the ability of an AI system to reflect, reason, and improve its output iteratively. As this blog rightly points out, incorporating the ability to iteratively think, reason, and reflect, directly in response to even general-purpose questions opens up new opportunities for improved reasoning in LLMs.

The use of reasoning tokens makes the o1 series of models slower and costlier, as compared to GPT-4o. We are confident that OpenAI will work to optimize on that front in the next iterations of the o1 models and create more compute- and cost-efficient models for its users. We expect the cost of the models to go down with time and await OpenAI to ease the rate limits and open up more features for general use. Alongside the full release of o1, we would eagerly await support for multimodal capabilities, browsing, function calling, streaming, system messages, and other features, in the future versions of the o1 models that the company is working on.

The o1 models represent a significant step towards reasoning-based AI systems. This opens up a whole new world of opportunties for applications across domains (as well as research) based on these models. As mentioned earlier, this would directly impact progress in science (healthcare, physics, education, and more) and software engineering. However, hiding the chain of thought makes the models less explainable and transparent. We do understand that OpenAI had strike a balance between competitive edge and user experience. We are confident that the AI behemoth would work on making the models more explainable in the future.

The future of AI is exciting and OpenAI is leading from the front with rapid, game-changing innovations. The “chain-of-thought” reasoning paradigm will trigger more open-source innovations that look to better o1’s performance. This healthy competition is sure to accelerate innovation and growth in AI. To that end, we wish OpenAI the very best in their persistent endeavours to achieve superintelligence.

If you are interested to learn more, feel free to check out this blog post and this technical research post by OpenAI, and this article by DeepLearning.AI.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at [email protected] and share your feedback and thoughts to help us make it better for you next time.


Acronyms used in the blog that have not been defined earlier: (a) Artificial Intelligence (AI), (b) Application Programming Interface (API), (c) United States of America (U.S.A.), (d) American Invitational Mathematics Examination (AIME), (e) Doctor of Philosophy (PhD), (f) Machine Learning (ML), (g) Science, Technology, Engineering, and Mathematics (STEM), (h) United Kingdom (U.K.), and (i) Requests Per Minute (RPM).