OpenAI released two major new preview models today: o1-preview and o1-mini (that mini one is also a preview, despite the name)—previously rumored as having the codename “strawberry”. There’s a lot to understand about these models—they’re not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved “reasoning” capabilities.
- Trained for chain of thought
- Low-level details from the API documentation
- Hidden reasoning tokens
- Examples
- What’s new in all of this
Trained for chain of thought # OpenAI’s elevator pitch is a good starting point: We’ve developed a new series of AI models designed to spend more time thinking before they respond. One way to think about these new models is as a specialized extension of the chain of thought prompting pattern—the “think step by step” trick that we’ve been exploring as a a community for a couple of years now, first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022. OpenAI’s article Learning to Reason with LLMs explains how the new models were trained: Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.