What type of content do you primarily create?
When I tell people that I have a degree in math, they have one of two responses: Either “I hated math," or “Why would you do that?!"
I get it; the world is not really filled with math lovers. That’s probably why so many people want to outsource their math to AI tools like ChatGPT. But you should avoid relying too heavily on AI for math—Large Language Models (LLMs) are still only B students at best. For instance, ChatGPT-4 scores 76.6% in math problem solving, while Claude 3.5 Sonnet manages 71.1%. And at the moment, these two models are the best LLMs out there.
But even if you aren’t using AI tools for math problems, there are some reasons this particular issue should be important to you: most importantly, it’s an illustration of how you can’t know where your LLM will succeed and where it’ll fail miserably.
We expect computers to be good at math, but LLMs aren't
In the 1950s, as AI was just emerging and the term "Artificial Intelligence" was freshly coined, researchers pondered what constituted the most human kind of thought. Language? Science? Creativity? Strategy? All were top contenders, but the pinnacle of human intelligence was widely agreed to be…mathematics.
It might sound strange now, but it makes sense if you consider that humans are the only species to have invented things like geometry proofs, calculus, and linear algebra. These special reasoning skills have enabled us to understand and predict so much about our natural world.
As a result, mathematical reasoning was one of the first kinds of "human thought" to be tackled by AI. Fast forward to today, and math is still everywhere in computers: Excel formulas, Python code, cryptography, data analysis tools, financial modeling software, and even in the algorithms that power your favorite social media platforms.
Math is one of those things that we expect computers to excel at because they've been such calculating powerhouses in the past. However, when working with LLMs, all these expectations need to be revisited. We need to learn a new way of thinking about what this technology is good at—and what it isn't.
Why ChatGPT is bad at math
When Anthropic's new model Claude 3.5 Sonnet came out, many people shared the model's stats in comparison to other frontier models like ChatGPT-4o and Gemini. While most people were oohing and ahhing over the model's scores as compared to GPT-4o, another thing caught my eye: We're still pretty far off from 100% in graduate-level reasoning and math problem-solving.
At their core, LLMs are text-prediction engines with a degree of randomness. They rely on probability. Even if they seem to be reasoning logically, they’re actually using pattern recognition instead of showing true mathematical understanding. Meanwhile, mathematics has specific structures: formulas, equations, and relationships between concepts. LLMs don't (yet) capture these features in the way a purpose-built math system would.
LLMs can also be sensitive to how questions are worded. This is both an advantage and disadvantage—it means smart prompting can help LLMs like ChatGPT be better at math. But it also means that their abilities are inconsistent, and as a result, you can't rely on their answers without double-checking them.
We're on the “Jagged Technological Frontier”
Even if math isn't your thing, if you use LLMs for various tasks, you should know about how LLMs' struggles with math relate to the "Jagged Technological Frontier."
The Jagged Technological Frontier refers to the uneven landscape of an AI's abilities. Imagine a border between two places that’s not a smooth, straight line, but a jagged, unpredictable edge. This edge represents the boundary between what AI can and can't do reliably. The "jaggedness" comes from the fact that these abilities aren't uniform or easily predictable.
One of the biggest challenges of working with LLMs is knowing when you're pushing up against this frontier. After all, from your perspective, you're just writing a prompt. It seems like it should be pretty much the same process regardless of what you're asking the AI to do. So it can be surprising and frustrating when LLMs excel at some tasks that seem complex, but then stumble on others that appear simple.
Here's the thing: none of us, not even the developers of these models, know exactly where this jagged frontier lies. It's not a fixed boundary, and it can shift depending on factors like how a prompt is phrased or what the model has been trained on. This uncertainty is part of what makes working with LLMs challenging.
Adding to this complexity is the fact that different LLMs have different strengths and weaknesses. The jagged frontier for ChatGPT-4 isn't the same as the one for Claude 3.5 Sonnet or Google's Gemini. A task that falls within the capabilities of one model might be beyond the frontier for another.
So, what does this mean for you as a user of these technologies? In essence, it calls for a balanced approach of curiosity and caution. Stay vigilant and pay attention to the output you're getting, especially for critical tasks where accuracy is important. Don't assume that because an LLM handled one task well, it’ll be just as good at everything else. It's often worth experimenting with different models, as each has its own strengths and weaknesses. As you work more with these tools, you'll start to develop a feel for what kinds of tasks they excel at and where they tend to falter. Finally, remember that clear, specific prompts often yield the most useful results.
Which brings us to…
How to make ChatGPT better at math right now
Out of the box, LLMs generally perform poorly at most math and reasoning problems, but there are several advanced prompting methods that can give them a boost. We covered some of them in more detail in our advanced prompts article.
Here are some techniques you can try:
1. "Let's think step by step"
This technique sounds so simplistic that it's hard to believe it works. But just adding "Let's think step by step" to your prompt leads to an improvement. Wording the prompt this way makes the LLM formulate its response in a sentence or two rather than providing the response directly. It's as if asking the LLM to "show its work" leads it to reason more effectively.
2. Chain of Thought
Instead of asking the LLM to make up its own steps, like in the previous prompt style, Chain of Thought asks the LLM to go through a specific set of intermediate steps to get an answer. The catch is, you need to know the right steps in advance, so it doesn't work for all kinds of problems.
3. Tree of Thought
If the LLM doesn't know the path to follow to get the correct answer, or if there isn't one set way, Tree of Thought can come in handy. This prompt style elaborates on Chain of Thought prompting by encouraging the LLM to explore multiple options, and then backtrack if it hits a dead end.
4. Few-shot prompting
A fancy name for "Just give me some examples," few-shot prompting is a technique where you give the LLM a few examples of the reasoning and result you want it to calculate. It then takes these examples to attempt to solve the new problem.
5. Have it write computer code for the problem
It sounds complicated, but it's a simple concept: Ask the LLM to write and execute code to find the solution to the math problem. This plays to the strengths of LLMs—exact calculations can be done in a manner that produces consistent, accurate results, but the prompt can still be written in natural language.
While these prompting techniques can certainly boost LLMs' mathematical performance, they're not a silver bullet. They still require human oversight and often specialized knowledge to use effectively. This brings us to a crucial question: Is this as good as it gets, or can we expect LLMs to improve in the future?
Will ChatGPT eventually pass algebra?
Over the past few years, LLMs have been steadily improving, sometimes in remarkable ways. The way LLMs are trained is a hurdle to their math ability, but it doesn't mean that it's impossible for them to learn.
Researchers are constantly working on new techniques to improve LLMs' mathematical reasoning. These include incorporating symbolic reasoning systems, developing specialized math modules, enhancing training data with more mathematical content. And, as always, prompting techniques keep improving.
That said, for now, my suggestion is to rely on plugin tools if you want to do anything mathematical with high stakes. And as always, double-check the work because subtle errors can easily slip through if you aren't paying attention. Remember, even when LLMs seem confident in their mathematical answers, they can still make mistakes.