What type of content do you primarily create?
“Elephant...harmony...whisper...desert...wisdom…”
That's GPT-4 trying to ace the Divergent Association Task (DAT), a key measure of creativity that a team of Canadian researchers used to explore AI's creative potential—and how it stacks up against humans.
Their findings offer valuable insights into how you can coax more creative outputs from AI tools.
TL;DR
Here are some lessons from the experiment that can help you get more creative results from an LLM:
- Use the most creative model you can. In this experiment, OpenAI’s GPT-4 was the most creative.
- You can judge the creativity of an LLM yourself by prompting it with the Divergent Association Task.
- Adjusting the “temperature” can increase an LLM’s creativity. This is tricky, but can sometimes be done just through prompting—experiment with suggesting different temperature values between 0 and 1 and see what you get.
- A bad strategy is worse than no strategy. Make sure the strategy you’re suggesting to the LLM is one that it can use to be more creative.
- For maximum creativity, use multiple LLMs. In this experiment, answers from the different LLMs were across the board. With more models at your disposal, you have more chances for a good output.
The experiment
The DAT is designed to measure divergent thinking, which is defined as the ability to come up with new and different solutions to open-ended problems—stuff like finding 10 ways to use a shoebox, or brainstorming a dozen approaches to marketing a new product. In fact, divergent thinking is a key element in the first stage of the creative process, where you have to explore a bunch of ideas before you find the best ones.
Put simply, divergent thinking is a test of creativity, and humans tend to be really good at it. LLMs, on the other hand? We’ll see.
In the DAT, test-takers (whether people or AIs) are asked to generate a list of 10 nouns that are "semantically distant" from one another—a fancy way of saying that each of the words have to be as different in meaning as possible.
So “elephant” and “tiger”? Not so different.
But “elephant” and “harmony”? Very different.
In this experiment, researchers had 100,000 humans and nine different AI models (across 500 sessions each) generate 10 words, then assessed their semantic distance.
To test whether a good score on the DAT was linked to a good score on other creative tasks, the researchers also compared the AI models to humans in writing haikus, movie synopses, and flash fiction.
Here’s what they discovered.
The results
GPT-4 takes the gold and humans get silver
Well, the robots won. The models differed in performance, with the top model, GPT-4, surpassing human performance. GeminiPro was on par with human performance on the test. GPT-4-turbo was notably worse in performance than even GPT-3, GPT-4's predecessor.
Large LLMs do well on the DAT, but tend to generate the same answers
Still, humans won out in overall diversity of the words they came up with.
Most LLMs have words they absolutely love to use in the DAT. Claude 3 overwhelmingly included “whisper,” “cactus,” and “labyrinth” in its responses, while GPT-4 had a thing for “microscope,” “elephant,” and “volcano.” Because of this, their scores tightly clustered around the average, with less variation between attempts.
Meanwhile, some of the lesser-known models like RedPajama, an open source effort to reproduce the LLaMA dataset and models, and Pythia, part of Eleuther AI's project to understand how knowledge develops during training, had an extremely wide set of scores, from very low to very high. These models also used a much more diverse set of words on the test—the top picks were only used 20% of the time or less.
Humans, however, all had different ideas for how to ace the test. They picked very different answers compared to each other.
Here's the top-performing human DAT, with a score of 95.7.
- Javelin
- Haemoglobin
- Citrus
- Gangrene
- Upstairs
- Microphone
- Numbat
- Tarantula
- Question
- Paraglider
Among the 100,000 human responses, the most common words (“car,” “dog,” and “tree”) appeared in less than 2% of the sample.
Upping the temperature also ups creativity
“Temperature” is a parameter that controls the randomness of an LLM’s responses. Since higher temperatures increase randomness, the researchers hypothesized that creativity scores would rise if they upped the temperature as well.
And rise they did. With a temperature of 1.5, the average creativity was over 85, which was higher than 72% of the human scores. As expected, the repetitiveness of the words used also decreased.
This suggests that the top-performing model isn't just repeating the same good response, the researchers say. It's succeeding by generating "more nuanced and diverse responses."
Good prompting strategies work—and bad strategies don't
What about using a specific strategy to do the task? When the researchers tried giving various strategies to GPT-3.5 and GPT-4, they had an impact—sometimes for the better and sometimes for the worse.
When asked to generate the word list "using a strategy that relies on varying etymology," the creativity scores were slightly higher on average than with no strategy.
Here's an example of GPT-4's results using the etymology strategy.
- Elephant
- Galaxy
- Love
- Quantum
- Cathedral
- Fear
- Cycle
- Oxygen
- Plantation
- Hamburger
One of the most interesting results in the research was when the researchers used a prompt where they asked the LLM to complete the task "using a strategy that relies on meaning opposition." So the model generated pairs of opposites.
Here's an example of what it generated:
- Freedom
- Slavery
- Truth
- Fiction
- Destruction
- Creation
- Love
- Hatred
- Peace
- Chaos
Since opposites are semantically close to one another, this was a bad strategy. For both GPT-3 and GPT-4, this strategy resulted in a score even worse than generating 10 words at random. Yikes.
Beyond the DAT
To find out whether a good score on the DAT could predict performance on other creative tasks— and how the LLMs fared against humans yet again—the researchers tested the top three models (GPT-3, Vicuna, and GPT-4) on three additional tasks: generating haikus, movie synopses and flash fiction.
Here, the human examples emerged victorious. GPT-4 again outperformed the other two models, which suggested that a good score on the DAT meant the model would also score well on other creative tasks.
Again, increasing the temperature resulted in enhanced creativity scores, especially in the flash fiction and synopsis writing tasks.
Key takeaways
So how do you apply all of this to your own work? Here are some tips.
1. Use the most creative models
For creativity, the researchers found that some models (namely GPT-4) are quite a bit better than others. Like with so many tasks, the largest frontier models—that is, the models that are pushing the boundaries of what’s possible—tended to perform better.
That said, size isn't everything. Vicuna, the relatively small model fine-tuned from LLaMA that reportedly cost $300 to train, performed better than some of the larger models.
2. It’s easy to test an LLM's creativity.
Since the Divergent Associations Test correlates with the other creative writing tasks, it can be used to compare the creativity of different models. I've started including these under a capability benchmarking section in my prompt library so I can quickly benchmark the new models that come out.
But model scores can vary, so make sure you do multiple tests to get an accurate measure.
3. Fiddle with the temperature
Upping the temperature can make the outputs more creative, but how do you do this yourself? Model temperature is usually hidden behind the scenes, so if you're using the app or website, you can't directly fiddle with it. Instead, you need to use the API, or log into a platform account and use the playground interface.
Annoying, but if you’re really itching to get something very different, it's doable. (Note, though, that in my tests, there were a few times where the high-temperature settings made the response devolve into gibberish before it completed the task, so getting a balance is key.)
But I have a hot tip for you: You can actually affect the model temperature to some degree just by using a prompt. I don't know why this works, but it does.
While the results in ChatGPT-4o differed somewhat when I included my desired temperature setting in the prompt, the results in Claude 3.5 Sonnet were phenomenally different when I asked for high temperature results. Your mileage may vary, but if you're really looking for creativity, try it out.
4. A bad strategy is worse than no strategy
I was surprised to see how strongly the strategy impacted the output, both in positive and negative directions. Bad strategy really reduced the quality of results, even worse than using no strategy, while a good strategy improved it, if only slightly.
I think this is promising: It means that it's worthwhile to think about what strategy might actually work to solve the problem, and good prompting is worth taking seriously as a method to improve results. But be warned—you need to find a good strategy the LLM can actually implement.
5. Mix & match
When you compare the top words across models, here’s hardly any overlap. To me, this suggests that if you're trying to get the most creative result, using two different LLMs and picking the best response can give you an additional edge when it comes to generating creative outputs. Even better, add your own results into the mix before asking the LLMs so you can get the best of human and machine.
Conclusion
I'm always impressed by the capabilities of the new LLMs coming out and this research showed that there’s a lot to be excited about. GPT-4 can especially do a good job on various tests of creativity, so it's worthwhile to see how you can use it to push your creative projects to new heights.