What type of content do you primarily create?
When AI spits out "Elephant...harmony...whisper...desert...wisdom…" is it actually being creative, or just remixing human ideas?
That's GPT-4 attempting the Divergent Association Task (DAT)—a standard measure of human creativity that researchers recently used to put AI's supposed creative abilities to the test. Turns out, there's a weird gap between what we think AI can create and what it actually delivers.
Their findings don't just compare robots to humans—they reveal specific techniques that actually work when you need AI to think outside its algorithmic box. No more settling for predictable, mediocre outputs.
Summary: maximizing creative outputs from LLMs
Here are some lessons from the experiment that can help you get more creative outputs from LLMs:
- Use the most creative model you can. In this experiment, OpenAI's GPT-4 was the most creative.
- You can judge the creativity of an LLM yourself by prompting it with the Divergent Association Task.
- Adjusting the “temperature” can increase an LLM's creativity. This is tricky, but can sometimes be done just through prompting—experiment with suggesting different temperature values between 0 and 1 and see what you get.
- A bad strategy is worse than no strategy. Make sure the strategy you're suggesting to the LLM is one that it can use to be more creative.
- For maximum creativity, use multiple LLMs. In this experiment, answers from the different LLMs were across the board. With more models at your disposal, you have more chances for a good output.
Ensuring reproducibility in creative LLM outputs
Maintaining reproducibility in creative LLM tasks can feel like herding cats, but it’s absolutely doable with the right methods. One key technique is using a fixed random seed, which ensures that each generation run uses the same sequence of randomness for consistent outputs [source]. Nucleus sampling, also known as top-p sampling, offers a flexible way to control your variation, though it can produce messy results at higher temperatures [source]. Min-p sampling is a newer twist that adjusts the threshold based on the LLM’s confidence, helping maintain coherence when pushing temperature settings [source]. Armed with these sampling strategies, you can dial in the exact balance between stable repeats and wild originality you need.
How we tested creative outputs from LLMs
The DAT is designed to measure divergent thinking, which is defined as the ability to come up with new and different solutions to open-ended problems—stuff like finding 10 ways to use a shoebox, or brainstorming a dozen approaches to marketing a new product. In fact, divergent thinking is a key element in the first stage of the creative process, where you have to explore a bunch of ideas before you find the best ones.
Put simply, divergent thinking is a test of creativity, and humans tend to be really good at it. LLMs, on the other hand? We'll see.
In the DAT, test-takers (whether people or AIs) are asked to generate a list of 10 nouns that are "semantically distant" from one another—a fancy way of saying that each of the words have to be as different in meaning as possible.
So “elephant” and “tiger”? Not so different.
But “elephant” and “harmony”? Very different.
In this experiment, researchers had 100,000 humans and nine different AI models (across 500 sessions each) generate 10 words, then assessed their semantic distance.
To test whether a good score on the DAT was linked to a good score on other creative tasks, the researchers also compared the AI models to humans in writing haikus, movie synopses, and flash fiction.
Here's what they discovered.
Research findings on creative LLM outputs
GPT-4 outperforms humans in creativity tests
Well, the robots won. The models differed in performance, with the top model, GPT-4, surpassing human performance in generating creative outputs from LLMs. GeminiPro was on par with human performance on the test. GPT-4-turbo was notably worse in performance than even GPT-3, GPT-4's predecessor.
Large LLMs excel but lack output diversity
Still, humans won out in overall diversity of the words they came up with.
Most LLMs have words they absolutely love to use in the DAT. Claude 3 overwhelmingly included “whisper,” “cactus,” and “labyrinth” in its responses, while GPT-4 had a thing for “microscope,” “elephant,” and “volcano.” Because of this, their scores tightly clustered around the average, with less variation between attempts.
Meanwhile, some of the lesser-known models like RedPajama, an open source effort to reproduce the LLaMA dataset and models, and Pythia, part of Eleuther AI's project to understand how knowledge develops during training, had an extremely wide set of scores, from very low to very high. These models also used a much more diverse set of words on the test—the top picks were only used 20% of the time or less.
Humans, however, all had different ideas for how to ace the test. They picked very different answers compared to each other.
Here's the top-performing human DAT, with a score of 95.7.
- Javelin
- Haemoglobin
- Citrus
- Gangrene
- Upstairs
- Microphone
- Numbat
- Tarantula
- Question
- Paraglider
Among the 100,000 human responses, the most common words (“car,” “dog,” and “tree”) appeared in less than 2% of the sample.
Higher temperature improves creative outputs
"Temperature" is a parameter that controls the randomness of an LLM's responses. Since higher temperatures increase randomness, the researchers hypothesized that creativity scores would rise if they upped the temperature as well. This is a key technical implementation detail for obtaining reproducible or deterministic outputs from language models.
And rise they did. With a temperature of 1.5, the average creativity was over 85, which was higher than 72% of the human scores. As expected, the repetitiveness of the words used also decreased. This demonstrates how temperature settings directly impact creative homogeneity in LLM outputs.
This suggests that the top-performing model isn't just repeating the same good response, the researchers say. It's succeeding by generating "more nuanced and diverse responses" - a key factor in producing truly creative outputs from LLMs.
Effective prompting strategies for LLM creativity
What about using a specific strategy to do the task? When the researchers tried giving various strategies to GPT-3.5 and GPT-4, they had an impact—sometimes for the better and sometimes for the worse. This finding has significant implications for both individual creators and organizations seeking consistent creative outputs.
When asked to generate the word list "using a strategy that relies on varying etymology," the creativity scores were slightly higher on average than with no strategy.
Here's an example of GPT-4's results using the etymology strategy.
- Elephant
- Galaxy
- Love
- Quantum
- Cathedral
- Fear
- Cycle
- Oxygen
- Plantation
- Hamburger
One of the most interesting results in the research was when the researchers used a prompt where they asked the LLM to complete the task "using a strategy that relies on meaning opposition." So the model generated pairs of opposites.
Here's an example of what it generated:
- Freedom
- Slavery
- Truth
- Fiction
- Destruction
- Creation
- Love
- Hatred
- Peace
- Chaos
Since opposites are semantically close to one another, this was a bad strategy. For both GPT-3 and GPT-4, this strategy resulted in a score even worse than generating 10 words at random. Yikes.
Creative outputs beyond standardized tests
To find out whether a good score on the DAT could predict performance on other creative tasks— and how the LLMs fared against humans yet again—the researchers tested the top three models (GPT-3, Vicuna, and GPT-4) on three additional tasks: generating haikus, movie synopses and flash fiction.
Here, the human examples emerged victorious. GPT-4 again outperformed the other two models, which suggested that a good score on the DAT meant the model would also score well on other creative tasks. This correlation is important for benchmarking reproducible creative outcomes across different LLMs.
Again, increasing the temperature resulted in enhanced creativity scores, especially in the flash fiction and synopsis writing tasks.
Enterprise‑level possibilities for creative LLM outputs
For those operating at scale, LLMs can supercharge everything from marketing copy to product ideation. Thanks to their knack for pattern recognition and text generation, these models can produce swift drafts for brainstorming [source]. But even with high‑performing LLMs, brand consistency can slip if you don’t run a quality check or align them with your organization’s voice. Enterprises often set up a workflow of AI output followed by human review, to catch and correct any stragglers that don’t match brand guidelines [source]. In the end, you get a creative pipeline that’s not just fast, but also reliably on‑message.
Key takeaways for enhancing LLM creative outputs
So how do you apply all of this to your own work? Here are some practical tips for getting the most creative outputs from LLMs:
1. Use the most creative LLM models
For creativity, the researchers found that some models (namely GPT-4) are quite a bit better than others. Like with so many tasks, the largest frontier models—that is, the models that are pushing the boundaries of what's possible—tended to perform better at generating creative outputs from LLMs.
That said, size isn't everything. Vicuna, the relatively small model fine-tuned from LLaMA that reportedly cost $300 to train, performed better than some of the larger models.
2. Easy ways to test LLM creativity
Since the Divergent Associations Test correlates with the other creative writing tasks, it can be used to compare the creativity of different models. I've started including these under a capability benchmarking section in my prompt library so I can quickly benchmark the new models that come out. This approach works well for both individual creators and enterprise teams seeking to evaluate LLM creativity.
But model scores can vary, so make sure you do multiple tests to get an accurate measure.
3. Adjust temperature settings
Upping the temperature can make the outputs more creative, but how do you do this yourself? Model temperature is usually hidden behind the scenes, so if you're using the app or website, you can't directly fiddle with it. Instead, you need to use the API, or log into a platform account and use the playground interface. This technical implementation is essential for controlling deterministic output from language models.

Annoying, but if you're really itching to get something very different, it's doable. (Note, though, that in my tests, there were a few times where the high-temperature settings made the response devolve into gibberish before it completed the task, so getting a balance is key.)
But I have a hot tip for you: You can actually affect the model temperature to some degree just by using a prompt. I don't know why this works, but it does. This technique can help reduce creative homogeneity and get more unique outputs from LLMs.
While the results in ChatGPT-4o differed somewhat when I included my desired temperature setting in the prompt, the results in Claude 3.5 Sonnet were phenomenally different when I asked for high temperature results. Your mileage may vary, but if you're really looking for creativity, try it out.
4. Avoid counterproductive prompt strategies
I was surprised to see how strongly the strategy impacted the output, both in positive and negative directions. Bad strategy really reduced the quality of results, even worse than using no strategy, while a good strategy improved it, if only slightly.
I think this is promising: It means that it's worthwhile to think about what strategy might actually work to solve the problem, and good prompting is worth taking seriously as a method to improve results. But be warned—you need to find a good strategy the LLM can actually implement.
5. Combine different LLMs for varied outputs
When you compare the top words across models, here's hardly any overlap. To me, this suggests that if you're trying to get the most creative result, using two different LLMs and picking the best response can give you an additional edge when it comes to generating creative outputs. Even better, add your own results into the mix before asking the LLMs so you can get the best of human and machine.
Conclusion: improving creative outputs from LLMs
I'm always impressed by the capabilities of the new LLMs coming out and this research showed that there's a lot to be excited about. GPT-4 can especially do a good job on various tests of creativity, so it's worthwhile to see how you can use it to push your creative projects to new heights. With Descript's Underlord AI assistant, you can apply these principles to generate creative outputs for your video and audio content that stand out from the crowd.
FAQs
How can I keep LLM creativity from becoming repetitive?
It helps to experiment with advanced sampling approaches, like nucleus sampling or min-p sampling, to increase output variety [source]. Also, adjusting temperature higher can reduce repetition, but watch out for incoherence. Fixed random seeds let you replicate or tweak outputs as needed for more refined exploration [source]. Combining multiple LLMs can bring fresh perspectives without getting stuck on the same old phrases.
Can enterprises safely use AI for marketing campaigns?
Yes, if they have a plan for quality control and brand alignment. LLMs can speed up creative brainstorming and content generation, but they sometimes produce off-brand texts [source]. A solid pipeline with AI at the front end and humans reviewing the output can mitigate that risk [source]. That way, you get a polished final piece that’s both fast and firmly on message.
