The Turing Jest

or, why computers struggle with comedy so much.

Zev Burton
July 24, 2024

Thank you to the 1 new subscriber since the last post. With that, we are 360 strong! Please share below with anyone else who may be interested:

I bring to you today my favorite article of the year so far, from Business Insider: DeepMind researchers realize AI is really, really unfunny. That's a problem.

Here are the main paragraphs:

In a study published earlier this month, Google DeepMind researchers concluded that artificial-intelligence chatbots are simply not funny.

Last year, four researchers from the UK and Canada asked 20 professional comedians who used AI for their work to experiment with OpenAI's ChatGPT and Google's Gemini. The comedians, who were anonymized in the study, played around with the large language models to write jokes. They reported a slew of limitations. The chatbots produced "bland" and "generic" jokes even after prompting. Responses stayed away from any "sexually suggestive material, dark humor, and offensive jokes."

The participants also found that the chatbots' overall creative abilities were limited and that the humans had to do most of the work.

Shubhangi Goel from Business Insider (republished by Yahoo! News)

Are we surprised? As someone who has spent far too much time studying humor (see: book), not at all.

The article itself from Google DeepMind gives one comedian’s review of the LLMs as “cruise ship comedy material from the 1950s, but a bit less racist”.

what an incredible title

I want to offer a few more perspectives as to why this happened - one observation, one mathematical point, and one theory for the future.

Observation

Think about what the LLMs were based on: all the text, including all the comedy, ever.

Charlie Chaplin. “Eh, what’s up, doc?” Lucille Ball. “Alrighty then!” Richard Pryor. “More Cowbell!” Amy Schumer. “I don’t get no respect.” Aristophanes. “Bazinga!” Groucho Marx.

LLMs use this extensive data to predict the next word based on the patterns in their training data. They tend to rely on the most frequently occurring phrases and ideas. As a result, they often adopt the most common comedy tropes, which, over the past century, have unfortunately been moderately sexist. This reliance on common tropes means that LLMs can perpetuate outdated and potentially harmful stereotypes — cruise ship comedy, if you will.

Mathematical Point

There’s a phrase in stand-up, a “tight five” — a well-crafted and polished five-minute comedy set — is critical. It should start strong, each joke should build on each other, pacing must be perfect, and have key themes that a comic can really drive home with a clever and incredible finale. An LLM struggles with each of these in the following ways:

Starting Strong

LLMs tend to generate responses that are safe and statistically common, rather than bold and attention-grabbing. The ‘next word’ probability distribution is smoothed to avoid outliers, which leads to more generic and, frankly, boring outputs. Moreover, strong openings often draw on personal anecdotes or unique experiences that a certain comedian may have, which LLMs inherently lack. Personal experience and storytelling involve integrating multiple layers of context, memory, and emotion, which LLMs can't replicate from training data alone. (Not to mention, when one comic tried to write a set from the perspective of an Asian woman, the LLMs refused, but did in fact write a set from the perspective of a white man.)

Jokes Building on Each Other

LLMs struggle with creating a comedy set where jokes build on each other due to several key limitations. They can understand and relate words within a short context but lose coherence over longer sequences because the attention mechanism in transformers works well for short to medium-length text but struggles with extended context. Additionally, building interconnected jokes requires an overall narrative structure, which LLMs lack since they don't use a multi-level planning approach. Instead, they predict each next word based on the previous ones, which can result in locally coherent but globally disjointed narratives. This focus on sentence-by-sentence coherence without an overarching plan makes it difficult for LLMs to maintain a consistent and engaging storyline throughout a comedy set.

Perfect Pacing

British comedian Jimmy Carr has proposed that 92 beats per minute is the optimal rhythm for stand-up comedy. This tempo is comparable to well-known songs like Radiohead's "Creep" or Simon & Garfunkel's "Mrs. Robinson." However, the nuances of comedic timing—such as pacing, pauses, and delivery—remain a significant challenge for LLMs.

A Theory

When AI gets funny, that’s it. As in, AI will have conquered what I consider to be the final hurdle. I don’t yet know how we are measuring “funny,” but I’m proposing a version of a Turing Test that I will dub “the Turing Jest”:

One hundred up-and-coming (or virtually unknown) comedians perform their bits and record them in front of a live audience. We could use Netflix’s taped specials or old Comedy Central specials; what matters is that they are real comedians doing real jokes.
We have AI create one hundred thirty- to sixty-minute-long sets and “record” them. There must be video and audio. Everything about the “specials” must be computer generated — not just the comedian, but the audience, the camera movements, even the slightly-off-putting titles.
We release all of them on Netflix/Hulu/an already established streaming service secretly and wait for the reviews to come in after one year. If the reviews for the AI-generated specials are better than the real specials, then we declare defeat. If not, then we keep our Roombas.

Is this method perfect? Probably not — but we’ve yet to truly explore how to ‘measure’ artificial comedy, and I believe it’s time to start seriously thinking about it.

-Zev