Why AI-generated videos feel hypnotic, fluid, and uncanny

The strengths and weaknesses of being an impartial number cruncher

From Narrow To General AI
9 min readMay 8, 2024
https://www.youtube.com/watch?v=UXlPKKg4Md0
https://youtu.be/HK6y8DAPN_0?t=187

The stunning animations above were produced by Sora, the most advanced AI-based video generator to so far come out of OpenAI. They are undeniably impressive, even for someone who works in AI and has seen many such leaps forward in the last few decades. As you watch the videos you may be astonished by how seamlessly the AI blends together the various themes and objects in the clips, and how coherent and physically plausible the results look. Watching them, the average person would be baffled as to how AI technology has arrived at this level of realism.

After a while, and once the novelty wears off, you may begin to notice something else. The videos have an odd, hypnotic vibe. Watch them long enough and your eyes — and perhaps your mind — may start to glaze over. It feels like the video is pushing on your brain with a subtle and consistent pressure. There is something uncanny about the experience — not unpleasant per se, rather that something remains perpetually out of reach, hidden, so to speak, behind your eyes, or somewhere else.

This experience can be unbalancing, possibly because the clips look so perfectly crafted. Their representations are delicate and soft, polished, well-oiled, with no harsh edges, regardless of the content. Yet though I was certainly mesmerized by them I came away from the encounters with sense of fogginess in my brain. This is partly the result of the surreal content of the videos. But that by itself doesn’t explain the phenomenon completely — many a human-created video is surreal, and doesn’t have the same uncanny, lingering effect. Nor can it somehow be attributed to my own unconscious prejudices against the originating technology. In fact, imagining that these clips were filmed by a human actually heightens their uncanniness rather than diminishing it.

The clips are eye-candy to be sure, and their enjoyment comes from the visual delight of their crisp, colourful imagery. And just like candy, they can ultimately feel unsatisfying. They always seem like they are about to “deliver” something, but never do. Or to put it another way, they have no intention of delivering anything, the momentary experience is the point.

Which of the cafe-goers is the focal character in this clip? It seems to switch between all four, from left to right, as the video progresses. The “story” being told also shifts as it changes characters.

Watching them is like watching a fractal: entrancing both in itself, and as a style of presentation. Or perhaps it’s like hearing a story without a climax. As soon as you think you’ve got the set up figured out it subtly shifts and reveals a different story, which now anticipates its own payoff, and so on; like a run-on sentence. The videos are always on the cusp of a revelation, but they never cross over — and you don’t expect them to either. They are liminal, and prefer to stay that way.

https://giphy.com/gifs/animation-trippy-psychedelic-xUPGcxdhS5isNwqsuY

The hypnotic effect comes from the sense that the video is going somewhere, and at every turn it makes a new promise, holding you in suspense. The viewer follows along, and feels they can’t let go until they see something delivered. Ultimately they realize it never will be and just enjoy the experience in the moment. And so it meanders from goal to goal, as undirected as scrolling on social media.

To be clear, the videos do have an underlying idea, but it’s revealed at the start and maintained consistently throughout. There is no arc of setup and delivery, just a constant pressure. This is why the content of the videos is chosen to be visually captivating, and is often presented in slow motion. Cute puppies playing in snow, boats in a coffee cup, mammoths striding down a plateau. Like the stock footage they are trained on, their visual appeal is their whole point. The message of the clips matches the medium.

(I just added this clip cause it’s gorgeous)

Perhaps the best way to understand this phenomenon is to ask: what is the clip trying to say? What is its intent, its theme, its thesis? Why did the creator want us to see this, and specifically this? When you watch any video, you expect to quickly pick up the intent or purpose behind its creation. For AI-generated videos the answer shouldn’t be difficult to discover, since once a generative model has been trained, all specified content comes from the text prompt that it’s generated on. So whatever unique message a video has compared to other videos from the same model must come exclusively from that prompt. The text prompt should, ultimately, encapsulate the thesis… but it doesn’t. The prompt represents the content, but it doesn’t represent the purpose.

To understand what I mean, remember that any prompt you give an AI doesn’t come out of nowhere. You, as it’s author, have a history of motives and feelings, goals and intents, all trying to find expression through that prompt. They are your hidden purpose, not the concrete content you are literally asking for. None of that backstory makes it to the AI. The AI is being shortchanged, since it is not being given all the facts.

Your intent is not visible in the prompt itself; unless the AI could infer it somehow.

It is difficult to tell an AI something like “I want the viewer to feel the urge to help the poor” or “I want them to feel excited about exploring the world” or “I want them to be full of child-like wonder”. Often you yourself aren’t fully aware of what your are really trying to say — at least not enough to put it into English words.

So whenever an AI’s output misses the mark it is because, like an unintentionally evil genie, it is delivering what you asked for, but not what you wanted. The mistakes the AI makes reveal its alienation from the user’s hidden intent. It is frustratingly literal, and doesn’t suss when it should emphasize or downplay some part of the prompt, nor even when to add something missing that should rightly be there.

As a user you subsequently have to edit the prompt to try to shift it closer to what you had intended, but didn’t actually say. All the while, the AI is being pushed towards your vision from behind, prompt by prompt, rather than being pulled by a driving motive. And like a sandcastle that you have to keep pulling back into shape, any part of it that you neglect slowly crumbles.

I didn’t feel Dall-E really understood my intent. The focus shouldn’t be on the “homeless” part, but on the wonder in the child’s eyes. Of course, I could add this to the prompt, but then it misunderstands something else.

The experience of creating art using a generator often feels like filming a chimpanzee for a movie role. The on-set animal trainer and the cameraman must capture enough shots so that the resulting reel can be edited to make it look like the chimp intended what the script wanted. The animal is unlikely to, say, look at a sad interaction and intuit that it should also look sad, then steel itself with renewed resolve. That all has to be approximated piecemeal through other trained behaviours, then stitched together as though the chimpanzee meant it.

For an AI to be aligned with your intentions it must either have its own intents, or somehow extract them from the user’s prompt — and the latter is exceptionally difficult. In contrast, when you commission a human artist you assume that they will understand your intent from the request. If they don’t, you can explain it to them through emotional or metaphorical language.

Any “intent” or meaning the AI displays arises, not from the prompt itself, but in fragments borrowed from the training videos. Train it on only cheerful videos, and the result will be cheerful. AI are not opinionated or picky about what they represent or what their art communicates. You could say they are completely open-minded, using whatever imagery they are given during training.

This clip doesn’t so much say “puppies are cute”, but rather “you think puppies are cute. I don’t feel anything. Enjoy”.

AI generators are good at merging a diversity of content into a coherent whole — in a sense, that is all that they do. The generator takes the prompt you give it, then branches out in several directions to “find” other clips it has seen during training which correlate with pieces of the prompt.

By design, the resulting image borrows its style or “feel” from the training images.

In contrast, human creators are significantly more limited, and it can be difficult for an artist to break out of their peculiar style of representation. This is both a strength and a weakness. The constrained style of a piece of human-made art shows a focus; more importantly, it shows what they value. All intentional human behaviour has values embedded in it. For an artist these values give rise to the themes in their work by making them selective about what they draw or paint.

Human communication seeks more than just coherence, it tries to move its viewer in a certain direction, to make them think or feel a certain way. This is embedded in the resulting expression. Consequently, each piece of human art an AI trains on, and subsequently borrows from has a different intent. And to mix those in a giant pot necessarily muddies the outcome. This is why generative AI are more successful when they copy one particular artist’s style, since they borrow that artist’s implicit motives — what they look at, what they emphasize, what they ignore — without adulterating it with noise from the rest of the internet.

This fuzzball’s eyes start by looking eagerly in front of it, and so you follow its gaze. Then the mushroom unexpectedly appears from the right, at which point the creature looks there. Why did it look excited to begin with, then? A human artist would have known what she intended to portray all along, and focused all parts of the clip in support of that aim.

When you merge multiple viewpoints together, the result won’t say anything specific, nor can it make a “point”. An AI generated video is like the combined sound of a dozen human voices speaking in unison through the cultural artifacts we have created. There is no single value or motive uniting the result. AI are exceptionally good at merging content but not intent, simply because intent shouldn’t be merged at all. “Intent” is always biased in favour of what it wants — it brooks no compromises. The only way to correct the blending of intents in AI art is for the AI to first learn the effect its choices have on the viewer, then use that to drive the art in the direction it wants to.

It’s hard to know where the camera expects you to look — the focus of the shot is the vanishing point down the street, and the cherry blossoms are eye-catching, yet the man’s gaze draws your eye to the snow on the awnings.

For now, generative AI is a blank slate of sorts, an impartial data cruncher. This has its appeal, to be sure. It lacks many of the annoyances of working with humans, particularly their stubbornness and partiality. It is a compliant and obsequious mimic of social artifacts. This also means its failures can be confusing, almost like the software is “mocking”¹ you. It imitates your representations without showing that it understands why you care about the message or the subject matter. You want the AI ‘artist’ to agree with your message, and to express that through the resulting creation. But it can’t agree with you, it can only ape what it’s seen. So it ends up generating “mock” human artifacts —that is, facsimiles that lack the driving voice of their originals.

¹ This is simile; AI isn’t mocking anything.

What is it thinking? …What does it want?!

--

--

From Narrow To General AI

The road from Narrow AI to AGI presents both technical and philosophical challenges. This blog explores novel approaches and addresses longstanding questions.