Adventures in Veo 3.1: Teaching AI to Remember What Characters Look Like

Here's a problem I didn't expect to spend my Friday night solving: how do you make an AI video generator remember what a cartoon elk looks like across multiple scenes?

I've been building Vibecaster, which now generates AI videos using Google's Veo 3.1. The videos are surprisingly good—Veo handles dialogue, camera movement, and even generates matching audio. But there's a catch: if you're telling a story with recurring characters, each scene might render them completely differently.

The Test Prompt

A cozy New Year's Eve scene with three characters:

  • David: A 42-year-old British man in a burgundy sweater, sitting in a leather armchair
  • Elky: A 4-foot cartoon elk mascot with big antlers and a red scarf
  • Loggy: A 2-foot cartoon wooden log with dot eyes, rosy cheeks, and a green woolly hat

The punchline? Loggy is terrified of the roaring fireplace. (He's made of wood. It tracks.)

The problem: across three scenes, Elky's antlers kept changing shape. Loggy's face drifted between cute and cursed. David somehow gained and lost a beard between cuts. The story was there, but the characters weren't consistent.

Reference Images

Veo 3.1 lets you pass up to 3 reference images to guide video generation. Google's documentation says it's for character consistency.

Except there's a catch: reference images and "first-frame" mode are mutually exclusive. You can either:

  • Give Veo an image and say "animate this" (first-frame mode)
  • Give Veo reference images and say "include these characters" (ingredients mode)

But not both. And we were using first-frame mode because it gives you more control—generate a scene image with Nano Banana Pro (Google's image model), then animate it with Veo. The references would be ignored.

The Solution: A Two-Stage Pipeline

Here's what we built:

User Prompt (with character descriptions)
    ↓
LLM Analysis → Extract characters + which scenes they appear in
    ↓
Nano Banana Pro → Generate reference portrait for EACH character
    ↓
Scene 1: Generate image WITH character refs → Veo animates
    ↓
Scene 2+: Veo extends previous video + uses character refs

Nano Banana Pro can also use reference images when generating scene images. Generate character portraits first, use those as references when generating each scene's first frame. Scene 1 characters match the portraits. Veo's video extension chain maintains continuity from there.

The Implementation

First, we analyze the user's prompt to extract structured character data:

{
  "characters": [
    {
      "id": "david",
      "name": "David",
      "description": "42yo British man, short brown hair, burgundy sweater...",
      "style": "storybook_human",
      "priority": 1
    },
    {
      "id": "elky",
      "name": "Elky",
      "description": "4ft cartoon elk, upright, tan fur, big antlers, red scarf...",
      "style": "pixar_3d",
      "priority": 2
    },
    {
      "id": "loggy",
      "name": "Loggy",
      "description": "2ft cartoon wooden log, bark texture, dot eyes, rosy cheeks...",
      "style": "pixar_3d",
      "priority": 3
    }
  ],
  "scene_characters": {
    "1": ["david", "elky"],
    "2": ["david", "loggy"],
    "3": ["david", "loggy"]
  }
}

Then we generate a reference portrait for each character. The LLM detected that David is a stylized human while Elky and Loggy are Pixar-style 3D mascots—so each gets appropriate style prompts.

When generating each scene, we pass only the relevant character references. Scene 1 gets David and Elky. Scenes 2 and 3 get David and Loggy. Veo's limit is 3 references per call, but we can have unlimited characters total—different combinations per scene.

The Fun Part: Mixed Styles

What makes this interesting is mixed-style scenes. David is a semi-realistic human. Elky and Loggy are cartoon mascots. They need to coexist in the same frame without looking like a bad Photoshop composite.

The trick is the "global style" hint. We tell both Nano Banana and Veo that this is a "storybook" aesthetic—warm lighting, slightly painterly quality. David gets rendered in a Pixar-adjacent style that matches the mascots. Everyone lives in the same visual universe.

What I Learned

LLMs handle structured extraction well. Give Gemini a messy prompt with character descriptions scattered throughout, ask for JSON output, and it reliably pulls out the right data. The "which characters appear in which scene" detection worked on the first try.

Video extension is the primary consistency mechanism. Reference images help guide generation, but each scene extends from the previous video's last frames. The characters are already on screen—they just keep doing things.

API limitations force creative solutions. The mutual exclusivity between first-frame and reference modes seemed like a blocker. Using references at image generation (Nano Banana) instead of video generation (Veo) works around it.

The Expensive Mistake

When testing, I accidentally ran the full integration test suite instead of just the unit tests. Four Veo video generations, several image generations, all hitting real APIs. Lesson learned: pytest -m "not integration" exists for a reason.

At least the tests passed.

What's Next

The multi-character reference system is live on Vibecaster. The next obvious feature is letting users upload their own reference images—skip the portrait generation and just say "this is what my character looks like."

But for now, I'm just happy that Elky's antlers stay the same shape throughout the video. It's the little things.

Built live with Claude Code at 2 AM because apparently that's when I do my best work.