WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

1HKUST   2Ant Group   3ZJU   4CUHK
*Corresponding author.

Abstract

We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory.

Dynamic Memory

3D Trajectories
Result
Global: A woman in an elegant white dress walks along a coastal road beside a black sports car that first came to a halt and then starts to pull away, set against a stunning ocean sunset.
[0-10s]: A black sports car was parked in place. A woman walks forward on the road.
[10-15s]: A black sports is driving on the road. A woman walks forward on the road.
3D Trajectories
Result
Global: A trunk car is captured in motion from behind as it cruises through a large parking area, heading past a 'QUICK MART' convenience store under a twilight sky painted with shades of purple and orange.
[0-10s]: A trunk car truns at the intersection and drives on the road.
[10-20s]: A car was parked in place.
[20-25s]: The truck car drove into the distance.
3D Trajectories
Result
Global: Beside a small campfire surrounded by stones, a man in a yellow hoodie and a red beanie gazed at the empty desert road and distant mountains as twilight enveloped the area.
[0-5s]: A man stands by the campfire, gazing into the distance.
[5-10s]: The man walks into the distance.
3D Trajectories
Result
Global: "A full-length photograph capturing a woman and a man with contrasting styles—one in a vibrant pink and turquoise tracksuit checking her phone, and the other in a bright yellow jacket and red cargo pants—standing in front of a wall.
[0-15s]: The man walks back and forth in the open space. The woman walks back and forth in the open space.
3D Trajectories
Result
Global: A man in a red hoodie and white cargo pants walks down the center of a suburban residential street.
[0-15s]: A man walks down the center of a suburban residential street.

Promptable Events

3D Trajectories
Result
Global: In a vast, open space, there are a few trees by the roadside.
[0-15s]: A horse was walking on the road.
[15-20s]: A bus drove past the camera.
[20-35s]: On a vast, empty field, a person stood still.

Viewpoints Control

3D Trajectories
Result
Global: A ginger cat with a small backpack, amidst a vast green field with a giant stone bust and a gnarled tree.
[0-15s]: A ginger cat with a small backpack walks around. A stone statue keeps still in place. A tree keeps still in place.
3D Trajectories
Result
Global: A man wearing a white robe stands in a grand ancient courtyard, gazing at majestic ruins illuminated by flaming braziers under a twilight sky.
[0-5s]: A man walks forward.
[5-10s]: A man turns around.
[10-15s]: A man stands still on the ground.

Comparisons

Ablation

Appearance Condition

No Appearance Condition
No Appearance Condition + Self-attention Routing
Ours
Global: "A full-length photograph capturing a woman and a man with contrasting styles—one in a vibrant pink and turquoise tracksuit checking her phone, and the other in a bright yellow jacket and red cargo pants—standing in front of a wall.
[0-15s]: The man walks back and forth in the open space. The woman walks back and forth in the open space.

Dynamic Context

No Dynamic Context
Ours

Appearance Condition Drop Mechanism

No Appearance Condition Drop
Ours