Models - Mar 14, 2026

The Physics of Sora 2: How OpenAI is Simulating the Physical World

Introduction

One of the most striking aspects of Sora 2 — OpenAI’s text-to-video model released on September 30, 2025 — is not just that it generates video, but that the generated video appears to understand physics. Objects fall with realistic trajectories. Water splashes in plausible patterns. Light interacts with surfaces in ways that feel physically grounded.

But does Sora 2 truly understand physics, or is it performing an extraordinarily sophisticated form of pattern matching? This question sits at the intersection of computer science, physics, and philosophy — and the answer has implications far beyond video generation.

Diffusion Transformers: The Architecture Behind the Magic

Sora 2’s architecture is built on diffusion transformers, an approach that combines two of the most successful paradigms in modern AI: diffusion models (as used in DALL-E 3 and Stable Diffusion) and transformer networks (as used in GPT-4 and other large language models).

In a standard diffusion model, the system learns to generate images by learning to reverse a noise-adding process. Starting from pure noise, the model iteratively removes noise to produce a coherent image. Sora extends this into the temporal dimension — instead of denoising a single image, it denoises an entire video sequence simultaneously.

The transformer component is what allows Sora to reason about relationships between different parts of the video across both space and time. Each frame is divided into patches, and the transformer’s attention mechanism allows every patch to attend to every other patch — including patches from different frames. This means the model can, in principle, maintain coherence between the beginning and end of a video, ensuring that objects that move off-screen reappear correctly.

How Sora 2 Handles Gravity

Gravity is one of the most fundamental physical forces, and getting it wrong is immediately noticeable to human viewers. In Sora 1 (released December 9, 2024, for Plus and Pro users in the US and Canada), objects occasionally floated unnaturally or fell at incorrect speeds.

Sora 2 shows marked improvement. Dropped objects accelerate at rates that approximate real gravitational acceleration. Bouncing objects lose energy on each bounce. Thrown objects follow parabolic trajectories.

Importantly, Sora 2 did not learn an equation for gravity. It learned statistical regularities from millions of video frames that happen to be consistent with gravitational physics. The model has internalized a correlation that looks like gravity without possessing a causal model of gravitational force.

This distinction matters. When you push Sora 2 into unusual scenarios — zero gravity, extreme slow motion, microscopic scales — the illusion of physical understanding can break down because the training data becomes sparse in those regimes.

Fluid Dynamics and Soft Body Physics

Fluid simulation is one of the most computationally expensive tasks in traditional visual effects. Simulating water, smoke, or fire using physics-based methods (like Navier-Stokes solvers) requires enormous computational resources and expert setup.

Sora 2 sidesteps this entirely. Rather than solving differential equations, it generates fluid behavior that looks convincing because it has observed millions of examples of how fluids actually behave on camera. The result is:

Water that splashes, pools, and reflects light in plausible ways
Smoke that dissipates with realistic turbulence patterns
Fire that flickers and casts dynamic light on surrounding surfaces
Cloth that drapes and folds with reasonable fabric behavior

The quality is not yet indistinguishable from high-end VFX simulation, but it is dramatically better than anything available in AI video generation even 12 months ago. For many use cases — social media content, pre-visualization, concept exploration — it is more than sufficient.

Object Permanence and Scene Consistency

One of the most challenging aspects of video generation is maintaining consistent objects across frames. In early AI video models, objects would appear and disappear randomly, change shape between frames, or merge with their backgrounds.

Sora 2 addresses this through the transformer’s attention mechanism, which allows the model to track objects across the temporal dimension. The improvement is significant:

Objects that move behind occluders tend to reappear correctly
Scene geometry remains largely stable across frames
Character appearance is more consistent (though still imperfect over long sequences)
Lighting conditions evolve plausibly as the camera moves

However, Sora 2 still fails at object permanence in complex scenarios. When multiple similar objects interact, the model sometimes “merges” them or loses track of individual items. This is a known limitation of the current architecture and a major focus of ongoing research.

Camera Physics and Cinematography

Beyond simulating physical objects, Sora 2 also simulates camera physics:

Depth of field varies with apparent focal length
Motion blur appears on fast-moving objects
Lens distortion is subtly present, especially at wide angles
Exposure adjusts plausibly when transitioning between bright and dark areas

These camera effects are not added as post-processing — they emerge from the model’s training on real camera footage. Because Sora learned from videos shot with real cameras, it implicitly learned how cameras mediate the visual experience.

This has interesting implications for cinematography. By specifying different visual styles in the prompt, users can effectively “choose” different camera and lens characteristics without any technical knowledge of optics.

The “World Simulator” Claim

OpenAI has described Sora as a “world simulator,” implying that the model builds internal representations of 3D scenes and physical laws. This claim is controversial in the research community.

Arguments for the “world simulator” interpretation:

The model maintains 3D consistency across viewpoint changes
Physical behaviors emerge without explicit physics programming
Scene elements interact in ways that suggest spatial awareness

Arguments against:

The model fails on out-of-distribution scenarios that a true physics simulator would handle
There is no evidence of explicit 3D representations in the model’s latent space
Many apparent physics effects could be explained by sophisticated 2D pattern matching

The truth likely lies somewhere in between. Sora 2 has learned something about 3D structure and physical causality, but that “something” is not a physics engine in any traditional sense. It is a statistical approximation that works remarkably well within the distribution of its training data and degrades outside of it.

Implications for Scientific Visualization

If Sora and its successors continue to improve their physical plausibility, there are fascinating implications for scientific visualization:

Molecular dynamics could be visualized from text descriptions
Geological processes operating over millions of years could be rendered as time-lapse videos
Engineering prototypes could be stress-tested visually before physical construction
Weather patterns could be visualized in intuitive ways for public communication

However, for any of these applications to be trustworthy, the model would need to be validated against known physical ground truth — something that current AI video models are not designed for.

The Gap Between Appearance and Understanding

The most philosophically interesting aspect of Sora 2’s physics is the gap between appearing to understand physics and actually understanding physics. Sora 2 generates videos where objects behave as if they are subject to gravity, friction, and elasticity. But the model has no concept of mass, no concept of force, and no concept of energy conservation.

This gap matters for several reasons:

Reliability: A physics engine will always produce physically correct results within its domain. Sora 2 will produce physically plausible results most of the time, but will occasionally produce impossible outputs.
Extrapolation: A physics engine can simulate scenarios it has never encountered. Sora 2 can only generate scenarios similar to its training data.
Control: A physics engine allows precise control over physical parameters. Sora 2 offers only indirect control through natural language prompts.

Where Sora 2 Falls Short

Despite its impressive capabilities, Sora 2 has clear limitations in physics simulation:

Counting: Objects sometimes spontaneously multiply or disappear
Complex mechanical interactions: Gears, pulleys, and linkages often behave incorrectly
Long-term consistency: Physics degrades noticeably in videos longer than 20-30 seconds
Text and symbols: Words on objects still frequently become garbled
Hands and fingers: A persistent challenge inherited from image generation

The Road Ahead

The trajectory from Sora 1 to Sora 2 suggests that physical plausibility will continue to improve rapidly. Each generation of the model learns from more data, uses more compute, and benefits from architectural improvements.

The eventual convergence of AI video generation with physics simulation engines is likely. Future models may combine learned visual priors with explicit physics constraints, producing outputs that are both visually stunning and physically accurate.

For now, Sora 2 occupies a fascinating middle ground — a system that has learned enough about physics to fool human viewers most of the time, without understanding physics in any meaningful sense.

For creators and researchers exploring AI-generated video and its intersection with physical simulation, Flowith offers a multi-model workspace where you can experiment with different AI tools and orchestrate complex generation workflows seamlessly.