Nima Ajam Gard
â¢
January 22, 2026
â¢
13 min read
What’s Missing from Robot Learning (Part I)
In this series, I examine what is missing from today’s dominant approaches to robot learning. My focus is pragmatic: what is the next meaningful step toward scalable, reliable advanced AI automation systems that actually work in the real world.

In this series, I want to examine what is missing from today’s dominant approaches to robot learning.
I’ll leave the long-term debate about “how to solve AGI for robotics” to others. My focus is more pragmatic: what is the next meaningful step toward scalable, reliable advanced AI automation systems that actually work in the real world.
Robot learning is, by nature, an applied subfield of AI. Much of its progress comes from borrowing components developed elsewhere and assembling them into systems. Vision-Language-Action (VLA) models are a good example. They are currently the standard solution for long-horizon robotic tasks.
A typical VLA system combines:
a visionâlanguage model, usually pretrained for captioning, question answering, or instruction following, and
an action model, often an expert head such as a diffusion policy trained with a flow-matching objective [1].
VLAs are arguably the best solutions we have today, not because they are fundamentally correct, but because they check several boxes required for real-world deployment. That said, they are still missing critical components.
A Learning Analogy from the Physical World
Let me ground this discussion with a personal experience.
Several years ago, I decided to learn how to snowboard. I grew up in a hot climate where winter sports simply didn’t exist, so this was entirely new to me. Before ever stepping on a board, I watched hours of snowboarding videos. I wanted to see whether it was possible to learn a physically demanding and coordination-heavy skill purely from observation. I assumed that being generally athletic would help, and that all I needed was to understand the physics.
Armed with YouTube videos, new gear, and excitement, I went to Perfect North Slopes right over the Ohio border in Indiana. I got on the board and immediately fell. Repeatedly. I spent far more time on the ground than on the board.
This wasn’t surprising in hindsight. I was trying to activate muscle groups in unfamiliar ways. The videos showed what to do and what success looked like, but not how it felt to do it, nor how to correct mistakes in real time. They couldn’t because that information depends on embodiment.
Learning a physical skill like snowboarding requires more than observationâit requires embodiment.
On my next visit, I signed up for beginner lessons. Immediately, things changed. I learned how to balance, received explicit feedback on what I was doing wrong, and developed a basic sense of control. I still fell often, but I was now spending meaningful time upright.
When I went home, I rewatched many of the same videos. This time, they made much more sense. I noticed details I had previously ignored and could mentally simulate movements I now had some physical reference for. Concepts like shifting my center of gravity or steering with my back foot were no longer abstract.
On my third trip, I mostly experimented. I had some instruction, a partial physical model, and a desire to test hypotheses. After a few falls, I managed to ride down the bunny hill. I had finally connected actionsâspecific body commandsâto outcomes I had previously only seen in videos.
I am not a good snowboarder, and this is not a prescription for how people should learn sports. But the experience provides a useful reference frame for how robots might learn new skills.
Mapping the Analogy to Robot Learning
The three components of robot learning map to different aspects of skill acquisition.
In this framing:
Instruction and feedback are analogous to imitation learning with dense supervision.
Watching videos corresponds to learning a world model, which means understanding what outcomes are possible and how the world behaves.
Trying things out and mapping actions to outcomes is reinforcement learning, grounded in embodiment-specific control.
I’ve intentionally left out memory, planning, and continual improvement here, even though they are critical, because adding them would complicate the comparison. In this document, I will only talk about the world model. In later articles, I’ll cover the other components.
So how do VLAs stack up?
What VLAs Get Right and Where They Fall Short
The core problem in robot learning is mapping high-level goals, which may span seconds to minutes, into sequences of embodied actions. Vision-language models are convenient tools for decomposing high-level tasks. Trained on vast internet-scale datasets, they are good at semantic understanding and instruction following.
In that sense, VLAs provide some form of long-horizon planning. Compared to approaches like pure diffusion policies or action-chunking transformers, VLAs at least attempt to reason over extended task structure.
But are they good world models for robotics? I’m skeptical.
This skepticism comes from two related issues:
Task misalignment: The pretraining objectives of visionâlanguage models (captioning, QA, instruction following) are weakly aligned with the requirements of physical interaction [2].
Latent usefulness: Task formulation and training objectives strongly affects the semantics and utility of learned latent representations [3, 4].
In most VLA architectures, the visionâlanguage model is asked to serve both as a world model and as a planner. Alternative formulations, such as System 1 / System 2 distinctions or hierarchical architectures, still rely on heuristics to connect high-level reasoning to low-level control [5, 6].
Sometimes every observation is routed through the visionâlanguage model before reaching the expert head. Sometimes a high-level latent is refreshed at a slower rate. Although the details of Harmonic reasoning [7] is not clear, it hints at this problem and suggest that simultaneously thinking and acting is necessary.
So why are VLAs so popular?
Simply put: because visionâlanguage models are abundant and open-source. They are easy to plug in, and they have been trained on internet-scale data.
I don’t want to discount the success and impressive results we’ve seen from VLAs. My point is that perhaps there is another formulation we need to consider.
World Models for Robotics
If VLAs are not the right world models, what is?
I argue that we need forward dynamics models trained across diverse scenarios. These are models that explicitly learn how the world evolves under a given action. Humans think (plan) not by generating language [8] tokens, but by mentally simulating outcomes grounded in physics and embodiment. Chomsky argues the opposite about the role of language in thinking [9], however I favor the former viewpoint.
https://the company-robotics-1.wistia.com/medias/i13jzgw6ot
The video above shows multiple visualizationsâI wish video generation were more consistent at the time of writing this blogâof internal planning of a robot. Humans can only imagine one simulation; on the other hand, with world models, we can spin up many simulations. But what is the fidelity of these simulations? Does our internal model of the world just learn an abstract representation of what we have observed, or is it a high-fidelity reconstruction of what has happened and is about to happen? Should the internal visualization be a detailed rollout of an event or just a sketch of it? The cartoon from the 2018 World Models paper [10] shows a cyclist with a simplified version of himself riding his bike.
From Scott McCloud’s “Understanding Comics” [11] – our awareness of self flows outward to include objects of our extended identity.
A world model is conditioned on state and action and predicts future states. Whether such models should predict pixels is an open question. State could include recent visual observations, context, or other sensory inputs. Task formulation in world models is still an open problem. Some believe in pixels while others believe in embeddings. Pixel reconstruction may help learn different representations than predicting embeddings, but perhaps pixel reconstruction is a tool to visualize the internal thinking of the model. The dynamic module in Genie [12] takes visual latents and latent actions, which are learned from unlabeled videos, and predicts the visual latents for the future step. Pixel reconstruction is used to train the video tokenizer. JEPA-style world models [13] are similar to the dynamic modules of Genie, but the training objective is formulated differently.

What Roles Can a World Model Play?
World models can be used in the gaming industry, virtual worlds in VR, or they can be more specific to a process, such as a fluid simulator. In robot learning, a world model can serve multiple roles:
A simulator for evaluation
An agent’s internal predictive model
A data generator and augmenter
To better understand the landscape of simulators, we will take a look at NVIDIA’s ecosystem. We can evaluate current offerings along three axes:
Physics simulation
Asset generation
Skill training
The NVIDIA ecosystem spans three axes: Physics Simulation (Newton/Warp, Isaac Sim), Skill Training (Isaac Labs), and Asset Generation (Replicator).
Physics Simulation
Physics simulation enables policy rollout, imagination, evaluation, and synthetic data generation. Accuracy and ease of setup are critical. The lack of fidelity leads to sim-to-real gap, and intense setup activities lead to a barrier to entry when it comes to the use of physics-based simulators in robot learning. Current well-known simulators such as Isaac Sim integrates physics, rendering, sensors, and robotics workflows. Isaac Sim which uses PhysX, a simulator for rigid bodies, articulations, soft bodies, and friction is becoming easier to use, but it still leans heavily on an expert user for setup. Warp is a GPU-accelerated spatial computing and simulation framework (not itself a full physics engine) that compiles Python kernels into fast CUDA code, with differentiable and vectorized primitives. Mujoco is a lightweight and high-performance physics engine focused on articulated body dynamics and contact simulation, widely used in research and control. Newton is a new open-source, GPU-accelerated physics engine that targets robotics. It is built on Warp and integrates MuJoCo-Warp solvers. All of these simulators and frameworks require someone to build assets and set up environments on top of defining the physics logic, physical properties of materials, etc. A user has to either define the workflow through a graph or write code to be able to perform a task.
Skill Training
Skill training frameworks define tasks, train policies, evaluate policy performance, and deploy models. This layer is relatively new, but tightly coupled to the simulator beneath it. Isaac Labs, for instance, is built on the same simulators offered by Nvidia, which is understandable given that it is part of the same ecosystem. However, to experiment with different models, a more modular skill training framework is needed. Such a framework needs to have abstraction around the simulator, a.k.a world model, open-source models such as LeRobot, configuration management, native experiment tracking, and more importantly, experiment analysis.
Asset Generation
Asset generation is the artistic and historically most brittle part of robot learning. Scenes, materials, textures, lighting, and geometry all live here. A great tool for creating individual scenes and programmatically augmenting them is Replicator. Creating diverse, realistic assets has traditionally required specialized skill sets (graphics, simulation expertise) and significant manual effort. As a result, dataset diversity is bottlenecked by asset creation speed.
While physics simulators can support photorealistic rendering, asset pipelines remain difficult to scale. This has limited the diversity and realism of simulated training environments. Therefore, a solution that can easily scale scene creation and augment the diversity of assets is a natural step in getting over the data bottleneck in robot learning.
Why World Models Matter Now
We are at an inflection point. Learned world models allow us to collapse the boundary between physics simulation and asset generation.
World models are, in fact, neural simulators. They learn both dynamics and appearance from data. While they do not fully solve asset generation, they dramatically reduce the dependency on hand-authored scenes.
Today, there are two emerging approaches:
Synthetic-first: Generate an initial scene using an image generator, then use that image as the starting state for a world model rollout. This approach can be effective for long-tail problems or pre/post-training agents.
Real-first: Capture a real image, apply generative augmentation, then pass the result into a world model. This is an effective approach to harnessing the abundance of images on the internet.
Lucid Sim [14] generating simulated scenes – both approaches can generate effectively unbounded data for robot learning.
Both approaches can generate effectively unbounded data for robot learning.
The Remaining Gap: Interaction and Control
One major issue remains: how we interact with these environments.
Even though we can generate many scenarios, the interaction and control of the world model remains a challenge. One approach today is neural trajectories: a series of pseudo actions extracted through an inverse dynamics model or latents actions from unlabeled videos [12, 15]. The obvious next step is to augment trajectories in a way that remains feasible across different scenes and dynamics. In other words, trajectories should become templates that can be adapted, transformed, and recomposedâserving as structured priors. This is best viewed as exploiting human-generated trajectories, not merely replaying them.
Final Thoughts
World models are simulators without hand-built physics, internal predictive models for planning, testbed for policy evaluation, and large-scale data augmentation and generation engines. We are seeing early results coming from groups such as 1x [16] and Wayve [17] that show the effectiveness of world models.
What’s missing today is a widely accessible world model suite that integrates:
Scene creation (real or synthetic)
Learned dynamics
Trajectory-conditioned interaction
Data augmentation and evaluation
The algorithms used in world models need to improve to better handle:
Single and multi-agent
Long context
Memory and consistency
Self recognition
Faster than real time rollout
That gap is what’s currently missing in robot learning.
A world model training workflow: Build scenes from real-world videos, add robots for interaction, and train in parallel with various textures, scenarios, and lighting.
About Nima Gard
I’m Director of AI in the industry, where I lead our Perception, Weld Intelligence, and Robot Learning teams. We build advanced AI automation that empowers legacy manufacturing technologies to see, understand, and weld without programmingâpowered by Obsidian, the first foundational AI model for welding. My current focus is on building the first world model for manufacturing.
References
[1] Intelligence, P., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., … & Zhou, Z. (2025). Ï*â.â: a VLA That Learns From Experience. arXiv preprint arXiv:2511.14759.
[2] https://x.com/DrJimFan/status/2005340845055340558
[3] Kumar, Akarsh, et al. “Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis.” arXiv preprint arXiv:2505.11581 (2025).
[4] Dieleman, Sander. “Generative Modelling in Latent Space.” Sander.ai, 15 Apr. 2025, https://sander.ai/2025/04/15/latents.html
[5] “Helix: A Vision-Language-Action Model for Generalist Humanoid Control.” Figure AI, 20 Feb. 2025, https://www.figure.ai/news/helix
[6] Bjorck, Johan, et al. “Gr00t n1: An open foundation model for generalist humanoid robots.” arXiv preprint arXiv:2503.14734 (2025).
[7] Generalist AI Team. “GEN-0 / Embodied Foundation Models That Scale with Physical Interaction.” Generalist AI Blog, 4 Nov. 2025, generalistai.com/blog/nov-04-2025-GEN-0
[8] Fedorenko E, Piantadosi ST, Gibson EAF. “Language is primarily a tool for communication rather than thought.” Nature. 2024 Jun;630(8017):575-586.
[9] Chomsky, Noam. What Kind of Creatures Are We? Columbia University Press, 2016.
[10] Ha, David, and Jürgen Schmidhuber. “World models.” arXiv preprint arXiv:1803.10122 2.3 (2018).
[11] E, M. More thoughts from understanding comics by scott mccloud, 2012. https://goo.gl/5Tndi4
[12] Bruce, Jake, et al. “Genie: Generative interactive environments.” Forty-first International Conference on Machine Learning. 2024.
[13] Assran, Mido, et al. “V-jepa 2: Self-supervised video models enable understanding, prediction and planning.” arXiv preprint arXiv:2506.09985 (2025).
[14] Yu, Alan, et al. “Learning visual parkour from generated images.” 8th Annual Conference on Robot Learning. 2024.
[15] Jang, Joel, et al. “DreamGen: Unlocking Generalization in Robot Learning through Video World Models.” arXiv preprint arXiv:2505.12705 (2025).
[16] AI Team. “1X World Model | From Video to Action: A New Way Robots Learn.” 1X, 12 Jan. 2026, www.1x.tech/discover/world-model-self-learning
[17] Wayve. “GAIA-3: Scaling World Models to Power Safety and Evaluation.” Wayve, 2 Dec. 2025, wayve.ai/thinking/gaia-3/









