Research Essay · Neural Computers

By Mingchen Zhuge Published Updated

Neural Computer: A New Machine Form Is Emerging

TL;DR: we are starting to expect the machine itself to learn how to run.

Neural Computer teaser diagram

If you have ever wondered whether AI might ultimately become a kind of computer, this essay is for you.

Over the past few decades, computers gradually became an important medium through which people get things done. In the last few years, AI has started to move into that role as well: it no longer just answers questions. AI systems now call tools, operate interfaces, and participate in real workflows. The question changes with it: do we want AI to use computers, or to become a kind of computer? This is also the question behind what I call the Neural Computer (NC).

Here, Neural Computer does not simply refer to the NTM / DNC line associated with Alex Graves[1][2], nor are we talking about some newer hardware direction such as Taalas[15]. It goes beyond these earlier ideas. So the following are not the goals of Neural Computer here: a stronger agent, a world model for computer environments, or an extra layer of intelligence added on top of conventional computers. What matters here is whether some of the responsibilities now carried by the program stack, toolchain, and control layer could gradually move into the runtime the model actually depends on.

I suspect this idea has crossed many people's minds over the past year, so I call it a pre-consensus.


1. Why now: a new machine form is starting to emerge

Three things are happening at once.

First, agents are getting better and better at real work. In 2023, MetaGPT, one of the early coding-agent prototypes[3], could barely produce a few hundred lines of code. By 2025, Cursor, Codex, and Claude Code had already become default productivity tools for many programmers. Today OpenClaw[4] has started bringing these systems to non-programmers too. The question is no longer whether an agent can occasionally pull off a task. The question is now whether it can enter real production and daily life and handle things for you reliably.

For agents, the main bottlenecks now are: (1) how to stay stable over long-horizon tasks, (2) how capabilities can be retained and reused, and (3) how workflows can be reused over time. The dominant path still adds structure on the scaffold or harness side: stronger memory, longer workflows, and tighter action loops, all in service of raising task completion rates. Push that further and the more aggressive path becomes recursive self-improvement: models training the next generation of models, agents continuously rewriting themselves[5].

Agents are making the transition from prototype experiments to professional productivity tools and, increasingly, to everyday infrastructure.[3][4]

Second, world models are getting better at modeling dynamic environments. Over the last year, projects such as GameNGen and Genie 2 / 3 have made more people believe that a model can do more than represent the current state. It can also maintain an internal structure for what is likely to happen next. More importantly, this ability has already entered a few real closed loops. This is especially true in corner cases that are hard to collect repeatedly and cheaply in the real world. In those settings, rollout is already being used directly for prediction, planning, control, and training. Along this trajectory, from Jürgen Schmidhuber's 1990 vision in Making the World Differentiable[6], to the 2018 paper World Models[7], and now to Waymo using world models in autonomous-driving simulation and training[8][9], this line is already entering concrete system roles in simulation, training, and interactive environment generation.

A world model is no longer just used to represent the world. It is also being used to generate possible future states and feed planning or action. Today this line has already split into several recognizable directions. In autonomous driving and physical AI, world models act as simulation and synthetic-data engines for expensive, dangerous, or rare slices of the real world, as in Waymo World Model and NVIDIA Cosmos[8][10]. In spatial intelligence, they target 3D worlds that can be generated, entered, and interacted with persistently, such as World Labs' Marble[11]. On the more real-time interactive side, generative models are moving from static content generation toward controllable, explorable environments, with examples such as GameNGen's real-time neural simulation of DOOM[12] and Google DeepMind's Genie 2 / Genie 3[13][14]. These directions look different on the surface, but they are still addressing the same underlying problem: how to learn the rules by which environments evolve through time, action, and constraint into the system itself.

From 1990 to 2018 to today: world models evolved from early ideas of differentiable-world modeling to system-level use in simulation and training, exemplified by Waymo World Model.[6][7][8][9]

Third, conventional computers are starting to show more obvious structural friction in the age of AI. More and more tasks today are open-ended, long-horizon, and continuously interactive. That is exactly where the traditional software stack begins to feel heavy. Its stability is still a real advantage, but in settings dominated by natural language, demonstrations, interface operations, and weak constraints, the cost of organizing and driving the task keeps going up.

Conventional computers are already rewriting their own substrate for AI. Chips, compilers, memory systems, and software stacks are all becoming more model-friendly. Most of these changes, however, still happen inside the existing computational paradigm: they make the old machine better for AI, without redefining what the machine is. In that trend, projects like Taalas push a little further by turning specific models into deployment units of their own. The model is no longer just a payload running on the machine; hardware itself begins to organize more directly around the model[15]. Even so, that is still a deployment-level change. It is not yet a new general machine form.

Taken together, these three shifts point to the same question.

If agents are getting better at real work, world models are getting better at internal simulation, and conventional computers are already rebuilding their substrate for AI, could there be a new runtime that brings execution, rollout, and capability retention into the same learning machine? Seen this way, the main human-machine relationship shifts. In the conventional era, people mainly interact with computers. In the agent era, they increasingly interact with agents, which then call the computer on their behalf. World models occupy a parallel position: they can serve humans or agents, but they do not themselves close the loop of getting work done. NC asks whether some of the responsibilities now split across computers, agents, and world models can be drawn back into the same learning machine. At that point, the object in front of the user would no longer be an agent using a computer for them. It would be a Neural Computer.

How the human-machine relation changes
How the human-machine relation changes: in the conventional era, the relation looked more like Human → Computer; in the agent era, it looks more like Human → Agent → Computer, while World Model appears more as a parallel predictive layer; if NC matures, humans would face a Neural Computer more directly.

This also means that interaction starts to look a little more like programming. Today, natural-language instructions, keyboard and mouse traces, screen transitions, and task feedback are mostly logs of what happened. Under the NC framing, they may become inputs that shape later behavior. Today we install capabilities mainly through code. Later, demonstrations, interaction traces, and constraints may themselves become ways for capabilities to enter runtime.


2. What is a Neural Computer, and what would count as it actually working?

Start with this table. It places conventional computers, agents, world models, and Neural Computers on the same scale. It makes the differences easier to see: what each one organizes around, where its source of truth lives, and what role it mainly plays.

Form Organized around Where the source of truth lives Main role
Conventional computer Explicit programs Explicit programs and explicit state Reliably execute explicit programs
Agent Tasks External environments, toolchains, and workflows Complete tasks inside an existing environment
World Model Environments State-evolution models Predict and simulate environmental change
Neural Computer Runtime Capabilities and state inside runtime Keep the machine running, accumulate capabilities, and govern updates

Now imagine what using an NC would actually feel like. With a conventional computer, you install software. With an agent, you describe the task. With an NC, what you do is closer to installing capabilities into the machine itself, and expecting them to remain there afterward.

That is why runtime here does not mean a particular software component. It means the layer that keeps a system recognizably the same machine over time: what gets to stay, what pushes state forward, what kinds of input truly change the machine, and what kinds of change amount to rewriting it. For NC, the practical question is not whether we can add yet another external layer, but whether capabilities and state can actually enter the same learned runtime.

If it works, what might the machine actually look like?

First, it may not keep growing along today's foundation-model path. The default instinct today is to keep pushing toward stronger dense or MoE foundation models in roughly the 1B-10T range, and a great deal of progress will continue to happen that way. My own guess is that a mature NC points toward a different substrate: something more like a 10T-1000T machine that is sparser, more addressable, and a little more circuit-like. A future CNC may look less like an ever-denser cloud of continuous representations and more like a composable, routable substrate whose parts can be inspected locally. It may borrow less from brains or animal perception than people expect, and more from the logic of a NAND-style machine: discrete, sparse, and locally verifiable. That path is still far from developed. Recent work such as OpenAI's research on weight-sparse transformers is one sign of it, but the underlying idea is much older and richer in AI, especially in RL, where sparse structure, local specialization, and routing have long mattered for how systems learn and act[16].

Second, it may not always upgrade itself by globally changing parameters. On today's path, the natural upgrade cycle is still to train a larger dense or MoE model and swap in a new block of weights. NC points to a different path: through sustained interaction, runtime may gradually acquire new internal structures. User inputs stop looking like one-shot triggers and start acting more like ways of installing, invoking, composing, and preserving reusable neural routines, perhaps even internal executors that can be called again later. Functionally, that is closer to memory than to a processor. Upgrading the machine would no longer always mean rewriting the whole thing; it could mean writing new structures into an internal state that is addressable, callable, and persistent. In that picture, progress no longer looks like swapping in a larger model, but like continuously adding new components into the machine. Older ideas such as NPI and HyperNetworks can be read as suggestive precursors here: the former tried to decompose complex programs into callable, composable subprograms[17]; the latter hinted that machines might generate downstream neural modules to extend their own capability boundary[18]. Taken far enough, a strong Neural Computer could generate new sub-networks directly and attach them internally in a plug-and-play way, much as we install or uninstall software today, but without handwritten code and compilation as intermediaries.

Third, it may gradually pull world-model-style rollout into runtime itself. At that point, rollout becomes part of the machine's normal operating process. People may provide an input and an expected output, or simply specify evaluation criteria ahead of time. In some rounds they may provide nothing at all, and runtime could still continue with internal self-play, self-testing, candidate filtering, and compression, then turn useful improvements into the next round of capability updates. The change is not just that more context is stored; the internal capability structure itself is updated. None of this implies silent, unguided drift; the entire update path has to remain governable.

At that point, the idea of NC as a machine form becomes easier to describe. The core test is whether capabilities can really enter runtime, and whether they can be installed, reused, executed, and governed there. CNC is the name for the state in which that project is genuinely completed. In the original paper, an NC instance counts as a CNC only if it satisfies four conditions at once: it must be Turing complete, universally programmable, behavior-consistent unless explicitly reprogrammed, and it must exhibit architecture and programming semantics native to NC rather than inherited from conventional computers. The table below restates those four requirements more directly.

CNC condition Meaning What we would probably need to see in engineering terms
Turing complete It should not be limited to a few fixed task types; in principle, it should be able to express general computation. But expressivity alone is not enough. The real test is whether the same NC can stably carry longer and more complex algorithmic processes as effective memory and context grow, rather than simply failing in a different way when tasks get longer.
Universally programmable Inputs should not just trigger one-off behavior; they should be installable as routines or internal executors that can be invoked again later. Capabilities should be installable, callable, composable, and retainable, and once they enter runtime they should remain reusable across tasks.
Behavior-consistent Ordinary use should not silently mutate the machine. Behavioral change should only come from explicit updates. Behavior should be reproducible within the same version; execution and update traces should be trackable; failures should support replay and rollback; long-term drift should be measurable and governable.
Machine-native semantics It should not merely imitate old computers with neural nets; it should begin to form its own machine semantics and its own way of being programmed. The neural substrate should gain capabilities through composition, routing, continuous state, and internal execution structures that conventional stacks are poor at; meanwhile, instructions, demonstrations, traces, and constraints themselves begin to act as programming inputs alongside handwritten code.

3. The paper's prototype: what it shows, and what is still missing

My estimate is that a real Neural Computer is still about three years away. Relative to the NC I actually have in mind, the work in our paper is still an early step. For now, what I think is the most convenient unified container is this class of neural architectures built for video generation and world modeling; if the goal is to put pixels, actions, and temporal rollout into the same end-to-end prototype, they are also the fastest path. What we are using them to validate is only a subset of NC's key capabilities. They are better read as transitional prototypes than as NC's final structure; reaching CNC would still require a much deeper rebuild from the bottom up.

GUIWorld takes the same question from CLI into full GUI. At this point the main issue is no longer text and commands, but real keyboard-and-mouse actions: the cursor has to land correctly, hovering has to trigger feedback, clicks have to change buttons, dropdowns, modals, and text fields in the right way, and keyboard input has to push the interface forward frame by frame.

The data setup here is already a fairly complete interaction rig. We fixed the environment to Ubuntu 22.04 with XFCE4, 1024×768 resolution, and 15 FPS capture (thanks to recommended setup by NeuralOS[35]), then built the full pipeline for desktop execution, recording, and action replay so that every click, hover, input, and interface change could be recorded stably. The dataset has three parts: roughly 1,000 hours of Random Slow, roughly 400 hours of Random Fast, and roughly 110 hours of real goal-directed trajectories driven by Claude CUA. The first two probe how open-world noise such as mouse acceleration, pauses, hovering, and window switching affects the model. The third gives cleaner action-response pairs and asks a simpler question: after this action, does the interface actually make the right next move?

On the model side, we did not try just one action-injection scheme. We trained four variants in parallel. The main difference between them is not whether they receive actions at all, but how deep actions enter the trunk and where they begin to participate in state evolution. Figure 7 in the paper lays out the four designs clearly:

Figure 7. Four modes for injecting GUI actions into the diffusion transformer
Figure 7 Four ways of injecting GUI actions into the diffusion transformer. These correspond to Models 1 through 4 described above.
Model Name Injection mode Related line
Model 1 External Input-side latent modulation Shallow action-conditioned baseline
Model 2 Contextual Action tokens merged into the main sequence WHAM[33]
Model 3 Residual Injected through a side residual branch ControlNet[34]
Model 4 Internal Action cross-attention inside each block Matrix-Game 2.0[32]

From the final results (details omitted here): among the four designs, Model 4 works best. In GUI environments with fine-grained timing and local interaction, injecting actions directly inside the block is the most effective way to teach the backbone how the interface should continue after an action. The data story is just as clear: 110 hours of supervised data beat roughly 1,400 hours of random data, and explicit visual supervision of the cursor works far better than pure coordinate supervision. The practical takeaway is straightforward: progress on GUI depends on stronger action semantics, clearer state transitions, and treating the cursor as a visual object to supervise.

Very few people initially expected video models to handle computer scenes this discrete, text-heavy, and action-sensitive. But once the task and data are organized well, they already produce interesting results on interface rendering, page transitions, short-term state continuation, local interaction, execution echo, and even some very early signs of working memory. Video models are still nowhere near the endpoint, but as an early prototype container they are already good enough to turn several otherwise abstract NC questions into concrete ones.

3.4 From prototype NC to CNC: what is still missing?

If we bring back the CNC condition table from Section 2, the overall conclusion of the current prototype is fairly clear: Turing complete has only been touched at the edge, universally programmable has barely appeared as an entry point, behavior-consistent holds only locally in controlled settings, and machine-native semantics is still clearer as a direction than as a result. NC is not about stacking agents, world models, and conventional computers on top of one another. It is about pulling some of the responsibilities now scattered across those objects back into the same learned runtime. What matters about the prototype is not that it is close to the endpoint, but that it makes several key constraints visible early.


4. If Neural Computer takes hold, software, hardware, and even “programs” will change

Put more plainly, Neural Computer is one view of what the next generation of computers may become. The most direct competition is likely to come from personalized super agents with strong memory, strong tool use, and persistent online presence. The table below places the three side by side.

Quick read: start with “what you actually get,” “how experience accumulates,” and “what gets installed.”

ConventionalComputer PersonalizedSuper Agent CompletelyNeural Computer
Basic positioning
What you actually get A machine that precisely executes the programs you write A persistent agent with strong memory and strong tool use that handles things on your behalf A machine shaped by accumulated experience, with more capabilities staying inside the system
Organized around Explicit programs Task flow
Persistent operation, but capability still comes from the external stack
Runtime
Persistent operation, with capabilities themselves inside the machine
How experience accumulates You manually translate it into code, configuration, and rules It gets written into memory, vector stores, workflows, skill files, MCPs, and prompts, then retrieved, injected, and orchestrated next time It enters runtime directly and starts participating in later execution, instead of staying in an external retrieval layer
Installation and evolution
What gets installed Software, libraries, scripts, and services Tools, workflows, memory entries, skill descriptions Capabilities themselves, along with installable, callable, composable sub-NNs
How it evolves Through abstraction, interfaces, and program reuse; the machine itself changes little Through foundation-model generalization and ongoing interaction; the system improves along the external stack Through runtime updates and ongoing interaction; the machine changes along its internal capability structure
Substrate form N/A Closer to today's path: dense or MoE foundation models in the 1B-10T range Closer to a next-generation substrate: a 10T-1000T machine that is sparser, more addressable, and more circuit-like
Position in the stack
Where it sits in the AI stack Mainly the chips / infrastructure layer Mainly spans the models and applications layers Most directly changes the boundary between models and applications, and may push parts of infrastructure to reorganize around runtime
Current maturity Fully mature
Backed by 70+ years of engineering and still the substrate of most systems
Already usable, and likely to keep improving quickly
Systems like Claude, Cursor, and OpenClaw already show the early form
The direction is plausible, and formal prototypes have appeared, but nothing close to a usable prototype yet
The four conditions of Completely Neural Computer are still unmet

If CNC really works, the first things to change would be what gets delivered and how the stack is organized. Today what gets installed is still software, tools, workflows, and memory entries. On the NC path, what gets installed would look more like capability itself. Code would still matter, but it would stop being the only doorway in. Instructions, demonstrations, interaction traces, and constraints would begin to do part of the installation work themselves. Even the word “program” would start to shift: it would no longer mean only a block of code, but a capability object that can be installed, composed, versioned, and updated over time.

From there the change would propagate into the stack and into the boundary of the machine itself. Software layout, hardware interfaces, update governance, and debugging would increasingly reorganize around the same continuously running machine. Phones, browsers, IDEs, and terminals would still remain, but they would feel more like different windows into that same machine. In the end, the change would reach not only the tool stack, but also the meaning of the word “computer.”

Note and acknowledgements: the content and views in this essay represent Mingchen Zhuge alone. Thanks to Wenyi Wang, Haozhe Liu, Shuming Liu, Yuandong Tian, Dylan R. Ashley, and Yutian Deng for thoughtful review comments. Some figures and materials are adapted from the original paper and related public sources.

References

If you want to cite this piece, you can use either the arXiv entry or the blog entry below.

arXiv BibTeX

@misc{zhuge2026neuralcomputers,
  title         = {Neural Computers},
  author        = {Mingchen Zhuge and Changsheng Zhao and Haozhe Liu and Zijian Zhou and Shuming Liu and Wenyi Wang and Ernie Chang and Gael Le Lan and Junjie Fei and Wenxuan Zhang and Yasheng Sun and Zhipeng Cai and Zechun Liu and Yunyang Xiong and Yining Yang and Yuandong Tian and Yangyang Shi and Vikas Chandra and J{\"u}rgen Schmidhuber},
  year          = {2026},
  eprint        = {2604.06425},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2604.06425}
}

Blog BibTeX

@online{zhuge2026neuralcomputerblog,
  author  = {Mingchen Zhuge},
  title   = {Neural Computer: A New Machine Form Is Emerging},
  year    = {2026},
  month   = apr,
  day     = {7},
  url     = {https://metauto.ai/neuralcomputer/},
  note    = {Research essay},
  urldate = {2026-04-09}
}

Reference List