Research Essay · Neural Computers

By Mingchen Zhuge Published April 7, 2026 Updated April 7, 2026

Neural Computer: A New Machine Form Is Emerging

TL;DR: we are starting to expect the machine itself to learn how to run.

Paper (arXiv) GitHub Chinese Version (中文版)

If you'd like to continue the conversation, feel free to reach out via:

Emailmczhuge [AT] gmail.com
X@MingchenZhuge
WeChat

If you have ever wondered whether AI might ultimately become a kind of computer, this essay is for you.

Over the past few decades, computers gradually became an important medium through which people get things done. In the last few years, AI has started to move into that role as well: it no longer just answers questions. AI systems now call tools, operate interfaces, and participate in real workflows. The question changes with it: do we want AI to use computers, or to become a kind of computer? This is also the question behind what I call the Neural Computer (NC).

Here, Neural Computer does not simply refer to the NTM / DNC line associated with Alex Graves[1][2], nor are we talking about some newer hardware direction such as Taalas[15]. It goes beyond these earlier ideas. So the following are not the goals of Neural Computer here: a stronger agent, a world model for computer environments, or an extra layer of intelligence added on top of conventional computers. What matters here is whether some of the responsibilities now carried by the program stack, toolchain, and control layer could gradually move into the runtime the model actually depends on.

I suspect this idea has crossed many people's minds over the past year, so I call it a pre-consensus.

Big picture

Neural Computer (NC) asks whether models can start taking on some runtime responsibilities that still belong to the machine itself.
Conventional computers organize around explicit programs, agents around tasks, world models around environments, and NC around runtime.
Completely Neural Computer (CNC) is the completed form of NC.
Current prototypes already show early hints of runtime primitives.
If capabilities can enter runtime and remain installable, reusable, and governable there, the Neural Computer could change what we mean by a computer.

1. Why now: a new machine form is starting to emerge

Three things are happening at once.

First, agents are getting better and better at real work. In 2023, MetaGPT, one of the early coding-agent prototypes[3], could barely produce a few hundred lines of code. By 2025, Cursor, Codex, and Claude Code had already become default productivity tools for many programmers. Today OpenClaw[4] has started bringing these systems to non-programmers too. The question is no longer whether an agent can occasionally pull off a task. The question is now whether it can enter real production and daily life and handle things for you reliably.

For agents, the main bottlenecks now are: (1) how to stay stable over long-horizon tasks, (2) how capabilities can be retained and reused, and (3) how workflows can be reused over time. The dominant path still adds structure on the scaffold or harness side: stronger memory, longer workflows, and tighter action loops, all in service of raising task completion rates. Push that further and the more aggressive path becomes recursive self-improvement: models training the next generation of models, agents continuously rewriting themselves[5].

MetaGPT paper front page — Agents are making the transition from prototype experiments to professional productivity tools and, increasingly, to everyday infrastructure.[3][4]

OpenClaw repository popularity chart — Agents are making the transition from prototype experiments to professional productivity tools and, increasingly, to everyday infrastructure.[3][4]

Second, world models are getting better at modeling dynamic environments. Over the last year, projects such as GameNGen and Genie 2 / 3 have made more people believe that a model can do more than represent the current state. It can also maintain an internal structure for what is likely to happen next. More importantly, this ability has already entered a few real closed loops. This is especially true in corner cases that are hard to collect repeatedly and cheaply in the real world. In those settings, rollout is already being used directly for prediction, planning, control, and training. Along this trajectory, from Jürgen Schmidhuber's 1990 vision in Making the World Differentiable[6], to the 2018 paper World Models[7], and now to Waymo using world models in autonomous-driving simulation and training[8][9], this line is already entering concrete system roles in simulation, training, and interactive environment generation.

A world model is no longer just used to represent the world. It is also being used to generate possible future states and feed planning or action. Today this line has already split into several recognizable directions. In autonomous driving and physical AI, world models act as simulation and synthetic-data engines for expensive, dangerous, or rare slices of the real world, as in Waymo World Model and NVIDIA Cosmos[8][10]. In spatial intelligence, they target 3D worlds that can be generated, entered, and interacted with persistently, such as World Labs' Marble[11]. On the more real-time interactive side, generative models are moving from static content generation toward controllable, explorable environments, with examples such as GameNGen's real-time neural simulation of DOOM[12] and Google DeepMind's Genie 2 / Genie 3[13][14]. These directions look different on the surface, but they are still addressing the same underlying problem: how to learn the rules by which environments evolve through time, action, and constraint into the system itself.

Diagram from Jürgen Schmidhuber's 1990 Making the World Differentiable paper — From 1990 to 2018 to today: world models evolved from early ideas of differentiable-world modeling to system-level use in simulation and training, exemplified by Waymo World Model.[6][7][8][9]

World Models paper front page — From 1990 to 2018 to today: world models evolved from early ideas of differentiable-world modeling to system-level use in simulation and training, exemplified by Waymo World Model.[6][7][8][9]

Third, conventional computers are starting to show more obvious structural friction in the age of AI. More and more tasks today are open-ended, long-horizon, and continuously interactive. That is exactly where the traditional software stack begins to feel heavy. Its stability is still a real advantage, but in settings dominated by natural language, demonstrations, interface operations, and weak constraints, the cost of organizing and driving the task keeps going up.

Conventional computers are already rewriting their own substrate for AI. Chips, compilers, memory systems, and software stacks are all becoming more model-friendly. Most of these changes, however, still happen inside the existing computational paradigm: they make the old machine better for AI, without redefining what the machine is. In that trend, projects like Taalas push a little further by turning specific models into deployment units of their own. The model is no longer just a payload running on the machine; hardware itself begins to organize more directly around the model[15]. Even so, that is still a deployment-level change. It is not yet a new general machine form.

Taken together, these three shifts point to the same question.

If agents are getting better at real work, world models are getting better at internal simulation, and conventional computers are already rebuilding their substrate for AI, could there be a new runtime that brings execution, rollout, and capability retention into the same learning machine? Seen this way, the main human-machine relationship shifts. In the conventional era, people mainly interact with computers. In the agent era, they increasingly interact with agents, which then call the computer on their behalf. World models occupy a parallel position: they can serve humans or agents, but they do not themselves close the loop of getting work done. NC asks whether some of the responsibilities now split across computers, agents, and world models can be drawn back into the same learning machine. At that point, the object in front of the user would no longer be an agent using a computer for them. It would be a Neural Computer.

How the human-machine relation changes: in the conventional era, the relation looked more like Human → Computer; in the agent era, it looks more like Human → Agent → Computer, while World Model appears more as a parallel predictive layer; if NC matures, humans would face a Neural Computer more directly.

This also means that interaction starts to look a little more like programming. Today, natural-language instructions, keyboard and mouse traces, screen transitions, and task feedback are mostly logs of what happened. Under the NC framing, they may become inputs that shape later behavior. Today we install capabilities mainly through code. Later, demonstrations, interaction traces, and constraints may themselves become ways for capabilities to enter runtime.

2. What is a Neural Computer, and what would count as it actually working?

Start with this table. It places conventional computers, agents, world models, and Neural Computers on the same scale. It makes the differences easier to see: what each one organizes around, where its source of truth lives, and what role it mainly plays.

Form	Organized around	Where the source of truth lives	Main role
Conventional computer	Explicit programs	Explicit programs and explicit state	Reliably execute explicit programs
Agent	Tasks	External environments, toolchains, and workflows	Complete tasks inside an existing environment
World Model	Environments	State-evolution models	Predict and simulate environmental change
Neural Computer	Runtime	Capabilities and state inside runtime	Keep the machine running, accumulate capabilities, and govern updates

Now imagine what using an NC would actually feel like. With a conventional computer, you install software. With an agent, you describe the task. With an NC, what you do is closer to installing capabilities into the machine itself, and expecting them to remain there afterward.

That is why runtime here does not mean a particular software component. It means the layer that keeps a system recognizably the same machine over time: what gets to stay, what pushes state forward, what kinds of input truly change the machine, and what kinds of change amount to rewriting it. For NC, the practical question is not whether we can add yet another external layer, but whether capabilities and state can actually enter the same learned runtime.

If it works, what might the machine actually look like?

First, it may not keep growing along today's foundation-model path. The default instinct today is to keep pushing toward stronger dense or MoE foundation models in roughly the 1B-10T range, and a great deal of progress will continue to happen that way. My own guess is that a mature NC points toward a different substrate: something more like a 10T-1000T machine that is sparser, more addressable, and a little more circuit-like. A future CNC may look less like an ever-denser cloud of continuous representations and more like a composable, routable substrate whose parts can be inspected locally. It may borrow less from brains or animal perception than people expect, and more from the logic of a NAND-style machine: discrete, sparse, and locally verifiable. That path is still far from developed. Recent work such as OpenAI's research on weight-sparse transformers is one sign of it, but the underlying idea is much older and richer in AI, especially in RL, where sparse structure, local specialization, and routing have long mattered for how systems learn and act[16].

Second, it may not always upgrade itself by globally changing parameters. On today's path, the natural upgrade cycle is still to train a larger dense or MoE model and swap in a new block of weights. NC points to a different path: through sustained interaction, runtime may gradually acquire new internal structures. User inputs stop looking like one-shot triggers and start acting more like ways of installing, invoking, composing, and preserving reusable neural routines, perhaps even internal executors that can be called again later. Functionally, that is closer to memory than to a processor. Upgrading the machine would no longer always mean rewriting the whole thing; it could mean writing new structures into an internal state that is addressable, callable, and persistent. In that picture, progress no longer looks like swapping in a larger model, but like continuously adding new components into the machine. Older ideas such as NPI and HyperNetworks can be read as suggestive precursors here: the former tried to decompose complex programs into callable, composable subprograms[17]; the latter hinted that machines might generate downstream neural modules to extend their own capability boundary[18]. Taken far enough, a strong Neural Computer could generate new sub-networks directly and attach them internally in a plug-and-play way, much as we install or uninstall software today, but without handwritten code and compilation as intermediaries.

Third, it may gradually pull world-model-style rollout into runtime itself. At that point, rollout becomes part of the machine's normal operating process. People may provide an input and an expected output, or simply specify evaluation criteria ahead of time. In some rounds they may provide nothing at all, and runtime could still continue with internal self-play, self-testing, candidate filtering, and compression, then turn useful improvements into the next round of capability updates. The change is not just that more context is stored; the internal capability structure itself is updated. None of this implies silent, unguided drift; the entire update path has to remain governable.

At that point, the idea of NC as a machine form becomes easier to describe. The core test is whether capabilities can really enter runtime, and whether they can be installed, reused, executed, and governed there. CNC is the name for the state in which that project is genuinely completed. In the original paper, an NC instance counts as a CNC only if it satisfies four conditions at once: it must be Turing complete, universally programmable, behavior-consistent unless explicitly reprogrammed, and it must exhibit architecture and programming semantics native to NC rather than inherited from conventional computers. The table below restates those four requirements more directly.

CNC condition	Meaning	What we would probably need to see in engineering terms
Turing complete	It should not be limited to a few fixed task types; in principle, it should be able to express general computation.	But expressivity alone is not enough. The real test is whether the same NC can stably carry longer and more complex algorithmic processes as effective memory and context grow, rather than simply failing in a different way when tasks get longer.
Universally programmable	Inputs should not just trigger one-off behavior; they should be installable as routines or internal executors that can be invoked again later.	Capabilities should be installable, callable, composable, and retainable, and once they enter runtime they should remain reusable across tasks.
Behavior-consistent	Ordinary use should not silently mutate the machine. Behavioral change should only come from explicit updates.	Behavior should be reproducible within the same version; execution and update traces should be trackable; failures should support replay and rollback; long-term drift should be measurable and governable.
Machine-native semantics	It should not merely imitate old computers with neural nets; it should begin to form its own machine semantics and its own way of being programmed.	The neural substrate should gain capabilities through composition, routing, continuous state, and internal execution structures that conventional stacks are poor at; meanwhile, instructions, demonstrations, traces, and constraints themselves begin to act as programming inputs alongside handwritten code.

3. The paper's prototype: what it shows, and what is still missing

My estimate is that a real Neural Computer is still about three years away. Relative to the NC I actually have in mind, the work in our paper is still an early step. For now, what I think is the most convenient unified container is this class of neural architectures built for video generation and world modeling; if the goal is to put pixels, actions, and temporal rollout into the same end-to-end prototype, they are also the fastest path. What we are using them to validate is only a subset of NC's key capabilities. They are better read as transitional prototypes than as NC's final structure; reaching CNC would still require a much deeper rebuild from the bottom up.

3.1 CLIGen (General): an imitation game for computers

First ask whether terminal rendering holds up at all: color, cursor behavior, scrolling, TUI layout, and overall pacing.

Look at the first batch of generations. If you do not inspect closely, some of them already pass a quick glance. What CLIGen (General) shows first is simply that video models can already render terminal behavior convincingly enough to look real at first sight. Mainstream video models were never trained for text-dense computer scenes that depend heavily on discrete layout, but after additional training this “imitation game for computers” does begin to work.

Neural Computer (CLIGen General 1)The user types the command CREATE TABLE posts (ID INTEGER), with the terminal displaying the command in a dark background with colored syntax highlighting, including green and yellow text, and the cursor moving character-by-character as the user types, with some corrections and backspacing along the way. The output shows the command being executed, with key words like CREATE and TABLE in distinct colors, and the filename posts appearing in the command line.

Neural Computer (CLIGen General 2)The terminal displays a series of ANSI escape code formatted texts with changing background and foreground colors, executing commands like \u001b[48;2;255;128;128;38;2;0;0;0m which set the background to a shade of pink and text to black, and printing numbered lists with colors. The output includes specific numbers, such as "1", "5", "7", and "9", in different colors, creating a visually dynamic and colorful display, but the exact username, hostname, and path are not specified in the provided terminal session content.

Neural Computer (CLIGen General 3)At the root@localhost:~# prompt, the user types the date command, which displays the current date and time in a plain text format as "2021. 10. 11. 22:47:43 KST", then begins typing the cat command.

Neural Computer (CLIGen General 4)The terminal displaying progress bars, package names like pillow, notebook, and tzlocal, and version changes in green and red text. The output shows downloading and installing statuses, including percentages, for packages like smmap, tomli, and protobuf, with the terminal scrolling through the output rapidly.

Neural Computer (CLIGen General 5)At the unspecified username@hostname prompt, the terminal displays a partition editor with a disk image file named "sd.img" (128 MiB) and the user interacts with it, creating a new Linux partition from free space, with key output content showing partition details in a table format, including "sd.img1" and "sd.img2" with their respective sizes and types, and a new partition "sd.img3" with 55M size and Linux type (83). The terminal shows a mix of black and colored text, including blue and red, with a cursor that blinks and moves to different parts of the screen as the user navigates through the partition editor options, such as "New", "Quit", and "Write", with specific prompts like "Partition type: Linux (83)" and "Create new partition from free space".

Neural Computer (CLIGen General 6)The terminal displays a progress bar with the command output "Evaluating" and percentages from 60% to 85%, showing yellow progress bars with increasing completion, such as "│████████████████████▍ │" to "│████████████████████████▉ │", alongside item counts "24/40" to "34/40" and time estimates "0:00:20" to "0:00:07". The output includes specific item completion and estimated time remaining, with the yellow-colored progress bar indicating the evaluation progress.

What gets learned first here is the outer layer of the terminal: how colors shift, how the cursor blinks, whether the window ratio stays stable, how long logs scroll, and how full-screen TUIs, progress bars, and status bars appear. What stabilizes first is the terminal's surface and rhythm. In the language of the previous section, what is being learned first here is still the appearance of runtime.

Seen from September 2025, this result was surprising. With only about 1,100 hours of noisy terminal data, Wan2.1[31] went from a model that barely understood computer interfaces and struggled with even slightly small text to one that could generate stable terminal scenes, with nontrivial shallow alignment to common commands, echoes, and log formats. For video generation, this is among the hardest classes of scenes: dense text, rapid changes, blinking cursors, and almost no natural motion. The result exceeded what many people expected at the time. The data here still came from general terminal videos, with lots of style variation and very mixed scenes. Once terminal rendering started to hold up, it became natural to push toward harder questions inside the computer: memory, reasoning, programming, and execution.

3.2 REPL and math: it is no longer just drawing terminals

Here the target is a harder execution structure: input, enter, echo, local editing, and state continuation.

After the initial terminal-rendering experiments, the more interesting question is whether the terminal can be treated as a small local machine that is stably driven by actions. If you type a command, does the buffer advance? If you press enter, does the echo follow? If you make a mistake, edit, and retype, does the state continue coherently? REPL and math are really two views of the same question here: has the model started to learn some of the terminal's state-transition rules?

Neural Computer (CLIGen Clean 1)Sleep 200ms
Type "env | head -n 5"
Enter
Sleep 600ms
Hide

Neural Computer (CLIGen Clean 2)Sleep 200ms
Type "date"
Enter
Sleep 300ms
Type "whoami"
Enter
Sleep 300ms

Neural Computer (CLIGen Clean 3)Sleep 200ms
Type "date"
Enter
Sleep 300ms
Type "whomai"
Enter
Sleep 300ms
Type "whomai"
Enter
Sleep 300ms
Hide

Neural Computer (CLIGen Clean 4)Sleep 200ms
Type "top"
Enter
Sleep 2s
Down 3
Sleep 600ms
Up 2
Hide

Neural Computer (CLIGen Clean 5)Sleep 500ms
Type "echo $HOME"
Sleep 90ms
Enter
Sleep 1442ms
Hide

Neural Computer (CLIGen Clean 6)Sleep 200ms
Type "id"
Enter
Sleep 400ms
Hide

Neural Computer (CLIGen Clean 7)Sleep 200ms
Type "pwd"
Enter
Sleep 400ms
Hide

Neural Computer (CLIGen Clean 8)Sleep 400ms
Type "python - <<'PY'"
Enter
Type "import time"
Enter
Type "for i in range(18):"
Enter
Type " print(f'Frame
{i:02d} ::' + '>' * (i % 20))"
Enter
Type " time.sleep(0.2)"
Enter
Type "PY"
Enter
Sleep 4000ms
Hide

Neural Computer (CLIGen Clean 9)Sleep 400ms
Type "seq 1 28 | paste -
d',' - - - - | column -t -s','
| tee metrics_7x4.txt"
Enter
Sleep 2000ms
Hide

Neural Computer (CLIGen Clean 10)Sleep 180ms
Type "echo History size:
$HISTSIZE"
Sleep 120ms
Enter
Sleep 400ms
Type "cal"
Sleep 120ms
Enter
Sleep 400ms
Type "echo Home:
$HOME"
Sleep 120ms
Enter
Sleep 400ms
Sleep 400ms
Hide

Neural Computer (CLIGen Clean 11)Sleep 800ms
Sleep 180ms
Type "echo Learning shell
basics"
Sleep 120ms
Enter
Sleep 400ms
Type "date +%Y-%m-%d"
Sleep 120ms
Enter
Sleep 400ms
Type "echo Login shell: $0"
Sleep 120ms
Enter
Sleep 400ms
Type "uname -r"
Sleep 120ms
Enter
Sleep 400ms
Sleep

Neural Computer (CLIGen Clean 12)Sleep 200ms
Type "python"
Enter
Sleep 400ms
Type "5"
Enter
Sleep 400ms
Type "exit()"
Enter
Sleep 400ms
Hide

Neural Computer (CLIGen Clean 13)Sleep 200ms
Type "python"
Enter
Sleep 1s
Type "10+15"
Enter
Sleep 800ms
Hide

Neural Computer (CLIGen Clean 14)Sleep 200ms
Type "python"
Enter
Sleep 1s
Type "40/1"
Enter
Sleep 800ms
Hide

Here the center of gravity shifts toward the causal structure of command execution. This training set comes from cleaner, more reproducible scripted traces: we generated these terminal videos ourselves through scripts and Docker so that input, enter, echo, errors, and local edits all happen inside a much more stable terminal environment.

The results already show that the model has learned some of the most basic operating regularities of a computer terminal. For very simple commands such as pwd, date, whoami, echo $HOME, and env | head -n 5, the typed input, the enter key, the echoed output, and the final display are already fairly close to reality; different commands also produce output shapes that match the corresponding terminal scenario. Relative to the previous section, the commands themselves are now driving character updates, echo generation, and local state changes, and the terminal unfolds more according to its own operating logic.

Pushed further along this line, the model has begun to pick up something in simple arithmetic scenes as well, but reasoning itself is still far from solved. Even at the level of two-digit addition, current models still struggle to compute stably. Part of that is surely a data issue: we have not yet given the model enough hard training data to force out stable reasoning. But there is also a deeper possibility: asking current DiT-based video models to carry stable reasoning may simply be the wrong bet. The more reliable conclusion for now is that terminal execution has started to hold; symbolic reasoning has not.

3.3 Then GUIWorld: interface control starts to work too

The final question is whether actions can genuinely drive interface state: whether clicks, hovers, typing, and window feedback form a closed loop.

By the CLI stage, one thing was already clear: video models are strong at rendering, and some basic memory and execution ability had begun to show up, while the lowest layer of symbolic reasoning remained weak. GUIWorld shifts the emphasis again. Now the question is whether actions can actually push interface state forward.

Conventional Computer (GT)	Neural Computer (Generation)
Conventional Computer (GT)"thinking": "Now I'll click on the Term icon to open the terminal:" "action": "left_click", "x": 82, "y": 520	Neural Computer (Generation)"thinking": "Now I'll click on the Term icon to open the terminal:" "action": "left_click", "x": 82, "y": 520
Conventional Computer (GT)"thinking": "\"Now I'll type in 768 for the height:" "action": "type", "text": "512" "thinking": "Now I'll click the OK button to create the new image:" "action": "left_click", "x": 763, "y": 514	Neural Computer (Generation)"thinking": "\"Now I'll type in 768 for the height:" "action": "type", "text": "512" "thinking": "Now I'll click the OK button to create the new image:" "action": "left_click", "x": 763, "y": 514
Conventional Computer (GT)"thinking": "I can see the Firefox icon on the desktop. Now I'll click on it to open Firefox:" "action": "double_click", "x": 224, "y": 70 "thinking": "I see Firefox has opened with a welcome page. Now I'll click on the address bar and navigate to Wikipedia:" "action": "double_click", "x": 487, "y": 81	Neural Computer (Generation)"thinking": "I can see the Firefox icon on the desktop. Now I'll click on it to open Firefox:" "action": "double_click", "x": 224, "y": 70 "thinking": "I see Firefox has opened with a welcome page. Now I'll click on the address bar and navigate to Wikipedia:" "action": "double_click", "x": 487, "y": 81
Conventional Computer (GT)"thinking": "Now I'll press Enter to execute the command:" "action": "type", "text": "cat ~/Desktop/datetime.txt"	Neural Computer (Generation)"thinking": "Now I'll press Enter to execute the command:" "action": "type", "text": "cat ~/Desktop/datetime.txt"
Conventional Computer (GT)"thinking": "I can see the border_photo.png file that we created with ffmpeg. I'll double-click on it to open it in GIMP:" "action": "double_click", "x": 310, "y": 143 "thinking": "Great! Now the image is open in GIMP. I can see it's a peachpuff colored image as requested. Now I'll add a border to it. In GIMP, I'll go to the \"ilters\" menu and look for a border option:" "action": "left_click", "x": 547, "y": 43	Neural Computer (Generation)"thinking": "I can see the border_photo.png file that we created with ffmpeg. I'll double-click on it to open it in GIMP:" "action": "double_click", "x": 310, "y": 143 "thinking": "Great! Now the image is open in GIMP. I can see it's a peachpuff colored image as requested. Now I'll add a border to it. In GIMP, I'll go to the \"ilters\" menu and look for a border option:" "action": "left_click", "x": 547, "y": 43
Conventional Computer (GT)"thinking": "I can see that the theme has changed to \"smooth\" and I can already see the tiles in the background have a different style. Now I need to click the \"Close\" button to apply these changes and close the preferences dialog:" "action": "left_click", "x": 629, "y": 423	Neural Computer (Generation)"thinking": "I can see that the theme has changed to \"smooth\" and I can already see the tiles in the background have a different style. Now I need to click the \"Close\" button to apply these changes and close the preferences dialog:" "action": "left_click", "x": 629, "y": 423
Conventional Computer (GT)"thinking": "I see Wikipedia appears in the suggestions. I'll click on it to go to Wikipedia:" "action": "left_click", "x": 240, "y": 150	Neural Computer (Generation)"thinking": "I see Wikipedia appears in the suggestions. I'll click on it to go to Wikipedia:" "action": "left_click", "x": 240, "y": 150

More Comparisons

The seven pairs above are the main comparisons; below are more direct visual side-by-side samples as a supplementary gallery for quick browsing.