Research note May 17 sharing deck May 17, 2026

Learning Machines: A System View of TLM Learning Machines：TLM 的系统视角

May 17 Sharing: The Machine Learning

Learning Machines Lab

This sharing reframes "learning" as a full system rather than a single better answer. Starting from the classic task-experience-performance view of machine learning, it defines TLM as a continuously learning agent system operating in dynamic real environments: it executes long-horizon tasks, diagnoses its capability boundaries, asks for feedback, and consolidates that feedback into an updatable agent harness so the agent's cognitive core becomes stronger over time.

Download the May 17 presentation

Mind map of the May 17 Learning Machines sharing — The sharing connects learning contracts, cognitive core, agent harness, feedback, persistent memory, and long-term measurement into one TLM system view.

In Brief 简要结论

1
TLM inherits the machine-learning loop, but moves the loop from static data to real long-horizon agent work.
2
The agent harness is the practical learning carrier: memory, state, workflow, experience base, tools, and quality standards can all be updated without relying only on model-weight changes.
3
The key learning behavior is active: the agent must know when it does not know, locate the missing feedback, and ask the right human, tool, verifier, or environment for help.
4
Long-term evaluation should measure cognitive-core strengthening, experience reuse, feedback efficiency, and transfer, not only one-shot task success.

Start From Learning 从学习的系统定义开始

The first move in the deck is conceptual discipline: learning is not "the answer looks better this time." A learning system needs a task or objective, an environment that supplies experience, a learning subject, a learning carrier, a feedback signal, an update mechanism, and a measurable performance criterion. This matters because agent learning can otherwise become a vague metaphor. If we cannot say what task the agent is improving on, what feedback it received, where the update was stored, and how later performance improved, then we have not yet shown learning. We have only shown execution.

Diagram of learning as a system loop — Learning becomes an explicit contract: task, experience, carrier, feedback, update, and measurable performance improvement.

Define TLM as a Learning System 把 TLM 定义为学习系统

TLM transfers the machine-learning loop into the setting of agents that work in dynamic real environments. The task is no longer a labeled dataset; it is a long-horizon task with changing context, tools, humans, constraints, and external state. The learning subject is an LLM plus an agent harness. The learning carrier is not only model parameters, but also durable state, workflow graphs, memory, experience bases, tool policies, and quality standards. The update process becomes a loop of absorbing, compressing, validating, generalizing, and reusing feedback.

Diagram mapping the learning-system contract onto TLM — TLM relocates the learning loop from static datasets into long-horizon agent work in dynamic real environments.

Cognitive Core Is a Capability, Not a Fact Store

The deck introduces "cognitive core" by first grounding it in human cognition: perceiving, remembering, understanding, judging, solving problems, and creating. For an agent, the cognitive core is the mechanism that organizes and updates cognitive resources. Strengthening it does not mean adding more isolated facts to memory. It means the agent becomes better at understanding the problem, decomposing it, deciding what evidence matters, calling the right resources, forming judgments, and absorbing corrections. In Bloom-style terms, TLM should move agents from recognition and understanding toward application, analysis, evaluation, and creation.

Mind map of cognitive-core capabilities — Cognitive-core strengthening means better organization, judgment, resource use, and feedback absorption, not merely a larger fact store.

Agent Harness Becomes the Runtime for Learning Agent Harness 是学习的运行时

The agent harness is the bidirectional runtime between model reasoning and environmental action. It turns model outputs into executable tool calls, workflows, recoverable actions, and observable traces; it also turns environment, tool, verifier, and human feedback into signals that can be inspected and reused. This is why the harness is not a thin wrapper. It carries runtime responsibilities such as context assembly, tool invocation, error recovery, observation management, and state tracking. It also carries governance and learning responsibilities: safety constraints, validation, feedback normalization, process inspection, memory updates, and quality standards.

Diagram of the agent harness as a bidirectional runtime — The harness translates model reasoning into environmental action and turns observations back into reusable learning signals.

Active Learning Moves From Samples to Capability Boundaries

Classical online active learning asks which incoming sample is worth labeling. TLM asks a different question: where is the agent's capability boundary, why is it uncertain, and who or what can provide the feedback that closes the gap? The learning loop therefore begins during task execution. The agent acts in a real environment, exposes uncertainty or risk, asks an expert, tool, verifier, or environment for help, then consolidates the answer into memory, workflow, standards, or self-diagnosis. A good TLM system must not wait passively for feedback; it must decide when, what, and whom to ask.

Loop for active feedback and capability-boundary diagnosis — TLM active learning asks where the agent is uncertain, what feedback would close the gap, and who or what can provide it.

The Technical Agenda Has Eight Coupled Questions

The deck organizes the research agenda around the same dimensions used to define learning. The questions are not independent modules. Defining and measuring the cognitive core shapes the target. Building dynamic environments determines what experience is available. Supporting lifelong execution depends on context management. Designing persistent harness state determines where learning lives. Diagnosing capability boundaries determines when feedback is requested. Converting natural language feedback into updates determines whether help becomes reusable. Scaling-law research asks which task, feedback, and memory variables drive capability growth. Long-term evaluation checks whether the whole loop actually improves the agent over time.

Map of eight coupled technical questions for TLM — The agenda mirrors the definition of learning: target, environment, subject, carrier, paradigm, signal, update, and evaluation.

Persistent Harness State Is the Memory of Learning 可持久化 Harness 是学习的记忆

A practical TLM does not have to update model weights every time it learns. Much of the learning can live in persistent harness structures: state, workflow graphs, memory, experience bases, tool strategies, and quality standards. The important design problem is granularity. The system needs factual memory, cases, procedures, failure patterns, and criteria for quality; it also needs retrieval that injects the right experience at the right task stage. The deck's "consolidation / dream" language points to a second-order update: raw feedback should be compressed, verified, and generalized into stable experience rather than stored as noisy transcript fragments.

Diagram of persistent harness memory and consolidation — Persistent harness state gives learning a place to live: memory, workflows, cases, standards, and generalized experience.

Feedback Must Become an Update Signal 反馈必须变成更新信号

TLM's feedback is richer than scalar rewards or class labels. It includes expert advice, process criticism, tool verification, environment observations, and task success or failure. The hard part is turning that natural language feedback into an actionable update. The deck proposes a decomposition: separate fact corrections, standard changes, strategy suggestions, and failure causes; attribute each piece to the part of the system it should affect; verify whether the feedback is reliable; then route it to memory, workflow, quality standards, or self-diagnosis. TextGrad-like ideas matter here because they treat text itself as a gradient-like signal for improvement.

Diagram showing how natural language feedback is routed into updates — Feedback becomes useful when it is decomposed, attributed, verified, and routed into the right update surface.

Improvement Requires Scaling Laws and Long-Term Evaluation

The final part of the deck pushes beyond architecture into measurement. TLM capability growth may depend on task length, task diversity, feedback density, expert quality, memory capacity, retrieval noise, and experience reuse rate. These variables form a different scaling-law surface from model size alone. Evaluation must therefore be longitudinal: long-horizon task success, product quality, gap-diagnosis accuracy, feedback utilization efficiency, transfer ability, memory reuse, robustness, and continuous-running stability. The lesson is simple but demanding: a learning machine should be judged by whether yesterday's work makes tomorrow's work measurably better.

Diagram connecting TLM scaling variables with long-term evaluation — Scaling and evaluation ask whether task experience, feedback, memory, and reuse compound into durable capability growth.