Learning Machines: A System View of TLM
May 17 Sharing: The Machine Learning
Learning Machines Lab
This sharing reframes "learning" as a full system rather than a single better answer. Starting from the classic task-experience-performance view of machine learning, it defines TLM as a continuously learning agent system operating in dynamic real environments: it executes long-horizon tasks, diagnoses its capability boundaries, asks for feedback, and consolidates that feedback into an updatable agent harness so the agent's cognitive core becomes stronger over time.
In Brief
- 1
TLM inherits the machine-learning loop, but moves the loop from static data to real long-horizon agent work.
- 2
The agent harness is the practical learning carrier: memory, state, workflow, experience base, tools, and quality standards can all be updated without relying only on model-weight changes.
- 3
The key learning behavior is active: the agent must know when it does not know, locate the missing feedback, and ask the right human, tool, verifier, or environment for help.
- 4
Long-term evaluation should measure cognitive-core strengthening, experience reuse, feedback efficiency, and transfer, not only one-shot task success.
Start From Learning
The first move in the deck is conceptual discipline: learning is not "the answer looks better this time." A learning system needs a task or objective, an environment that supplies experience, a learning subject, a learning carrier, a feedback signal, an update mechanism, and a measurable performance criterion. This matters because agent learning can otherwise become a vague metaphor. If we cannot say what task the agent is improving on, what feedback it received, where the update was stored, and how later performance improved, then we have not yet shown learning. We have only shown execution.
Define TLM as a Learning System
TLM transfers the machine-learning loop into the setting of agents that work in dynamic real environments. The task is no longer a labeled dataset; it is a long-horizon task with changing context, tools, humans, constraints, and external state. The learning subject is an LLM plus an agent harness. The learning carrier is not only model parameters, but also durable state, workflow graphs, memory, experience bases, tool policies, and quality standards. The update process becomes a loop of absorbing, compressing, validating, generalizing, and reusing feedback.
Cognitive Core Is a Capability, Not a Fact Store
The deck introduces "cognitive core" by first grounding it in human cognition: perceiving, remembering, understanding, judging, solving problems, and creating. For an agent, the cognitive core is the mechanism that organizes and updates cognitive resources. Strengthening it does not mean adding more isolated facts to memory. It means the agent becomes better at understanding the problem, decomposing it, deciding what evidence matters, calling the right resources, forming judgments, and absorbing corrections. In Bloom-style terms, TLM should move agents from recognition and understanding toward application, analysis, evaluation, and creation.
Agent Harness Becomes the Runtime for Learning
The agent harness is the bidirectional runtime between model reasoning and environmental action. It turns model outputs into executable tool calls, workflows, recoverable actions, and observable traces; it also turns environment, tool, verifier, and human feedback into signals that can be inspected and reused. This is why the harness is not a thin wrapper. It carries runtime responsibilities such as context assembly, tool invocation, error recovery, observation management, and state tracking. It also carries governance and learning responsibilities: safety constraints, validation, feedback normalization, process inspection, memory updates, and quality standards.
Active Learning Moves From Samples to Capability Boundaries
Classical online active learning asks which incoming sample is worth labeling. TLM asks a different question: where is the agent's capability boundary, why is it uncertain, and who or what can provide the feedback that closes the gap? The learning loop therefore begins during task execution. The agent acts in a real environment, exposes uncertainty or risk, asks an expert, tool, verifier, or environment for help, then consolidates the answer into memory, workflow, standards, or self-diagnosis. A good TLM system must not wait passively for feedback; it must decide when, what, and whom to ask.
The Technical Agenda Has Eight Coupled Questions
The deck organizes the research agenda around the same dimensions used to define learning. The questions are not independent modules. Defining and measuring the cognitive core shapes the target. Building dynamic environments determines what experience is available. Supporting lifelong execution depends on context management. Designing persistent harness state determines where learning lives. Diagnosing capability boundaries determines when feedback is requested. Converting natural language feedback into updates determines whether help becomes reusable. Scaling-law research asks which task, feedback, and memory variables drive capability growth. Long-term evaluation checks whether the whole loop actually improves the agent over time.
Persistent Harness State Is the Memory of Learning
A practical TLM does not have to update model weights every time it learns. Much of the learning can live in persistent harness structures: state, workflow graphs, memory, experience bases, tool strategies, and quality standards. The important design problem is granularity. The system needs factual memory, cases, procedures, failure patterns, and criteria for quality; it also needs retrieval that injects the right experience at the right task stage. The deck's "consolidation / dream" language points to a second-order update: raw feedback should be compressed, verified, and generalized into stable experience rather than stored as noisy transcript fragments.
Feedback Must Become an Update Signal
TLM's feedback is richer than scalar rewards or class labels. It includes expert advice, process criticism, tool verification, environment observations, and task success or failure. The hard part is turning that natural language feedback into an actionable update. The deck proposes a decomposition: separate fact corrections, standard changes, strategy suggestions, and failure causes; attribute each piece to the part of the system it should affect; verify whether the feedback is reliable; then route it to memory, workflow, quality standards, or self-diagnosis. TextGrad-like ideas matter here because they treat text itself as a gradient-like signal for improvement.
Improvement Requires Scaling Laws and Long-Term Evaluation
The final part of the deck pushes beyond architecture into measurement. TLM capability growth may depend on task length, task diversity, feedback density, expert quality, memory capacity, retrieval noise, and experience reuse rate. These variables form a different scaling-law surface from model size alone. Evaluation must therefore be longitudinal: long-horizon task success, product quality, gap-diagnosis accuracy, feedback utilization efficiency, transfer ability, memory reuse, robustness, and continuous-running stability. The lesson is simple but demanding: a learning machine should be judged by whether yesterday's work makes tomorrow's work measurably better.