Research note 研究笔记 May 17 sharing deck May 17, 2026 2026年5月17日

Learning Machines: A System View of TLM Learning Machines:TLM 的系统视角

May 17 Sharing: The Machine Learning

Learning Machines Lab

This sharing reframes "learning" as a full system rather than a single better answer. Starting from the classic task-experience-performance view of machine learning, it defines TLM as a continuously learning agent system operating in dynamic real environments: it executes long-horizon tasks, diagnoses its capability boundaries, asks for feedback, and consolidates that feedback into an updatable agent harness so the agent's cognitive core becomes stronger over time. 这次分享把“学习”重新放回一个完整系统里,而不是把它理解为某次输出变好。从机器学习中 任务、经验、性能改进的经典定义出发,PPT 将 TLM 定义为运行在动态真实环境中的持续学习 智能体系统:它执行长程任务,诊断能力边界,主动寻求反馈,并把反馈沉淀进可更新的 agent harness,使智能体的认知核随着时间被强化。

Mind map of the May 17 Learning Machines sharing
The sharing connects learning contracts, cognitive core, agent harness, feedback, persistent memory, and long-term measurement into one TLM system view. 这次分享把学习契约、认知核、agent harness、反馈、持久化记忆和长期度量连接成一个 TLM 系统视角。

In Brief 简要结论

  1. 1

    TLM inherits the machine-learning loop, but moves the loop from static data to real long-horizon agent work. TLM 继承机器学习的学习闭环,但把闭环从静态数据迁移到真实长程智能体任务中。

  2. 2

    The agent harness is the practical learning carrier: memory, state, workflow, experience base, tools, and quality standards can all be updated without relying only on model-weight changes. agent harness 是实际的学习载体:记忆、状态、工作流、经验库、工具策略和质量标准都可以更新,而不必只依赖模型参数变化。

  3. 3

    The key learning behavior is active: the agent must know when it does not know, locate the missing feedback, and ask the right human, tool, verifier, or environment for help. 关键学习行为是主动的:智能体必须知道自己何时不会、缺什么反馈,并向合适的人、工具、验证器或环境求助。

  4. 4

    Long-term evaluation should measure cognitive-core strengthening, experience reuse, feedback efficiency, and transfer, not only one-shot task success. 长期评估应该衡量认知核强化、经验复用、反馈利用效率和迁移能力,而不只是一次任务是否成功。

Start From Learning 从学习的系统定义开始

The first move in the deck is conceptual discipline: learning is not "the answer looks better this time." A learning system needs a task or objective, an environment that supplies experience, a learning subject, a learning carrier, a feedback signal, an update mechanism, and a measurable performance criterion. This matters because agent learning can otherwise become a vague metaphor. If we cannot say what task the agent is improving on, what feedback it received, where the update was stored, and how later performance improved, then we have not yet shown learning. We have only shown execution. PPT 的第一步是概念上的约束:学习不是“这次答案看起来更好”。一个学习系统至少需要 任务或目标、提供经验的环境、执行学习的主体、承载变化的学习载体、反馈信号、更新机制, 以及可度量的性能标准。这个约束很重要,因为智能体学习很容易退化成一个松散比喻。 如果我们说不清智能体在什么任务上变好、收到了什么反馈、更新沉淀在哪里,以及后来表现 如何可靠提升,那么它还只是执行,并没有真正构成学习。

Diagram of learning as a system loop
Learning becomes an explicit contract: task, experience, carrier, feedback, update, and measurable performance improvement. 学习被约束为明确契约:任务、经验、载体、反馈、更新,以及可度量的性能提升。

Define TLM as a Learning System 把 TLM 定义为学习系统

TLM transfers the machine-learning loop into the setting of agents that work in dynamic real environments. The task is no longer a labeled dataset; it is a long-horizon task with changing context, tools, humans, constraints, and external state. The learning subject is an LLM plus an agent harness. The learning carrier is not only model parameters, but also durable state, workflow graphs, memory, experience bases, tool policies, and quality standards. The update process becomes a loop of absorbing, compressing, validating, generalizing, and reusing feedback. TLM 把机器学习的学习闭环迁移到智能体执行真实任务的场景里。任务不再只是带标签的数据集, 而是上下文、工具、人类、约束和外部状态不断变化的长程任务。学习主体是 LLM 与 agent harness 共同构成的智能体。学习载体也不只有模型参数,还包括持久化状态、工作流图、记忆、 经验库、工具策略和质量标准。更新过程则变成吸收、压缩、验证、泛化并复用反馈的闭环。

Diagram mapping the learning-system contract onto TLM
TLM relocates the learning loop from static datasets into long-horizon agent work in dynamic real environments. TLM 将学习闭环从静态数据集迁移到动态真实环境中的长程智能体任务。

Cognitive Core Is a Capability, Not a Fact Store 认知核是能力,不是知识条目库

The deck introduces "cognitive core" by first grounding it in human cognition: perceiving, remembering, understanding, judging, solving problems, and creating. For an agent, the cognitive core is the mechanism that organizes and updates cognitive resources. Strengthening it does not mean adding more isolated facts to memory. It means the agent becomes better at understanding the problem, decomposing it, deciding what evidence matters, calling the right resources, forming judgments, and absorbing corrections. In Bloom-style terms, TLM should move agents from recognition and understanding toward application, analysis, evaluation, and creation. PPT 先从人的认知来解释“认知核”:感知、记忆、理解、判断、解决问题和创造。对应到智能体, 认知核不是某条知识、某个工具或某段 prompt,而是组织、调度和更新认知资源的核心机制。 强化认知核也不是往 memory 里塞更多孤立事实,而是让智能体更会理解问题、拆解任务、 判断证据重要性、调用合适资源、形成判断并吸收修正。用布鲁姆认知层级来类比,TLM 希望 智能体从识记和理解,逐步走向应用、分析、评价与创造。

Mind map of cognitive-core capabilities
Cognitive-core strengthening means better organization, judgment, resource use, and feedback absorption, not merely a larger fact store. 认知核强化意味着更好的组织、判断、资源调用与反馈吸收,而不只是更大的事实库。

Agent Harness Becomes the Runtime for Learning Agent Harness 是学习的运行时

The agent harness is the bidirectional runtime between model reasoning and environmental action. It turns model outputs into executable tool calls, workflows, recoverable actions, and observable traces; it also turns environment, tool, verifier, and human feedback into signals that can be inspected and reused. This is why the harness is not a thin wrapper. It carries runtime responsibilities such as context assembly, tool invocation, error recovery, observation management, and state tracking. It also carries governance and learning responsibilities: safety constraints, validation, feedback normalization, process inspection, memory updates, and quality standards. agent harness 是模型推理与环境行动之间的双向运行时:它把模型输出转化为可执行工具调用、 工作流、可恢复动作和可观测轨迹,也把环境、工具、验证器和人类反馈整理成可以检查与复用的 信号。因此 harness 不是一层很薄的 wrapper。它承担上下文组装、工具调用、错误恢复、观测管理、 状态追踪等运行时职责,也承担安全约束、验证、反馈归一化、过程检查、记忆更新与质量标准等 治理和学习职责。

Diagram of the agent harness as a bidirectional runtime
The harness translates model reasoning into environmental action and turns observations back into reusable learning signals. harness 把模型推理转化为环境行动,也把观测和反馈整理成可复用的学习信号。

Active Learning Moves From Samples to Capability Boundaries 主动学习从样本选择转向能力边界

Classical online active learning asks which incoming sample is worth labeling. TLM asks a different question: where is the agent's capability boundary, why is it uncertain, and who or what can provide the feedback that closes the gap? The learning loop therefore begins during task execution. The agent acts in a real environment, exposes uncertainty or risk, asks an expert, tool, verifier, or environment for help, then consolidates the answer into memory, workflow, standards, or self-diagnosis. A good TLM system must not wait passively for feedback; it must decide when, what, and whom to ask. 传统在线主动学习问的是:哪个新样本最值得标注?TLM 问的是另一个问题:智能体的能力边界 在哪里,它为什么不确定,谁或什么资源能提供补齐缺口的反馈?因此学习闭环从任务执行中开始。 智能体在真实环境中行动,暴露不确定性或风险,向专家、工具、验证器或环境求助,再把答案 沉淀进记忆、工作流、质量标准或自我诊断机制。好的 TLM 系统不能被动等待反馈,而要主动决定 何时求助、问什么、向谁问。

Loop for active feedback and capability-boundary diagnosis
TLM active learning asks where the agent is uncertain, what feedback would close the gap, and who or what can provide it. TLM 的主动学习追问智能体哪里不确定、需要什么反馈、以及应该向谁或什么资源求助。

The Technical Agenda Has Eight Coupled Questions 技术命题是八个相互耦合的问题

The deck organizes the research agenda around the same dimensions used to define learning. The questions are not independent modules. Defining and measuring the cognitive core shapes the target. Building dynamic environments determines what experience is available. Supporting lifelong execution depends on context management. Designing persistent harness state determines where learning lives. Diagnosing capability boundaries determines when feedback is requested. Converting natural language feedback into updates determines whether help becomes reusable. Scaling-law research asks which task, feedback, and memory variables drive capability growth. Long-term evaluation checks whether the whole loop actually improves the agent over time. PPT 用定义学习的同一组维度来组织 TLM 的研究命题。这些问题不是彼此独立的模块。 定义并度量认知核决定了目标;构建动态环境决定经验来源;支撑永续执行依赖上下文管理; 设计可持久化 harness 决定学习沉淀在哪里;诊断能力边界决定何时请求反馈;把自然语言反馈 转化为更新决定帮助能否复用;scaling law 研究追问任务、反馈和记忆变量如何推动能力增长; 长期评估则检验整个闭环是否真的让智能体随着时间变强。

Map of eight coupled technical questions for TLM
The agenda mirrors the definition of learning: target, environment, subject, carrier, paradigm, signal, update, and evaluation. 技术命题对应学习定义的八个维度:目标、环境、主体、载体、范式、信号、更新和评估。

Persistent Harness State Is the Memory of Learning 可持久化 Harness 是学习的记忆

A practical TLM does not have to update model weights every time it learns. Much of the learning can live in persistent harness structures: state, workflow graphs, memory, experience bases, tool strategies, and quality standards. The important design problem is granularity. The system needs factual memory, cases, procedures, failure patterns, and criteria for quality; it also needs retrieval that injects the right experience at the right task stage. The deck's "consolidation / dream" language points to a second-order update: raw feedback should be compressed, verified, and generalized into stable experience rather than stored as noisy transcript fragments. 一个实用的 TLM 不必每次学习都更新模型权重。大量学习可以沉淀在持久化 harness 结构中: state、workflow graph、memory、experience base、工具策略和质量标准。真正关键的是粒度设计。 系统需要事实记忆、案例、流程、失败模式和质量判断标准,也需要在合适任务阶段注入合适经验的 检索机制。PPT 中的 consolidation / dream 指向一种二阶更新:原始反馈不应只是作为嘈杂对话 片段被保存,而要被压缩、验证、泛化为稳定经验。

Diagram of persistent harness memory and consolidation
Persistent harness state gives learning a place to live: memory, workflows, cases, standards, and generalized experience. 可持久化 harness 为学习提供沉淀位置:记忆、工作流、案例、标准和泛化经验。

Feedback Must Become an Update Signal 反馈必须变成更新信号

TLM's feedback is richer than scalar rewards or class labels. It includes expert advice, process criticism, tool verification, environment observations, and task success or failure. The hard part is turning that natural language feedback into an actionable update. The deck proposes a decomposition: separate fact corrections, standard changes, strategy suggestions, and failure causes; attribute each piece to the part of the system it should affect; verify whether the feedback is reliable; then route it to memory, workflow, quality standards, or self-diagnosis. TextGrad-like ideas matter here because they treat text itself as a gradient-like signal for improvement. TLM 的反馈比标量奖励或类别标签更丰富:它包含专家建议、过程批评、工具验证、环境观察以及 任务成败。困难在于如何把这些自然语言反馈变成可操作的更新信号。PPT 提出一种拆解方式: 把反馈分成事实修正、标准变化、策略建议和失败原因;判断每一部分应该作用于系统哪里; 验证反馈是否可靠;再把它路由到 memory、workflow、quality standard 或 self-diagnosis。 TextGrad 一类思路在这里重要,因为它把文本本身当作类似梯度的改进信号。

Diagram showing how natural language feedback is routed into updates
Feedback becomes useful when it is decomposed, attributed, verified, and routed into the right update surface. 反馈只有在被拆解、归因、验证并路由到合适更新面之后,才会真正变成可复用学习。

Improvement Requires Scaling Laws and Long-Term Evaluation 能力增长需要 Scaling Law 与长期评估

The final part of the deck pushes beyond architecture into measurement. TLM capability growth may depend on task length, task diversity, feedback density, expert quality, memory capacity, retrieval noise, and experience reuse rate. These variables form a different scaling-law surface from model size alone. Evaluation must therefore be longitudinal: long-horizon task success, product quality, gap-diagnosis accuracy, feedback utilization efficiency, transfer ability, memory reuse, robustness, and continuous-running stability. The lesson is simple but demanding: a learning machine should be judged by whether yesterday's work makes tomorrow's work measurably better. PPT 的最后部分从架构推进到度量。TLM 的能力增长可能取决于任务长度、任务多样性、反馈密度、 专家质量、记忆容量、检索噪声和经验复用率。这些变量构成了不同于单纯模型规模的 scaling law 表面。因此评估必须是纵向的:长程任务成功率、产物质量、知识缺口识别准确率、反馈利用 效率、迁移能力、记忆复用率、鲁棒性和持续运行稳定性。核心判断很简单也很苛刻:一台学习机器 应该用“昨天的工作是否让明天的工作可度量地变好”来评价。

Diagram connecting TLM scaling variables with long-term evaluation
Scaling and evaluation ask whether task experience, feedback, memory, and reuse compound into durable capability growth. Scaling 与评估关心任务经验、反馈、记忆和复用是否能复利为稳定的能力增长。