Method and apparatus for improved reward-based learning using nonlinear dimensionality reduction

Invention Grant

US08060454B2 Method and apparatus for improved reward-based learning using nonlinear dimensionality reduction 失效

Title translation: 使用非线性维数降低改进奖励学习的方法和装置

Please log in to see more content

Patent Title: Method and apparatus for improved reward-based learning using nonlinear dimensionality reduction
Patent Title (中): 使用非线性维数降低改进奖励学习的方法和装置
Application No.: US11870698

Application Date: 2007-10-11
Publication No.: US08060454B2

Publication Date: 2011-11-15
Inventor: Rajarshi Das , Gerald J. Tesauro , Kilian Q. Weinberger
Applicant: Rajarshi Das , Gerald J. Tesauro , Kilian Q. Weinberger
Applicant Address: US NY Armonk
Assignee: International Business Machines Corporation
Current Assignee: International Business Machines Corporation
Current Assignee Address: US NY Armonk
Main IPC: G06F15/18
IPC: G06F15/18

Method and apparatus for improved reward-based learning using nonlinear dimensionality reduction

Abstract:

The present invention is a method and an apparatus for reward-based learning of management policies. In one embodiment, a method for reward-based learning includes receiving a set of one or more exemplars, where at least two of the exemplars comprise a (state, action) pair for a system, and at least one of the exemplars includes an immediate reward responsive to a (state, action) pair. A distance measure between pairs of exemplars is used to compute a Non-Linear Dimensionality Reduction (NLDR) mapping of (state, action) pairs into a lower-dimensional representation, thereby producing embedded exemplars, wherein one or more parameters of the NLDR are tuned to minimize a cross-validation Bellman error on a holdout set taken from the set of one or more exemplars. The mapping is then applied to the set of exemplars, and reward-based learning is applied to the embedded exemplars to obtain a learned management policy.

Abstract(Chinese):

本发明是用于管理策略的奖励学习的方法和装置。在一个实施例中，用于基于奖励的学习的方法包括接收一组一个或多个示例，其中至少两个示例包括用于系统的（状态，动作）对，并且所述示例中的至少一个包括立即响应（状态，动作）对的奖励。使用示范对之间的距离测量来计算（状态，动作）对的非线性尺寸减小（NLDR）映射到较低维表示，从而产生嵌入的样本，其中NLDR的一个或多个参数被调谐以最小化从一组或多个样本组中获取的保持集上的交叉验证Bellman错误。然后将该映射应用于一组示例，并且基于奖励的学习被应用于嵌入的示例以获得学习的管理策略。

Public/Granted literature

US20090098515A1 METHOD AND APPARATUS FOR IMPROVED REWARD-BASED LEARNING USING NONLINEAR DIMENSIONALITY REDUCTION Public/Granted day:2009-04-16

Information query

Espacenet