System for automated data engineering for large scale machine learning

Invention Grant

US11301438B2 System for automated data engineering for large scale machine learning 有权

Please log in to see more content

Patent Title: System for automated data engineering for large scale machine learning
Application No.: US17009883

Application Date: 2020-09-02
Publication No.: US11301438B2

Publication Date: 2022-04-12
Inventor: Wei Dai , Weiren Yu , Eric Xing
Applicant: Petuum Inc.
Applicant Address: US PA Pittsburgh
Assignee: Petuum Inc.
Current Assignee: Petuum Inc.
Current Assignee Address: US PA Pittsburgh
Agency: MagStone Law, LLP
Agent Enshan Hong
Main IPC: G06F16/21
IPC: G06F16/21 ; G06N20/00 ; G06F16/27 ; G06F15/76

System for automated data engineering for large scale machine learning

Abstract:

Accordingly, a data engineering system for machine learning at scale is disclosed. In one embodiment, the data engineering system includes an ingest processing module having a schema update submodule and a feature statistics update submodule, wherein the schema update submodule is configured to discover new features and add them to a schema, and wherein the feature statistics update submodule collects statistics for each feature to be used in an online transformation, a record store to store data from a data source, and a transformation module, to receive a low dimensional data instance from the record store and to receive the schema and feature statistics from the ingest processing module, and to transform the low dimensional data instance into a high dimensional representation. One embodiment provides a method for data engineering for machine learning at scale, the method including calling a built-in feature transformation or defining a new transformation, specifying a data source and compressing and storing the data, providing ingest-time processing by automatically analyzing necessary statistics for features, and then generating a schema for a dataset for subsequent data engineering. Other embodiments are disclosed herein.

Public/Granted literature

US20210026818A1 System for Automated Data Engineering for Large Scale Machine Learning Public/Granted day:2021-01-28

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/20	.•结构化数据，例如关系型数据
G06F16/21	..••数据库设计、管理或维护