Systems and methods for classifying documents for data loss prevention

Invention Grant

US09043247B1 Systems and methods for classifying documents for data loss prevention 有权

Title translation: 用于分类数据丢失预防的文档的系统和方法

Please log in to see more content

Patent Title: Systems and methods for classifying documents for data loss prevention
Patent Title (中): 用于分类数据丢失预防的文档的系统和方法
Application No.: US13405293

Application Date: 2012-02-25
Publication No.: US09043247B1

Publication Date: 2015-05-26
Inventor: Michael Hart , Kushal Tayal , Phillip DiCorpo
Applicant: Michael Hart , Kushal Tayal , Phillip DiCorpo
Applicant Address: US CA Mountain View
Assignee: Symantec Corporation
Current Assignee: Symantec Corporation
Current Assignee Address: US CA Mountain View
Agency: ALG Intellectual Property, LLC
Main IPC: G06F15/18
IPC: G06F15/18 ; G06F17/30 ; G06N5/02

Systems and methods for classifying documents for data loss prevention

Abstract:

A computer-implemented method for classifying documents for data loss prevention may include 1) identifying training documents for a machine learning classifier configured for data loss prevention, 2) performing a semantic analysis on training documents to identify topics within the set training documents, 3) applying a similarity metric to the topics to identify at least one unrelated topic with a similarity to the other topics within the plurality of topics, as determined by the similarity metric, that falls below a similarity threshold, 4) identifying, based on the semantic analysis, at least one irrelevant training document within the set of training documents in which a predominance of the unrelated topic is above a predominance threshold, and 5) excluding the irrelevant training document from the set of training documents based on the predominance of the unrelated topic within the irrelevant training document. Various other methods, systems, and computer-readable media are also disclosed.

Abstract(Chinese):

一种用于数据丢失防范文件分类的计算机实现方法可以包括：1）为配置用于数据丢失预防的机器学习分类器识别训练文档; 2）对训练文档执行语义分析，以识别所设定的训练文档内的主题; 3）对所述主题应用相似性度量以识别与所述多个主题内的其他主题相似的至少一个不相关主题，所述主题由相似性度量确定，所述相似性度量低于相似性阈值; 4）基于所述语义分析在一组培训文件中至少有一个不相关的培训文件，其中不相关主题的优势高于优势阈值，以及5）基于不相关主题的优势，从该组培训文档中排除不相关的培训文档不相干的培训文件。还公开了各种其它方法，系统和计算机可读介质。

Information query

Espacenet