-
公开(公告)号:US20170300565A1
公开(公告)日:2017-10-19
申请号:US15098856
申请日:2016-04-14
Applicant: Xerox Corporation
Inventor: Ioan Calapodescu , Nicolas Guerin , Fanchon Jacques
CPC classification number: G06F16/353 , G06F16/278 , G06F16/30 , G06F16/3325 , G06F16/3344 , G06F16/35 , G06F16/93 , G06N20/00
Abstract: A method for extracting entities from a text document includes, for at least a section of a text document, providing a first set of entities extracted from the at least a section, clustering at least a subset of the extracted entities in the first set into clusters, based on locations of the entities in the document. Complete ones of the clusters of entities are identified. Patterns for extracting new entities are learned based on the complete clusters. New entities are extracted from incomplete clusters based on the learned patterns.