Invention Grant
- Patent Title: System and method for automatically extracting metadata from unstructured electronic documents
- Patent Title (中): 从非结构化电子文档自动提取元数据的系统和方法
-
Application No.: US13258484Application Date: 2010-01-18
-
Publication No.: US08843815B2Publication Date: 2014-09-23
- Inventor: Sheng-Wen Yang , Yuhong Xiong , Wei Liu
- Applicant: Sheng-Wen Yang , Yuhong Xiong , Wei Liu
- Applicant Address: US TX Houston
- Assignee: Hewlett-Packard Development Company, L. P.
- Current Assignee: Hewlett-Packard Development Company, L. P.
- Current Assignee Address: US TX Houston
- International Application: PCT/CN2010/070243 WO 20100118
- International Announcement: WO2011/085562 WO 20110721
- Main IPC: G06F17/00
- IPC: G06F17/00 ; G06F17/27 ; G06F17/30

Abstract:
A system and method for automatically extracting meta data from unstructured electronic documents is disclosed. In one embodiment, the unstructured electronic document is converted into a plain text document. Further, a document header of the unstructured electronic document is extracted from the plain text document using a rule-based document header extractor, where the rule-based document header extractor may be based on a rule that includes determining a ratio of a number of words with their initial letters capitalized in a text line over a total number of words in the text line in the plain text document. Moreover, meta data is extracted from the extracted document header using a heuristic approach.
Public/Granted literature
- US20120278705A1 System and Method for Automatically Extracting Metadata from Unstructured Electronic Documents Public/Granted day:2012-11-01
Information query