Invention Grant
- Patent Title: Method for organizing structurally similar web pages from a web site
- Patent Title (中): 从网站组织结构相似的网页的方法
-
Application No.: US11838351Application Date: 2007-08-14
-
Publication No.: US07941420B2Publication Date: 2011-05-10
- Inventor: Krishna Prasad Chitrapura , Krishna Leela Poola
- Applicant: Krishna Prasad Chitrapura , Krishna Leela Poola
- Applicant Address: US CA Sunnyvale
- Assignee: Yahoo! Inc.
- Current Assignee: Yahoo! Inc.
- Current Assignee Address: US CA Sunnyvale
- Agency: Hickman Palermo Truong & Becker LLP
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
Techniques are described for organizing structurally similar web pages for a website. Fingerprints are made of the structure of the web pages using shingling by placing the web page's HTML tags and attributes in sequence and encoding the tags and attributes using a standard encoding technique. Fixed-size portions of the encoded sequence are taken and a set of values extracted using independent hash functions to compute the shingles. Alternatively, a DOM tree representation of HTML of the web page is generated and each path of the DOM tree encoded and values extracted using independent hash functions to compute the shingles. A specified number of shingles are retained as the fingerprint. The pages are then clustered based upon the URL and the similarity of the shingles. The clustered hierarchal organization of pages is further pruned by various criteria including similarity of shingles or support of the cluster node in the hierarchy.
Public/Granted literature
- US20090049062A1 Method for Organizing Structurally Similar Web Pages from a Web Site Public/Granted day:2009-02-19
Information query