Invention Grant
- Patent Title: Techniques for clustering structurally similar web pages based on page features
- Patent Title (中): 基于页面特征聚类结构相似网页的技术
-
Application No.: US11481809Application Date: 2006-07-05
-
Publication No.: US07676465B2Publication Date: 2010-03-09
- Inventor: Krishna Leela Poola
- Applicant: Krishna Leela Poola
- Applicant Address: US CA Sunnyvale
- Assignee: Yahoo! Inc.
- Current Assignee: Yahoo! Inc.
- Current Assignee Address: US CA Sunnyvale
- Agency: Hickman Palermo Truong & Becker LLP
- Main IPC: G06F7/00
- IPC: G06F7/00 ; G06F17/30

Abstract:
Web page clustering techniques described herein are URL Clustering and Page Clustering, whereby clustering algorithms cluster together pages that are structurally similar. Regarding URL clustering, because similarly structured pages have similar patterns in their URLs, grouping similar URL patterns will group structurally similar pages. Embodiments of URL clustering may involve: (a) URL normalization and (b) URL variation computation. Regarding page clustering, page feature-based techniques further cluster any given set of homogenous clusters, reducing the number of clusters based on the underlying page code. Embodiments of page clustering may reduce the number of clusters based on the tag probabilities and the tag sequence, utilizing an Approximate Nearest Neighborhood (ANN) graph along with evaluation of intra-cluster and inter-cluster compactness.
Public/Granted literature
- US20080010292A1 Techniques for clustering structurally similar webpages based on page features Public/Granted day:2008-01-10
Information query