Invention Grant
- Patent Title: Techniques for categorizing web pages
- Patent Title (中): 技术分类网页
-
Application No.: US12652624Application Date: 2010-01-05
-
Publication No.: US08768926B2Publication Date: 2014-07-01
- Inventor: Ashwin Tengli , Rajeev Rastogi , Jeyashankher Ramamirtham , Srinivasan H Sengamedu , Sandeepkumar Bhuramal Satpal
- Applicant: Ashwin Tengli , Rajeev Rastogi , Jeyashankher Ramamirtham , Srinivasan H Sengamedu , Sandeepkumar Bhuramal Satpal
- Applicant Address: US CA Sunnyvale
- Assignee: Yahoo! Inc.
- Current Assignee: Yahoo! Inc.
- Current Assignee Address: US CA Sunnyvale
- Agency: Hickman Palermo Truong Becker Bingham Wong LLP
- Main IPC: G06F7/00
- IPC: G06F7/00 ; G06F17/30

Abstract:
Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern (“URL-regex”) that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.
Public/Granted literature
- US20110167063A1 TECHNIQUES FOR CATEGORIZING WEB PAGES Public/Granted day:2011-07-07
Information query