Method and system for page construct detection based on sequential regularities
Abstract:
Disclosed is a method and system that generates a page construct structure associated with a sequentially-ordered set of pages, each being characterized by a set of page construct features. N-grams, i.e., a sequence of n features, are computed from a set of page construct features for n contiguous pages, and n-grams which are repetitive are selected. Pages matching the most frequent repetitive n-ram are grouped together under a new node, and a new sequence is created. The method is iteratively applied to this new sequence. The output is an ordered set of trees.
Information query
Patent Agency Ranking
0/0