Methods and systems for detecting duplicate document using document similarity measuring model based on deep learning
Abstract:
Disclosed is a method and system, the method including extracting similar and dissimilar document pair sets from a document database, the similar document pair set including similar document pairs having a common attribute, and the dissimilar document pair set including dissimilar document pairs extracted randomly, calculating a mathematical similarity for each of the similar and dissimilar document pairs using a mathematical measure to obtain a first and second mathematical similarities, calculating a semantic similarity for each of the similar and dissimilar document pairs to obtain a first and second semantic similarities, the first semantic similarities being higher than the first mathematical similarities, and the second semantic similarities being lower than the second mathematical similarities, training a similarity model based on the similar and dissimilar document pairs, and the first and second semantic similarities to obtain a trained similarity model, and detecting a duplicate document using the trained similarity model.
Information query
Patent Agency Ranking
0/0