广西科学院学报

引用本文：

周小平,黄家裕,刘连芳,梁一平,申文明.基于网页正文主题和摘要的网页去重算法[J].广西科学院学报,2009,25(4):251-253. [点击复制]
ZHOU Xiao-ping,HUANG Jia-yu,LIU Lian-fang,LIANG Yi-ping,SHEN Wen-ming.The Detection on Duplicated Web Pages from Meta Search[J].Journal of Guangxi Academy of Sciences,2009,25(4):251-253. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 494次下载 612次	码上扫一扫！
基于网页正文主题和摘要的网页去重算法
周小平¹, 黄家裕², 刘连芳^1,2, 梁一平¹, 申文明¹
0 字体:加大+\|默认\|缩小-
(1.广西大学计算机与电子信息学院, 广西南宁 530004;2.南宁平方软件新技术有限公司, 广西南宁 530003)

摘要:

针对元搜索返回的网页内容相同,别名差异很大的重复网页,提出基于网页正文主题和摘要的网页去重算法,并通过实验对算法进行有效性验证。该算法首先对各成员搜索引擎返回来的网页标题进行有关处理,提取出网页的主题信息,然后对摘要进行分词,再计算摘要的相似度,二者结合能更好地现出文章摘要的内容,实现网页去重。该算法有效,并且比基于传统特征码的算法有明显的优势,更接近人工统计结果。

关键词: 去重网页分词相似度元搜索

DOI：

投稿时间：2009-10-10

基金项目:国家中小企业创新基金项目(编号:08c26224501313)资助。

The Detection on Duplicated Web Pages from Meta Search

ZHOU Xiao-ping¹, HUANG Jia-yu², LIU Lian-fang^1,2, LIANG Yi-ping¹, SHEN Wen-ming¹

(1.School of Computer, Electronic and Information, Guangxi University, Nanning, Guangxi, 530004, China;2.Pingsoft New Technology Co. Ltd. of Naning, Nanning, Guangxi, 530004, China)

Abstract:

According to the duplicated web pages returning from meta-search engine with same contents,but different name,an algorithm of duplicated webpages detection based on a combined duplication detection of the title and summary of web page is proposed.The effectiveness of the algorithm is verified through experiments.First,the algorithm analyze the page title which single search engines return;second,thematic information of page is extracted and word segmentation on the summary is carried out;finally,the similarity is calculated.By combining thematic information of web page title and the similarity of word segmentation on the summary,the algorithm can better to reflect the contents of the article summary,realize to detection and elimination of duplicated web pages.The algorithm has obvious advantages compared with the traditional signature-based algorithm,and is closer to artificial results.

Key words: duplicate detection Web pages Chinese word segmentation repetition rate meta search engine

用微信扫一扫