引用本文
  • 申文明,黄家裕,刘连芳.平行语料库的相似语句去重算法[J].广西科学院学报,2009,25(4):248-250,256.    [点击复制]
  • SHEN Wen-ming,HUANG Jia-yu,LIU Lian-fang.Algorithm for Removing Similar Sentence on Parallel Corpus[J].Journal of Guangxi Academy of Sciences,2009,25(4):248-250,256.   [点击复制]
【打印本页】 【在线阅读全文】【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

←前一篇|后一篇→

过刊浏览    高级检索

本文已被:浏览 372次   下载 514 本文二维码信息
码上扫一扫!
平行语料库的相似语句去重算法
申文明1, 黄家裕2, 刘连芳1,2
0
(1.广西大学计算机与电子信息学院, 广西南宁 530004;2.南宁平方软件新技术有限公司, 广西南宁 530003)
摘要:
尝试对平行语料库中需要去重的中文句子相似情况作分类,利用整体相似因子和局部相似因子计算句子的相似度,并借鉴KMP算法的匹配跳跃思想,提出中文字符串匹配的类KMP算法,并对算法进行实验验证。结果表明,算法具有较好的效果,能够实现平行语料库中相似句子的去重。算法开放测试的召回率达94%,去重准确率达到84%。算法可以应用于任何长度的语句比对,适用范围广。
关键词:  去重  相似句子  平行语料库  类KMP
DOI:
投稿时间:2009-10-10
基金项目:南宁市人才小高地基金项目(No.2007007)资助。
Algorithm for Removing Similar Sentence on Parallel Corpus
SHEN Wen-ming1, HUANG Jia-yu2, LIU Lian-fang1,2
(1.School of Computer, Electronic and Information, Guangxi University, Nanning, Guangxi, 530004, China;2.Pingsoft New Technology Co. Ltd. of Naning, Nanning, Guangxi, 530004, China)
Abstract:
The similarity of Chinese sentence is classified and duplicated sentence is removed.Sentence similarity depends on similarity of unitary factor and partial factor.According to the idea of KMP's jump,the simular KMP in chinese sentence is used.The experiment results show that the algorithm is effective, the recall rate of duplicate removal reach 94%, and the precision rate reach 84% in large scale testing.
Key words:  duplicate removal  similar sentence  parallel corpus  similar KMP

用微信扫一扫

用微信扫一扫