广西科学院学报

引用本文：

乌云塔那,王斯日古楞.蒙古语词向量评测研究[J].广西科学院学报,2018,34(1):68-71. [点击复制]
Wuyuntana,Wangsiriguleng.Research on Mongolian Word Vectors Evaluation[J].Journal of Guangxi Academy of Sciences,2018,34(1):68-71. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 648次下载 1060次	码上扫一扫！
蒙古语词向量评测研究
乌云塔那, 王斯日古楞
0 字体:加大+\|默认\|缩小-
(内蒙古师范大学计算机与信息工程学院, 内蒙古呼和浩特 010022)

摘要:

词向量具有良好的语义特性,可用于改善和简化许多自然语言信息处理应用。本研究利用CBOW和Skip-gram两种模型架构在不同数据和不同维度下训练蒙古语词向量,然后结合蒙古语特征设计一个语义语法综合测试集,并在此测试集上用语义和语法相似度来评测词向量质量。研究结果表明,蒙古语语义和语法相似性任务上,Skip-gram模型优于CBOW模型,Skip-gram模型的窗口大小为5的情况下,词向量质量最好,且随着词向量维度或训练数据的增大,词向量质量有明显的提高。

关键词: 词向量 CBOW模型 Skip-gram模型词向量质量语义语法相似度

DOI：10.13657/j.cnki.gxkxyxb.20180320.006

投稿时间：2017-11-01修订日期：2017-12-15

基金项目:内蒙古自治区自然科学基金项目"基于条件随机场的蒙古文命名体识别研究"(2016MS0623)和国家自然科学基金项目"基于神经网络的蒙汉机器翻译研究"(61762072)资助。

Research on Mongolian Word Vectors Evaluation

Wuyuntana, Wangsiriguleng

(Computer and Information Engineering College, Inner Mongolia Normal University, Hohhot, Inner Mongolia, 010022, China)

Abstract:

The words vector has good semantic properties and can be used to improve and simplify many natural language processing applications. This study used CBOW (continuous Bag of words) and Skip-gram two model architectures to train the Mongolian word vectors in different data and different dimensions. Then we design a comprehensive semantic syntactic test set based on the Mongolia language features. And on this test set, we use semantic and syntactic similarity to estimate the quality of the word vectors. The results indicate that Skip-gram model is superior to CBOW model in Mongolian semantic and syntactic similarity tasks, and the word vectors quality is the best when the window size is 5. Moreover, with the increase of the word vectors dimension or training data, the quality of the word vectors is obviously improved.

Key words: word vectors CBOW model Skip-gram model quality of the word vectors semantic syntactic similarity

用微信扫一扫