摘要: |
词向量具有良好的语义特性,可用于改善和简化许多自然语言信息处理应用。本研究利用CBOW和Skip-gram两种模型架构在不同数据和不同维度下训练蒙古语词向量,然后结合蒙古语特征设计一个语义语法综合测试集,并在此测试集上用语义和语法相似度来评测词向量质量。研究结果表明,蒙古语语义和语法相似性任务上,Skip-gram模型优于CBOW模型,Skip-gram模型的窗口大小为5的情况下,词向量质量最好,且随着词向量维度或训练数据的增大,词向量质量有明显的提高。 |
关键词: 词向量 CBOW模型 Skip-gram模型 词向量质量 语义语法相似度 |
DOI:10.13657/j.cnki.gxkxyxb.20180320.006 |
投稿时间:2017-11-01修订日期:2017-12-15 |
基金项目:内蒙古自治区自然科学基金项目"基于条件随机场的蒙古文命名体识别研究"(2016MS0623)和国家自然科学基金项目"基于神经网络的蒙汉机器翻译研究"(61762072)资助。 |
|
Research on Mongolian Word Vectors Evaluation |
Wuyuntana, Wangsiriguleng
|
(Computer and Information Engineering College, Inner Mongolia Normal University, Hohhot, Inner Mongolia, 010022, China) |
Abstract: |
The words vector has good semantic properties and can be used to improve and simplify many natural language processing applications. This study used CBOW (continuous Bag of words) and Skip-gram two model architectures to train the Mongolian word vectors in different data and different dimensions. Then we design a comprehensive semantic syntactic test set based on the Mongolia language features. And on this test set, we use semantic and syntactic similarity to estimate the quality of the word vectors. The results indicate that Skip-gram model is superior to CBOW model in Mongolian semantic and syntactic similarity tasks, and the word vectors quality is the best when the window size is 5. Moreover, with the increase of the word vectors dimension or training data, the quality of the word vectors is obviously improved. |
Key words: word vectors CBOW model Skip-gram model quality of the word vectors semantic syntactic similarity |