广西科学

引用本文：

郑洁,黄辉,秦永彬.基于带噪预训练的刑期预测方法[J].广西科学,2023,30(1):71-78. [点击复制]
ZHENG Jie,HUANG Hui,QIN Yongbin.Sentence Prediction Method Based on Noisy Pretraining[J].Guangxi Sciences,2023,30(1):71-78. [点击复制]

【打印本页】【在线阅读全文】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】【关闭】

本文已被：浏览 386次下载 597次	码上扫一扫！
基于带噪预训练的刑期预测方法
郑洁¹, 黄辉², 秦永彬²
0 字体:加大+\|默认\|缩小-
(1.贵阳职业技术学院信息科学系, 贵州贵阳 550081;2.贵州大学计算机科学与技术学院, 贵州贵阳 550025)

摘要:

刑期预测模型利用自然语言处理技术自动预测当前案件的建议刑期，对提高司法工作效率，维护司法审判的公平与公正，以及实现同案同判具有重要意义。现有的研究通常采用基于预训练语言模型的方法进行刑期预测建模，但由于存在裁判文书文本较长、专业性强及部分案由标注数据不足等问题，刑期预测任务依然具有较强的挑战性。针对上述问题，本文提出了基于带噪预训练的刑期预测方法。首先，根据刑期预测任务的特点，设计了融合罪名信息的刑期预测模型；其次，结合遮蔽语言模型(Masked Language Model，MLM)任务和自蒸馏策略减少刑期预测任务预训练数据中噪声的影响；最后，改进RoBERTa-wwm模型中的位置嵌入，增强模型的长文本建模能力。实验结果表明，本文提出的预训练方法能够极大地提升刑期预测任务的准确率，在小样本条件下也具有很好的表现。

关键词: 刑期预测|语言模型|自蒸馏|长文本建模|预训练

DOI：10.13656/j.cnki.gxkx.20230308.008

基金项目:国家自然科学基金项目(62066008)和贵州省科学技术基金重点项目(黔科合基础[2020]1Z055)资助。

Sentence Prediction Method Based on Noisy Pretraining

ZHENG Jie¹, HUANG Hui², QIN Yongbin²

(1.Department of Information Science, Guiyang Vocational and Technical College, Guiyang, Guizhou, 550081, China;2.College of Computer Science and Technology, Guizhou University, Guiyang, Guizhou, 550025, China)

Abstract:

The sentence prediction model uses natural language processing technology to automatically predict the recommended sentence of the current case,which is of great significance to improve the efficiency of judicial work,maintain the fairness and justice of judicial trial,and realize the same sentence in the same case.The existing studies usually adopt the method based on pre-training language model to model the sentence prediction.However,due to the problems of long text of judgment documents,strong professionalism,and insufficient labeling data for some cases,the task of sentence prediction is still quite challenging.In view of the above problems,this paper proposes a sentence prediction method based on noisy pre-training.Firstly,according to the characteristics of sentence prediction task,a sentence prediction model integrating crime information is designed.Secondly,the influence of noise in the pre-training data of sentence prediction task is alleviated by combining the Masked Language Model (MLM) task and the self-distillation strategy.Finally,the position embedding in the RoBERTa-wwm model is improved to enhance the long text modeling ability of the model.The experimental results show that the pre-training method proposed in this paper can greatly improve the accuracy of the sentence prediction task,and also has good performance under small sample conditions.

Key words: sentence prediction|language model|self-distillation|long text modeling|pretrain

用微信扫一扫