现代汉语未登录词识别研究.doc

需要金币：1000 个金币	资料包括：完整论文
转换比率：金额 X 10=金币数量，例100元=1000金币	论文字数：12342
折扣与优惠：团购最低可5折优惠 - 了解详情	论文格式：Word格式(*.doc)

上一篇：文本关键词抽取方法研究.doc

下一篇：校园网络管理设计.doc

摘要：随着人工智能技术的发展，自然语言理解领域的应用已经越来越广泛，几乎任何基于汉语文本的系统都必须经过分词这一步。中文分词技术是对中文句子的切分技术,是计算机理解汉字意思的基础,是中文信息处理系统中最重要的预处理技术。而未登录词的识别则是影响中文分词准确率的一个重要因素。所谓未登录词主要是指分词系统的常用词词典中未收录的词。汉语中未登录词的种类很多、结构规律各异、数量众多，而且还在不断形成，不可能完全收录到常用词词典中。但是如果一篇文章中存在未被识别的未登录词，将直接影响到中文分词的准确率和召回率。虽然现在国内外也有许多分词软件，未登录词识别的准确率和召回率都有所提高，但是未登录词的误判和漏判将干扰中文信息检索以及中文分词的正确进行。

　　　首先,本文选取人民日报（2001~2004）的语料，作为实验的语料库。然后利用中科院的分词软件对语料进行切割分词。本文主要处理的是连续单字所组成的散串（分词碎片），判断他们是否为可能为未登录词。通过陈小荷教授的一揽子算法对分词后的语料进行分析,通过大规模语料求出单字概率，单字词概率以及单字非词概率，并根据所求的数据进行算法实现。本文选取了三个测试语料，通过提取的未登录词的总个数,正确的未登录词个数,以及未提取到的未登录词个数,准确率和召回率分别为：84.61%、91.67%，81.66%、98.0%，83.33% 、90.91%。结果表明本系统对未登录词的识别率比较高。

关键词：未登录词，分词碎片，单字非词概率，单字词概率

Abstract：With the development of artificial intelligence, natural language understanding application of the field has become increasingly widespread, almost any system based on Chinese word must be segmentation through this step, Chinese word segmentation is a technology for Chinese sentences, and foundation of the computer to understand Chinese characters, the most important pre-processing techniques in Chinese information processing system. The identification of unknown words is an important factor for the accuracy of Chinese word segmentation. The so-called unknown words are mainly refers to the common dictionary words are not included in the segmentation system. Unknown- words in Chinese have many different types, different law structure, a large number, but also constantly updated and expanded, and not fully included into the common word dictionary. But if an article has the unknown words which can’t be identified, this will directly affect the accuracy and recall of Chinese word segmentation. Although there are many words segmentation software at home and abroad, unknown word recognition and recall accuracy are improved, but the miscarriage and the Missing of unknown words would interfere the Chinese information retrieval and the Chinese word segmentation correctly.

　　　First, we select the corpus of the People’s Daily (2001~2004) as the experimental corpus. And use the CAS word segmentation software to cut the corpus. This paper deals with the bulk consisting of a continuous string of words (sub-word fragments) and to determine whether they are likely to unknown words. Analysis the corpus which have been segmentation though the Package algorithm of Professor Chen Xiao he. Though a large corpus calculate the probability of word, single word probability and the probability of words of non-words. And according to gained data to achieve the algorithm. This paper selects three test corpus, By extracting the total number of unknown words, the correct number of unknown words, and not to extract the number of unknown words , precision and recall rates were: 84.61%, 91.67%, 81.66%, 98.0%, 83.3%, and 90.91%. The results show that the system recognition rate of unknown words is relatively high.

Key words: unknown words sub-word fragments single word of non-words probability single word probability

基于android的心录社交系统的设计与实现	我国通信企业间主要相同业务的网络营销	朴素贝叶斯分类器及其应用_计算机科学与
电子病历管理_子系统的设计与实现Delph	酒店餐饮管理系统的设计与实现VB+SQL.do	Joyrich服装商贸企业采购管理系统的分析与
单周期cpu设计与展示.rar	网络环境下非公众人物个人隐私主动公开	基于Android的宠物狗就医客户端的设计和实
北京市各区县经济实力统计分析_信息与计	第三类Fourier延拓的数值算法_信息与计算	英语在线考试系统VS2008+SQL+C#语言.rar
经济作物销售系统网站的研究与实现.ra	XX机械有限公司员工工资管理系统.rar	XX软件公司人事管理系统的设计与实现.