一种中文相似性匹配方法组成比例

技术编号：6080551 阅读：260 留言：0更新日期：2012-04-11 18:40

本发明专利技术提供了一种中文相似性匹配方法，利用编辑距离公式及键盘指法规则得到中文对应拼音的编辑相似度，即反应二者在编辑时是否容易混淆，通过汉字声母、韵母的发音规律得到字符串的声母相似度及韵母相似度，同时结合方言或者普通发音中常见的模糊音，计算字符串之间的发音相似度，由于汉字的字形是中文一个重要特征，所以利用字形编码-五笔字型编码计算字符串之间的字形相似度；在计算的同时收集信息，用于更新数据；综合上述相似度得到中文词整体的相似度，充分考虑了中文的拼写习惯、用户的输入习惯及键盘布局、普通话的发音规则、方言以及常见错误发音的影响、汉字字形等多方面因素，结合统计规律，综合评价中文词之间的相似度。

A Chinese similarity matching method

The invention provides a Chinese similarity matching method, using edit distance formula and keyboard fingering rules Chinese corresponding pinyin edit similarity, the reaction is easy to confuse the two when editing, get acoustic similarity and similarity of parent strings by vowel consonant and vowel pronunciation rules Chinese characters, combined with fuzzy tone common dialect or the general pronunciation, pronunciation similarity calculation between strings, as an important feature of Chinese is Chinese characters font, font font encoding - so using similarity calculation between five strokes encoding string; in the collection of information at the same time calculation, for updating data; the similarity similarity is the overall Chinese word, considering the spelling habits, user input habits and keyboard layout, the Chinese Mandarin pronunciation rules, dialect And the influence of common mistakes, pronunciation, Chinese characters and other factors, combined with statistical rules, a comprehensive evaluation of the similarity between Chinese words.

全部详细技术资料下载

【技术实现步骤摘要】

本专利技术涉及搜索中的文相似性匹配
，特别是涉及。
技术介绍
字符串的相似性函数作为衡量两个字符串之间近似程度的函数，是字符串匹配 (String matching)、文本比较(Text Comparison)、信肩、才由取(Information Extraction) 中一项基本技术，它的输入通常是两个相同或不同的字符串，返回一个确定的整数值。两个字符串相似度越高，对应的返回值就越大。这项技术在计算生物学(Computational Biology)，信号处理(Signal Processing)中也有广泛的应用。针对应用场合不同，有很多经典的相似性函数可供选择。例如编辑距离(EditDistance 或 Levenshtein Distance),它考虑了三种编辑操作-插入(insertion),删除(Deletion)和替换(Substitution)，用将一个字符串转换成另一个字符串所需要的最少的编辑操作的数量作为这两个字符串的相似度；Smith-Waterman距离(Smith-Waterman Algorithm)是一种用于找到两个序列中相似区域的算法，经常用于计算生物学中核苷酸序列、氨基酸序列的比对。这种算法中，所涉及的操作也只有三种插入、删除和替换。除了以上精确计算两个字符串差异的这些算法，也存在其他近似的简单的基于统计的方法。例如 Dice ￠5 (Dice Coefficient) 5 Jaccard ￠5 (Jaccard Index 5 Jaccard Similarity Coefficient)，这两...

【技术保护点】
１．一种中文相似性匹配方法，其特征在于，所述方法包括：获取两个待比较的字符串Ａ和Ｂ；计算两个字符串Ａ和Ｂ在编辑时的相似度；获取两个字符串Ａ和Ｂ的发音相似度；获取两个字符串Ａ和Ｂ的字形相似度；对照预先建立的汉字词频统计表Ｔａｂｌｅ３和汉字错误信息统计表Ｔａｂｌｅ４统计两个字符串Ａ和Ｂ的词频和错误信息；根据统计结果确定两个字符串Ａ和Ｂ编辑时的相似度、发音相似度以及字形相似度的权值，计算获取到两个待比较的中文字符串Ａ和Ｂ的匹配度。

【技术特征摘要】

【专利技术属性】
技术研发人员：李国良，黄维篁，冯建华，
申请(专利权)人：清华大学，
类型：发明
国别省市：11

全部详细技术资料下载我是这个专利的主人