一种基于众包技术的分类模型训练方法技术

技术编号：16380329 阅读：101 留言：0更新日期：2017-10-15 15:03

本发明专利技术提供了一种基于众包技术的分类模型训练方法。在少量样本对应的众包标注信息上对用户提供标注信息的水平进行估计；利用所观察到的标注者水平作为先验知识来确定训练样本所使用的标注信息；在训练样本及其标注信息上训练分类模型；利用该分类模型选择能使该模型期望误差最小的训练样本，并预测该样本所属类别；将所选样本及在该类别上标注水平最高的用户为其提供的标注信息加入训练集；在更新后的训练集上，迭代执行上述步骤，直至分类模型的精度或者训练样本的数量达到预设标准为止。本发明专利技术的效果在于，避免了标注水平低的用户提供的低质量标注信息对分类模型训练的不利影响，保证了众包环境下训练高泛化能力分类模型的效果。

A classification model training method based on crowdsourcing Technology

The invention provides a classification model training method based on crowdsourcing technology. In a few samples corresponding to the Crowdsourcing label information to the user to provide the level of annotation information is estimated by the level of annotation; observed as prior knowledge to determine the information marked by training samples in the training sample and label information; training classification model; the classification model selection can make the model minimum expected error the training samples, and to predict the sample category; the selected samples and information into the training set by marking the highest level of users in this category to provide; in the training set is updated on the iterative execution of the above steps until the classification precision of the model or the number of training samples reaches the preset standards so far. The effect of the invention is that it avoids the disadvantageous influence of the low quality annotation information provided by the users with low mark level on the training of the classification model, and guarantees the effect of training the classification model with high generalization ability in the crowdsourcing environment.

全部详细技术资料下载

【技术实现步骤摘要】
一种基于众包技术的分类模型训练方法
本专利技术涉及的是一种分类模型训练方法。
技术介绍
目前，在机器学习中监督学习的框架下，训练分类模型需要预先收集一组带有标注信息的数据样本。所收集训练数据的数量和质量直接决定了分类模型的泛化性能。在传统的训练数据收集过程中，需要具有专业领域知识的专家提供数据样本对应的唯一正确标注信息，用于保证训练所得的分类模型具有良好的泛化性能。这一传统做法面临的挑战在于，现实任务中具有专业背景的人员较少，获取样本标注信息的花费较高、时间较长。因而，随着网络技术和数据存储技术的发展，利用众包技术为训练样本快速获取大量廉价标注信息，成为降低标注获取过程中的时间和经济代价的有效途径之一。众包环境下，训练数据的标注获取任务并非由传统的专业人员来完成，而是以自由自愿的形式外包给非特定的大众网络来完成的，即非专业个人或开源个体以独立或协作的方式快速低价地完成标注任务。由于基于众包技术获取的标注信息来自多个在线的网络用户，因此难以保证所收集标注信息的质量，同时，由于缺少专业人士提供的正确标注信息作为“金标准”，也难以对这些用户的经验及其完成标注任务的准确度...
一种基于众包技术的分类模型训练方法

【技术保护点】
一种基于众包技术的分类模型训练方法，其特征是在所收集到的m个样本及由k个用户提供的众包标注信息为

【技术特征摘要】
1.一种基于众包技术的分类模型训练方法，其特征是在所收集到的m个样本及由k个用户提供的众包标注信息为的条件下，按照如下步骤进行：步骤一，从所收集的样本及其众包标注数据中随机抽取n个样本及其对应的众包标注信息步骤二，构建训练数据集其中，当时，yi＝1，否则，yi＝0；步骤三，在训练数据集上学习一个参数为w的分类模型；步骤四，第j个用户在类别c上提供的一组标注信息上的标注水平为其中，和分别表示该用户给出正确标注和错误标注的次数；步骤五，根据为样本xi提供标注信息的多个用户的标注水平，该样本用于训练的标注信息通过下式估计步骤五，使用分类模型对剩余m-n个样本所属类别进行预测，并计算每个样本对应的分类模型期望误差，如下其中，U表示剩余样本组成的集合，I(D,x)表示将样本加入训练集后分类模型的误差；步骤六，选择p(y|x*；w)＞0.5对应的类别，并...

【专利技术属性】
技术研发人员：吴伟宁，
申请(专利权)人：哈尔滨工程大学，
类型：发明
国别省市：黑龙江,23

全部详细技术资料下载我是这个专利的主人