删除或更新信息,请邮件至freekaoyan#163.com(#换成@)

哈尔滨工业大学计算机科学与技术学院/国家示范性软件学院研究生考研导师简介-关毅

本站小编 Free考研网/2019-05-25

基本信息科学研究教育教学论文专著
基本信息

男,1970年生,黑龙江省宁安市人。博士,教授,博士生导师。AAAS会员,ACM会员,IEEE高级会员,中国计算机学会高级会员,中国自动化学会会员。主要研究方向:医疗健康信息学、网络挖掘、自然语言处理及电子商务。提出具有重大理论突破意义的系统相似度测度理论,面向移动平台的智能输入法WI输入法的发明人,WI输入法目前在国内外拥有数十万用户,曾获得2010年中国互联网产品评选最佳技术创新提名奖。曾主持、参与并完成了二十余项国家自然科学基金、国家863、国际合作等项目,在国内外期刊和会议上发表学术论文100余篇,获得中国专利授权五项,美国专利授权两项,参与编写教材1本。

工作经历
时间工作经历
2000年香港科技大学电气与电子工程系人类语言技术中心任副研究员
2001年香港Weniwen有限公司任研究科学家
2001年5月哈尔滨工业大学任教
2001年10月特评为哈尔滨工业大学副教授
2006年哈尔滨工业大学教授
2007年哈尔滨工业大学博士生导师


教育经历
1988年至1992年在天津大学计算机科学与工程系软件专业获工学学士学位。1992年至1995年在哈尔滨工业大学计算机应用专业获得硕博连读资格1995年至1999年在哈尔滨工业大学计算机应用专业获得博士学位


科研项目

项目名称面向语句间语义相似度计算基于词主体自治学习的强化学习机制研究

项目来源国家自然科学基金

开始时间2009-01-01

结束时间2012-12-01

项目经费32万

担任角色负责

项目类别横向项目

项目状态完成



项目名称非常规突发事件网络舆情分析方法和预警机制的研究

项目来源国家自然科学基金重点

开始时间2009-01-01

结束时间2012-12-01

项目经费35

担任角色参与

项目类别横向项目

项目状态完成



项目名称下一代信息检索系统

项目来源国家自然科学基金重点

开始时间2008-01-01

结束时间2011-12-01

项目经费190

担任角色参与

项目类别纵向项目

项目状态完成



项目名称基于一种新的系统相似度度量的文本情感倾向性研究

项目来源微软教育部语言语音重点实验室开放基金项目

开始时间2010-01-01

结束时间2012-06-01

项目经费4万

担任角色负责

项目类别横向项目

项目状态进行中



项目名称面向IOS平台的语句输入系统WI 输入法研究

项目来源自选

开始时间2010-11-01

结束时间2019-12-01

担任角色负责

项目类别横向项目

项目状态进行中



项目名称淘宝购物网站中针对产品节点的信息挖掘技术研究

项目来源淘宝网

开始时间2010-09-01

结束时间2011-03-01

项目经费15

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称阿里巴巴浅层句法分析技术研究

项目来源阿里巴巴公司

开始时间2009-01-01

结束时间2009-11-01

项目经费15

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称富士通博客或bbs情感倾向性分析技术研究

项目来源富士通研发中心

开始时间2008-10-01

结束时间2009-06-01

项目经费15

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称隐式用户兴趣挖掘技术研究

项目来源myspace公司

开始时间2007-12-01

结束时间2008-12-01

项目经费10

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称问答式信息检索的理论与方法研究

项目来源国家自然科学基金重点

开始时间2006-01-01

结束时间2009-12-01

项目经费190

担任角色参与

项目类别纵向项目

项目状态完成



项目名称面向智能化信息检索的危险式人工免疫网络理论与方法研究

项目来源国家自然科学基金青年基金

开始时间2006-01-01

结束时间2009-12-01

项目经费24

担任角色负责

项目类别纵向项目

项目状态完成



项目名称新加坡词法分析国际合作项目

项目来源新加坡信息通信研究院

开始时间2008-01-01

结束时间2008-12-01

项目经费25

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称网站主题分析、标引与检索技术研究

项目来源微软基金

开始时间2006-06-01

结束时间2007-06-01

项目经费5

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称面向特定领域的词典获取和统计语言模型的建立

项目来源微软基金

开始时间2004-06-01

结束时间2006-06-01

项目经费6.5

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称网络信息的通用开放语义类名实体自动识别与标注研究

项目来源哈工大校基金

开始时间2003-06-01

结束时间2005-06-01

项目经费2

担任角色负责

项目类别纵向项目

项目状态完成



项目名称基于粗糙集大规模语料库语言学知识发现模型研究

项目来源国家自然基金

开始时间2002-01-01

结束时间2004-12-01

项目经费19

担任角色参与

项目类别纵向项目

项目状态完成



项目名称面向奥运智能信息服务的语料加工、文摘、检索技术研究

项目来源863重点项目

开始时间2003-12-01

结束时间2005-12-01

项目经费30

担任角色参与

项目类别纵向项目

项目状态完成



项目名称联通客服问答系统

项目来源八达集团

开始时间2002-06-01

结束时间2003-06-01

项目经费12

担任角色负责

项目类别纵向项目

项目状态进行中



项目名称手机操作系统智能输入

项目来源富士通公司

开始时间2002-03-01

结束时间2003-06-01

项目经费1800万日元

担任角色参与

项目类别纵向项目

项目状态进行中



项目名称基于内容的网络信息压缩及摘要自动生成技术

项目来源网络安全项目

开始时间2001-10-01

结束时间2002-10-01

项目经费60

担任角色参与

项目类别纵向项目

项目状态完成



项目名称智能化中文信息处理平台

项目来源863

开始时间2001-10-01

结束时间2002-10-01

项目经费60

担任角色参与

项目类别纵向项目

项目状态完成



奖项成果

奖项名称WI 输入法

获奖时间2011

完成人关毅 阎于闻 周春波 贾祯 田作辉等

所获奖项2010中国互联网创新产品评选最佳技术创新提名奖

简单介绍WI输入法是哈尔滨工业大学计算机学院语言技术研究中心网络智能研究室开发的iPhone / iPad /iPodtouch平台上的智能拼音语句输入法。它支持语句输入,全拼智能按键纠错,模糊音输入,简拼输入以及多种双拼输入方式。


研究领域
健康信息学智能化信息检索网络挖掘自然语言处理认知语言学


讲授课程
研究生专业必修课《自然语言处理》


论文期刊

论文标题基于电子商务用户行为的同义词识别

作者张书娟,董喜双,关毅

期刊名称中文信息学报

期卷第26卷,第3期

简单介绍本文研究了电子商务领域同义词的自动识别问题。针对该领域新词多、错别字多、近义词多的用词特点,提出基于用户行为的同义词识别方法。首先通过并列关系符号切分商品标题和基于SimRank思想聚集查询两种方法获取候选集合,进而获取两词的字面特征以及标题、查询、点击等用户行为特征,然后借助Gradient Boosting Decision Tree(GBDT)模型判断是否同义。实验表明同义词识别准确率达到了54.46%,高于SVM近4个百分点。


论文标题基于最大熵模型和最小割模型的中文词与句褒贬极性分析

作者董喜双,邹启波,关毅,高翔,闫铭

期刊名称第三届中文倾向性分析评测(COAE2011)

简单介绍本文运用最大熵模型和最小割模型预测中文词和句子的褒贬极性。词级情感分析首先构建领域情感词典,然后根据领域情感词典提取候选词,并使用最大熵模型预测候选词的极性,最后采用最小割模型优化极性结果。句级情感分析首先根据领域情感词典识别观点句,将观点句切分成短句并基于规则提取特征,应用最大熵模型预测短句的极性,最后根据短句的极性预测长句的极性。


论文标题基于购物网站用户搜索日志的商品词发现

作者杨锦锋,吕新波,关毅,周春波

期刊名称计算机应用与软件

期卷2011,28(11

简单介绍商品词是电子商务领域描述商品的新词。本文主要介绍了基于购物网站用户搜索日志的商品词发现的方法。该方法从搜索日志中提取用户查询,对查询进行分词,采用N元递增分步算法和串频统计,计算候选串的条件概率,选择候选商品词。为了降低人工审核的成本,我们只对产出商品词的准确率进行评价。我们利用该方法在手机、面霜和香水三类商品的搜索日志上进行了实验,最高准确率达到92.58%。


论文标题Automatically Generating Questions from Queries for Community-based Question Answering

作者Zhao, ShiqiandWang, HaifengandLi, ChaoandLiu, TingandGuan, Yi

期刊名称Proceedings of 5th International Joint Conference on Natural Language Processing

简单介绍This paper proposes a method that automatically generates questions from queries for community-based question answering (cQA) services. Our query-to-question generation model is built upon templates induced from search engine query logs. In detail, we first extract pairs of queries and user-clicked questions from query logs, with which we induce question generation templates. Then, when a new query is submitted, we select proper templates for the query and generate questions through template instantiation. We evaluated the method with a set of short queries randomly selected from query logs, and the generated questions were judged by human annotators. Experimental results show that, the precision of 1-best and 5- best generated questions is 67% and 61%, respectively, which outperforms a baseline method that directly retrieves questions for queries in a cQA site search engine. In addition, the results also suggest that the proposed method can improve the search of cQA archives.


论文标题电子商务中针对产品的摘要挖掘技术研究

作者季知祥,董喜双,关毅

期刊名称2011信息技术与管理科学国际学术研讨会

简单介绍In this paper, we present a novel approach for mining the summary of e-commerce products by using their description text. Product summary is composed of phrases from different aspects having independent meaning, which is different from the traditional multi-document summarization constructed by selecting sentences. Firstly, after extracting the body from the text, splitting text into sentence and removing repeated sentences, the sentences are clustered into a sub-topics used for describing the product from various aspects. Then the sentences divided by segmentation words are used to obtain candidate phrases. Finally, the phrases are classified by the Maximum Entropy model, and the highest score of phrase in each category will be extracted to form the product summary. The experiment indicated that precision as high as 90%.


论文标题基于X2统计和词情感分类相结合的中文情感词挖掘

作者张书娟,朱力,关毅,董喜双

期刊名称2011信息技术与管理科学国际学术研讨会

简单介绍Sentiment lexicon is constructed by sentiment score counting of Chinese characters, semantic similaritycalculation, and Pointwise Mutual Information. To enrich the lexicon, we combine Chi-square statistics and wordsentiment classification to mine sentiment words that are not contained in the lexicon. The average precision ofpolarity judgment of sentiment words is improved by 3%.


论文标题基于最大熵马尔科夫模型和条件随机域模型的汉语组块分析技术研究

作者李超,关毅,李生

期刊名称2011信息技术与管理科学国际学术研讨会

简单介绍In this paper, we present a Chinese chunking method, in which the chunking problem is transformedinto sequential labeling process by applying Maximum Entropy Markov Models and ConditionalRandom Field. Maximum Entropy Markov Model achieved the F-measure of 93.2% with the help of candidatetags selection, which can significantly reduce Error caused by lable bias and save testing time. When weuse Conditional Random Field Models, Maximum Entropy Markov Models took the place of ConditionalRandom Fields to select effective feature template, this method can save more than 80% time. ConditionalRandom Fields achieved the F-measure of 93.4%.


论文标题中文情感词倾向消歧

作者孙慧 关毅 董喜双

期刊名称第六届全国信息检索学术会议论文集(CCIR 2010)

简单介绍文本情感倾向性分析的基础是词汇情感倾向分析,本文针对基于词典的词汇情感倾向性分析方法中对情感词倾向绝对化标注问题,提出了一种获取上下文相关的词汇情感倾向方法。同时针对目前缺少包含上下文相关情感词标注资源的问题,使用最大熵交叉验证和手工校正结合的方法加以构造,并在此基础上构造了上下文相关的特征集合用来预测情感词在上下文中的情感倾向。实验表明,此种方法与基于词典的词语情感倾向性分析方法相比,F值提高了4.9%。


论文标题Selecting Optimal Feature Template Subset for CRFs

作者Xingjun Xu and Guanglu Sun and Yi Guan and. Xishuang Dong and Sheng Li

期刊名称 Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010)

简单介绍Conditional Random Fields (CRFs) are the state-of-the-art models for sequential labe-ling problems. A critical step is to select optimal feature template subset before em-ploying CRFs, which is a tedious task. To improve the efficiency of this step, we pro-pose a new method that adopts the maxi-mum entropy (ME) model and maximum entropy Markov models (MEMMs) instead of CRFs considering the homology be-tween ME, MEMMs, and CRFs. Moreover, empirical studies on the efficiency and ef-fectiveness of the method are conducted in the field of Chinese text chunking, whose performance is ranked the first place in task two of CIPS-ParsEval-2009.


论文标题HIT_LTRC at TREC 2010 Blog Track: Faceted Blog Distillation

作者Jinfeng Yang, Xishuang Dong, Yi Guan, Chengzhen Huang, Sheng Wang

期刊名称Proceedings of TREC 2010

简单介绍This paper describes our participation in the faceted blog distillation task at Blog Track 2010. In our approach, indri toolkit is ap-plied for basic topic relevance retrieval. Then the Maximum Entropy (ME) model is adopted to judge the relevance of each blog to specified facet. Feed faceted relevance is calculated by integrating the average relev-ance of all blogs within a feed and the av-erage relevance of the most relevant N blogs. Two implementations are applied to calculate feed faceted relevance. Experi-mental results on Blogs08 dataset show the effectiveness of our approach.


论文标题Complete Syntactic Analysis Based on Multi-level Chunking

作者ZhiPeng Jiang and Yu Zhao and Yi Guan and. Chao Li and Sheng Li

期刊名称Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP2010)

简单介绍This paper describes a complete syntactic analysis system based on multi-level chunking. On the basis of the correct se- quences of Chinese words provided by CLP2010, the system firstly has a Part-of- speech (POS) tagging with Conditional Random Fields (CRFs), and then does the base chunking and complex chunking with Maximum Entropy (ME), and finally gene- rates a complete syntactic analysis tree. The system took part in the Complete Sen- tence Parsing Track of the Task 2 Chinese Parsing in CLP2010, achieved the F-1 measure of 63.25% on the overall analysis, ranked the sixth; POS accuracy rate of 89.62%, ranked the third.


论文标题Learning of humanoid robot walk parameters based on FSR

作者Yuan, Quan-De,Hong, Bing-Rong, Guan, Yi ,Ke, Wen-De

期刊名称 China Journal of Harbin Institute of Technology (New Series)

期卷2010 17 SU



论文标题网页结构树相似度计算

作者祁钰;关毅;吕新波;岳淑珍

期刊名称黑龙江大学自然科学学报

期卷2009年第05期

简单介绍本文提出了一种针对网页结构树的相似度计算方法,首先把网页标签结构表示成树,然后通过动态规划算法,计算两棵树之间的距离,以此来衡量两个网页之间的相似程度。实验证明本方法可以正确区分同类网页和不同类网页。


论文标题基于最大熵模型的汉语基本块分析技术研究

作者李超 孙健 关毅 徐兴军 侯磊 李生

期刊名称中文信息学会句法分析评测(CIPS-ParsEval-2009)

简单介绍本文论述了一个应用最大熵马尔科夫模型序列化标注块的边界、成分信息和应用最大熵模型分类识别块的关系信息的汉语基本块分析方法。为有效减少识别错误,重点探讨了候选标签筛选、难点关系识别等改进措施。集成上述方法的系统,边界、成分标记识别F 值达到93.196%,关系标记识别F 值达到92.103%,在中文信息学会句法分析评测(CIPS-ParsEval-2009)任务2:汉语基本块分析中取得第一名。


论文标题基于最大熵模型的中文词与句情感分析研究

作者董喜双 关毅 李本阳 陈志杰

期刊名称第二届中文倾向性分析评测(COAE2009)

简单介绍本文将研究焦点对准喜、怒、哀、惧四类情感分析问题,重点解决中文词、句的情感分析问题。将词的情感分析处理为候选词情感分类问题。首先通过词性过滤获得候选词,进而根据特征模板获取候选词情感特征,然后应用最大熵模型判断候选词情感类别,最后应用中性词典、倾向性词典、复句词表、否定词表过滤候选情感词分类错误得到情感词集合。句的情感分析首先根据情感词典和倾向词典提取词特征,并采用规则提取词序列特征,然后采用最大熵模型对句子进行情感分类。在COAE2009评测中词与句情感分析取得较好结果。


论文标题An Overview of Learning to Rank for Information Retrieval

作者Dong, X.; Chen, X.; Guan, Y.; Xu, Z.; Li, S.

期刊名称Proc. WRI World Congress on Computer Science and Information Engineering

期卷2009年

简单介绍This paper presents an overview of learning to rank.It includes three parts: related concepts including thedefinitions of ranking and learning to rank; a summaryof pointwise models, pairwise models, and listwisemodels; estimation measures such as NormalizedDiscount Cumulative Gain and Mean AveragePrecision, respectively. Considering the deficiency thatcurrent learning to rank models lack of continuallearning ability, we present a new continual learningidea that combines multi-agent autonomy learningmechanism with molecular immune mechanism forranking.


论文标题基于Swarm的人工免疫网络算法研究

作者杜新凯 关毅 岳淑珍 徐兴军

期刊名称微计算机信息

期卷2008年18期

简单介绍智能化信息检索是网络时代最重要的应用之一。现有的机器学习理论与方法难以适应网络环境下数据的动态性和用户兴趣的多样性,成为智能化信息检索研究中的一个薄弱环节。本文通过学习和借鉴自然免疫系统的特征和原理,利用Swarm软件平台,设计和实现了一个人工免疫网络算法。该算法建立在对自然免疫系统的现有理解之上,具备自然免疫系统的主要特征,并被成功的应用于解决一个简单的模式识别问题。最后展望了将人工免疫系统这一新的机器学习机制应用到智能化信息检索系统中的前景。


论文标题基于词聚类特征的统计中文组块分析模型

作者孙广路 王晓龙 关毅

期刊名称电子学报

期卷2008,36(12

简单介绍提出了一种基于信息熵的层次词聚类算法,并将该算法产生的词簇作为特征应用到中文组块分析模型中.词聚类算法基于信息熵的理论,利用中文组块语料库中的词及其组块标记作为基本信息,采用二元层次聚类的方法形成具有一定句法功能的词簇.在聚类过程中,设计了优化算法节省聚类时间.用词簇特征代替传统的词性特征应用到组块分析模型中,并引入名实体和仿词识别模块,在此基础上构建了基于最大熵马尔科夫模型的中文组块分析系统.实验表明,本文的算法提升了聚类效率,产生的词簇特征有效地改进了中文组块分析系统的性能.


论文标题A New Measurement of Systematic Similarity

作者Yi Guan, Xiaolong Wang, and Qiang Wang

期刊名称IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS

期卷VOL. 38, N

简单介绍The relationship of similarity may be the most universalrelationship that exists between every two objects in eitherthe material world or the mental world. Although similarity modelinghas been the focus of cognitive science for decades, manytheoretical and realistic issues are still under controversy. In thispaper, a new theoretical framework that conforms to the natureof similarity and incorporates the current similarity models intoa universal model is presented. The new model, i.e., the systematicsimilarity model, which is inspired by the contrast model ofsimilarity and structure mapping theory in cognitive psychology,is the universal similarity measurement that has many potentialapplications in text, image, or video retrieval. The text relevanceranking experiments undertaken in this research tentatively showthe validity of the new model.


论文标题Recent advances on NLP research in Harbin Institute of Technology

作者Tiejun Zhao, Yi Guan, Ting Liu, Qiang Wang

期刊名称Frontiers of Computer Science in China

期卷1(4): 413-

简单介绍In the sixties of the last century, the researchers of Harbin Institute of Technology (HIT) attemptedto do relevant research in natural language processing. After more-than-40-year-effort, HIT has already established3 research laboratories for Chinese information processing, i.e. the Machine Intelligence and TranslationLaboratory (MI&T Lab), the Intelligent Technology and Natural Language Processing Laboratory (ITNLP) and theInformation Retrieval Laboratory (IR-Lab). At present it has a well-balanced research team of over 200 persons,including tutors of Ph. D candidate, professors, associate professors, lecturers, Ph. D and Master candidates etc.,and the research interests have extended to language processing, machine translation, text retrieval and other fields.Besides, during the course of the scientific research, HIT has accumulated a batch of key techniques and dataresources, won many prizes in the technical evaluations at home and abroad, and has become one of the mostimportant natural language processing bases for teaching and scientific research in China now.


论文标题基于多知识源的中文词法分析系统

作者姜维 王晓龙 关毅 赵健

期刊名称计算机学报

期卷2007年第1期

简单介绍汉语词法分析是中文自然语言处理的首要任务.文中深入研究中文分词、词性标注、命名实体识别所面临的问题及相互之间的协作关系,并阐述了一个基于混合语言模型构建的实用汉语词法分析系统.该系统采用了多种语言模型,有针对性地处理词法分析所面临的各个问题.其中分词系统参加了2005 年第二届国际汉语分词评测,在微软亚洲研究院、北京大学语料库开放测试中,分别获得犉量度为97.2% 与96.7%.而在北京大学标注的《人民日报》语料库的开放评测中,词性标注获得96.1%的精确率,命名实体识别获得的犉量度值为88.6%.


论文标题基于特征类别属性分析的文本分类器分类噪声裁剪方法

作者王强 关毅 王晓龙

期刊名称自动化学报

期卷2007年08期

简单介绍提出一种应用文本特征的类别属性进行文本分类过程中的类别噪声裁剪(Eliminating class noise, ECN) 的算法. 算法通过分析文本关键特征中蕴含的类别指示信息, 主动预测待分类文本可能归属的类别集, 从而减少参与决策的分类器数目,降低分类延迟, 提高分类精度. 在中、英文测试语料上的实验表明, 该算法的F 值分别达到0.76 与0.93, 而且分类器运行效率也有明显提升, 整体性能较好. 进一步的实验表明, 此算法的扩展性能较好, 结合一定的反馈学习策略, 分类性能可进一步提高, 其F 值可达到0.806 与0.943.


论文标题A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation

作者Chi-Ho Li, Dongdong Zhang, Mu Li, Ming Zhou,Minghui Li, Yi Guan

期刊名称Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

简单介绍Inspired by previous preprocessing approaches to SMT, this paper proposes anovel, probabilistic approach to reorderingwhich combines the merits of syntax andphrase-based SMT. Given a source sentenceand its parse tree, our method generates,by tree operations, an n-best list of reordered inputs, which are then fed to standard phrase-based decoder to produce theoptimal translation. Experiments show that,for the NIST MT-05 task of Chinese-toEnglish translation, the proposal leads toBLEU improvement of 1.56%.


论文标题基于标题类别语义识别的文本分类算法研究

作者王强 关毅 王晓龙

期刊名称电子与信息学报

期卷第29卷第12期

简单介绍本文提出了一种基于标题类别语义识别的文本分类算法。算法利用基于类别信息的特征选择策略构造分类的特征空间,通过识别文本标题中的特征词的类别语义来预测文本的候选类别,最后在候选类别空间中用分类器执行分类操作。实验表明该算法在有效降低分类候选数目的基础上可显著提高文本分类的精度,通过对类别空间表示效率指标的验证,进一步表明该算法有效地提高了文本表示空间的性能。


论文标题Using Maximum Entropy Model to Extract Protein-Protein Interaction Information from Biomedical Literature

作者Chengjie Sun, Lei Lin, Xiaolong Wang, Yi Guan

期刊名称Lecture Notes in Computer Science of Advanced Intelligent Computing Theories and Applications with Aspects of Theoretical and Me

期卷 Volume 46

简单介绍Protein-Protein interaction (PPI) information play a vitalrole in biological research. This work proposes a two-step machine learningbased method to extract PPI information from biomedical literature.Both steps use Maximum Entropy (ME) model. The first step is designedto estimate whether a sentence in a literature contains PPI information.The second step is to judge whether each protein pair in a sentence hasinteraction. Two steps are combined through adding the outputs of thefirst step to the model of the second step as features. Experiments showthe method achieves a total accuracy of 81.9% in BC–PPI corpus andthe outputs of the first step can effectively prompt the performance ofthe PPI information extraction.


论文标题Rich features based Conditional Random Fields for biological named entities recognition

作者Chengjie Sun, Yi Guan, Xiaolong Wang, Lei Lin

期刊名称Computers in Biology and Medicine archive

期卷Volume 37,

简单介绍 Biological named entity recognition is a critical task for automatically mining knowledge from biological literature. In this paper, this taskis cast as a sequential labeling problem and Conditional Random Fields model is introduced to solve it. Under the framework of ConditionalonRandom Fields model, rich features including literal, context and semantics are involved. Among these features, shallow syntactic features are?rst introduced, which effectively improve the model’s performance. Experiments show that our method can achieve an F-measure of 71.2%in an open evaluation data, which is better than most of state-of-the-art systems.


论文标题Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins

作者Qiwen Dong, Xiaolong Wang, Lei Lin, Yi Guan

期刊名称BMC Bioinformatics

期卷2007; 8

简单介绍BackgroundRecognition of binding sites in proteins is a direct computational approach to the characterization of proteins in terms of biological andbiochemical function. Residue preferences have been widely used in many studies but the results are often not satisfactory. Althoughdifferent amino acid compositions among the interaction sites of different complexes have been observed, such differences have notbeen integrated into the prediction process. Furthermore, the evolution information has not been exploited to achieve a more powerfulpropensity.ResultIn this study, the residue interface propensities of four kinds of complexes (homo-permanent complexes, homo-transient complexes,hetero-permanent complexes and hetero-transient complexes) are investigated. These propensities, combined with sequence profilesand accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by takingevolutional information into consideration, which results in a class of novel propensities at the profile level, i.e. the binary profiles interface propensities. Experimentis performed on the 1139 non-redundant protein chains. Although different residue interface propensities among different complexes are observed, the improvementof the classifier with residue interface propensities can be negligible in comparison with that without propensities. The binary profile interface propensities cansignificantly improve the performance of binding sites prediction by about ten percent in term of both precision and recall.ConclusionAlthough there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for thecomplicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, whichindicates that the propensities at the profile level are more accurate than those at the residue level.


论文标题基于支持向量机的音字转换模型

作者姜维 关毅 王晓龙 刘秉权

期刊名称中文信息学报

简单介绍针对n—gram在音字转换中不易融合更多特征,本文提出了一种基于支持向量机(svm)的音字转换模型,有效提供可以融合多种知识源的音字转换框架。同时,svm优越的泛化能力减轻了传统模型易于过度拟合的问题,而通过软间隔分类又在一定程度上克服小样本中噪声问题。此外,本文利用粗糙集理论提取复杂特征以及长距离特征,并将其融合于svm模型中,克服了传统模型难于实现远距离约束的问题。实验结果表明,基于svm音字转换模型比传统采用绝对平滑算法的trigram模型精度提高了1.2%;增加远距离特征的


论文标题一个基于免疫机制的在线机器学习算法

作者何晏成 关毅 岳淑珍

期刊名称第三届全国信息检索与内容安全学术会议

简单介绍本文在免疫应答机制和免疫网络理论等人体免疫原理的基础上,提出了一个新的在线机器学习算法,并将其运用在智能化信息检索系统的知识库参数调节上,实验结果表明该算法具有很好的适应性和从动态环境中持续学习的能力。


论文标题A Maximum Entropy Chunking Model with N-fold Template Correction

作者Sun Guanglu, Guan Yi, Wang Xiaolong

期刊名称 Journal of Electronics(China)

期卷2007 24 (5

简单介绍This letter presents a new chunking method based on M3J(imum Entropy fME)model withⅣ_fold template correction mode1.First two types of m achine learning models are described.Based onthe analysis of the two models,then the chunking model which combines the profits of conditionalprobability model and rule based model is proposed.The selection of features and rule templates in thechunking model is discussed.Experimental results for the CoNLL一2000 corpus show that this approachachieves impressive accuracy in terms of the F-score:92.93%.Compared with the ME model and MEMarkov model,the new chunking model achieves better performance


论文标题An Improved Feature Representation Method for Maximum Entropy Model Data Mining

作者Guan Yi, Zhao Jian

期刊名称Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference

简单介绍In maximum entropy model (MEM), features are typi- cally represented by either 0-1 binary-valued function or real-valued function. However, both representations only examine the impact of specific value of some attributes but not their types. Such negligence not only causes the de- creasing of classification precision, but also slows the con- vergence speed of the generalized iterative scaling (GIS) al- gorithm, as will become more apparent to incomplete data. In this paper, an improved feature representation method is presented. The feature is composed of two parts: the first one is for specific value of an attribute; the second one is for the type of corresponding attribute. The experi- mental results on Mushroom dataset of UCI data repository showed that the average classifying precisions on incom- plete dataset and complete dataset were improved by 1.5% and 3.0% respectively, and the average convergence speed was improved by 42.9% and 90.7% respectively.


论文标题Biomedical Named Entities Recognition Using Conditional Random Fields Model

作者Chengjie Sun, Yi Guan, Xiaolong Wang, Lei Lin

期刊名称Lecture Notes in Computer Science of Fuzzy Systems and Knowledge Discovery

期卷Volume 422

简单介绍Biomedical named entity recognition is a critical task for automatically mining knowledge from biomedical literature. In this paper, we introduce Conditional Random Fields model to recognize biomedical named entities from biomedical literature. Rich features including literal, context and semantics are involved in Conditional Random Fields model. Shallow syntactic features are first introduced to Conditional Random Fields model and do boundary detection and semantic labeling at the same time, which effectively improve the model’s performance. Experiments show that our method can achieve an F-measure of 71.2% in JNLPBA test data and which is better than most of state-of-the-art system.


论文标题Exploring Efficient Feature Inference and Compensation In Text Classification

作者Qiang Wang, Yi Guan and Xiaolong Wang

期刊名称Journal of Chinese Language and Computing

期卷2006. 16 (

简单介绍This paper explores the feasibility of constructing an integrated framework for featureinference and compensation (FIC) in text classification. In this framework, featureinference is to devise intelligent pre-fetching mechanisms that allow for prejudging thecandidate class labels to unseen documents using the category information linked tofeatures, while feature compensation is to revise the current accepted feature set bylearning new or removing incorrect feature values through the classifier results. Thefeasibility of the novel approach has been examined with SVM classifiers on ChineseLibrary Classification (CLC) and Reuters-21578 dataset. The experimental results arereported to evaluate the effectiveness and efficiency of the proposed FIC approach.


论文标题A Novel Feature Selection Method Based on Category Information Analysis for Class Prejudging in Text Classification

作者Qiang Wang, Yi Guan, XiaoLong Wang and Zhiming Xu

期刊名称International Journal of Computer Science and Network Security

期卷2006. 6

简单介绍This paper presents a new feature selection algorithm withthe category information analysis in text classification.The algorithm obscure or reduce the noises of text featuresby computing the feature contribution with word anddocument frequency and introducing variance mechanismto mine the latent category information. The algorithm isdistinguished from others by providing a pre-fetchingtechnique for classifier while it is compatible withefficient feature selection, which means that the classifiercan actively prejudge the candidate class labels to unseendocuments using the category information linked tofeatures and classify them in the candidate class space toretrench time expenses. The experimental results onChinese and English corpus show that the algorithmachieves a high performance. The F measure is 0.73 and0.93 respectively and the run efficiency of classifier isimproved greatly.


论文标题SVM-Based Spam Filter with Active and Online Learning

作者Qiang Wang, Yi Guan, Xiaolong Wang

期刊名称Proceeding of Text REtrieval Conference on Spam Filtering Task(TREC2006)

简单介绍A realistic classification model for spam filtering should not only take account of the fact that spam evolvesover time, but also that labeling a large number of examples for initial training can be expensive in terms ofboth time and money. This paper address the problem of separating legitimate emails from unsolicitedones with active and online learning algorithm, using a Support Vector Machines (SVM) as the baseclassifier. We evaluate its effectiveness using a set of goodness criteria on TREC2006 spam filteringbenchmark datasets, and promising results are reported.


论文标题Answer Extraction Based on System Similarity Model and Stratified Sampling Logistic Regression in Rare Data

作者Peng Li, Yi Guan, Xiaolong Wang, Yongdong Xu

期刊名称IJCSNS International Journal of Computer Science and Network Security

期卷VOL.6 No.3

简单介绍This paper provides a novel and efficient method forextracting exact textual answers from the returned documentsthat are retrieved by traditional IR system in large-scalecollection of texts. The main intended contribution of this paperis to propose System Similarity Model (SSM), which can beconsidered as an extension of vector space model (VSM) to rankpassages, and to apply binary logistic regression model (LRM),which seldom be used in IE to extract special information fromcandidate data sets. The parameters estimated for the datagathered with serious problem of data sparse, therefore we takestratified sampling method, and improve traditional logisticregression model parameters estimated methods. The series ofexperimental results show that the overall performance of oursystem is good and our approach is effective. Our system,Insun05QA1, participated in QA track of TREC 2005 andobtained excellent results.


论文标题应用粗糙集理论提取特征的词性标注模型

作者姜维 王晓龙 关毅 徐志明

期刊名称高技术通讯

期卷 2006年10期

简单介绍针对词性标注中的复杂特征提取问题,应用粗糙集理论(rough sets),有效地挖掘了包括长距离特征在内的复杂特征,并有效地处理了语料库噪声问题。最后,将这些特征融合于最大熵模型中,训练时按模型整体性能为其分配权重。开放实验表明:增加粗规则后获得96.29%的标注精度,相比原有模型提高了0.83%。


论文标题A Pragmatic Chinese Word Segmentation Approach Based on Mixing Models

作者Jiang Wei, Guan Yi, Wang Xiao-Long

期刊名称International Journal of Computational Linguistics and Chinese Language Processing

期卷Volume 11(

简单介绍A pragmatic Chinese word segmentation approach is presented in this paper basedon mixing language models. Chinese word segmentation is composed of severalhard sub-tasks, which usually encounter different difficulties. The authors apply thecorresponding language model to solve each special sub-task, so as to takeadvantage of each model. First, a class-based trigram is adopted in basic wordsegmentation, which applies the Absolute Discount Smoothing algorithm toovercome data sparseness. The Maximum Entropy Model (ME) is also used toidentify Named Entities. Second, the authors propose the application of rough setsand average mutual information, etc. to extract special features. Finally, somefeatures are extended through the combination of the word cluster and thethesaurus. The authors’ system participated in the Second International ChineseWord Segmentation Bakeoff, and achieved 96.7 and 97.2 in F-measure in the PKUand MSRA open tests, respectively.


论文标题Conditional Random Fields Based Label Sequence and Information Feedback

作者Wei Jiang, Yi Guan, Xiao-Long Wang

期刊名称Lecture Notes in Computer Science of Natural Language Processing and Expert Systems

期卷 Volume 41

简单介绍Part-of-speech (POS) tagging and shallow parsing are sequencemodeling problems. While HMM and other generative models are not the mostappropriate for the task of labeling sequential data. Compared with HMM,Maximum Entropy Markov models (MEMM) and other discriminative finitestatemodels can easily fused more features, however they suffer from the labelbias problem. This paper presents a method of Chinese POS tagging andshallow parsing based on conditional random fields (CRF), as newdiscriminative sequential models, which may incorporate many rich featuresand well avoid the label bias problem. Moreover, we propose the informationfeedback from syntactical analysis to lexical analysis, since natural languageshould be a multi-knowledge interaction in nature. Experiments show that CRFapproach achieves 0.70% F-score improvement in POS tagging and 0.67%improvement in shallow parsing. And we also confirm the effectiveness ofinformation feedback to some complicated multi-class words.


论文标题Applying Rough Sets in Word Segmentation Disambiguation Based on Maximum Entropy Model

作者Jiang, W., X.-L. Wang, Y. Guan, and G.-H. Liang

期刊名称Journal of Harbin Institute of Technology (New Series)

期卷13(1)

简单介绍To solve the complicated feature extraction and longdistance dependency problem in Word SegmentationDisambiguation (WSD), this paper proposes to apply roughsets in WSD based on the Maximum Entropy model. Firstly,rough set theory is applied to extract the complicatedfeatures and long distance features, even from noise orinconsistent corpus. Secondly, these features are added intothe Maximum Entropy model, consequently, the featureweights can be assigned according to the performance ofthe whole disambiguation model. Finally, the semanticlexicon is adopted to build class-based rough set features toovercome data sparseness. The experiment indicated thatour method performed better than previous models, whichgot top rank in WSD in 863 Evaluation in 2003. Thissystem ranked first and second respectively in MSR andPKU open test in the Second International Chinese WordSegmentation Bakeoff held in 2005.


论文标题Improving Feature extraction in Named Entity Recognition based on Maximum Entropy Model

作者Jiang, W., Y. Guan, and X.-L. Wang

期刊名称2006 International Conference on Machine Learning and Cybernetics (ICMLC2006)

简单介绍A new method of improving feature extraction for NamedEntity Recognition is proposed in this paper. First of all, thecontext features and the entity features are extracted by thecorresponding algorithm. The triggers extracted by MutualInformation, Information Gain, Average Mutual Informationetc, are adopted to enhance the context features. And roughset theory is used to extract the entity features. Secondly, wordcluster method is presented to improve the approach ofexpanding features, which make us select features more easily,and overcome the sparse data problem effectively. Finally, allthe features are added into the maximum entropy model. Theexperiments have confirmed that our method is effective. Theabove method has been used in our word segmenter, whichparticipated in the International SIGHAN-2005 Evaluation,and ranked first in open test in MSR corpus.


论文标题Improving Sequence Tagging using Machine-Learning Techniques

作者Wei Jiang, Xiao-Long Wang, Yi Guan

期刊名称2006 International Conference on Machine Learning and Cybernetics (ICMLC2006)

简单介绍This paper presents an excel sequence tagging approachbased on the combined machine learning methods. Firstly,conditional random fields (CRF) is presented as a new kind ofdiscriminative sequential model, it can incorporate many richfeatures, and well avoid the label bias problem that is thelimitation of maximum entropy Markov models (MEMM) andother discriminative finite-state models. Secondly, supportvector machine is improved to adapt the sequential taggingtask. Finally, these improved models and other existing modelsare combined together, which have achieved thestate-of-the-art performance. Experimental results show thatCRF approach achieves 0.70% improvement in POS taggingand 0.67% improvement in shallow parsing. Moreover, ourcombination method achieves F-measure 93.73% and 93.69%in above two tasks respectively, which is better than anysub-model.


论文标题An Improved Unknown Word Recognition Model based on Multi-Knowledge Source Method

作者Jiang, W., Y. Guan, and X.-L. Wang

期刊名称6th International Conference on Intelligent Systems Design and Applications (ISDA@#%06)

期卷 vol 2, 20

简单介绍Unknown word recognition (UWR) is a difficultand foundational task in lexical processing and content-basedunderstanding. And it can improve many text-based processingapplications, such as Information Extraction, Question Answersystem, Electronic Meeting System. However the unified dealingapproach is difficult to exploit more domain knowledge features,so the performance cannot be further improved easily, sinceUWR has been proved to be NP-hard problem. This paperpresents a novel method for UWR task, which divides the UWRinto several hard sub-tasks that usually encountering differentdifficulties, accordingly, several language models are adopted tosolve the special sub-tasks, so as to exert the ability of each modelin addressing special problems. Firstly, a class-based trigram isused in basic word segmentation, aided with absolute smoothingalgorithm to overcome data sparseness. And Maximum EntropyModel (ME) is used to recognize Named Entity. New worddetection adopts variance and Conditional Random Fieldsalgorithm. Secondly, Multi-Knowledge features are effectivelyextracted and utilized in whole processing. Our systemparticipated in the Second International Chinese WordSegmentation Bakeoff (SIGHAN2005), and got the overallperformance 97.2% F-measure in MSRA open test.


论文标题A Pragmatic Chinese Word Segmentation System

作者Jiang, W., Y. Guan, and X.-L. Wang

期刊名称proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

简单介绍This paper presents our work for participationin the Third International ChineseWord Segmentation Bakeoff. We applyseveral processing approaches accordingto the corresponding sub-tasks, which areexhibited in real natural language. In oursystem, Trigram model with smoothingalgorithm is the core module in wordsegmentation, and Maximum Entropymodel is the basic model in Named EntityRecognition task. The experimentindicates that this system achieves Fmeasure96.8% in MSRA open test in thethird SIGHAN-2006 bakeoff.


论文标题基于条件随机域的词性标注模型

作者姜维 关毅 王晓龙

期刊名称计算机工程与应用

期卷21期

简单介绍词性标注主要面临兼类词消歧以及未知词标注的难题,传统隐马尔科夫方法不易融合新特征,而最大熵马尔科夫模型存在标注偏置等问题。本文引入条件随机域建立词性标注模型,易于融合新的特征,并能解决标注偏置的问题。此外,又引入长距离特征有效地标注复杂兼类词,以及应用后缀词与命名实体识别等方法提高未知词的标注精度。在条件随机域模型框架下,本文进一步探讨了融合模型的方法及性能。词性标注开放实验表明,条件随机域模型获得了96.10%的标注精度。


论文标题A Novel Dynamic Adaptive Method Based on Artificial Immune System in Chinese Named Entity Recognition

作者Wei Jiang, Yi Guan, XiaoLong Wang

期刊名称International Journal of Computer Science and Network Security

期卷Vol. 6 No.

简单介绍Named Entity recognition(NER), as a task of providing important semantic information, is a critical first step in information extraction and question answer system. The NER has been proved to be NP-hard problem, and the existing methods usually adopt supervised or unsupervised learning model, as a result, there is still a distance away from the required performance in real application, however the system can hardly improved with the model being applied. The paper proposes a novel method based on artificial immune system(AIS) for NER. We apply clonal selection principle and affinity maturation of the vertebrate immune response, where the secondary immune response has high performance than the primary immune response, and the similar antigens may have a good immunity. We also introduce the reinforcement learning method into our system to tune the immune response, and the context features are exploied by the maximum entropy principle. The experimental results indicate that our method exhibits a good performance and implements the dynamic learning behavior


论文标题InsunQA06 on QA track of TREC2006

作者Zhao, Y., Xu, Z., Li, P., & Guan, Y

期刊名称Fifteenth Text REtrieval Conference (TREC 2006).

简单介绍This is the second time that our group takes part in the QA track of TREC. We developed a question-answering system, named InsunQA06, based on our Insun05QA system, and with InsunQA06 we participated in the Main Task, which submitted answers to three types of questions: factoid questions, list questions and others questions.The structure of InsunQA06 is similar with the structure of Insun05QA. Towards Insun05QA, the main difference of InsunQA06 is that new methods are developed and used in answer extraction module, for factoid and “others” questions. And external knowledge such as knowledge from Internet plays more important role in answer extraction. Besides that, we accomplished our documents retrieval module based on Indri, instead of SMART in InsunQA06.


论文标题Classifying Incomplete Data based on Maximum Entropy Model with New Feature Compensating

作者Zhao Jian, Xiao-long Wang, Guan Yi, Lin Lei

期刊名称 Journal of Electronics

期卷2006, Vol.

简单介绍For incomplete data classifying, MEM (Maximum Entropy Model) trained by GIS(Generalized Iterative Scaling) algorithm utilizes a global unique compensating feature to offsetthe effect of missing attributes of some samples in order to satisfy the constraint of GIS. However,this kind of compensating strategy neglects a fact that different features have different effects onclassification result. Hence, in this paper, an improved compensating strategy, taking effects ofboth different feature types and label types into accounts, is firstly proposed to overcome theshortage of traditional method. Experimental results on Mushroom data set from UCI datarepository show that the new method is feasible and effective. The average error rate is reduced byabout 68.3% and 33.5% respectively on two kinds of experimental datasets.


论文标题Research on Chinese Named Entity Recognition Base on Conditional Random Fields

作者Zhao Jian , Xiao-long Wang, Guan Yi, Xu Zhiming

期刊名称Journal of Electronics

期卷2006, Vol.

简单介绍Chinese named entity recognition (CNER) is an important and difficult task in Chineseinformation processing domain. In this paper, a new probabilistic model, conditional randomfields (CRF), which is very fit for labeling sequence data, is firstly introduced to the task of CNER.Unlike the generative model, CRF does not make effort on the observation modeling and canutilize rich overlapped features; moreover it can avoid the label bias problem of discriminativemodel. In order to perform CNER, special features including morphology Features, n-gramfeatures, lexicon features and combined features are selected to capture informative trait ofChinese language. Experiments of 6-fold cross validation on half-year People’s Daily show that,among four typical kinds of probabilistic models, CRF outperform the other three models. Thisapproach can achieve an overall F-measure around 85%.


论文标题一种改进的Wu-Manber 多模式匹配算法及应用

作者孙晓山 王强 关毅 王晓龙

期刊名称中文信息学报

期卷2006年02期

简单介绍本文针对Wu-Manber多模式匹配算法在处理后缀模式情况下的不足,给出了一种改进的后缀模式处理算法,减少了匹配过程中字符比较的次数,提高了算法的运行效率。本文在随机选择的TREC2000的52,067篇文档上进行了全文检索实验, 对比了Wu-Manber算法、使用后缀模式的改进算法、不使用后缀模式的简单改进等三种算法的匹配过程中字符比较的次数。实验结果说明,本文的改进能够比较稳定的减少匹配过程中字符比较的次数,提高匹配的速度和效率。


论文标题中文名实体识别:基于词触发对的条件随机域方法

作者赵健 王晓龙 关毅 徐志明

期刊名称高技术通讯

期卷2006年08期

简单介绍首次把条件随机域(CRF)模型应用到了中文名实体识别中,且根据中文的特点,定义了多种特征模板。同时,为了解决长距离约束问题,将词语触发对融合到了CRF模型中。提出了基于词语方差(word variance)的选词方法,在词语相关性计算上,采用了平均互信息(AMI)方法和 统计量方法。通过在半年人民日报上的测试,结果表明在采用相同特征集合的条件下,条件随机域模型较其他概率模型有更好的性能表现;融合长距离触发对的条件随机域模型可以使系统的F量度提高约1.38%。


论文标题Chinese Word Segmentation based on Mixing Model

作者Jiang, W., J. Zhao, Y. Guan, and Z.-M. Xu

期刊名称The 4th SIGHAN Workshop

简单介绍This paper presents our recent work for participation in the Second International Chinese Word Segmentation Bakeoff. According to difficulties, we divide word segmentation into several sub-tasks, which are solved by mixed language models, so as to take advantage of each approach in addressing special problems. The experiment indicated that this system achieved 96.7% and 97.2% in F-measure in PKU and MSR open test respectively.


论文标题基于数据挖掘思想的网页正文抽取方法的研究

作者蒲宇达,关毅,王强

期刊名称第三届学生计算语言学研讨会论文集

简单介绍为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠数据挖掘思想,从中文新闻类网页中抽取正文内容的方法.


论文标题融合聚类触发对特征的最大熵词性标注模型

作者赵岩,王晓龙,刘秉权,关毅

期刊名称计算机研究与发展

期卷2006,43(00

简单介绍为解决传统HMM词性标注模型不能包含远距离词特征的问题,提出了形如"WA→WB/TB"的触发对来承载远距离词特征信息,并采用平均互信息量度对触发对特征进行选择.在最大熵框架下,将选择后的触发对特征加入到词性标注系统中.


论文标题文档聚类综述

作者刘远超,王晓龙,徐志明,关毅

期刊名称中文信息学报

期卷2006,20(00

简单介绍聚类作为一种自动化程度较高的无监督机器学习方法 ,近年来在信息检索、多文档自动文摘等领域获得了广泛的应用。本文首先讨论了文档聚类的应用背景和体系结构 ,然后对文档聚类算法、聚类空间的构造和降维方法、文档聚类中的语义问题进行了综述。最后还介绍了聚类质量评测问题


论文标题K-NN 与 SVM 相融合的文本分类技术研究

作者王强,王晓龙,关毅,徐志明

期刊名称高技术通讯

期卷2005,15(00

简单介绍本文提出了一种改进的 K-NN (K Nearest Neighbor)与 SVM (Support Vector Machine)相融合的文本分类算法。该算法利用文本聚类描述 K-NN 算法中文本类别的内部结构,用 sigmoid 函数对 SVM输出结果进行概率转换,同时引入 CLA(Classifier’s Local Accuracy)技术进行分类可信度分析以实现两种算法的融合。实验表明该算法综合了 K-NN 与 SVM 在分类问题中的优势,既有效地降低了分类候选的数目,又相应地提高了文本分类的精度,具有较好的性能。


论文标题基于矢量空间模型和最大熵模型的词义问题解决策略

作者赵岩,王晓龙,刘秉权,关毅

期刊名称高技术通讯

期卷2005,15(00

简单介绍词义问题是自然语言处理中的核心问题之一,尤其在汉语这种轻语法、重意义的语言中更是如此。本文针对单义词的词义问题构建了融合触发对(trigger pair)的矢量空间模型用来进行词义相似度的计算,并以此为基础进行了词语的聚类;针对多义词的词义问题应用融合远距离上下文信息的最大熵模型进行了有导词义消歧的研究。为克服以往词义消歧评测中通过人工构造带有词义标记的测试例句而带来的覆盖程度小、主观影响大等问题,本文将模型的评测直接放到了词语聚类和分词歧义这两个实际的应用中。分词歧义的消解正确率达到了 92%,词语聚类的结果满足进一步应用的需要。


论文标题论系统相似的度量

作者关毅,王晓龙,王强

期刊名称全国第八届计算语言学联合学术会议 (JSCL-2005) 论文集

简单介绍本文阐明了系统相似度计算的基本原理,提出了一种新的系统相似度计算函数,论证了该函数的代数特点。作为系统相似度计算的应用之一,本文进而提出了一种新的信息检索模型-系统相似模型,论证了向量空间模型为该模型的特例,且该模型能有效地弥补向量空间模型的缺陷。


论文标题基于 Cover 级别的中文信息检索技术的研究

作者包刚,关毅,王强,赵健

期刊名称计算机工程与应用

期卷2005,41(02

简单介绍信息检索系统如果能较精确地定位于文章中用户关心的部分必将提高用户的检索效率。基于 Cover级别的检索策略就是针对上述问题提出的。基于 Cover 级别的检索策略以用户查询的关键词集合作为输入,在被检索文档中找到包含关键词集合的最短文本片断集作为输出。本文采用了一种经过改进的基于 Cover 级别的检索策略,对系统返回的文本片断作了限制,并在检索过程中使用了贪心算法(Greedy Algorithm)的思想,最后将其应用到中文信息检索系统中。实验证明,采用改进的策略比原有的基于 Cover 级别的检索策略在返回有效结果个数和平均排序倒数(MRR)等指标上都有了提高。


论文标题多文档文摘中基于语义相似度的最大边缘相关技术研究

作者刘寒磊,关毅,徐永东

期刊名称全国第八届计算语言学联合学术会议 (JSCL-2005)

简单介绍多文档自动文摘致力于从多篇文档中将全面、简洁的摘要性文档呈现给用户,提高用户获取信息的效率。本文提出了基于语句级语义相似度的最大边缘相关方法来选取文摘句,为生成高质量的文摘提供文摘单元支持。实验结果表明,与基于相关度大小排序选择文摘句的方法相比,系统的精确率和召回率明显提高;直观的评测可以看出该方法使生成文摘内容间的冗余度大大降低,信息覆盖面更广,概括性和可读性较强,能够达到较好的质量。


论文标题Automatic Text Summarization Based on Lexical Chains

作者Yanmin Chen, Xiaolong Wang, Guan Yi

期刊名称ICNC (1)

简单介绍The method of lexical chains is the first time introduced to generatesummaries from Chinese texts. The algorithm which computes lexical chainsbased on the HowNet knowledge database is modified to improve the performanceand suit Chinese summarization. Moreover, the construction rules of lexicalchains are extended, and relationship among more lexical items is used. Thealgorithm constructs lexical chains first, and then strong chains are identified andsignificant sentences are extracted from the text to generate the summary. Evaluationresults show that the performance of the system has a notable improvementboth in precision and recall compared to the original system1.


论文标题基于上下文平均互信息的问句查询扩展模型

作者邵兵,关毅,王强,王晓龙,任瑞春

期刊名称第二届全国学生计算语言学研讨会

简单介绍信息检索中存在用词歧义的问题,在中文自然语言查询处理中,表达差异问题更加突出。提出了一种基于上下文互信息的问句查询扩展模型,模型首先对训练集文档中的词或词组进行相关分析,计算每对词或词组间的互信息,然后再利用中文语义网与同义词资源进行中文信息检索的查询扩展。实验结果表明,该方法适宜改进Web 上的信息检索,相对一般的查询扩展算法可以大幅度提高各项指标。


论文标题A Study of Semi-discrete Matrix Decomposition for LSI in Automated Text Categorization

作者Qiang Wang, Xiaolong Wang, Guan Yi

期刊名称 A Study of Semi-discrete Matrix Decomposition for LSI in Automated Text Categorization

期卷IJCNLP

简单介绍This paper proposes the use of Latent Semantic Indexing (LSI) techniques,decomposed with semi-discrete matrix decomposition (SDD) method,for text categorization. The SDD algorithm is a recent solution to LSI, whichcan achieve similar performance at a much lower storage cost. In this paper,LSI is used for text categorization by constructing new features of category ascombinations or transformations of the original features. In the experiments ondata set of Chinese Library Classification we compare accuracy to a classifierbased on k-Nearest Neighbor (k-NN) and the result shows that k-NN based onLSI is sometimes significantly better. Much future work remains, but the resultsindicate that LSI is a promising technique for text categorization.


论文标题A Maximum Entropy Markov Model for Chunking

作者Guang-Lu Sun, Yi Guan, Xiao-Long Wang, Jian Zha

期刊名称Proceedings of the Fourth International Conference on Machine Learning and Cybernetics

简单介绍This paper presents a new chunking method based on maximum entropy Markov models (MEMM). MEMM is described in detail that combines transition probabilities and conditional probabilities of states effectively. The conditional probabilities of states are estimated by maximum entropy (ME) theory. The transition probabilities of the states are estimated by N-gram model in which interpolation smoothing algorithm is utilized on the basis of analyzing chunking spec. Experiment results show that this approach achieves an impressive performance: 92.53% in F-score on the open data sets of CoNLL-2000 shared task. The performance of the algorithm is close to the state-of-the-art.


论文标题 Automatic and efficient recognition of proper nouns based on maximum entropy model

作者Peng Li, Yi Guan, Xiao Long Wang, Jun Sun

期刊名称ICMLC2005

简单介绍This paper presents a high performance method to identify English proper nouns (PNs) based on maximum entropy model (MaxEnt). Most traditional PNs recognition systems use lexical resources such as name list, as new names are constantly coming into existence, these are necessarily incomplete. Therefore machine learning methods are used to identify PNs automatically. In the framework of MaxEnt model, semantic and lexical information of surrounding words and word itself acting as atomic features comprises feature templates and forms feature without requiring extra expert knowledge. The test on WSJ of Penn Treebank II shows that this method guarantees high precision and recall, and at the same time it can reduce the quantity of features dramatically, downsize system space consumption, and decrease the time of training and testing, so as to improve the efficiency considerably. The method in this paper can be transformed to identify other specific noun easily because the principle of methods is universal.


论文标题Extracting answers to natural language questions from large-scale corpus

作者Peng Li, Xiao Long Wang, Yi Guan, Yu Ming Zhao

期刊名称Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering

简单介绍This paper provides a novel and tractable method for extracting exact textual answers from the returned documents that are retrieved by traditional IR system in large-scale collection of texts. In our approach, WordNet and Web information are employed to improve the performance as external auxiliary resources, then some NLP technologies are used to constitute the empirical answer ranking formula, such as POS tagging, Named Entity recognition, and parser etc. The method involves automatically ranking passages with System Similarity Model, automatically downloading related Web pages by means of Web crawler, and automatically mining answers with empirical formula from candidate answer sets. The series of experimental results show that the overall performance of our system is good and the structure of the system is reasonable.


论文标题Analyzing the Incomplete Data based on the Improved Maximum Entropy Model

作者Jian Zhao, XiaoLong Wang, Yi Guan, Lei Lin

期刊名称 International Journal ofInformation Technology

期卷vol.11, no

简单介绍:. When the MEM (Maximum Entropy Model) trained by GIS (Generalized Iterative Scaling) algorithm was used to analyze the incomplete data, in order to satisfy the constraint of GIS, a global unique compensating feature was introduced to offset the effect of missing attributes of some samples on classification result. However, this kind of compensating strategy neglected a basic fact that different features had different effect on classification result. In this paper, an improved compensating strategy was proposed to overcome the shortage of traditional method: took effects of both different feature types and label types into account. Experiment results on Mushroom data set coming from UCI data repository showed that the new method was feasible and effective. The average error rate was reduced by about 33.5%.


论文标题Using category-based semantic field for text categorization

作者Qiang Wang, XiaoLong Wang, Yi Guan, ZhiMing Xu

期刊名称The 4th International Conference on Machine Learning and Cybernetics(ICMLC)

简单介绍This paper proposes a new document representation method to text categorization. It applies Category-based Semantic Field (CBSF) theory for text categorization to gain a more efficient representation of documents. The lexical chain is introduced to compute CBSF and Hownet? used as a lexical database. In particular, the title of each document functions as a clue to forecast the potential CBSF of the test document. Combined with classifier, thisapproach is examined in text categorization and the result indicates that it performs better than conventional methods with featureschosen on the basis of bag-of-words (BOW) system, on the same task.


论文标题Domain-Specific Term Extraction and Its Application in Text Classification

作者Tao Liu, Xiao-long Wang, Yi Guan, Zhi-ming Xu, Qiang Wang,

期刊名称Proceedings of 8th Joint Conference on Information Sciences (JCIS2005)

简单介绍A statistical method is proposed for domain-specific term extraction from domain comparative corpora. It takes distribution of a candidate word among domains and within a domain into account. Entropy impurity is used to measure distribution of a word among domains and within a domain. Normalization step is added into the extraction process to cope with unbalanced corpora. So it characterizes attributes of domain-specific term more precisely and more effectively than previous term extraction approaches. Domain-specific terms are applied in text classification as the feature space. Experiments show that it achieves better performance than traditional methods for feature selection.


论文标题蛋白质二级结构预测: 基于词条的最大熵马尔科夫方法

作者董启文,王晓龙,林磊,关毅,赵健

期刊名称中国科学 C 辑 生命科学 2005

期卷 35 (1): 8

简单介绍提出了一种新的蛋白质二级结构预测方法.该方法从氨基酸序列中提取出和自然语言中的“词”类似的与物种相关的蛋白质二级结构词条,这些词条形成了蛋白质二级结构词典,该词典描述了氨基酸序列和蛋白质二级结构之间的关系.预测蛋白质二级结构的过程和自然语言中的分词和词性标注一体化的过程类似.该方法把词条序列看成是马尔科夫链,通过Viterbi算法搜索每个词条被标注为某种二级结构类型的最大概率,其中使用词网格描述分词的结果,使用最大熵马尔科夫模型计算词条的二级结构概率.蛋白质二级结构预测的结果是最优的分词所对应的二级结构类型.在 4 个物种的蛋白质序列上对这种方法进行测试,并和 PHD 方法进行比较.试验结果显示,这种方法的 Q3 准确率比 PHD 方法高 3.9%, SOV 准确率比 PHD 方法高 4.6%. 结合BLAST 搜索的局部相似的序列可以进一步提高预测的准确率.在 50 个 CASP5 目标蛋白质序列上进行测试的结果是: Q3 准确率为 78.9%, SOV 准确率为 77.1%.


论文标题Insun05QA on QA Track of TREC 2005

作者Yuming Zhao,Yi Guan, ZhiMing Xu, Peng Li

期刊名称Proceedings of TREC 2005

简单介绍This is the first time that our group takes part in the QA track. At TREC2005, the system we developed, Insun05QA, participated in the Main Task, which submitted answers to three types of questions: factoid questions, list questions and others questions. And we also submitted the document ranking which our answers are generated from. A new sentence similarity calculating method is used in our Insun05QA system. It can be considered as an extension of vector space model. And our QA system incorporates several useful tools. These tools include WordNet, developed by Princeton University, Minipar by Dekang Lin, and GATE, developed by University of Sheffield. Moreover, external knowledge such as knowledge from Internet is also widely used in our system. Since it is the first time that we take part in QA track and the preparing time is limited, we concentrate on the processing of factoid questions. And the methods we developed to process list and others questions are generated from the method used to process factoid questions.


论文标题一种基于粗糙集增量式规则学习的问题分类方法研究

作者李鹏;王晓龙;关毅

期刊名称电子与信息学报

期卷2008年05期

简单介绍该文提出一种基于粗糙集增量式规则自动学习来实现问题分类的方法,通过深入提取问句特征并采用决策表形式构建训练语料,利用机器学习的方法自动获取分类规则。与其他方法相比优势在于,用于分类的规则自动生成,并采用粗糙集理论的简约方法获得优化的最小规则集;首次在问题分类中引入增量式学习理念,不但提高了分类精度,而且避免了繁琐的重新训练过程,大大提高了学习速度,并且提高了分类的可扩展性和适应性。对比实验表明,该方法分类精度高,适应性好。在国际TREC2005Q/A实际评测中表现良好。


论文标题基于统计的网页正文信息抽取方法的研究

作者孙承杰, 关毅

期刊名称中文信息学报

期卷2004年第18卷0

简单介绍为了把自然语言处理技术有效的运用到网页文档中,本文提出了一种依靠统计信息,从中文新闻类网页中抽取正文内容的方法。该方法先根据网页中的HTML 标记把网页表示成一棵树,然后利用树中每个结点包含的中文字符数从中选择包含正文信息的结点。该方法克服了传统的网页内容抽取方法需要针对不同的数据源构造不同的包装器的缺点,具有简单、准确的特点,试验表明该方法的抽取准确率可以达到95 %以上。采用该方法实现的网页文本抽取工具目前为一个面向旅游领域的问答系统提供语料支持,很好的满足了问答系统的需求。


论文标题面向专业网站的中文问答系统研究

作者关毅,王晓龙,赵岩,赵健

期刊名称Proceedings of the 20th International Conference on Computer Processing of Oriental Languages

简单介绍问答系统是一种大量运用自然语言处理技术的新型信息检索系统,正在成为自然语言处理领域和信息检索领域中的一个引人注目的研究热点。本文在论述了面向专业网站的中文问答系统的几个基本问题:定义、概况、国内外研究现状之后,介绍了哈工大计算机应用教研室开发的问答系统实验平台,提出了以系统相似为基础的问答系统的基本原理,从而把应用于这一特定信息检索技术的各项自然语言处理技术理顺到系统化、理论化的轨道。


论文标题基于统计的汉语词汇间语义相似度计算

作者关毅,王晓龙

期刊名称语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集,

简单介绍语义相似是词汇间的纂本关系之一,汉语词汇间语义相似的定量化研究对于信息检索、统计语言模型等自然语言处理的应用技术具有重要的指导意义。本文定义了语义相似度的数学模型,进而描述了基于相关嫡的汉语词汇间语义相似度计算方法。初步实验表明,该方法是一种理论基础严整,实践上行之有效的方法


论文标题基于短语的汉语N-gram语言模型研究

作者刘秉权,王晓龙,王轩,关毅

期刊名称863计划智能计算机主题学术会议

简单介绍N-gram统计语言模型因其鲁棒性强、简洁、有效等特点成为当前的主流语言建模技术,但其本身存在难以克服的缺点:不能有效处理长距离语言约束;统计信息有时也不能反映真实的语言规律.


论文标题汉语大词表 N—gram 统计语言模型构造算法

作者徐志明,王晓龙,关毅

期刊名称计算机应用研究

期卷1999,16(00

简单介绍本文提出了汉语大词表的N-gram统计语言模型构造技术,根据信息论的观点,给出了自然语言处理中各种应用中的统计语言建模的统一框架描述,提出了一种汉语大词表的Trigram语言模型构造算法。把构造的Trigram语言模型应用于大词表非特定人孤立词语音识别系统中,系统识别率达到82%。


论文标题基于转移的音字转换纠错规则获取技术

作者关毅,王晓龙,张凯

期刊名称计算机研究与发展

期卷1999,36(00

简单介绍文中描述了一种在音字转换系统中从规模不限的在线文本中自动获取纠错规则的机器学习技术.该技术从音字转换结果中自动获取误转换结果及其相应的上下文信息,从而生成转移规则集.该转移规则集应用于音字转换的后处理模块,使音字转换系统的转换正确率进一步提高,并使系统具备了很强的灵活性和可扩展性.


论文标题基于统计的计算语言模型

作者关毅,张凯,付国宏

期刊名称计算机应用研究

期卷1999,16(00



论文标题语音识别语言理解模型

作者徐志明,王晓龙,张凯,关毅,孙玉琦

期刊名称第五届全国人机语音通讯学术会议论文集

简单介绍本文提出了一种规则与统计相结合的计算语言模型应用于语言识别后端处理的技术,把基于统计的大词表Markov统计模型与语言规则量化模型集成在一个语言理解系统, 讨论了两种计算语言模型的互补性与结合机制


论文标题基于统计与规则相结合的汉语计算语言模型及其在语音识别中的应用

作者关毅,王晓龙.

期刊名称高技术通讯

期卷1998,8(004

简单介绍把基于统计的语料概率统计方法与基于规则的自然语言理解方法结合起来, 提出了一种新的汉语计算语言模型, 并把该模型应用于语音识别后处理模块中, 取得了较理想的结果


论文标题现代汉语计算语言模型中语言单位的频度—频级关系

作者关毅,王晓龙,张凯

期刊名称中文信息学报

期卷1999年02期

简单介绍Zipf 定律是一个反映英文单词词频分布情况的普适性统计规律。我们通过实验发现 ,在现代汉语的字、词、二元对等等语言单位上 ,其频度与频级的关系也近似地遵循 Zipf定律 ,说明了 Zipf 定律对于汉语的不同层次的语言单位也是普遍适用的。本文通过实验证实了 Zipf 定律所反映的汉语语言单位频度 —频级关系 ,并进而深入讨论了它对于汉语自然语言处理的各项技术 ,尤其是建立现代汉语基于统计的计算语言模型所具有的重要指导意义。


论文标题Clinical-decision support based on medical literature: A complex network approach

作者Jingchi Jiang, Jichuan Zheng, Chao Zhao, Jia Su, Yi Guan, Qiubin Yu

期刊名称Physica A: Statistical Mechanics and its Applications

期卷Volume 459, 1 October 2016, Pages 42–54

简单介绍In making clinical decisions, clinicians often review medical literature to ensure the reliability of diagnosis, test, and treatment because the medical literature can answer clinical questions and assist clinicians making clinical decisions. Therefore, finding the appropriate literature is a critical problem for clinical-decision support (CDS). First, the present study employs search engines to retrieve relevant literature about patient records. However, the result of the traditional method is usually unsatisfactory. To improve the relevance of the retrieval result, a medical literature network (MLN) based on these retrieved papers is constructed. Then, we show that this MLN has small-world and scale-free properties of a complex network. According to the structural characteristics of the MLN, we adopt two methods to further identify the potential relevant literature in addition to the retrieved literature. By integrating these potential papers into the MLN, a more comprehensive MLN is built to answer the question of actual patient records. Furthermore, we propose a re-ranking model to sort all papers by relevance. We experimentally find that the re-ranking model can improve the normalized discounted cumulative gain of the results. As participants of the Text Retrieval Conference 2015, our clinical-decision method based on the MLN also yields higher scores than the medians in most topics and achieves the best scores for topics: #11 and #12. These research results indicate that our study can be used to effectively assist clinicians in making clinical decisions, and the MLN can facilitate the investigation of CDS.


论文标题中文电子病历命名实体和实体关系标注体系及语料库构建

作者杨锦锋,关毅,何彬,曲春燕,于秋滨,刘雅欣,赵永杰

期刊名称软件学报

简单介绍 电子病历是由医务人员撰写的面向患者个体描述医疗活动的记录, 蕴含了大量的医疗知识和患者的健康信息. 电子病历命名实体识别和实体关系抽取等信息抽取研究对于临床决策支持、循证医学实践和个性化医疗服务等具有重要意义, 而电子病历命名实体和实体关系标注语料库的构建是首当其冲的. 本文在调研了国内外电子病历命名实体和实体关系标注语料库构建的基础上, 结合中文电子病历特点, 提出适合中文电子病历的命名实体和实体关系的标注体系, 在医生的指导和参与下, 制定了命名实体和实体关系的详细标注规范, 构建了标注体系完整、 规模较大且一致性较高的标注语料库. 语料库包含病历文本992 份, 命名实体标注一致性达到0.922, 实体关系一致性达到0.895. 我们的工作为中文电子病历信息抽取后续研究打下了坚实的基础.


论文标题中文电子病历命名实体语料库构建

作者曲春燕,关毅,杨锦锋,赵永杰,刘雅欣

期刊名称高技术通讯

期卷2015, 25(2)

简单介绍针对中文电子病历命名实体语料标注空白的现状,研究了中文电子病历命名实体标注语料库的构建.参考2010年美国国家集成生物与临床信息学研究中心(I2B2)给出的电子病历命名实体类型及修饰类型的定义,在专业医生的指导下制定了详尽的中文电子病历标注规范;通过对大量中文电子病历的分析,提出了一套完整的中文电子病历命名实体标注方案,而且采用预标注和正式标注的方法,建立了一定规模的中文电子病历命名实体标注语料库,其标注语料的一致性达到了92%以上.该工作对中文电子病历的命名实体识别及信息抽取研究提供了可靠的数据支持,对医疗知识挖掘也有重要意义.


论文标题CRFs based de-identification of medical records

作者He B, Guan Y, Cheng J, et al

期刊名称Journal of biomedical informatics

简单介绍De-identification is a shared task of the 2014 i2b2/UTHealth challenge. The purpose of this task is to remove protected health information (PHI) from medical records. In this paper, we propose a novel de-identifier, WI-deId, based on conditional random fields (CRFs). A preprocessing module, which tokenizes the medical records using regular expressions and an off-the-shelf tokenizer, is introduced, and three groups of features are extracted to train the de-identifier model. The experiment shows that our system is effective in the de-identification of medical records, achieving a micro-F1 of 0.9232 at the i2b2 strict entity evaluation level.


出版物
王晓龙 关毅 《计算机自然语言处理》清华大学出版社 2005年


论著成果
1995年,微软拼音输入法(与微软公司合作)主要参加人

1996年,Macintosh用BOPOMPOFO智能语句输入法(与日本佳能泰克(佳能公司子公司)公司合作)主要参加人

2000年,Weniwen智能中文搜索引擎 主要参加人

2002年,智能化中文信息处理平台 主要参加人

2003年,Insun_TC文本分类系统 主要负责人

2004年,面向体育、旅游领域的智能中文问答系统InsunTourQA 主要负责人

2005年,ICSU词法分析系统 主要负责人

2005年,InsunQA英文问答系统 主要负责人

2008年,面向博客bbs的中文情感极性分析系统(与富士通中国研发中心合作)第一负责人

2008年,myspace隐式用户兴趣挖掘系统(与myspace公司聚友网合作)第一负责人

2009年,中文浅层句法分析系统(与阿里巴巴公司合作)第一负责人

2010年,面向IOS的中文智能语句输入法WI输入法 第一负责人

2010年,电子病历管理系统(与哈尔滨医科大学第二附属医院合作)第一负责人

















相关话题/介绍 论文 中文 系统 信息