Data Is Art

基于词典的中文情感倾向分析算法设计

shenhao — Wed, 04 Jun 2014 06:32:26 +0000

情感倾向可认为是主体对某一客体主观存在的内心喜恶，内在评价的一种倾向。它由两个方面来衡量：一个情感倾向方向，一个是情感倾向度。情感倾向方向也称为情感极性。在微博中，可以理解为用户对某客体表达自身观点所持的态度是支持、反对、中立，即通常所指的正面情感、负面情感、中性情感。例如“赞美”与“表扬”同为褒义词，表达正面情感，而“龌龊”与“丑陋”就是贬义词，表达负面情感。情感倾向度是指主体对客体表达正面情感或负面情感时的强弱程度，不同的情感程度往往是通过不同的情感词或情感语气等来体现。例如：“敬爱”与“亲爱”都是表达正面情感，同为褒义词。但是“敬爱”远比“亲爱”在表达情感程度上要强烈。通常在情感倾向分析研究中，为了区分两者的程度差别，采取给每个情感词赋予不同的权值来体现。目前，情感倾向分析的方法主要分为两类：一种是基于情感词典的方法；一种是基于机器学习的方法，如基于大规模语料库的机器学习。前者需要用到标注好的情感词典，英文的词典有很多，中文主要有知网整理的情感词典Hownet和台湾大学整理发布的NTUSD两个情感词典，还有哈工大信息检索研究室开源的《同义词词林》可以用于情感词典的扩充。基于机器学习的方法则需要大量的人工标注的语料作为训练集，通过提取文本特征，构建分类器来实现情感的分类。文本情感分析的分析粒度可以是词语、句子也可以是段落或篇章。段落篇章级情感分析主要是针对某个主题或事件进行倾向性判断，一般需要构建对应事件的情感词典，如电影评论的分析，需要构建电影行业自己的情感词典效果会比通用情感词典效果更好；也可以通过人工标注大量电影评论来构建分类器。句子级的情感分析大多事通过计算句子里包含的所有情感词的平均值来得到。篇章级的情感分析，也可以通过聚合篇章中所有的句子的情感倾向来计算得出。因此，针对句子级的情感倾向分析，既能解决较短文本的情感分析，同时也可以是篇章级文本情感分析的基础。本文正是根据这一思路，设计的情感分析算法。算法主要由三部分组成： 1、文本切割转换算法设计的最大分析对象为篇章，最小对象为句子，我们可以把句子视作特例——单句的篇章，故算法分析的对象为文档D。 Paragraph = Document.split(“/n”) ## 将文档以换行符”/n”分割成段落P Sentence = Paragraph.split( punc ) punc = [“。”,”；”,”？”,”！”] ## 将段落用中文里常用的句号、分号、问号、感叹号等划分句意的符号，切割成不同的句子L Group = Sentence.split(“，”) ## 用逗号划分出句子里的意群（表示情感的最小单元） Seg( each Group ) ##调用在线分词工具或者本地分词函数，对意群进行分词开源中文分词工具有很多，如在线的SCWS(PHP)，张华平博士团队开发的NLPIR(C、Python、Java)，哈工大的LTP(C++、Python)，还有R语言的分词包RWordseg（NLPIR的R接口）。几款分词工具各有各自的特点，在这里不详细介绍了，读者可以自行检索查阅。文本切割的目的是将文本变成我们后续分析需要的格式，如句子“我今天很不高兴。”，进行文本切割后，转换成： [（1，“我”，“r”），（2，“今天”，”t”），（3，“很”，”d”），（4，“不”，”d”），（5，“高兴”，“a”）] 选择不同的分词工具，可以获得不同的词语属性，用SCWS分词，还可以获得每个词的IDF值；用LTP分词，可以获得句子的依存关系、语义角色等。这些属性对于我们后面计算句子的情感倾向都是有帮助的。本文只用都了词语的词性，感兴趣的读者可以思考如何用其他的属性来实现更好的情感分析。 2、情感定位本文基于已有的中文情感词库，构建了一张情感词表，然后对文本进行中文分词处理，将处理后得到的单词依次与预先构建好的情感词表逐个查找，若能找到，则是情感词，并读取情感极性及相应权值，否则，不是情感词，则进入下一个候选单词，直至整句话判断结束。过程可以表示如下：


For each Paragraph in Document:
   For each Line in Paragraph:
      For each Group in Line:
         For each Word in Group:
            If word in senDict:
               senWord = (句中位置，情感倾向，情感强度)

文本的情感分析是从发现句中的情感词开始，通过情感词的倾向和倾向度，来决定句子的情感，从而决定整个文本的情感。但是我们在实际生活中会发现，否定词的修饰会使情感词语的情感极性发生改变。比如：“我今天很不高兴”，该句中“高兴”是褒义词，由于否定词“不”的修饰，使其情感极性发生了改变，转变成了负面情感。由于汉语中存在多重否定现象，即当否定词出现奇数次时，表示否定意思；当否定词出现偶数次时，表示肯定意思。本文单独构建了一个否定词典notDict，并设置其权值为-1，常见的否定词如：不、没、无、非、莫、弗、毋、勿、未、否、别、無、休。对否定词的处理过程可以简化为：


For each Paragraph in Document:
   For each Line in Paragraph:
      For each Group in Line:
         For each Word in Group:
            If word in senDict:
               senWord = (句中位置，情感倾向，情感强度)
               LastSenWordPosition = 0  ##上一个情感词在句中的位置
               for i in range(senWord[0]，LastSenWordPosition，-1):
                  if Group[i] in notDict:
                     notWord.append( (句中位置，-1) )
               LastSenWordPosition = senWord[0]

另外，当程度副词修饰情感词，该情感词的情感倾向程度发生了变化。比如： “今天坐了12个小时的车，身体极度疲惫。”，“疲惫”是一个贬义词，前面一个程度副词“极度”的修饰使得“疲惫”原来的情感倾向程度发生了变化，这比没有修饰之前更加强烈。因此，为了准确表达文本的情感倾向，需做相应的权值调整。本文中的程度副词来源于知网（HowNet），选用“情感分析用词语集（beta版）”中的“中文程度级别词语”共219 个，蔺璜等人提出了把程度副词划分六个等级，笔者为每个程度副词定义了一个权重，被程度副词修饰后的情感词其权值应做相应调整。程度副词如下表所示：

程度副词示例

type	权值	汇总
超\|over	1.5	30
很\|very	1.25	42
极其\|extreme / 最\|most	2	69
较\|more	1.2	37
欠\|insufficiently	0.5	12
稍\|-ish	0.8	29

程度副词的处理过程跟否定词类似，过程简化如下：


For each Paragraph in Document:
   For each Line in Paragraph:
      For each Group in Line:
         For each Word in Group:
            If word in senDict:
               senWord = (句中位置，情感倾向，情感强度)
               LastSenWordPosition = 0  ##上一个情感词在句中的位置
               for i in range(senWord[0]，LastSenWordPosition，-1):
                  if Group[i] in degreeDict:
                     degreeWord = ( (句中位置，修饰强度) )
               LastSenWordPosition = senWord[0]

经过这样的处理，文本被进一步转换格式：

“我今天很不高兴。”

①经过文本切割转换 [（1，“我”，“r”），（2，“今天”，”t”），（3，“很”，”d”），（4，“不”，”d”），（5，“高兴”，“a”）] ②情感定位

[(5，“Happy”，4)，[(4,-1)]，(3,1.25)] ##[情感词，否定词，程度副词]

3、情感聚合本文在前面说过，篇章级情感倾向通过聚合篇章中所有的句子的情感倾向来计算得出。句子级由句子中所含情感词来计算。通过前两步的操作，我们完成了句子意群的划分，同时也提出了每个意群里的情感词、否定词和程度副词。有了这些，下面我们先求出意群的情感值：情感群—情感值 = 否定词-1 * 程度词权重 * 情感词权重我们在实际应用中又发现，当一个句子中同时出现否定词和程度词时，由于否定词和程度词相对位置的不同，会引起情感的不同，比如： “我很不高兴”——分词之后：我很不高兴 “我不很高兴”——分词之后：我不很高兴可以看出，第一句话表达的是一种很强烈的负面情感，而第二句话则表达的是一种较弱的正面情感。因此，如果否定词在程度词之前，起到的是减弱的作用；如果否定词在程度词之后，则起到的是逆向情感的作用。所以我们对上述算法做了一下调整： W = 1 If 位置（否定词）> 位置（程度词）： W = -1 意群情感值 = W * 程度词权重 * 情感词权重 If 位置（否定词）< 位置（程度词）： W = 0.5 意群情感值 = W * 程度词权重 * 情感词权重如果意群里出现多个否定词，则处理办法为：


For n in notWord：
   W = -1 * W

句子是由意群组成，故句子的情感我们可以简单记做： 句子情感值 = sum（意群情感值1，意群情感值2……） 段落是由不同的句子组成，但是考虑到段落的长短变化很大，故放弃用求和的方式来计算情感值，改为求平均值： 段落情感值 = average（句子1情感值，句子2情感值……） 文档是由不同的段落组成，同理，不同文档有不同的段落，故我们同样求平均值： 文档情感值 = average（段落1情感值，段落2情感值……） 以上是情感值的计算，至于情感倾向，首先可以通过情感值的符号来判断情感倾向是正向还是负向，若情感倾向不止正、负、中立这三种情况，还有更细的划分，则可以根据情感正负的情况，把对应的情感倾向进行汇总来表述。上述的做法是最简单的做法，没有考虑太多句子之间的差异以及不同段落对文档的重要性。本算法还有很多值得改进的地方，比如句子是由词语根据一定的语言规则构成的，应该把句子中词语的依存关系纳入到句子情感的计算过程中去，可根据句子依存关系，从句子的根节点开始对每个词进行情感倾向计算，根据句子依存关系求出句子的情感倾向和情感值。文档的情感，也应该根据句子的不同重要程度来计算，根据句子对文档的重要程度赋予不同权重，调整其对文档情感的贡献程度。确定句子的重要程度，可以根据句子在文档中的位置，根据句子中所含信息量的大小，句子中所含关键词的多少等等。

参考文献：陈晓东. (2012). 基于情感词典的中文微博情感倾向分析研究 (Master's thesis, 华中科技大学). 王飞跃,李晓晨,毛文吉,王涛. (2013). 社会计算的基本方法与应用 (pp. 36-49). 浙江大学出版社.

我们不培养数据科学家，我们培养数据工匠！

shenhao — Thu, 22 May 2014 17:15:25 +0000

WHY DATA ARTISANS ARE THE NEW DATA SCIENTISTS

NUMBER AREN'T JUST NUMBERS. THEY ALSO TELL STORIES. The era of Big Data is upon us. From retailers to shipping companies to marketing agencies, more businesses are using Big Data to discover patterns and trends. Many companies even have specialized teams of people who work on Big Data projects, solely focusing on analyzing and manipulating the troves of information generated and stored in the Internet age. But who are these people behind the Big Data platforms? Typically, Big Data is associated with data scientists, the “geeks” who boast the statistical, mathematical, and database knowledge required for working with large unstructured datasets. While they are often seen as the faces behind Big Data, data scientists are not the only ones who work with data on a daily basis. In fact, there is a new type of employee emerging: the data artisan. The term data artisan was first coined by Alteryx, a software company interviewed for this story. Data artisans are employees who possess a blend of technical skills and business acumen that enables them to extract actionable insight from the huge volumes of data that exist--despite their lack of experience with it--demonstrating that businesses don’t always need a data scientist to interpret data effectively. The qualifications and requirements of the role may vary across companies, but one thing’s for certain: the data artisan will have a significant impact on the enterprise of the future. Some companies are already ahead of the curve. The three highlighted below are utilizing data artisans to get value out of Big Data, and it may not be long before other companies follow suit.

MENDICANT MARKETING

Mendicant Marketing, an Internet marketing firm, is leveraging Big Data to develop effective marketing strategies for its clients. The firm analyzes massive data sets to return meaningful patterns and results that can be applied to digital marketing campaigns. Kevin Milani is a digital marketing data artisan at the company, performing data implementation and analysis on a daily basis. While he calls himself a “geek at heart,” Milani isn’t the stereotypical data nerd. He spent the majority of his early career working in the business marketing industry, specializing in search engine marketing and Google AdWords. He never worked with Big Data until he founded Mendicant Marketing in 2007. Despite his inexperience with Big Data, Milani is able to utilize his marketing knowledge to identify connections between data points. He also frequently experiments with the data to find creative ways to gain insight. By merging Big Data with his digital marketing expertise, he helps businesses target the right prospect with the right message at the right time. “It’s incredible what you can achieve when working with data,” said Milani. “Anytime you have large amounts of customer information, there’s an opportunity to use Big Data analytics to discover new things about your customers and your business.” The key to leveraging this type of data, however, is the data artisan. “Data artisans are going to play a huge role in the future,” said Milani. “The profession is going to explode, and if businesses don’t keep up, they’re going to fail.”

FINDTHEBEST

FindTheBest is a research hub that helps consumers think like experts. The site boasts hundreds of comparisons on topics from colleges to cars to ski resorts to dog breeds (disclosure: I work here). Pooja Sohoni is a Product Associate at the company and helps design and edit the website’s health pages. She is not an engineer or a statistician but an anthropology major who has become a self-taught expert in Big Data. Since she lacks a robust technical background, she relies on her knowledge of medical anthropology to analyze health data. In the health vertical on FindTheBest, more than 4,800 hospitals in the United States can be searched according to location, rating, and type, among other criteria. Go to the page for a particular hospital and a wealth of information appears in summaries, charts, and graphics--down to the average costs for procedures and the mortality rates for common serious conditions. Sohoni worked on the comparison and chose which data points to include based on what would be most relevant to consumers. The data is presented in graphs and pie charts so consumers can easily understand the implications of the data. In addition to designing the initial charts and graphs, Sohoni selects which visualizations will represent certain pieces of information. She also writes summaries and explanations for key terms. Sohoni even compiles an introductory guide, which helps the consumer further understand the topic. “Consumers want to be able to explain why they picked the product they did,” Sohoni said. “I give them the data and enable them to see the context and importance of that data.”

ALTERYX

Alteryx is a business information software company known for its analytics platform. The platform is specifically designed for front-line business decision makers--who typically lack a strong background in statistics and data analytics--so they can identify problems and opportunities more efficiently. Dan Putler, the data artisan in residence at the company, says Alteryx is helping data artisans like himself by giving them a sophisticated analytics platform with capabilities that enable them to truly tell the story behind the data. “Alteryx empowers data artisans to do things they haven’t done in the past with analytics tools they have never used before,” said Putler. “These tools allow them to use their knowledge to develop sophisticated mathematical models and implications for the business.” As a data artisan, Putler is largely involved with both product development and high-level strategy development. He also works with external companies in sales and marketing processes, helping them to create segmentation strategies and improve their promotional material. Putler believes all companies can leverage Big Data to discover business opportunities—and they don’t need a data scientist to do it. He predicts that data artisans will partially take over the functions of data scientists and will play an integral role in the near future. “Data artisans allow businesses to do deeper, more predictive analytics than they would otherwise be able to do,” said Putler. “This means businesses will be able to make better decisions with data artisans with greater efficiency.” --Alejandra Saragoza is a recent graduate of UC Santa Barbara, and currently serves as a marketing associate for FindTheBest, a research hub that helps consumers think like experts.

海底捞北京分店微博签到展示

admin — Thu, 27 Mar 2014 06:03:58 +0000

采用Tableau很容易实现BI，极速报表生成

想不想成为：数据艺术家（Data Artisan）

shenhao — Sat, 15 Mar 2014 13:51:30 +0000

想不想成为数据艺术家 Data Artisan！掌握 Alteryx、Tableau、R、D3是俺最喜欢的软件工具！大数据领域：数据科学、网络科学、空间地理科学、可视化技术数据新闻：新闻人开始用数据来呈现想说的真实故事！文章来源：南都全媒体集群官方微博

用Tableau制作世界名画的

shenhao — Fri, 14 Mar 2014 17:10:10 +0000

今天无聊，从某Tableau博客找到了Tableau可以制作世界名画的技术，挺有意思！ 想法找到数据，原来是一个著名的数学TSP问题。 旅行推销员问题（Travelling Salesman Problem，又称为旅行商问题、货郎担问题、TSP问题）是一个多局部最优的最优化问题：有n个城市，一个推销员要从其中某一个城市出发，唯一走遍所有的城市，再回到他出发的城市，求最短的路线。也即求一个最短的哈密顿回路。从网站找到六幅世界名画的TSP数据，网站：http://www.math.uwaterloo.ca/tsp/data/art/index.html 制作图形：1）散点图，2）调整大小，3）设定形状，4）调试颜色

Tableau 中文教程——内有大量案例

shenhao — Fri, 14 Mar 2014 01:51:14 +0000

#公告·Tableau 中文教材# 各位朋友，数据可视化重要性不言而喻，Tableau作为一款极速可视化分析工具，深受喜爱！为此，我们现将由博易智讯主要针对Tableau 7.0撰写的中文教程电子版公开发布，并补充8.0版功能，全书共300页，供大家免费下载。欢迎交流探讨。点击http://t.cn/zH9Yo9a，选择Tableau教材。

用Tableau展示MH370周边机场位置

shenhao — Wed, 12 Mar 2014 15:53:57 +0000

有消息说：马来西亚空军首长Rodzali Daud确认,MH370于8日早上2:40从马空军在Butterworth的雷达上消失,地点在马六甲海峡中的Pulau Perak。——下载OpenFlight全球机场数据，标示马来西亚及周边国家机场位置，密密麻麻的机场，777-200机型庞然大物，雷达竟然不知？将Airpots.csv文件命名后导入Tableau，选择特定WMS链接地图，选择主要国家，展示马来西亚及周边国家机场位置！

用Tableau展示全球恐怖事件

shenhao — Wed, 12 Mar 2014 15:48:17 +0000

【Global Terrorism Database—全球恐怖事件数据库，从1970-2012年底共记录113113个恐怖事件】，用Tableau通过地图，分国家、城市、Attack type、Target type展示，如图，点击http://t.cn/8FdJFIO，图之间可联动、钻取交互查看。另外，在首页上鼠标悬浮于某个国家时，点击链接是可直接下钻到该国城市明细的。

图1

图2

图3

Alteryx兼具ETL、Mining、Blending和Spatial分析软件

shenhao — Wed, 12 Mar 2014 15:46:36 +0000

Alteryx公司的使命是成为一个一站式数据统计分析平台，Alteryx 的软件可以像 Tableau 一样将数据运算与精美的图像完美地嫁接在一起，同时又能够和 SAS 及 R 语言一样统计和分析数据，可以说 Alteryx 就是前三者的混合体。值得关注的还有Alteryx在分析空间地理数据上的领先优势：

Alteryx可以快速地预处理空间数据，如地理编码、数据清洗及数据融合，通过拖放式的操作界面，无需任何编码，即可快速地实现空间地理分析，然后以地图形式展示出来。
Alteryx可以连接到各种主流空间数据源，如Map Inputs、ESRI文件、MapInfo文件及空间数据库。Alteryx的空间数据分析速度大大快于传统的分析工具如SAS。
此外，在进行空间地理分析时，你还可结合Alteryx的预测分析功能，从而得到更深入的数据洞察。而分析之后的结果，还可以通过第三方分析工具如Tableau、ESRI或MapInfo做更进一步的地图或地图层，实现特定的可视化展现。

Alteryx主要从事智能软件及平台的设计、开发、制造及分销，博易智讯则是一家专门提供IT服务、商业智能产品及服务、数据分析、数据挖掘产品及服务的公司，根据双方的合作协议，由Alteryx公司提供优质的软件产品，博易智讯公司提供优质的本地服务和客户管理，双方共同为中国企业的企业管理和决策支持工作提供更好的服务。 2014年1月，博易智讯正式与来自美国加州的大数据分析公司Alteryx签订了合作协议，作为Alteryx在中国目前唯一的代理商，代理销售Alteryx的相关产品，以及提供与产品相关的技术支持服务。博易智讯与Alteryx公司将以长期合作、优势互补、实现双赢为原则，推进双方在中国目标市场上的全方位战略合作。

解释 Data Science

shenhao — Sun, 09 Mar 2014 05:08:03 +0000

一种解释Data Science 信息来源：https://s3.amazonaws.com/aws.drewconway.com/viz/venn_diagram/data_science.html

解释Data science:

数学与统计知识	一旦你开始清理已获得的数据，那么从数据中提取信息的洞察力就显得尤为重要。你需要能运用恰当的数学和统计方法，特别是熟悉这些工具的基线。
黑客技术	数据作为电子交易的商品，需要借助黑客技术才能在“市场”中存在。为了躲避'黑帽'活动，数据黑客们必须能够处理文本文件的命令行，在算法上深入思考并不断学习新的工具。
专业知识	所谓“科学”就是不断发现和积累知识，这需要提出更多有建设性的问题以及能用统计方法进行测试的数据假设。总之，问题先行，数据支撑。
机器学习	数据和数学的结合就是机器学习。机器学习作为一种兴趣是无比美妙的，但在从事专门研究的“数据科学家”眼里却并非如此。
传统研究	传统的研究员们一般都致力于实质性的专业知识与数学统计知识的学习。比如博士级研究人员们花费大量时间学习这些领域获得的专业知识，但很少有时间涉及技术层面。
危险区	那些因了解的知识过多而开始质疑统计学真实性的人被划入“危险区”，也是上图中最有争议的部分。因为无论是出于无知或恶意，这些复合的技术能增强他们分析的合理性却不能深入了解他们的分析思路和创造成果。

Math & Statistics Knowledge: Once you have acquired and cleaned the data, the next step is to actually extract insight from it. You need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. Hacking Skills: Data is a commodity traded electronically, therefore, in order to be in this market you need to speak hacker. Far from 'black hat' activities, data hackers must be able to manipulate text files at the command-line, thinking algorithmically, and be interested in learning new tools. Substantive Expertise: Science is about discovery and building knowledge, which requires some motivating questions about the world and hypotheses that can be brought to data and tested with statistical methods. Questions first, then data. Machine Learning: Data plus math is machine learning, which is fantastic if that is what you if that is what you are interested in, but not if you are doing data science. Traditional Research: Substantive expertise plus math and statistics knowledge is where most traditional researcher falls. Doctoral level researchers spend most of their time acquiring expertise in these areas, but very little time learning about technology. Danger Zone: This is where I place people who, 'know enough to be dangerous,' and is the most problematic area of the diagram. It is from this part of the diagram that the phrase 'lies, damned lies, and statistics' emanates, because either through ignorance or malice this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created.