注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

淘尽网 官方博客

淘尽网http://www.tao3w.com做最好的比价网站

 
 
 

日志

 
 
关于我

淘尽网 http://www.tao3w.com 做最好的比价网站,做最好的数据抓取专家。

网易考拉推荐

Bayesian Opinion Mining  

2012-09-17 17:38:19|  分类: 推荐系统 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
http://phpir.com/bayesian-opinion-mining

The web is a great place for people to express their opinions, on just about any subject. Even the professionally opinionated, like movie reviewers, have blogs where the public can comment and respond with what they think, and there are a number of sites that deal in nothing more than this. The ability to automatically extract people's opinions from all this raw text can be a very powerful one, and it's a well studied area - no doubt because of the commercial possibilities.

Opinion mining, or sentiment analysis, is far from a solved problem though. People often express more than one opinion "the movie was terrible, but DeNiro's performance was superb, as always", use sarcasm "this is probably the best laptop Dell could come up with", or use negation and complex devices that can be hard to parse "not that I'm saying this was a bad experience".

On top of this, expressions of sentiment tend to be very topic focused - what works for one subject might not work for another. To use a well worn example, it's a good thing to say that the plot of a movie is unpredictable, but a bad thing to say it about the steering of a car. Even within a certain product, the same words can describe opposite feeling about different features - it's bad for the start-up time on a digital camera to be long, but it's good for the battery life to be long. This is why a great deal of work, particularly in product reviews, is spent in classifying which element of a product is being talked about, before starting the opinion mining process.

At the movies

We'll start with a simpler approach, and look at movie reviews. Luckily for us these are fairly easily available on line from places like Rotten Tomatoes and IMDB, and indeed a convenient data set of sentences expressing positive and negative opinions has already been compiled. We're using opinions expressed on the sentence level in order to give ourselves a little more granularity - while most movie reviews are longer than this, they will also usually express more than one opinion, and keeping our document unit smaller helps us avoid muddying the waters.

The data is supplied as two files, one for positive opinions and the other negative, with one sentence per line, which makes it easy to parse. To actually extract the opinion, we're going to make use of a classic and well known tool, a Naive Bayesian classifier. These were all the rage for spam filters a couple of years back, and are still a hugely popular way of doing filtering. They have the advantage that they're easy to implement, pretty effective, and quick to classify with.

Naive Bayes

Bayesian classifiers are based around the Bayes rule, a way of looking at conditional probabilities that allows you to flip the condition around in a convenient way. A conditional probably is a probably that event X will occur, given the evidence Y. That is normally written P(X | Y). The Bayes rule allows us to determine this probability when all we have is the probability of the opposite result, and of the two components individually: P(X | Y) = P(X)P(Y | X) / P(Y). This restatement can be very helpful when we're trying to estimate the probability of something based on examples of it occurring.

In this case, we're trying to estimate the probability that a document is positive or negative, given it's contents. We can restate that so that is in terms of the probability of that document occurring if it has been predetermined to be positive or negative. This is convenient, because we have examples of positive and negative opinions from our data set above.

The thing that makes this a "naive" Bayesian process is that we make a big assumption about how we can calculate at the probability of the document occurring: that it is equal to the product of the probabilities of each word within it occurring. This implies that there is no link between one word and another word. This independence assumption is clearly not true: there are lots of words which occur together more frequently that either do individually, or with other words, but this convenient fiction massively simplifies things for us, and makes it straightforward to build a classifier.

We can estimate the probability of a word occurring given a positive or negative sentiment by looking through a series of examples of positive and negative sentiments and counting how often it occurs in each class. This is what makes thissupervised learning - the requirement for pre-classified examples to train on.

So, our initial formula looks like this.

P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)

We can drop the dividing P(line), as it's the same for both classes, and we just want to rank them rather than calculate a precise probability. We can use the independence assumption to let us treat P(sentence | sentiment) as the product of P( token | sentiment) across all the tokens in the sentence. So, we estimate P(token | sentiment) as

count(this token in class) + 1 / count(all tokens in class) + count( all tokens )

The extra 1 and count of all tokens is called 'add one' or Laplace smoothing, and stops a 0 finding it's way into the multiplications. If we didn't have it any sentence with an unseen token in it would score zero. We have implemented the above in the classify function of the following class:

We're implementing this in PHP in the classify function:

<?php
class Opinion {
        private $index = array();
        private $classes = array('pos', 'neg');
        private $classTokCounts = array('pos' => 0, 'neg' => 0);
        private $tokCount = 0;
        private $classDocCounts = array('pos' => 0, 'neg' => 0);
        private $docCount = 0;
        private $prior = array('pos' => 0.5, 'neg' => 0.5);

        public function addToIndex($file, $class, $limit = 0) {
                $fh = fopen($file, 'r');
                $i = 0;
                if(!in_array($class, $this->classes)) {
                        echo "Invalid class specified\n";
                        return;
                }
                while($line = fgets($fh)) {
                        if($limit > 0 && $i > $limit) {
                                break;
                        }
                        $i++;
                        
                        $this->docCount++;
                        $this->classDocCounts[$class]++;
                        $tokens = $this->tokenise($line);
                        foreach($tokens as $token) {
                                if(!isset($this->index[$token][$class])) {
                                        $this->index[$token][$class] = 0;
                                }
                                $this->index[$token][$class]++;
                                $this->classTokCounts[$class]++;
                                $this->tokCount++;
                        }
                }
                fclose($fh);
        }
        
        public function classify($document) {
                $this->prior['pos'] = $this->classDocCounts['pos'] / $this->docCount;
                $this->prior['neg'] = $this->classDocCounts['neg'] / $this->docCount; 
                $tokens = $this->tokenise($document);
                $classScores = array();

                foreach($this->classes as $class) {
                        $classScores[$class] = 1;
                        foreach($tokens as $token) {
                                $count = isset($this->index[$token][$class]) ? 
                                        $this->index[$token][$class] : 0;

                                $classScores[$class] *= ($count + 1) / 
                                        ($this->classTokCounts[$class] + $this->tokCount);
                        }
                        $classScores[$class] = $this->prior[$class] * $classScores[$class];
                }
                
                arsort($classScores);
                return key($classScores);
        }

        private function tokenise($document) {
                $document = strtolower($document);
                preg_match_all('/\w+/', $document, $matches);
                return $matches[0];
        }
}
?>

The classify function starts by calculating the prior probability (the chance of it being one or the other before any tokens are looked at) based on the number of positive and negative examples - in this example that'll always be 0.5 as we have the same amount of data for each. We then tokenise the incoming document, and for each class multiply together the likelihood of each word being seen in that class. We sort the final result, and return the highest scoring class.

The other important method here is addToIndex. All this does is loop over the data, tokenising the documents and storing counts of the terms for later use.

We can generate a slightly scrubby test set by not quite taking all the data, and using the remaining training examples to test with.

<?php
$op = new Opinion();
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.neg', 'neg', 5000);
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.pos', 'pos', 5000);
$i = 0; $t = 0; $f = 0;
$fh = fopen('opinion/rt-polaritydata/rt-polarity.neg', 'r');
while($line = fgets($fh)) {
        if($i++ > 5001) {
                if($op->classify($line) == 'neg') {
                        $t++;
                } else {
                        $f++;
                }
        }
}
echo "Accuracy: " . ($t / ($t+$f));
?>

This gives an accuracy of around 0.8, which isn't bad really! To demonstrate it, we can chuck a couple of example sentences in:

<?php
$op = new Opinion();
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.neg', 'neg');
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.pos', 'pos');
$string = "Avatar had a surprisingly decent plot, and genuinely incredible special effects";
echo "Classifying '$string' - " . $op->classify($string) . "\n";
$string = "Twilight was an atrocious movie, filled with stumbling, awful dialogue, and ridiculous story telling.";
echo "Classifying '$string' - " . $op->classify($string) . "\n";
?>

Which returns as expected:

Classifying 'Avatar had a surprisingly decent plot,             and genuinely incredible special effects' - pos Classifying 'Twilight was an atrocious movie, filled with              stumbling, awful dialogue, and ridiculous story              telling.' - neg 

We can even use it on a longer review, as long as we split into sentences first. I grabbed the review of Avatar from The Scientific Indian.

<?php
// … snip … article contents as $op setup
$sentences = explode(".", $doc);
$score = array('pos' => 0, 'neg' => 0);
foreach($sentences as $sentence) {
        if(strlen(trim($sentence))) {
                $class = $op->classify($sentence);
                echo "Classifying: \"" . trim($sentence) . "\" as " . $class . "\n";
                $score[$class]++;
        }
}
var_dump($score);
?>

Just to give a snippet of the output, we get:

Classifying: "Fortunately, the movie's moral premise plays               second fiddle to the technical feats" as neg Classifying: "I enjoyed the movie" as pos Classifying: "The ending is especially poignant" as pos Classifying: "The visual effects are spectacular and a lot of                the production techniques are a first in the craft                of movie making" as pos Classifying: "For that alone, the movie is a must see" as pos  array(2) {   ["pos"]=>   int(25)   ["neg"]=>   int(11) } 

So, broadly positive, which is the right direction!

More Opinions

There's a lot we haven't addressed in our classifier. We could pass the sentences through a couple of other classifiers first, using Bayesian techniques again, in order to determine some more useful facts. For example, is this even a review? If we just start processing blog posts, for example, we'll find a lot that mention a movie without actually saying whether it's good or bad, and we may as well discard those.

Then, for each sentence, which part of the movie is it talking about? We might be able to correctly interpret a review which slams the actor, slates the script, but was impressed with the special effects. At each stage, the process would be the same as this time - find or create training data, train a classifier, and let it go to work.

We could also look at more complicated language models and named entity extractors, that allow us to map the odd phrases that sometimes occur, and associate opinions with the appropriate parts of a sentence. This can be a lot more work, but can also lead to higher accuracy and reliability.

Photo Credit: Grégory Tonon

You can follow any responses to this entry through the comments feed.
  评论这张
 
阅读(643)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017