Monday, June 29, 2015

Sentiment analysis : aggregate scoring

There has already been a lot of advice on how to do sentiment analysis, but not much on how to aggregate the polarity scores for multiple pieces of text on which sentiment analysis has been done. Below is my research on how to address this problem.

There are multiple ways to aggregate scores. The most commonly used methods are:
1) Sum
2) Arithmetic mean
3) Median
4) Mode

While above methods are most easily understood ones, they do not exactly fit the problem solution.  Let's understand the problems with these methods.

1) Sum - sum is The simplest of the lot and it can be used if you want to aggregate the sentiment score on any corpus. In fact sum is the right metric to be used for scoring a single piece of text with multiple sentences of diverse polarity. However if the results of multiple pieces of text are evaluated and compared on a sum scale, then the difference effect may get lost. This happens mostly in cases where text is a mixture of highly subjective and objective sentences. A scale based comparison may not yield right results as some of the score points may be lie outside the normal range and it will also provide a very large distribution range.

2) Arithmetic Mean (Average) - Arithmetic mean is mostly the hammer for all nails. Its a great metric if the data points lie around a centroid in a reasonable radius. As soon as values go farther away from centroid, mean starts tilting towards the farthest data point.

3) Median - The median of a set is the middle value of the set when they are sorted into numeric order.The median has the benefit of being relatively unaffected by outliers, but the flip side of this is that it is completely insensitive to changes of value that do not affect the set ordering except when it is the middle values that change.

4) Mode - It is the number which appears most often in a set of numbers. Its quite evident from definition that it represents the highest frequency data point, a good metric to show majority sentiment but still not good enough on comparative scale.

So the nearest so far is arithmetic mean, and if the outlier issue can be addressed it would be the perfect metric to be used. Over the years many mathematicians have tried to address this issue hence there are various versions of arithmetic means with different calculative formulas. Most famous of them are:
a) Geometric mean
b) Harmonic mean
c) Trimean (this is what is used for Olympic swimming scoring)

A very good comparison of Arithmetic mean, harmonic mean (and epsilon adjusted), and geometric mean (and epsilon adjusted) is done by Ravana & Moffat.

If you have read this paper, you would agree that geometric mean makes a far stronger case to be used for sentiment score aggregation. Now the important part, how to calculate geometric mean for entire corpus when you have a tri-polarity analyzer i.e. positive, negative and neutral. For neutral cases, the polarity score would always be zero and it can make total GM (geometric mean) zero as geometric mean formula is:


This is still not as straightforward as it may seem. Given the fact that polarity scores would be 0 for neutral and negative for negative polarity. If total negative polarity scores are odd numbers then the result would be an imaginary number.

Solution to this problem is very well explained by Elsayed A. E. Habib (Click Here). So here's the solution.

Aggregate Geometric Mean = (N1*G1 - N2*G2)/N
N = N1+N2+N3

N1 = Total positive scores
N2 = Total negative scores
N3 = Total neutral scores
G1 = Geometric mean of positive scores
G2 = Geometric mean of negative scores

Geometric mean of all neutral texts would always be zero, so we didn't factor it in formula.

Geometric mean for negative numbers:
Case1: Total number of negative items is even – easy as all numbers multiplied together will result in even number for nth root

Case2: Total number of negative items is odd – calculate geometric mean of all negative numbers by taking score absolute value.

2 comments:

  1. Useful article & links. Thanks!

    ReplyDelete
  2. Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.

    Text Analytics Software

    Text Analytics NLP

    ReplyDelete