Monday, June 29, 2015

Sentiment analysis : aggregate scoring

There has already been a lot of advice on how to do sentiment analysis, but not much on how to aggregate the polarity scores for multiple pieces of text on which sentiment analysis has been done. Below is my research on how to address this problem.

There are multiple ways to aggregate scores. The most commonly used methods are:
1) Sum
2) Arithmetic mean
3) Median
4) Mode

While above methods are most easily understood ones, they do not exactly fit the problem solution.  Let's understand the problems with these methods.

1) Sum - sum is The simplest of the lot and it can be used if you want to aggregate the sentiment score on any corpus. In fact sum is the right metric to be used for scoring a single piece of text with multiple sentences of diverse polarity. However if the results of multiple pieces of text are evaluated and compared on a sum scale, then the difference effect may get lost. This happens mostly in cases where text is a mixture of highly subjective and objective sentences. A scale based comparison may not yield right results as some of the score points may be lie outside the normal range and it will also provide a very large distribution range.

2) Arithmetic Mean (Average) - Arithmetic mean is mostly the hammer for all nails. Its a great metric if the data points lie around a centroid in a reasonable radius. As soon as values go farther away from centroid, mean starts tilting towards the farthest data point.

3) Median - The median of a set is the middle value of the set when they are sorted into numeric order.The median has the benefit of being relatively unaffected by outliers, but the flip side of this is that it is completely insensitive to changes of value that do not affect the set ordering except when it is the middle values that change.

4) Mode - It is the number which appears most often in a set of numbers. Its quite evident from definition that it represents the highest frequency data point, a good metric to show majority sentiment but still not good enough on comparative scale.

So the nearest so far is arithmetic mean, and if the outlier issue can be addressed it would be the perfect metric to be used. Over the years many mathematicians have tried to address this issue hence there are various versions of arithmetic means with different calculative formulas. Most famous of them are:
a) Geometric mean
b) Harmonic mean
c) Trimean (this is what is used for Olympic swimming scoring)

A very good comparison of Arithmetic mean, harmonic mean (and epsilon adjusted), and geometric mean (and epsilon adjusted) is done by Ravana & Moffat.

If you have read this paper, you would agree that geometric mean makes a far stronger case to be used for sentiment score aggregation. Now the important part, how to calculate geometric mean for entire corpus when you have a tri-polarity analyzer i.e. positive, negative and neutral. For neutral cases, the polarity score would always be zero and it can make total GM (geometric mean) zero as geometric mean formula is:


This is still not as straightforward as it may seem. Given the fact that polarity scores would be 0 for neutral and negative for negative polarity. If total negative polarity scores are odd numbers then the result would be an imaginary number.

Solution to this problem is very well explained by Elsayed A. E. Habib (Click Here). So here's the solution.

Aggregate Geometric Mean = (N1*G1 - N2*G2)/N
N = N1+N2+N3

N1 = Total positive scores
N2 = Total negative scores
N3 = Total neutral scores
G1 = Geometric mean of positive scores
G2 = Geometric mean of negative scores

Geometric mean of all neutral texts would always be zero, so we didn't factor it in formula.

Geometric mean for negative numbers:
Case1: Total number of negative items is even – easy as all numbers multiplied together will result in even number for nth root

Case2: Total number of negative items is odd – calculate geometric mean of all negative numbers by taking score absolute value.

Friday, June 12, 2015

Gartner BI, Analytics & Information Mgmt. Summit–Mumbai 2015

 

This was my first conference with Gartner. Arranged very well with succinct agenda. It’s a good place to be in if you are looking for cues on next wave of technology. I attended it with mixed sense of what I’m going to get out of it; however its not the place you want to be in if you are looking for exact technical information on application of BI and analytics.

As it was impossible to attend all the session – I opted for a few that interest me. Here is the list:

9th June

1. Master Data Management (MDM) for Beginners

Guido De Simoni

It was mostly common sense. However liked below representation of different data types in an organization.

image

2. Gartner Opening Keynote: Crossing the analytical divide: New Skills, New Technologies

Ted Friedman, Kurt Schlegel, Rita Sallam

They provide a very high level of information, centered mostly around management, governance, architecture and flow of information. Most of the attendees are middle level and executive managers. Speakers/analyst at Gartner are great speakers with liking for dramatics. I got its taste in the opening keynote where the main theme was around three main subjects:

image

After presenting, which they do rather well, the bottom line emphasized was – use Bi-Modal mechanism. There is no final and identified cookbook on how to run a BI and analytics org. Its requires a little of everything e.g. while IT could be a centralized org with all businesses coming to it for solutions, there is a need for some decentralized pockets within company to focus on specific business use cases. There are no certain solutions, the solution/tools space is very vibrant with multiple offerings from numerous vendors, each with its own strengths and weaknesses. So you might have identified a certain solution for now; but its not going to be long before a disruptive technology barges right in industry and makes you look old fashioned. While organizations are betting big on sharing their data publicly, hosting data in external cloud services, there is a skepticism and risk lurking around, so protect the data that is life blood of the organization. Above messages were passed through three short movies with Gartner team casting. Fun to watch techies taking on Hollywood.

Below picture explains the bright and dark sides of these dilemmas.

Untitled

3. Industry Panel: An Expert view on Business Intelligence, Analytics and Information Management

Ian A Bertram, Neil Chandler, Donald Farmer, Deepak Ghodke

4. Building a Successful Business Analytics Strategy

Douglas Laney

image

image

5. Applying the Big Data Ecosystem

Arun Chandrasekaran

I couldn’t attend the full session. I went in in last 20 minutes but what I heard was very informative. I also got a chance to speak to Arun and discuss some of my current ideas and issues. Although he didn’t point to any specific solution, he helped me gain confidence on what I was doing.

6. Workshop: Foundation skills of a BI and Analytics Change Agent Leader

Ian A Bertram

Spoiler alert. If you are going to Gartner event, don’t expect technical knowledge transfer. the topics discussed in this workshop were no where related to BI or analytics. The ideas were very generic and common for any kind of change. It was about how to handle change, convince people to change and manage situations.

7. Interactive visualization: Insights for everyone

Rita L Sallam

A very informative session. Rita is a very seasoned speaker. Here are the my key takeaways from this talk.

image

image

image

8. The High functioning Analytics center of Excellence: As analytics now pervades every corner of your organization, coordinating, collaborating and governing business intelligence and data science functions has become critical. Many such CoEs are well-intentioned but lack key ingredients.

Douglas Laney

A waste of time. It is a very important subject, however short time and diversity of views resulted in a very immature discussion.

10th June

9. Gartner Magic Quadrant Power session: Data Integration, Business Intelligence, Advanced Analytics, DBMS for Data Warehouse

Ian A Bertram, Ted Friedman, Alexander Linden, Rita L Sallam, Roxane Edjlali

image

image

image

image

10. How to Monetize Your Information Assets

Douglas Laney

Not exactly a rocket science but a good session for beginners. Laney was able to convince on how “Infonomics” is gaining ground and may become part of balance sheet asset in future.

image

11. Workshop: How to Develop and Evolve an Enterprise Information Management Strategy

Guido De Simoni

I could have rather skipped it. But it was a good session for those who don’t have a defined dictrine on how they want to mange data in the organization.

image

12. The Enterprise Information Management Framework: Building Blocks for IM Success

Michael Patrick Moran

This session was an extension of the previous session. It was a content rich session with clear guidelines on how an EIM framework should be created and nurtured.

image

image

image

image

image

image

image

13. Innovating with Analytics: 40 Real World Examples in 40 Minutes

Douglas Laney

A very fun session. While we have been hearing many different use cases, Laney was still able to fill in some awe. Here’s the favorite one.

image

14. Big Data Discovery – The Next Generation of BI Self-Service

Rita L. Sallam

My most favorite session. This is exactly what I was looking for in whole two days and it came in last. Neverthless, full of information. The most interesting part is, she talked about exactly the same things that I’ve been thinking for quite a few days. It’s the tipping point. There are not many days left when people will get over big data fever and start taking note of the remainder of the data.

image

image

image

Besides these, there were two more sessions. One by Ketan Bhagat, he’s younger brother of popular author Chetan Bhagat and an author too. He spoke about “Operant Conditioning”. It was more of a book launch event. There was also a Closing keynote but nothing worthwhile in it.

I also noted some keywords that are going to be buzz words in analytics industry in coming years. they are:

Citizen Developer, Analytics Continuum, Data Lake, Lambda Architecture, Citizen Data Scientist

overall it was time well spent. Made great connections.