Sunday, August 16, 2015

Flat file to hierarchical tree - Python Way

import json

"""
"This class converts a flat file into a json hierarchical tree
"Inputs: a pandas data frame with flat file loaded
""""
class hierarchy_tree(object):
def __init__(self, data, ):
# self.__file = json_file_path
self.__data = data
self.__result = {'name':None, 'children':[], 'result':None}


"""
Call this public function to convert a json flat file to json tree
"""
def create(self, start=0, finish=None, callback=None):
data = self.__data
self.__callback = None
# check if callback is a function
if callable(callback):
self.__callback = callback

# iterate on each item
for row in data.iterrows():
#each row is a tuple
if finish == None:
finish = len(row[1])

row = row[1][start:finish]
lineage = []
for x in range(len(row)):
lineage.append(row[x])
self.__build_path(lineage)
return json.dumps(self.__result)

"""
This function actually creates nested dictionary
that is later dumped as json
"""
def __build_path(self, lineage):
parent = self.__result

for item in lineage:
# check if the current item exists as dictionary name
index = -1
for child in range(len(parent['children'])):
if parent['children'][child]['name'] == item:
# reset index if item found
index = child
break
# if existing item was not found
if index == -1:
# update as last item in dictionary
parent['children'].append({'name':item, 'children':[], 'result':None})
#
# implement callback
#pass arguments - Item text and its index in lineage
#
if callable(self.__callback):
parent['children'][index]['result'] = self.__callback(lineage, lineage.index(item))
# reset parent
parent = parent['children'][index]


Example Usage:


def callbackfunc(list, index):
return list[index] + str(index)

data = pandas.read_json(<json file path>)
tree = hierarchy_tree(data)
print tree.create(start=0, finish=3, callback=callbackfunc)

Monday, June 29, 2015

Sentiment analysis : aggregate scoring

There has already been a lot of advice on how to do sentiment analysis, but not much on how to aggregate the polarity scores for multiple pieces of text on which sentiment analysis has been done. Below is my research on how to address this problem.

There are multiple ways to aggregate scores. The most commonly used methods are:
1) Sum
2) Arithmetic mean
3) Median
4) Mode

While above methods are most easily understood ones, they do not exactly fit the problem solution.  Let's understand the problems with these methods.

1) Sum - sum is The simplest of the lot and it can be used if you want to aggregate the sentiment score on any corpus. In fact sum is the right metric to be used for scoring a single piece of text with multiple sentences of diverse polarity. However if the results of multiple pieces of text are evaluated and compared on a sum scale, then the difference effect may get lost. This happens mostly in cases where text is a mixture of highly subjective and objective sentences. A scale based comparison may not yield right results as some of the score points may be lie outside the normal range and it will also provide a very large distribution range.

2) Arithmetic Mean (Average) - Arithmetic mean is mostly the hammer for all nails. Its a great metric if the data points lie around a centroid in a reasonable radius. As soon as values go farther away from centroid, mean starts tilting towards the farthest data point.

3) Median - The median of a set is the middle value of the set when they are sorted into numeric order.The median has the benefit of being relatively unaffected by outliers, but the flip side of this is that it is completely insensitive to changes of value that do not affect the set ordering except when it is the middle values that change.

4) Mode - It is the number which appears most often in a set of numbers. Its quite evident from definition that it represents the highest frequency data point, a good metric to show majority sentiment but still not good enough on comparative scale.

So the nearest so far is arithmetic mean, and if the outlier issue can be addressed it would be the perfect metric to be used. Over the years many mathematicians have tried to address this issue hence there are various versions of arithmetic means with different calculative formulas. Most famous of them are:
a) Geometric mean
b) Harmonic mean
c) Trimean (this is what is used for Olympic swimming scoring)

A very good comparison of Arithmetic mean, harmonic mean (and epsilon adjusted), and geometric mean (and epsilon adjusted) is done by Ravana & Moffat.

If you have read this paper, you would agree that geometric mean makes a far stronger case to be used for sentiment score aggregation. Now the important part, how to calculate geometric mean for entire corpus when you have a tri-polarity analyzer i.e. positive, negative and neutral. For neutral cases, the polarity score would always be zero and it can make total GM (geometric mean) zero as geometric mean formula is:


This is still not as straightforward as it may seem. Given the fact that polarity scores would be 0 for neutral and negative for negative polarity. If total negative polarity scores are odd numbers then the result would be an imaginary number.

Solution to this problem is very well explained by Elsayed A. E. Habib (Click Here). So here's the solution.

Aggregate Geometric Mean = (N1*G1 - N2*G2)/N
N = N1+N2+N3

N1 = Total positive scores
N2 = Total negative scores
N3 = Total neutral scores
G1 = Geometric mean of positive scores
G2 = Geometric mean of negative scores

Geometric mean of all neutral texts would always be zero, so we didn't factor it in formula.

Geometric mean for negative numbers:
Case1: Total number of negative items is even – easy as all numbers multiplied together will result in even number for nth root

Case2: Total number of negative items is odd – calculate geometric mean of all negative numbers by taking score absolute value.

Friday, June 12, 2015

Gartner BI, Analytics & Information Mgmt. Summit–Mumbai 2015

 

This was my first conference with Gartner. Arranged very well with succinct agenda. It’s a good place to be in if you are looking for cues on next wave of technology. I attended it with mixed sense of what I’m going to get out of it; however its not the place you want to be in if you are looking for exact technical information on application of BI and analytics.

As it was impossible to attend all the session – I opted for a few that interest me. Here is the list:

9th June

1. Master Data Management (MDM) for Beginners

Guido De Simoni

It was mostly common sense. However liked below representation of different data types in an organization.

image

2. Gartner Opening Keynote: Crossing the analytical divide: New Skills, New Technologies

Ted Friedman, Kurt Schlegel, Rita Sallam

They provide a very high level of information, centered mostly around management, governance, architecture and flow of information. Most of the attendees are middle level and executive managers. Speakers/analyst at Gartner are great speakers with liking for dramatics. I got its taste in the opening keynote where the main theme was around three main subjects:

image

After presenting, which they do rather well, the bottom line emphasized was – use Bi-Modal mechanism. There is no final and identified cookbook on how to run a BI and analytics org. Its requires a little of everything e.g. while IT could be a centralized org with all businesses coming to it for solutions, there is a need for some decentralized pockets within company to focus on specific business use cases. There are no certain solutions, the solution/tools space is very vibrant with multiple offerings from numerous vendors, each with its own strengths and weaknesses. So you might have identified a certain solution for now; but its not going to be long before a disruptive technology barges right in industry and makes you look old fashioned. While organizations are betting big on sharing their data publicly, hosting data in external cloud services, there is a skepticism and risk lurking around, so protect the data that is life blood of the organization. Above messages were passed through three short movies with Gartner team casting. Fun to watch techies taking on Hollywood.

Below picture explains the bright and dark sides of these dilemmas.

Untitled

3. Industry Panel: An Expert view on Business Intelligence, Analytics and Information Management

Ian A Bertram, Neil Chandler, Donald Farmer, Deepak Ghodke

4. Building a Successful Business Analytics Strategy

Douglas Laney

image

image

5. Applying the Big Data Ecosystem

Arun Chandrasekaran

I couldn’t attend the full session. I went in in last 20 minutes but what I heard was very informative. I also got a chance to speak to Arun and discuss some of my current ideas and issues. Although he didn’t point to any specific solution, he helped me gain confidence on what I was doing.

6. Workshop: Foundation skills of a BI and Analytics Change Agent Leader

Ian A Bertram

Spoiler alert. If you are going to Gartner event, don’t expect technical knowledge transfer. the topics discussed in this workshop were no where related to BI or analytics. The ideas were very generic and common for any kind of change. It was about how to handle change, convince people to change and manage situations.

7. Interactive visualization: Insights for everyone

Rita L Sallam

A very informative session. Rita is a very seasoned speaker. Here are the my key takeaways from this talk.

image

image

image

8. The High functioning Analytics center of Excellence: As analytics now pervades every corner of your organization, coordinating, collaborating and governing business intelligence and data science functions has become critical. Many such CoEs are well-intentioned but lack key ingredients.

Douglas Laney

A waste of time. It is a very important subject, however short time and diversity of views resulted in a very immature discussion.

10th June

9. Gartner Magic Quadrant Power session: Data Integration, Business Intelligence, Advanced Analytics, DBMS for Data Warehouse

Ian A Bertram, Ted Friedman, Alexander Linden, Rita L Sallam, Roxane Edjlali

image

image

image

image

10. How to Monetize Your Information Assets

Douglas Laney

Not exactly a rocket science but a good session for beginners. Laney was able to convince on how “Infonomics” is gaining ground and may become part of balance sheet asset in future.

image

11. Workshop: How to Develop and Evolve an Enterprise Information Management Strategy

Guido De Simoni

I could have rather skipped it. But it was a good session for those who don’t have a defined dictrine on how they want to mange data in the organization.

image

12. The Enterprise Information Management Framework: Building Blocks for IM Success

Michael Patrick Moran

This session was an extension of the previous session. It was a content rich session with clear guidelines on how an EIM framework should be created and nurtured.

image

image

image

image

image

image

image

13. Innovating with Analytics: 40 Real World Examples in 40 Minutes

Douglas Laney

A very fun session. While we have been hearing many different use cases, Laney was still able to fill in some awe. Here’s the favorite one.

image

14. Big Data Discovery – The Next Generation of BI Self-Service

Rita L. Sallam

My most favorite session. This is exactly what I was looking for in whole two days and it came in last. Neverthless, full of information. The most interesting part is, she talked about exactly the same things that I’ve been thinking for quite a few days. It’s the tipping point. There are not many days left when people will get over big data fever and start taking note of the remainder of the data.

image

image

image

Besides these, there were two more sessions. One by Ketan Bhagat, he’s younger brother of popular author Chetan Bhagat and an author too. He spoke about “Operant Conditioning”. It was more of a book launch event. There was also a Closing keynote but nothing worthwhile in it.

I also noted some keywords that are going to be buzz words in analytics industry in coming years. they are:

Citizen Developer, Analytics Continuum, Data Lake, Lambda Architecture, Citizen Data Scientist

overall it was time well spent. Made great connections.

Monday, February 2, 2015

Lost MS SQL Server admin access? No Problem

Today I by mistake removed my own windows account from my local MS SQL Server installation. That meant I couldn’t use the DB anymore. I didn’t have any other account setup on this DB because it is a local setup. What it meant was that the SQL Server setup was useless, and only way to use it was to uninstall the whole DB Server and install it again.

However I did some search and found following solution to recover access-

Step 1 :

net stop mssqlserver

or you can stop it from services.msc. Now open a command prompt as an administrator and type:

Step 2 :

net start mssqlserver /f /T3608 

This will start MS SQL server in single user mode.

Step 3 :

sqlcmd 


Step 4 : create new windows user
1 create login [<<DOMAIN\USERNAME>>] from windows;
2 EXEC sys.sp_addsrvrolemember @loginame = N'<<DOMAIN\USERNAME>>', @rolename = N'sysadmin';
3 GO;

OR, create sql server login
1 CREATE LOGIN [testAdmin] WITH PASSWORD=N'test@1234', DEFAULT_DATABASE=[master];
2 EXEC sys.sp_addsrvrolemember @loginame = N'testAdmin', @rolename = N'sysadmin';
3 GO

Step 5 : close sqlcmd by CTRL+c


Step 6 : Login to SQL Server via SQL Server Mgmt. Studio

The curious case of SharePoint integrated SSRS Subscription

Some time ago I found that whenever I create an SSRS subscription with multiple cascaded filters in SharePoint integrated mode, the the time it takes to create subscription after selecting all filter values is more than the total time taken if the SSRS report were executed.

In order to understand why this happens, lets understand what happens when you create a subscription.

image

When you click on “Add Subscription”, all present data sets are executed. You might ask why?, well the answer is because present parameter and filter values need to be populated. This seems normal but the curious case that I’m talking about is something else.

The problem is with cascaded filters i.e. filters values depending on other filters values

Default behavior means all the datasets are fired when “add subscription” is clicked. Because child filter must have values that are derived from the values of parent filter in cascade, the dataset queries are fired multiple times, depending on how many cascaded children you have.

In a scenario where the data sets have complex and long queries, the problem can become very evident in terms of the time it takes to create a subscription.

But why does it happen. I asked this question to Microsoft. The response was – because of POSTBACK. Because it is SharePoint integrated mode, all calls are diverted through SharePoint and POSTBACK re-fires them. Another question I asked Microsoft was – why was it designed this way? They don’t have an answer to this. However they were kind enough to look further and fix this problem.

so here is the problem statement as defined by Microsoft and its solution:

Problem Description:

You’ve a report in SharePoint integrated mode which contains parameters. When you do an add subscription, the page takes a lot of time to come up. Also, when you change any of the parameters, the post back takes really long time to reload the page.

Analysis:

1. We took the profiler traces and found that a single query belonging to a data set is executing 4 times. This has been causing the issue.

3. We engaged our product team to have a better understanding of the behavior and how we can mitigate the same.

4. The reason behind the multiple execution of the query is due to the post back within ASP.NET.

5. We reviewed our code and found a scope for improvement, at least that would remove one level of execution.

6. Our Product team has agreed to release the improved code as part of SQL Server 2012 SP 2 CU 5.

Root cause:

1. Due to the post back behavior of ASP.NET, the queries have been executed multiple times.

Resolution:

A fix will be released as part of SQL Server 2012 SP 2 CU 5. The tentative release date is on March 16, 2015.