Sunday, August 16, 2015

Flat file to hierarchical tree - Python Way

import json

"""
"This class converts a flat file into a json hierarchical tree
"Inputs: a pandas data frame with flat file loaded
""""
class hierarchy_tree(object):
def __init__(self, data, ):
# self.__file = json_file_path
self.__data = data
self.__result = {'name':None, 'children':[], 'result':None}


"""
Call this public function to convert a json flat file to json tree
"""
def create(self, start=0, finish=None, callback=None):
data = self.__data
self.__callback = None
# check if callback is a function
if callable(callback):
self.__callback = callback

# iterate on each item
for row in data.iterrows():
#each row is a tuple
if finish == None:
finish = len(row[1])

row = row[1][start:finish]
lineage = []
for x in range(len(row)):
lineage.append(row[x])
self.__build_path(lineage)
return json.dumps(self.__result)

"""
This function actually creates nested dictionary
that is later dumped as json
"""
def __build_path(self, lineage):
parent = self.__result

for item in lineage:
# check if the current item exists as dictionary name
index = -1
for child in range(len(parent['children'])):
if parent['children'][child]['name'] == item:
# reset index if item found
index = child
break
# if existing item was not found
if index == -1:
# update as last item in dictionary
parent['children'].append({'name':item, 'children':[], 'result':None})
#
# implement callback
#pass arguments - Item text and its index in lineage
#
if callable(self.__callback):
parent['children'][index]['result'] = self.__callback(lineage, lineage.index(item))
# reset parent
parent = parent['children'][index]


Example Usage:


def callbackfunc(list, index):
return list[index] + str(index)

data = pandas.read_json(<json file path>)
tree = hierarchy_tree(data)
print tree.create(start=0, finish=3, callback=callbackfunc)

Monday, June 29, 2015

Sentiment analysis : aggregate scoring

There has already been a lot of advice on how to do sentiment analysis, but not much on how to aggregate the polarity scores for multiple pieces of text on which sentiment analysis has been done. Below is my research on how to address this problem.

There are multiple ways to aggregate scores. The most commonly used methods are:
1) Sum
2) Arithmetic mean
3) Median
4) Mode

While above methods are most easily understood ones, they do not exactly fit the problem solution.  Let's understand the problems with these methods.

1) Sum - sum is The simplest of the lot and it can be used if you want to aggregate the sentiment score on any corpus. In fact sum is the right metric to be used for scoring a single piece of text with multiple sentences of diverse polarity. However if the results of multiple pieces of text are evaluated and compared on a sum scale, then the difference effect may get lost. This happens mostly in cases where text is a mixture of highly subjective and objective sentences. A scale based comparison may not yield right results as some of the score points may be lie outside the normal range and it will also provide a very large distribution range.

2) Arithmetic Mean (Average) - Arithmetic mean is mostly the hammer for all nails. Its a great metric if the data points lie around a centroid in a reasonable radius. As soon as values go farther away from centroid, mean starts tilting towards the farthest data point.

3) Median - The median of a set is the middle value of the set when they are sorted into numeric order.The median has the benefit of being relatively unaffected by outliers, but the flip side of this is that it is completely insensitive to changes of value that do not affect the set ordering except when it is the middle values that change.

4) Mode - It is the number which appears most often in a set of numbers. Its quite evident from definition that it represents the highest frequency data point, a good metric to show majority sentiment but still not good enough on comparative scale.

So the nearest so far is arithmetic mean, and if the outlier issue can be addressed it would be the perfect metric to be used. Over the years many mathematicians have tried to address this issue hence there are various versions of arithmetic means with different calculative formulas. Most famous of them are:
a) Geometric mean
b) Harmonic mean
c) Trimean (this is what is used for Olympic swimming scoring)

A very good comparison of Arithmetic mean, harmonic mean (and epsilon adjusted), and geometric mean (and epsilon adjusted) is done by Ravana & Moffat.

If you have read this paper, you would agree that geometric mean makes a far stronger case to be used for sentiment score aggregation. Now the important part, how to calculate geometric mean for entire corpus when you have a tri-polarity analyzer i.e. positive, negative and neutral. For neutral cases, the polarity score would always be zero and it can make total GM (geometric mean) zero as geometric mean formula is:


This is still not as straightforward as it may seem. Given the fact that polarity scores would be 0 for neutral and negative for negative polarity. If total negative polarity scores are odd numbers then the result would be an imaginary number.

Solution to this problem is very well explained by Elsayed A. E. Habib (Click Here). So here's the solution.

Aggregate Geometric Mean = (N1*G1 - N2*G2)/N
N = N1+N2+N3

N1 = Total positive scores
N2 = Total negative scores
N3 = Total neutral scores
G1 = Geometric mean of positive scores
G2 = Geometric mean of negative scores

Geometric mean of all neutral texts would always be zero, so we didn't factor it in formula.

Geometric mean for negative numbers:
Case1: Total number of negative items is even – easy as all numbers multiplied together will result in even number for nth root

Case2: Total number of negative items is odd – calculate geometric mean of all negative numbers by taking score absolute value.

Friday, June 12, 2015

Gartner BI, Analytics & Information Mgmt. Summit–Mumbai 2015

 

This was my first conference with Gartner. Arranged very well with succinct agenda. It’s a good place to be in if you are looking for cues on next wave of technology. I attended it with mixed sense of what I’m going to get out of it; however its not the place you want to be in if you are looking for exact technical information on application of BI and analytics.

As it was impossible to attend all the session – I opted for a few that interest me. Here is the list:

9th June

1. Master Data Management (MDM) for Beginners

Guido De Simoni

It was mostly common sense. However liked below representation of different data types in an organization.

image

2. Gartner Opening Keynote: Crossing the analytical divide: New Skills, New Technologies

Ted Friedman, Kurt Schlegel, Rita Sallam

They provide a very high level of information, centered mostly around management, governance, architecture and flow of information. Most of the attendees are middle level and executive managers. Speakers/analyst at Gartner are great speakers with liking for dramatics. I got its taste in the opening keynote where the main theme was around three main subjects:

image

After presenting, which they do rather well, the bottom line emphasized was – use Bi-Modal mechanism. There is no final and identified cookbook on how to run a BI and analytics org. Its requires a little of everything e.g. while IT could be a centralized org with all businesses coming to it for solutions, there is a need for some decentralized pockets within company to focus on specific business use cases. There are no certain solutions, the solution/tools space is very vibrant with multiple offerings from numerous vendors, each with its own strengths and weaknesses. So you might have identified a certain solution for now; but its not going to be long before a disruptive technology barges right in industry and makes you look old fashioned. While organizations are betting big on sharing their data publicly, hosting data in external cloud services, there is a skepticism and risk lurking around, so protect the data that is life blood of the organization. Above messages were passed through three short movies with Gartner team casting. Fun to watch techies taking on Hollywood.

Below picture explains the bright and dark sides of these dilemmas.

Untitled

3. Industry Panel: An Expert view on Business Intelligence, Analytics and Information Management

Ian A Bertram, Neil Chandler, Donald Farmer, Deepak Ghodke

4. Building a Successful Business Analytics Strategy

Douglas Laney

image

image

5. Applying the Big Data Ecosystem

Arun Chandrasekaran

I couldn’t attend the full session. I went in in last 20 minutes but what I heard was very informative. I also got a chance to speak to Arun and discuss some of my current ideas and issues. Although he didn’t point to any specific solution, he helped me gain confidence on what I was doing.

6. Workshop: Foundation skills of a BI and Analytics Change Agent Leader

Ian A Bertram

Spoiler alert. If you are going to Gartner event, don’t expect technical knowledge transfer. the topics discussed in this workshop were no where related to BI or analytics. The ideas were very generic and common for any kind of change. It was about how to handle change, convince people to change and manage situations.

7. Interactive visualization: Insights for everyone

Rita L Sallam

A very informative session. Rita is a very seasoned speaker. Here are the my key takeaways from this talk.

image

image

image

8. The High functioning Analytics center of Excellence: As analytics now pervades every corner of your organization, coordinating, collaborating and governing business intelligence and data science functions has become critical. Many such CoEs are well-intentioned but lack key ingredients.

Douglas Laney

A waste of time. It is a very important subject, however short time and diversity of views resulted in a very immature discussion.

10th June

9. Gartner Magic Quadrant Power session: Data Integration, Business Intelligence, Advanced Analytics, DBMS for Data Warehouse

Ian A Bertram, Ted Friedman, Alexander Linden, Rita L Sallam, Roxane Edjlali

image

image

image

image

10. How to Monetize Your Information Assets

Douglas Laney

Not exactly a rocket science but a good session for beginners. Laney was able to convince on how “Infonomics” is gaining ground and may become part of balance sheet asset in future.

image

11. Workshop: How to Develop and Evolve an Enterprise Information Management Strategy

Guido De Simoni

I could have rather skipped it. But it was a good session for those who don’t have a defined dictrine on how they want to mange data in the organization.

image

12. The Enterprise Information Management Framework: Building Blocks for IM Success

Michael Patrick Moran

This session was an extension of the previous session. It was a content rich session with clear guidelines on how an EIM framework should be created and nurtured.

image

image

image

image

image

image

image

13. Innovating with Analytics: 40 Real World Examples in 40 Minutes

Douglas Laney

A very fun session. While we have been hearing many different use cases, Laney was still able to fill in some awe. Here’s the favorite one.

image

14. Big Data Discovery – The Next Generation of BI Self-Service

Rita L. Sallam

My most favorite session. This is exactly what I was looking for in whole two days and it came in last. Neverthless, full of information. The most interesting part is, she talked about exactly the same things that I’ve been thinking for quite a few days. It’s the tipping point. There are not many days left when people will get over big data fever and start taking note of the remainder of the data.

image

image

image

Besides these, there were two more sessions. One by Ketan Bhagat, he’s younger brother of popular author Chetan Bhagat and an author too. He spoke about “Operant Conditioning”. It was more of a book launch event. There was also a Closing keynote but nothing worthwhile in it.

I also noted some keywords that are going to be buzz words in analytics industry in coming years. they are:

Citizen Developer, Analytics Continuum, Data Lake, Lambda Architecture, Citizen Data Scientist

overall it was time well spent. Made great connections.

Monday, February 2, 2015

Lost MS SQL Server admin access? No Problem

Today I by mistake removed my own windows account from my local MS SQL Server installation. That meant I couldn’t use the DB anymore. I didn’t have any other account setup on this DB because it is a local setup. What it meant was that the SQL Server setup was useless, and only way to use it was to uninstall the whole DB Server and install it again.

However I did some search and found following solution to recover access-

Step 1 :

net stop mssqlserver

or you can stop it from services.msc. Now open a command prompt as an administrator and type:

Step 2 :

net start mssqlserver /f /T3608 

This will start MS SQL server in single user mode.

Step 3 :

sqlcmd 


Step 4 : create new windows user
1 create login [<<DOMAIN\USERNAME>>] from windows;
2 EXEC sys.sp_addsrvrolemember @loginame = N'<<DOMAIN\USERNAME>>', @rolename = N'sysadmin';
3 GO;

OR, create sql server login
1 CREATE LOGIN [testAdmin] WITH PASSWORD=N'test@1234', DEFAULT_DATABASE=[master];
2 EXEC sys.sp_addsrvrolemember @loginame = N'testAdmin', @rolename = N'sysadmin';
3 GO

Step 5 : close sqlcmd by CTRL+c


Step 6 : Login to SQL Server via SQL Server Mgmt. Studio

The curious case of SharePoint integrated SSRS Subscription

Some time ago I found that whenever I create an SSRS subscription with multiple cascaded filters in SharePoint integrated mode, the the time it takes to create subscription after selecting all filter values is more than the total time taken if the SSRS report were executed.

In order to understand why this happens, lets understand what happens when you create a subscription.

image

When you click on “Add Subscription”, all present data sets are executed. You might ask why?, well the answer is because present parameter and filter values need to be populated. This seems normal but the curious case that I’m talking about is something else.

The problem is with cascaded filters i.e. filters values depending on other filters values

Default behavior means all the datasets are fired when “add subscription” is clicked. Because child filter must have values that are derived from the values of parent filter in cascade, the dataset queries are fired multiple times, depending on how many cascaded children you have.

In a scenario where the data sets have complex and long queries, the problem can become very evident in terms of the time it takes to create a subscription.

But why does it happen. I asked this question to Microsoft. The response was – because of POSTBACK. Because it is SharePoint integrated mode, all calls are diverted through SharePoint and POSTBACK re-fires them. Another question I asked Microsoft was – why was it designed this way? They don’t have an answer to this. However they were kind enough to look further and fix this problem.

so here is the problem statement as defined by Microsoft and its solution:

Problem Description:

You’ve a report in SharePoint integrated mode which contains parameters. When you do an add subscription, the page takes a lot of time to come up. Also, when you change any of the parameters, the post back takes really long time to reload the page.

Analysis:

1. We took the profiler traces and found that a single query belonging to a data set is executing 4 times. This has been causing the issue.

3. We engaged our product team to have a better understanding of the behavior and how we can mitigate the same.

4. The reason behind the multiple execution of the query is due to the post back within ASP.NET.

5. We reviewed our code and found a scope for improvement, at least that would remove one level of execution.

6. Our Product team has agreed to release the improved code as part of SQL Server 2012 SP 2 CU 5.

Root cause:

1. Due to the post back behavior of ASP.NET, the queries have been executed multiple times.

Resolution:

A fix will be released as part of SQL Server 2012 SP 2 CU 5. The tentative release date is on March 16, 2015.

Tuesday, December 2, 2014

SSRS tweaks for external facing environment

An external facing (for consumption of non-corporate users e.g. customers, vendors etc.…) SSRS requires some extra tweaking in order to make it functional and compliant with organizational policies.

1. SSRS session expiring even before SharePoint session expires (for SharePoint integrated mode)

SSRS has its own session settings. make sure you have set them at par with SharePoint session settings. I’ve discussed about SharePoint settings in my previous blogs.

image

In case you do not want confusion between SSRS and SharePoint session settings, uncheck the “Use Session Cookies”. If there is a difference between session timeout settings between SharePoint and SSRS, you might face problems while creating subscriptions. Every time the session expires, you might get errors that subscription list doesn’t exist if you are on manage subscription page.

2. Disable “Open with Report Builder”

When SSRS reports are executed they show “Open with Report Builder” option under action button. In case of external facing environment you may not like it, because mostly external users are non-windows users while the report builder tool works only with windows authentication; so its of no use anyways. To remove this option from under the action button run following SQL query on SSRS main database.

1 UPDATE dbo.ConfigurationInfo
2
3 SET value = ‘False’
4
5 WHERE name = ‘EnableReportDesignClientDownload’

Sunday, November 23, 2014

SharePoint External Architecture & Implementation - Part 6

Previous Blog - Part 5
Note:- This is a maxi post.
This blog will deals specifically with Microsoft BI scenario in external. As all roads lead to Rome, all problems lead us to single cause in case of externally facing environments – Authentication.
In case end users of an externally facing Microsoft BI environment are vendors and customers, you do not want them to see each other’s data. However you also don’t want to create separate set of reports for each of these different users. The best, optimized and accepted solution is to provide row level security and handle the authorization of user at the data source level. Different databases have different mechanism to handle row level security. I’m going to talk about how to enable row level security for SSAS (SQL Server Analysis Services) in absence of active directory with SharePoint as front end.

I’ve always wondered what was the reason that Microsoft didn’t allow any other authentication except for Windows authentication in SSAS. It would have provided architects some leg room to try it with non-Microsoft products, nevertheless, they left the window open with “CustomData” implementation as query banding (read here). optional parameters can be passed in connection string with binding it with CustomData. Once this CustomData value is retrieved on the SSAS server, it be used to run logic using DAX (Data Analysis Expressions).

Important thing to note here is that only two solution can be used in an external Microsoft BI environment (assuming its non-windows authentication environment) with row level security; namely SSRS (SQL Server Reporting Services) and PPS (Performance Point Service). Other service such as excel services, Power Pivot, Power View etc.… don’t provide a way to delegate end user’s credentials to the data source.

Now lets look at the implementation of row level security for an Microsoft BI solution with SSAS as data source. The problem has to be divided in two parts.
  1. Send end user’s credentials to data source
  2. Return only those rows where end user has access

Send end user’s credentials to data source

a) SSRS - Use following connection string within SSRS report data source. Make sure its an embedded connection string i.e. connection string passed as a text string within SSRS data source because you can use SSRS inbuilt parameter USERID to pass end user’s identity this way. If you chose to use connection string by specifying it in SharePoint UI, this inbuilt variable will not be available. This is how a connection string would look like.
="Data Source=<SSAS Server Name>:<Port>;Initial Catalog=<Database Name>; Character Encoding=UTF-8; Customdata=" & User!UserID

b) PPS – PPS provides direct way to insert CustomData into the connection string. All you have to do is to click the checkbox “Provide the authenticated user as the value of the connection string property "CustomData"” on the PPS data connection properties page.

Untitled

Because SharePoint keeps the login name in an encoded format (read here) the CustomData that is passed to the SSAS would be in encoded format.

Return only those rows where end user has access

On the SSAS cube, create a role and add member those accounts which were supplied in SSRS data source and PPS unattended service account. Make sure you don’t provide these accounts admin access on SSAS else the whole exercise of row level security would fail and all the users will see whole data.

Because CustomData has been passed via connection string, you can use its value and apply the logic on it. The first logic that you need to apply is to decode the encoded string passed by SharePoint. This can be achieved by using DAX in the role created for the cube access.

Untitled1


Right(right(CustomData(),len(CustomData()) - search("|",CustomData())),len(right(CustomData(),len(CustomData()) - search("|",CustomData())))- search("|",right(CustomData(),len(CustomData()) - search("|",CustomData()))))


Above DAX expression will return you the actual user login ID from encoded value passed from SharePoint. You can use the login ID to run logical operation to return only those rows which match certain values.