Tuesday, April 11, 2017

Collated Insights & a way to resolve current BI problems

In my previous blog I highlighted problems that exist because of current approach to Business Intelligence (BI). Read Problems with current BI ecosystem.

In this article I will talk about the approach I took on working around those problem and deliver a better user experience. I'll start with the design thought that was adopted in search of a better BI solution. Like any business scenario, solution design always starts with defining a problem or opportunity statement. Because BI is a universal need, this search of problem statement didn't need any soul searching. This omnipresent problem statement from BI professional perspective is:

How to support better business decision making using technology, applications and best practices with collection, integration, analysis, and presentation of business information?

From here on, it takes shape of an interview or interrogation.

1. How are business decisions taken?
2. What information do business users need to make decision?
3. How frequently do they refer to facts?
4. How should information be delivered for most effective consumption?
5. What methods or analysis should be applied to data to simplify the judgement process?

Some of the answers were evident while others needed counselling from expert users. Because my focus is on IT Service Management within my organization, these answers were to be applied to a large organization divided into various IT Services managing different IT businesses. One thing that clearly stood out in the search of answers was need of standard set of metrics and KPIs. The purpose of BI is to measure performance of metrics/KPIs against goals, describe their attributes and diagnose their behaviour. Well groomed metrics are a bedrock of an effective BI process.

Design Paradigm
I treat BI like any other physical product or service. It requires input, processing and is delivered as product with an aim of reaching maximum consumer base. The best model to sell any product to maximum number of consumers is e-Commerce model. Make products available through a catalog and let people chose what they want. For maximum satisfaction, make products configurable so that consumers can chose the customizations as per their need. This customization ensures that while users don't have to bother about how products were built, they retain control on how it is consumed. We live in an era of heightened consciousness hence providing transparency about the processing mechanism adds to the value.

By now you would have already made the connection that I decided to offer BI in an e-Commerce model. I call this combination of e-Commerce and BI as "Collated Insights". In Collated Insights world; Metrics/KPIs are offered as products from a catalog of metrics where they are neatly categorized by business processes for easy search. All a consumer has to do is select required metrics. Once selected, these metrics start showing in a pre-configured dashboard template such as a card based view that shows absolute numerical values and a graph over a time scale. Before I start explaining about how my team achieved this, lets spend some time understanding philosophy of Collated Insights.
Collated Insights is a way of analysis that can be distributed and reproduced on demand. As the name signifies - it follows a methodology of collecting and combining insights. This methodology has two primary components - collate and deliver.

Collate
This is the most important of the two processes. In this process, focus is on collecting, compiling and codifying the organizational metrics. It should ideally be done by someone who has good understanding of the business and inclination for continual improvement. Metrics must be properly designed, understood and collected; otherwise they can lead to a trust deficit in them which can be very dangerous. Metrics should be comprehended in their business context with a very clear understanding of what metrics are used for improvement vs performance. Wherever possible, metrics should be defined tops down i.e. start with business measures such as profit, turnover etc. and follow the trail to operational metrics. It is paramount to establish a metrics registry that is enduring and available for reference. This process is akin to establishing a business vision and architecture where rationale and implications should be clearly marked out. A good metric is one that can be tied to an action. Clearly detail metric attributes such as data source, unit, access requirements, thresholds, targets, boundary conditions etc. Each metric must have associated measures which are used to value these metrics. As an example a Time To Resolve (TTR) for support tickets is a metric which can have possible measures such as Total TTR, Mean TTR, TTR percentile etc.
Here's a template for building metrics:

Metric Name: Time to Resolve (TTR) Incidents
Description: This KPI indicates the total amount of time taken (in hours) to resolve incidents.
Source DB Server :
Source Database Name :
Source Database Table :
Column : DurationMinutes
YTD Field (time attribute by which duration is calculated): Incident Resolved Date
Unit: Hours
Measures:
a) Total TTR
Measure Description: This measure is calculated as the sum of total time (in hours) spent in resolving incidents.
Formula: SUM(DurationMinutes)/24
Conditions: <boundary conditions e.g. SQL where clause statements>

b) Mean TTR
Measure Description: This measure is calculated as the average of total time (in hours) spent in resolving incidents.
Formula: AVERAGE(DurationMinutes)/24
Conditions: <boundary conditions e.g. SQL where clause statements>

Deliver
This is where technology plays the major role. Emphasis needs to be given on the way these collated metrics and their associated measures are delivered. BI and analytics is used for business transformation hence ensure it covers three important aspects of people, process and technology. The underlying delivery mechanism should have following attributes:
a) People (Adaptive)
        Easy to Use/Intuitive
        No specialized skill required
b) Process (Excellence)
        Codified
        Single Version of truth
        Fast Response
c) Technology (Leadership)
        Cutting Edge
        Scalable

While Collated Insights shares many aspects with traditional BI approach, the differentiating factor is how metrics are delivered. A well designed Collated Insights solution reduces the time to market, cost and trust deficit in the organizational BI ecosystem. In the next and final blog I'll explain the architecture that was used to deliver it.

Monday, March 13, 2017

Problems with current BI ecosystem

What is it that first comes to your mind when you hear decision support system and business intelligence. To an enterprise knowledge worker, these are synonymous with dashboards and reports. With increasing hype and commoditization of data processing, pivoting and visualization tools;  BI and insights applications have penetrated well into many enterprises. Among these applications, there are three primary market segments - 1) generic BI platforms such as Microsoft BI platform, Microstrategy etc., 2) Industry specific BI platforms such as Salesforce Wave Analytics etc. and 3) Visual Insights Platforms such as Tableau, QlikView etc…

A BI ecosystem in an enterprise typically works in a bi-modal approach; where an enterprise BI team hosts BI platforms while businesses use these platforms to build reports and dashboards for their analysis. For most parts this is an excellent modus operandi except for the fact that it overlooks an important BI need. In any organization where business units standardize their structure, processes and activities; this bi-modal approach to BI becomes inefficient, because same business unit services end up building silo'ed solutions for common BI needs owing to similar processes and activities. A good example of this case scenario is IT service management. In IT Service Management, IT performance is tracked via process indicators for processes such as incident management, change management etc which are delivered through enterprise ITSM tools. This common business structure leads to common metrics and underlying data. To meet these common BI needs, organizations build management dashboards that are used by executives for performance tracking while service operations teams build their own reports for ad-hoc analysis.

This bi-modal BI approach for common business structure leads to following issues:
1. Same metrics when delivered by different business services end up delivering different results due to difference of understanding, biases, lack of codification and overlooking of data quality issues.
2. Duplicate development cost and delay
3. Unequal performance comparison

In fact, with every organizational restructuring, these BI needs re-emerge leading to perpetual BI development of similar order. Additionally standard BI ecosystem still suffers with three major problems:
1. High development cost and turnaround time on canned BI reporting
2. Data understanding and reporting skills required for self-service reporting
3. BI Reporting & Advanced Analytics are more or less still two separate silos

Let’s understand these issues in more detail.
1. High development cost and turnaround time on canned BI reporting
Let’s first understand how a BI report [dashboard] is developed. 
  • Identify business process metrics to be monitored
  • Identify data sources
  • Identify data elements
  • Build datasets
  • Build data models (optional)
  • Develop report template
  • Bind report to data elements from datasets/models  

This is a standard approach to build a report or dashboard with few possible variations. This process may take from weeks to a few quarters based on data quality and report complexity. Now if a business needs to have multiple reports or dashboards – this whole process is repeated times the number of reports required, notwithstanding same metrics appearing in multiple reports. Not only is there a repetition of the process but also the datasets get cast in a particular fashion owing to the reporting style. The next time there is change in data schema, the risk looms large on potential report breakage. 
 
2. Data understanding and reporting skills required for self-service reporting
While Self-Service reporting addresses the cost and TTM issues, it brings in fresh set of requirements to be fulfilled. Self-service reporting requires business people to be in constant know of data models, data elements and reporting tools. Against canned reporting - Self-service reporting demands a discipline by users to share the results with all concerned stakeholders, so that everybody is aware of results. However in practice it’s hardly achievable; not to mention different business units might want to see results in their own way. When this discipline is not maintained and every business unit builds their own self-service report – a problem of “multiple versions of truth” arrives due to the fact that different users have their own interpretations of how a metric is calculated and boundary conditions are to be observed.
 
3. BI Reporting & Advanced Analytics are more or less still two separate silos
While Advanced Analytics has the ability to deal with abstract information, BI reporting requires a structured data approach. Usually Advanced analytics results are shown using BI reporting but if a demand arises to return to analytics to calculate the results on the fly, it becomes a challenge next to impossible. Let’s understand it by an example. 
Suppose we want to understand how a service is performing on “Time to Resolve (TTR)” with respect to the incidents that it is responsible for. The standard way would be to monitor average which is simple to calculate and report. However average is not an accurate indicator for a large range of data values. The right way is to understand the distribution of TTR and determine what percentage of incidents is causing how much TTR. The simplest way to do this is through percentiles/quartiles. However most standard BI reporting solutions are not capable enough to provide percentiles. They might deliver for the overall volume but they can’t really perform, once slicing and dicing comes into the picture. BI reporting’s inability to work in sync with advanced analytics solutions puts them in silos.
 
These problems are more of a legacy design flaw rather than technical roadblocks. BI vendors mostly focus on reporting and dashboards than on advanced analytics because advanced analytics is very domain specific and use case centric. The existing BI solutions try to expose data to end users because all domain specific metrics are not pertinent and tend to change for different industries.
At first these issues may seem like universal problems with no known solution, but the fact is; these are the problems associated with the current approach to BI. These issues are resolved when an alternative approach to BI is considered. In the next blog I'll explain how I fixed these issues and delivered better results that scores well on every drawback that current BI ecosystem has.

Sunday, August 16, 2015

Flat file to hierarchical tree - Python Way

import json

"""
"This class converts a flat file into a json hierarchical tree
"Inputs: a pandas data frame with flat file loaded
""""
class hierarchy_tree(object):
def __init__(self, data, ):
# self.__file = json_file_path
self.__data = data
self.__result = {'name':None, 'children':[], 'result':None}


"""
Call this public function to convert a json flat file to json tree
"""
def create(self, start=0, finish=None, callback=None):
data = self.__data
self.__callback = None
# check if callback is a function
if callable(callback):
self.__callback = callback

# iterate on each item
for row in data.iterrows():
#each row is a tuple
if finish == None:
finish = len(row[1])

row = row[1][start:finish]
lineage = []
for x in range(len(row)):
lineage.append(row[x])
self.__build_path(lineage)
return json.dumps(self.__result)

"""
This function actually creates nested dictionary
that is later dumped as json
"""
def __build_path(self, lineage):
parent = self.__result

for item in lineage:
# check if the current item exists as dictionary name
index = -1
for child in range(len(parent['children'])):
if parent['children'][child]['name'] == item:
# reset index if item found
index = child
break
# if existing item was not found
if index == -1:
# update as last item in dictionary
parent['children'].append({'name':item, 'children':[], 'result':None})
#
# implement callback
#pass arguments - Item text and its index in lineage
#
if callable(self.__callback):
parent['children'][index]['result'] = self.__callback(lineage, lineage.index(item))
# reset parent
parent = parent['children'][index]


Example Usage:


def callbackfunc(list, index):
return list[index] + str(index)

data = pandas.read_json(<json file path>)
tree = hierarchy_tree(data)
print tree.create(start=0, finish=3, callback=callbackfunc)

Monday, June 29, 2015

Sentiment analysis : aggregate scoring

There has already been a lot of advice on how to do sentiment analysis, but not much on how to aggregate the polarity scores for multiple pieces of text on which sentiment analysis has been done. Below is my research on how to address this problem.

There are multiple ways to aggregate scores. The most commonly used methods are:
1) Sum
2) Arithmetic mean
3) Median
4) Mode

While above methods are most easily understood ones, they do not exactly fit the problem solution.  Let's understand the problems with these methods.

1) Sum - sum is The simplest of the lot and it can be used if you want to aggregate the sentiment score on any corpus. In fact sum is the right metric to be used for scoring a single piece of text with multiple sentences of diverse polarity. However if the results of multiple pieces of text are evaluated and compared on a sum scale, then the difference effect may get lost. This happens mostly in cases where text is a mixture of highly subjective and objective sentences. A scale based comparison may not yield right results as some of the score points may be lie outside the normal range and it will also provide a very large distribution range.

2) Arithmetic Mean (Average) - Arithmetic mean is mostly the hammer for all nails. Its a great metric if the data points lie around a centroid in a reasonable radius. As soon as values go farther away from centroid, mean starts tilting towards the farthest data point.

3) Median - The median of a set is the middle value of the set when they are sorted into numeric order.The median has the benefit of being relatively unaffected by outliers, but the flip side of this is that it is completely insensitive to changes of value that do not affect the set ordering except when it is the middle values that change.

4) Mode - It is the number which appears most often in a set of numbers. Its quite evident from definition that it represents the highest frequency data point, a good metric to show majority sentiment but still not good enough on comparative scale.

So the nearest so far is arithmetic mean, and if the outlier issue can be addressed it would be the perfect metric to be used. Over the years many mathematicians have tried to address this issue hence there are various versions of arithmetic means with different calculative formulas. Most famous of them are:
a) Geometric mean
b) Harmonic mean
c) Trimean (this is what is used for Olympic swimming scoring)

A very good comparison of Arithmetic mean, harmonic mean (and epsilon adjusted), and geometric mean (and epsilon adjusted) is done by Ravana & Moffat.

If you have read this paper, you would agree that geometric mean makes a far stronger case to be used for sentiment score aggregation. Now the important part, how to calculate geometric mean for entire corpus when you have a tri-polarity analyzer i.e. positive, negative and neutral. For neutral cases, the polarity score would always be zero and it can make total GM (geometric mean) zero as geometric mean formula is:


This is still not as straightforward as it may seem. Given the fact that polarity scores would be 0 for neutral and negative for negative polarity. If total negative polarity scores are odd numbers then the result would be an imaginary number.

Solution to this problem is very well explained by Elsayed A. E. Habib (Click Here). So here's the solution.

Aggregate Geometric Mean = (N1*G1 - N2*G2)/N
N = N1+N2+N3

N1 = Total positive scores
N2 = Total negative scores
N3 = Total neutral scores
G1 = Geometric mean of positive scores
G2 = Geometric mean of negative scores

Geometric mean of all neutral texts would always be zero, so we didn't factor it in formula.

Geometric mean for negative numbers:
Case1: Total number of negative items is even – easy as all numbers multiplied together will result in even number for nth root

Case2: Total number of negative items is odd – calculate geometric mean of all negative numbers by taking score absolute value.

Friday, June 12, 2015

Gartner BI, Analytics & Information Mgmt. Summit–Mumbai 2015

 

This was my first conference with Gartner. Arranged very well with succinct agenda. It’s a good place to be in if you are looking for cues on next wave of technology. I attended it with mixed sense of what I’m going to get out of it; however its not the place you want to be in if you are looking for exact technical information on application of BI and analytics.

As it was impossible to attend all the session – I opted for a few that interest me. Here is the list:

9th June

1. Master Data Management (MDM) for Beginners

Guido De Simoni

It was mostly common sense. However liked below representation of different data types in an organization.

image

2. Gartner Opening Keynote: Crossing the analytical divide: New Skills, New Technologies

Ted Friedman, Kurt Schlegel, Rita Sallam

They provide a very high level of information, centered mostly around management, governance, architecture and flow of information. Most of the attendees are middle level and executive managers. Speakers/analyst at Gartner are great speakers with liking for dramatics. I got its taste in the opening keynote where the main theme was around three main subjects:

image

After presenting, which they do rather well, the bottom line emphasized was – use Bi-Modal mechanism. There is no final and identified cookbook on how to run a BI and analytics org. Its requires a little of everything e.g. while IT could be a centralized org with all businesses coming to it for solutions, there is a need for some decentralized pockets within company to focus on specific business use cases. There are no certain solutions, the solution/tools space is very vibrant with multiple offerings from numerous vendors, each with its own strengths and weaknesses. So you might have identified a certain solution for now; but its not going to be long before a disruptive technology barges right in industry and makes you look old fashioned. While organizations are betting big on sharing their data publicly, hosting data in external cloud services, there is a skepticism and risk lurking around, so protect the data that is life blood of the organization. Above messages were passed through three short movies with Gartner team casting. Fun to watch techies taking on Hollywood.

Below picture explains the bright and dark sides of these dilemmas.

Untitled

3. Industry Panel: An Expert view on Business Intelligence, Analytics and Information Management

Ian A Bertram, Neil Chandler, Donald Farmer, Deepak Ghodke

4. Building a Successful Business Analytics Strategy

Douglas Laney

image

image

5. Applying the Big Data Ecosystem

Arun Chandrasekaran

I couldn’t attend the full session. I went in in last 20 minutes but what I heard was very informative. I also got a chance to speak to Arun and discuss some of my current ideas and issues. Although he didn’t point to any specific solution, he helped me gain confidence on what I was doing.

6. Workshop: Foundation skills of a BI and Analytics Change Agent Leader

Ian A Bertram

Spoiler alert. If you are going to Gartner event, don’t expect technical knowledge transfer. the topics discussed in this workshop were no where related to BI or analytics. The ideas were very generic and common for any kind of change. It was about how to handle change, convince people to change and manage situations.

7. Interactive visualization: Insights for everyone

Rita L Sallam

A very informative session. Rita is a very seasoned speaker. Here are the my key takeaways from this talk.

image

image

image

8. The High functioning Analytics center of Excellence: As analytics now pervades every corner of your organization, coordinating, collaborating and governing business intelligence and data science functions has become critical. Many such CoEs are well-intentioned but lack key ingredients.

Douglas Laney

A waste of time. It is a very important subject, however short time and diversity of views resulted in a very immature discussion.

10th June

9. Gartner Magic Quadrant Power session: Data Integration, Business Intelligence, Advanced Analytics, DBMS for Data Warehouse

Ian A Bertram, Ted Friedman, Alexander Linden, Rita L Sallam, Roxane Edjlali

image

image

image

image

10. How to Monetize Your Information Assets

Douglas Laney

Not exactly a rocket science but a good session for beginners. Laney was able to convince on how “Infonomics” is gaining ground and may become part of balance sheet asset in future.

image

11. Workshop: How to Develop and Evolve an Enterprise Information Management Strategy

Guido De Simoni

I could have rather skipped it. But it was a good session for those who don’t have a defined dictrine on how they want to mange data in the organization.

image

12. The Enterprise Information Management Framework: Building Blocks for IM Success

Michael Patrick Moran

This session was an extension of the previous session. It was a content rich session with clear guidelines on how an EIM framework should be created and nurtured.

image

image

image

image

image

image

image

13. Innovating with Analytics: 40 Real World Examples in 40 Minutes

Douglas Laney

A very fun session. While we have been hearing many different use cases, Laney was still able to fill in some awe. Here’s the favorite one.

image

14. Big Data Discovery – The Next Generation of BI Self-Service

Rita L. Sallam

My most favorite session. This is exactly what I was looking for in whole two days and it came in last. Neverthless, full of information. The most interesting part is, she talked about exactly the same things that I’ve been thinking for quite a few days. It’s the tipping point. There are not many days left when people will get over big data fever and start taking note of the remainder of the data.

image

image

image

Besides these, there were two more sessions. One by Ketan Bhagat, he’s younger brother of popular author Chetan Bhagat and an author too. He spoke about “Operant Conditioning”. It was more of a book launch event. There was also a Closing keynote but nothing worthwhile in it.

I also noted some keywords that are going to be buzz words in analytics industry in coming years. they are:

Citizen Developer, Analytics Continuum, Data Lake, Lambda Architecture, Citizen Data Scientist

overall it was time well spent. Made great connections.

Monday, February 2, 2015

Lost MS SQL Server admin access? No Problem

Today I by mistake removed my own windows account from my local MS SQL Server installation. That meant I couldn’t use the DB anymore. I didn’t have any other account setup on this DB because it is a local setup. What it meant was that the SQL Server setup was useless, and only way to use it was to uninstall the whole DB Server and install it again.

However I did some search and found following solution to recover access-

Step 1 :

net stop mssqlserver

or you can stop it from services.msc. Now open a command prompt as an administrator and type:

Step 2 :

net start mssqlserver /f /T3608 

This will start MS SQL server in single user mode.

Step 3 :

sqlcmd 


Step 4 : create new windows user
1 create login [<<DOMAIN\USERNAME>>] from windows;
2 EXEC sys.sp_addsrvrolemember @loginame = N'<<DOMAIN\USERNAME>>', @rolename = N'sysadmin';
3 GO;

OR, create sql server login
1 CREATE LOGIN [testAdmin] WITH PASSWORD=N'test@1234', DEFAULT_DATABASE=[master];
2 EXEC sys.sp_addsrvrolemember @loginame = N'testAdmin', @rolename = N'sysadmin';
3 GO

Step 5 : close sqlcmd by CTRL+c


Step 6 : Login to SQL Server via SQL Server Mgmt. Studio

The curious case of SharePoint integrated SSRS Subscription

Some time ago I found that whenever I create an SSRS subscription with multiple cascaded filters in SharePoint integrated mode, the the time it takes to create subscription after selecting all filter values is more than the total time taken if the SSRS report were executed.

In order to understand why this happens, lets understand what happens when you create a subscription.

image

When you click on “Add Subscription”, all present data sets are executed. You might ask why?, well the answer is because present parameter and filter values need to be populated. This seems normal but the curious case that I’m talking about is something else.

The problem is with cascaded filters i.e. filters values depending on other filters values

Default behavior means all the datasets are fired when “add subscription” is clicked. Because child filter must have values that are derived from the values of parent filter in cascade, the dataset queries are fired multiple times, depending on how many cascaded children you have.

In a scenario where the data sets have complex and long queries, the problem can become very evident in terms of the time it takes to create a subscription.

But why does it happen. I asked this question to Microsoft. The response was – because of POSTBACK. Because it is SharePoint integrated mode, all calls are diverted through SharePoint and POSTBACK re-fires them. Another question I asked Microsoft was – why was it designed this way? They don’t have an answer to this. However they were kind enough to look further and fix this problem.

so here is the problem statement as defined by Microsoft and its solution:

Problem Description:

You’ve a report in SharePoint integrated mode which contains parameters. When you do an add subscription, the page takes a lot of time to come up. Also, when you change any of the parameters, the post back takes really long time to reload the page.

Analysis:

1. We took the profiler traces and found that a single query belonging to a data set is executing 4 times. This has been causing the issue.

3. We engaged our product team to have a better understanding of the behavior and how we can mitigate the same.

4. The reason behind the multiple execution of the query is due to the post back within ASP.NET.

5. We reviewed our code and found a scope for improvement, at least that would remove one level of execution.

6. Our Product team has agreed to release the improved code as part of SQL Server 2012 SP 2 CU 5.

Root cause:

1. Due to the post back behavior of ASP.NET, the queries have been executed multiple times.

Resolution:

A fix will be released as part of SQL Server 2012 SP 2 CU 5. The tentative release date is on March 16, 2015.