Flawless Excel in Business Intelligence

Somewhat prompted by some recent posts about the inherent error associated with Excel as a BI tool (and more importantly how some magical tool solves this problem) I thought I’d discuss Excel as a BI tool. In-fact, it’s one of my favourite pondering points “what role should Excel should play business intelligence”

As I’ve stated about it’s a discussion that’s often prompted by claims that a product improves the inherent risk associated with Excel (and to a lesser extent Access). Vendors often suggest that their Excel support mitigates the inherent risk associated with Excel because it adds a corporate security and authorised data capabilities the users trusted tool (often it is argued that Excel the only tool business users want to use or perhaps can use).

There are several flavours presented in the pitch (often several of them are presented in the same discussion) so let’s look at a few of the key selling (or scaring) points;

  1. Excel has inherent risk and is error prone. For this, I’ll take an extreme case since it suites my discussion. The nature of these stories is that Excel caused some monumental loss because of an error in a formula or there was some other error in the spreadsheet. There’s nothing like a bit of fear so make you sit up and pay attention is there? If you have not had this error yet don’t worry because you will … it’s only a matter of time (in fact we’ll even through in some statistics to show just how common this can be). It’s no big deal if you haven’t found it yet because you just haven’t looked hard enough. Infact some vendors even go so far to claim that users are arrogant (now that’s going to mend the bridge between IT and Business). But then again, there’s nothing like creating a bit of division is there – ‘oh yeah now I remember divide and conquer’
  2. The Excel (or Access System) is undocumented and the removal of the Key Person that developed the tool would cause monumental loss to the company. How could any responsible manager allow a system or some reporting tool to operate if it is reliant on a key person … this should never happen should it? If you want to be a good manager (and there’s the no back-out argument), ensure business continuality you need a system that’s IT sanctioned with proper documentation – so who wouldn’t want or need that that?
  3. Oh yes – did I mention that IT hasn’t sanctioned your Excel (or Access) work? They don’t support it and you can’t go to them with problems. This really enforces the risk associated with point 1 & 2. If there’s an issue then you are own your own. This is a very compelling argument that enforces a CYA mentality – perhaps we should rewrite the argument to the words ‘if you can’t transfer blame, you better not take responsibility for it’.
  4. End users should focus on data analysis and not data preparation. Here we are presented with the classic position that too much of time is spent in data preparation and not analysis. Could this possibly be a hint to suggest that an unproductive employee that could get so much more done if only used the tool.
  5. IT takes too long or does not do it right. This is such a common selling point that it is almost taken for granted (especially in the realm of reporting). A pitch that focuses on the ‘corporate lifecycle of reporting’ may go something like this;
    1. The business user describes the report to the IT analyst
    2. Sometime later the IT guys get around to writing the spec for the report and pass it to the dev queue where the report writer creates it, this doesn’t happen straight away though because the report writer is so busy. So say 6 months later the user gets their report.
    3. The report is inevitability wrong and the cycle begins again.
    4. By now the business has moved on and the business user needs a new report.
  6. Users need Excel interaction. If there’s been no justification presented above we need one resounding point – the final nail in the coffin (as it where). The pitch here focuses on user needs. We often hear very compelling statements along the lines that users just want to use Excel because that’s what they know or it’s just the way they work. Picture the preparation of a presentation we the analyst grabs some information from one report, creates a LOOKUP from data from another and then creates a chart to paste into Power Point. Based on this story it’s easy to see why users always are going to use Excel (well so the pitch goes).

I have listed a more than a few points here but I think they can be summarised into a few categories.

  1. Relying on Excel for the manipulation of data is a risky proposition (so you need some involvement of IT to provide corporate data to manipulate) and
  2. Users need Excel.

Now the vendor presents their tool – something to solve all the problems listed above, but how appropriate are such tools in reality? Aside from the idea that they present a very narrow perspective of what BI can do, we will now focus on how the vendors tool satisfies the gaping hole in the end users tool set.

Now (the vendor) present the age of enlightenment in their product … our (their) tool solves all these problems because it;

  1. Is managed by IT –hosted so that there will not be any issues associated with lost or deleted files.
  2. Has gone through a rigorous testing process.
  3. Can be used by anyone. The software is so intuitive that any user can pick it up and use it
  4. Delivers data to users in the format they want, that is, its accessible in Excel.

As convincing as these arguments might appear on the surface there’s a real problem with this line of thinking and we can sum this up in two ways;

Firstly to paraphrase Lyndsay Wise (W|T) – if users need Excel then the BI tool is not doing what it should. While I could go into a long winded discussion, I think the statement is succinct. Where users have to bring data out into Excel to manipulate it (you don’t tend to hear the buzz word Mash Up anymore do you?) then there’s a pretty convincing argument that the sanctioned solution just does cut it (sanctioned as in corporate structure). If we think about it for a minute longer and push the case for Excel we’d realise that what we are really saying is that our business model is so inflexible it needs an additional layer of manipulation in order to provide any real value. This may be the case but is so shouldn’t the problem be attacked at the source and not downstream, that is, when it enters the model and not when it’s spat out?

Now for the second point – one where the tool somehow removes error where there is some form of manipulation required. I am very interested to know exactly how any tool achieves this magical proposition (and have never seen a convincing argument by a vendor). At best, you’ve got a cell location (or multiple) that links back to corporate (and supposedly agreed) data. However, if we look at process of what Excel is being used for, we are faced with exactly the same issue that would be present had they not had a tool. That is, they are free to use the data how they wish. After all, that’s the reason why Excel is so popular – it allows the user (of any skill level) a programmatic interface for working with data. The tool adds no comfort than would otherwise be available from Excel.

If we would like to suggest that the availability of data in Excel is the key to Excel’s woes as a corporate BI tool (and those nasty errors that are so well published) then we are simply adding another layer for users to get wrong. After all the placement of data relies on the end user finding the right information to begin with!

Not surprisingly the argument for safety around Excel as a BI tool resolves in the user and not the tool. Once its placed in Excel, data is at the direction of the user and it is their ability that we seek mitigation against.

 

Advertisements

Microsoft Tabular Modeling Cookbook

I am pleased to announce that my tabular modelling book is finished.  The title is succinct – ‘Microsoft Tabular Modeling Cookbook’ and its available from packt (and others) at http://www.packtpub.com/microsoft-tabular-modeling-cookbook/book

There is (naturally) a lot in the book – but it is designed to get you up and running fast. We look at all flavours of modelling – both power pivot and SSAS and pretty much everything in between.

Of course any comments and feedback is welcome!

SQL Server and Power Pivot – Tabular Modeling Cookbook

I have not been blogging lately, truth is, I’ve been busy writing a book on Tabular Modelling!

The book is quiet comprehensive, and really focusses on what I perceive the lifecycle of tabular modelling to be (from an analytical perspective at least). That is, modelling and answering questions in Power Pivot (Excel 2010) and then migrating the model to servers, maintenance and management.

If you are interested in checking out the preview, the link is here; http://www.packtpub.com/sql-server-and-power-pivot-tabular-modeling-cookbook/book

Managing Connections In Power Pivot

Demonstrations that show importing data into Power Pivot do this from database icon in the home window (as identified below). However, once data is in your model and you wish to add new tables it is wiser to reuse the existing connection rather than create a new connection for each import. Each time that table(s) are added by this dialogue, a new connection is created within the model with multiple connections possible referring to the same source. For model simplicity, it is better to have a single connection referring to a single data source.

Standards for Import

The standard method importing data into Power Pivot is to use the import from database icon as shown in the diagram below (when a database is being used). This utility has features that allow you to detect relationships between tables and may suffice from two models. That is, when there are no existing tables and the model.

What this utility does is create a new connection to the data source each time the Wizard is invoked. Consider for example, an existing model that has the product table from Adventure Works imported. The initial import did not include the ProductCategory and ProductSubCateogory table. Naturally, we would want to include this in the model.

If we use this Wizard to reimport the tables, a new connection is set up. This connection is associated with the two new tables. We can see this by selecting the “Existing Connections” icon in the design.

Note that there are two connections in the model (as below) which was caused by the “re-import”of data once the initial population of the model had occurred.

Altering an Existing Connection

We can add data to the existing model and reuse an existing connection at the same time. If we wish to do this you simply reopen the existing connections properties and append tables (or add a query). Click the ‘Open’ button from the dialogue and select either the ‘Select from a list of tables…’ or ‘Write a query that will….’.

 

This will append any new tables to the model utilising existing connection.

Perhaps the main reason for reusing the connection is that it keeps the model succinct and manageable. Once a table has been imported through a connection the connection properties for that table cannot be changed without first deleting all the tables using the connection. Naturally this would increase the amount of rework required to the model should consolidation of connections be required at a later stage.

Finally, the connections are retained within the model even after all the tables using connection have been deleted. The connection must manually be deleted from the ‘Existing Connections’ dialogue.

 

 

 

OLAP Rigidity wins against Tabular?

There are many reasons why you might choose a Tabular Model over Multidimensional one. Marco Russo discusses some of the pros of tabular in his blog here which general relate to the flexibility that the tabular model gives you. However, one reason surprising reason for choosing multidimensional may be its rigid dimensional structure and the certainty that this structure gives in changing data environments.

A Multidimensional dimension provides many ways to record information against an attribute. Most importantly these are the attributes KEY, NAME and VALUE which are set under the dimension source properties. For example we can see the [State-Province] attribute from the Geography dimension in Adventure Works as below. The Key is the only property that must be set for an attribute so if no name is specified, the key will automatically be applied as the name.

Note that in the tabular engine, there is no concept of the KEY or Name. These are not specified as native DAX refers to column value(s).

Referring to Attribute Members

When an MDX query refers to members of the attribute it can do so by using the member name or the members unique name. Thus, the two MDX statements are equivalent;

— query using members unique name

select

[Measures].[Internet Sales Amount] on 0,

[Product].[Product Categories].[Category].&[4] on 1

from [Direct Sales]

 

— query using members name

select

[Measures].[Internet Sales Amount] on 0,

[Product].[Product Categories].[Category].[Accessories] on 1

from [Direct Sales]

 

There are many reasons why you would choose the first query over the second, namely, the second can give unexpected values when there is more than one member with the name being sought. That is, if there were two product categories (with different keys) and the same name (as Accessories), then the query would return only the first member. One might expect that the query would aggregate the results but this does not occur.

Once the query has been written, the members unique name is confusing and many people complain that it lacks reference to the underlying data (after all its impossible to know what does .&[4] means anyway). However, it is still the best way to refer to a member and this is the way most client tools generate MDX (using the unique_member_name).

Note that since the tabular engine has no concept of a key for an attribute (you don’t uniquely specify the key or name for an attribute) the MDX equivalent passed to the tabular engine uses what we would otherwise consider the name as the key. Thus, as in the example above, the tabular equivalent for this member is [Product].[Category].[Category].&[Accessories]. To an OLAP (multidimensional) developer, this type of key (ie the name as the key) is generally considered a real NO NO and contravenes best practice.

 

Client Queries and the interesting stuff.

For tabular models, the unique name for a member is generated by the name. Furthermore, most client tools refer to the key in their script. I have even seen some tools hard code the published member for parameters. So, what happens when the underlying attribute name changes? The change could break your report.

We can easily demonstrate how this error would occur using Excel as a client. Suppose I have a pivot based on a multidimensional cube and I convert the simple pivot to formulas (as below). The Accessories member in Cell A2 is hardcoded (the formula is in D2 and uses the key [4] as above). If the name of this member changes, the spread sheet still works fine, and the new name is displayed.

However, for the tabular example (as below), when the hardcoded name is changed, the MDX returns a #N/A error for the CUBEMEMBER function.

In the second picture, the name of the member has changed to [Accessories & Other] however the ‘report’ still relies on the hard coded value .&[Accessories] and the error occurs.

Conclusion

The use of a ‘key’ for each member would offer a way to solve the problem as addressed above and this is only available in multidimensional dimensions. To be fair though, the problem is as much a function of design as it is of structure. A multidimensional dimension would still raise the issue if the attributes name had been used as a key (but no-one does that do they?).

PASS Day 1 – MS Announces Hekaton, PolyBase

Today at PASS Microsoft announces three initiatives for SQL Server. These are;

  • Project Hekaton
  • Support for BIG Data
  • and the integration of self-service BI in office

For those that are actively engaged in Business Intelligence the integration of the xVelocity engine and PowerView into Excel is not surprising, it’s been on the road map for some time and has previously been available through CTP (Excel 2013). There is also the shift to less pervasive BI through additional functionality that is offered in Excel (for example chart prompting and data learning for lists). I think that a lot of analysts or users that are involved in data manipulation will also enjoy using PowerView visualisations and the ability to interact with data in real time will improve data competencies. Clearly Microsoft is delivering on its goal of bringing ‘BI to the masses’ and its flagship is Excel (with inbuilt PowerView for visualisation). Most Importantly the Excel 2013 PowerView SUPPORTS visualisations against OLAP databases so most companies investment in OLAP technologies can be utilised with PowerView in Excel.  The update to the OLAP SSAS engine will occur at some date in the future

For those that are engaged with SQL server, Project Hekaton is an in memory improvement to SQL server objects. The ability to hold these objects in memory will give massive improvements in performance and is targeted towards OLTP environments. The impressive thing about Hekaton is that the technology is embedded into the SQL Server engine and so will become part of core functionality. Additionally the in-memory component (load) can be customised to the database (and workload) so that there is complete control over the workload. In-fact, the engine suggests objects (tables and stored procedures) that should have in memory storage applied. There is no doubt that this will have a major impact on the performance of OLTP systems and reduce reliance on hardware based solutions (albeit the engine consumes CPU).

Finally it has been announced that, in a move to support big data Microsoft it will support its own HADOOP type instance through HDInsight. This is not really surprising given the industries direction at big data (it seems that all vendors have some big data solution). What is interesting from a Microsoft point of view is that the big data component of the engine will support SQL queries. This technology (PolyBase) allows the PDW (Parallel Data Warehouse) relational and big data components to be combined in a single SQL Query.

One noticeable exclusion from today’s keynote was the lack of clarity (or any announcement) around Microsoft’s mobile BI strategy. At last year’s summit, it was announced it would support mobile BI and demonstrated it on an IPad. A similar announcement was not given today, however there are sessions in the summit that specifically address a mobile solution (from MS personnel) and so will be interesting to see this content.

DAX III – Filtering Queries with CALCULATETABLE and FILTER

So far the queries that we have created in DAX have focused on the entire result set (with the exception of the start at clause in evaluate). Naturally, we would want to restrict the rowset in some way and this is the focus of this post – how to filter row-set tables in DAX queries. There are two ways to do this, firstly, we can use the CACULCUATETABLE syntax and secondly we can use a FILTER that covers the evaluate syntax.

BOL defines the function of CALCULATETABLE as “Evaluates a table expression in a context modified by the given filters” taking the form;

 

CALCULATETABLE(<expression>,<filter1>,<filter2>,…)

 

There are many applications of how we can use this, however, let’s look at the first of reducing the rows returned, say for the first six months of 2000. We can evaluate this query like this.

evaluate 
(
    calculatetable
    (
 
        addcolumns
        (
    
            summarize
            (
                ‘dates’
                , ‘dates'[year]
                , ‘dates'[Month]
            )
        
            , “sum_amount”calculate(sum(‘Sales'[amount]))
        )
        
        
    , dates[year]=2000
    , dates[month]>=1
    , dates[month]<=6
    )
)

 

One of the restrictions of using CALCULATETABLE to restrict rows is that the filter cannot be applied to calculated measures. So if we extended the filter arguments to include a condition on ‘sales_amount’, we would get an error.

evaluate 
(
    calculatetable
    (
 
        addcolumns
        (
    
            summarize
            (
                ‘dates’
                , ‘dates'[year]
                , ‘dates'[Month]
            )
        
            , “sum_amount”calculate(sum(‘Sales'[amount]))
        )
        
        
    , dates[year]=2000
    , dates[month]>=1
    , dates[month]<=6
    , calculate(sum(‘sales'[amount])) > 10000
    )
)

A function ‘CALCULATE’ has been used in a True/False expression that is used as a table filter expression. This is not allowed.

 

Note that the filter arguments in calculate table are applied to the underlying table regardless of the fields returned (projected or selected). So, if we wanted to sum for the first 6 months sales for the year 2000, we could simply remove the ‘dates'[month] field from the summarize and the calculation is applied only for Jan to Jun.

evaluate 
(
    calculatetable
    (
 
        addcolumns
        (
    
            summarize
            (
                ‘dates’
                , ‘dates'[year]
            )
        
            , “sum_amount”calculate(sum(‘Sales'[amount]))
        )
        
        
    , dates[year]=2000
    , dates[month]>=1
    , dates[month]<=6
   
    )
)

 

We could also include this restriction in the definition of the calculation and achieve the same results (see sum_amount2 below).

evaluate 
(
    calculatetable
    (
 
        addcolumns
        (
    
            summarize
            (
                ‘dates’
                , ‘dates'[year]
                
            )
        
            , “sum_amount”calculate(sum(‘Sales'[amount]))
            , “sum_amount2”calculate(sum(‘sales'[amount])
                                        , all(dates)
                                        , ‘dates'[Year]=2000
                                        , ‘dates'[Month] <6
                                        , ‘dates'[Month] >=1
                                        )
        )
        
        
    , dates[year]=2000
    , ‘dates'[Month] <6
    , ‘dates'[Month] >=1

    )
)

 

From the rowset point of view, we are still faced with the problem of how to restrict rows based on a calculated value. In order to do this, we can apply the FILTER function against the table. Filter returns a subset of the table and takes the form;

 

FILTER(<table>,<filter>)

 

Consider yearly sales for the first six months of the year as below;

evaluate 
(
    calculatetable
    (
         addcolumns
        (
            summarize
            (
                ‘dates’
                , ‘dates'[year]       
            )
            , “sum_amount”calculate(sum(‘Sales'[amount]))
        )
    , ‘dates'[Month] <6
    , ‘dates'[Month] >=1
    )
)

 

A filter can be applied to this rowset so that only sales over 50,000 are returned. Note that it is wrapped around the CALCUALTETABLE.

evaluate 
(

    filter
    (
    
            calculatetable
            (
                 addcolumns
                (
                    summarize
                    (
                        ‘dates’
                        , ‘dates'[year]
                    )
                    , “sum_amount”calculate(sum(‘Sales'[amount]))
                )
            , ‘dates'[Month] <6
            , ‘dates'[Month] >=1
            )
    
            , calculate(sum(‘Sales'[amount])) > 50000 
    )

)

 

In these situations, it may be best to define the measure explicitly;

define measure ‘dates'[sum_amount] = calculate(sum(‘Sales'[amount]))

evaluate 
(

    filter
    (
    
            calculatetable
            (
                 addcolumns
                (
                    summarize
                    (
                        ‘dates’
                        , ‘dates'[year]
                    )
                    , “sum_amount”‘dates'[sum_amount]
                )
            , ‘dates'[Month] <6
            , ‘dates'[Month] >=1
            )
    
            , ‘dates'[sum_amount] > 50000 
    )

)