Temporal Tables != Data Warehouse

With the release of SQL2016 just around the corner, there are an increasing number of posts appearing about new features that mean I don’t have to work.  One post (which motivated me to write this) claimed that this breakthrough would mean we no longer needed to store historical values – an interesting interpretation of a Data Warehouse to say the least.  Microsofts own content (here) recommends that there are huge productivity benefits in Slowly Changing Dimensions because you can compare the same key at two points in time.

The idea’s of temporary tables (recording every change to a table in an table_History table) is a cool feature – don’t get me wrong, I think its great.  However, there is (sadly) a lacking
amount of discussion about how the feature can be incorporated into the Data Warehouse.  Those that blindly follow the sales pitch “I don’t need a data warehouse” because I’ve got history tracking or perhaps “yeah, I’ve got a data warehouse – its a history tracked OLTP” will ensure their users cant get the data they need.

So lets call out some issues with reliance on temporal tables as a data warehouse replacement (and bring some data warehouse assumptions to the surface).  I will focus on the star schema design since most references explicit refer to changing dimensions (however we can apply these ideas to other methodologies).

A fundamental construct of the star schema is the surrogate key (almost as fundamental as the concept of dimension and fact).  Using the surrogate uniquely identifies an instance of a dimension at a point in time and therefore, state of the dimension can be precisely identified for the fact record.  For example, if I sold a product on a date, I need to look up the product dimension and determine which version of the product was applicable on that date.  The products surrogate key (not the Product Id or Code) is used in the fact.  This is the fundamental design of the star schema.  A temporal table does not provide you the capacity to do this – all it can do is provide the data to construct the star. 

How could you solve this with temporary tables?  Well, there may be the thought that you could concatenate the tables primary key and the records start date for uniqueness and then determine what (dimension) record is applicable to a fact record via a query.  Interesting idea but not something I’d take on litely.  I suspect that performance would degenerate so quickly both the BI users and the Ops users (remember that this is occurring on the OLTP) would walk away in droves.  Remember that this has to occur for every record in the fact – (and yes they are those LONG narrow tables)!

So lets leave it to the presentation tool – pick one, Power BI, SSAS, Tableau, Qlik, Jedox, …..  All these products rely on uniqueness between separated tables so we still require the surrogate to enforce and deliver uniqueness.   The star (or at least the principle) is still required.

The real power of the dimension (and to a lesser extent the fact) is that it adds business context that does not exist (or can not be easily calculated). Of course this is my opinion but think about it for a moment. Forget about historic values for a moment – raw information is in the source, if the user wanted that you could give it to them no problem. What the star gives is a modeled perspective of a particular part of the business. Consider a customer dimension – what adds values in analysis? It is often the supplementary data (age group, segment profile, status classification, targeted customer …. ) and all of these things are defined and stored in the dimension. So, if we are going to add value (as data warehousing professionals), we still need the dimension to provide this.

All business applications offer some form of reporting – if you’ve ever seen an information maturity chart, it is the first stage of maturity (see below thanks to an EMC2 slide deck).

big-data-business-model-maturity-chart

Riddle me this then, if the source application (OLTP) provides reporting why do we need a data warehouse? Show reports at a particular point in time? Maybe – (although a lot of users struggle with this and tend to think in current terms). There are a lot of tools that provide adhoc query (OLAP) capabilities so performance the performance of analysis isn’t a real consideration (after all, they could just use an OLAP engine over the OLTP right?).

I think one of the primary reasons is integration. We want to either integrate data from different systems (creating an enterprise view) or we want to supplement current data with with other, richer information (which is really just integration anyway isn’t it). We could even say that business rules and derived information falls into this category.

Here also temporal tables do not negate the need for the data warehouse. The data warehouse is responsible for delivering a consistent, conformed, business verified data that incorporates information from various sources. Nothing changed there (and still the need for a data warehouse).

Finally, lets consider the dimension. That subject orientated view of an object. Its the Product table that tells me everything I need to know about a Product – its category, groupings, margin positions and alike. The dimension is notorious for redundancy and de-normalisation but that’s the price we are prepared to pay for
delivering a single concise view to a user because it breaks down the complexity of the model for the user (they don’t have to combine products to product categories in a query). The idea that we have de-normalise breaks the basic OLTP conventions which force normalisation (after all, we expect 3rd normal form).

The data warehouse is designed to do this work for us and present the view to the user. Essentially, its just another integration problem but one that’s handled (and hidden) by the data warehouse. Our BI tools (again with the presentation layers) may be able to create these consolidations for us however we are still presented with the issue of uniqueness in related table records (that is, we have to identify which category related to a product at a point in time and the BI tools will not do that for us).

So, are temporal tools a replacement for a data warehouse? I think not, sure they may be able to assist with record change tracking (we haven’t discussed the shift in OLTP resource management). Temporary tables are only a tool and I’d be very careful of anyone that claims they could replace a data warehouse.

MySql Import Issues with Tabular and OldGuids

In my last post, I looked at a MySql connection property that causes loads to fail in Power BI and how you can work around that issue (see here).  In this post, I’ll look at those implications in SSAS Tabular.  In this example, I’ve downloaded the MySql ODBC drivers (both x32 and x64) from here and created ODBC connections (x32 and x64 with the same name) to my MySQL database.

Data is initially imported through the Table Wizard where I have to choose the relational source Others ( OLEDB/ODBC).  This brings me to the Table Import Wizard window (below) where I can choose my ODBC Connection.  Testing the connection proves successful.  So you navigate next a few times until you can chose to import from a list of tables or write a query.

image

If you choose to select tables (and any) of the tables have Guids in them, the entire load will fail and you are presented with the following errors.  image

Here the error is raised by the SQL generated because it is not compliant with MySQL (that is the SQL generated is “Select [calls].* from [calls]” and MySql does not like square brackets).

Far enough, so you continue.  Now that the connection has been created in the model all you need to do to import data is Open the existing connection (existing connections in the model can be viewed by clicking the Connections button in the ribbon (image) or by selecting “Existing Connections..” from the Model Menu group).  You can then open that connection (either Double Click it or highlight it and press the Open button) and you are presented with the select from tables or write a query).  This time you specify the query (Select * From Calls).  The query validates however, a new error is presented when you try to load data.  Although not specified, it does relate to the OldGuids property.

image

Now we have to manually edit the connection to specify its properties.  When we do so, we are presented with the following window.  Now I want to build the connection string so specify connection string and select the build option.  Then I can choose the machine data source (btw I set up my ODBC connections as machine sources).

image

Now I am taken to the drivers creation window (as below).  I don’t need to change anything here (since the connection works anyway), so I will accept the defaults and hit OK.  (Note that the OldGuids property is not an option in the details section of the drivers configuration)

image

My Connection String has now been specified (note the additional properties in the window and snippet below) and I can add other properties as I need.

image

Before:

DSN=dsnname;SERVER=server;UID=username;PWD=password;DATABASE=databasename;PORT=3306

After:

DSN=dsnname;SERVER=server;UID=username;PWD=password;DATABASE=databasename;PORT=3306;OldGuids=true;

Then I simply specify my impersonation and I am right to go (noting that I must write a query).

Biml XIII – The Biml Script for the Entire SSAS Database

This final post on biml SSAS processing and provides the full script to produce two packages.  It really ties in the first 2 posts to provide a complete solution for our processing scenario.  Remember, our goal is to point our biml at a SSAS database and have a processing package built that only processes the last 2 fact partitions of each measure group based on their last processed date.

The First Method

Firstly, we look at a very conventional arrangement that would process dimensions, partitions and indexes.  This looks like the image below.  In order to get the most performance (lowest processing time), we need to specify that each processing task uses parallelisation.

image

The code is similar to prior posts so we wont dwell on it too much.  Basically, we use AMO (note the reference to the AMO dll in the first line) to connect to our cube, discover dimensions and partitions and these in an iterative manner to each processing task.  The two methods in the C# code of importance are DimensionKeys (which returns all dimensions in the database) and Partitions (which returns the last two processed partitions for our intended measure group).

<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#" #> 

<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    int _max_partitions = 2;
#> 

<#+ 
Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId)
{
        var _ret = new List<PartitionStruct>(); 

        Server _ssas_server = new Server();
        _ssas_server.Connect(ServerName);
        Database _db = _ssas_server.Databases[DatabaseName];
        Cube _c = _db.Cubes[CubeId];
        MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

        foreach (Partition _p in _mg.Partitions)
        {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
        }
        _ssas_server.Disconnect();
        _ssas_server.Dispose(); 

        _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

        return _ret;
} 

public struct PartitionStruct
{
        public readonly string PARTITION_NAME;
        public readonly string PARTITION_ID;
        public readonly DateTime PROCESS_DATE;
        public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
        {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
        } 

} 

#>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="SSAS_PROCESSING_ssas_process_all" ConstraintMode="Linear" ProtectionLevel="EncryptAllWithUserKey">
         <Tasks>
            <AnalysisServicesProcessing Name="Process_Dimensions_Update" ConnectionName="olap" ProcessingOrder="Parallel" >
              <ProcessingConfigurations>
                <# 
                Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server , _ssas_database);
                foreach (string _dim in _dimensions.Keys){ #>            
                        <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database #>" ProcessingOption="ProcessUpdate"  DimensionId="<#= _dim #>" />
                <# } #>
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
            <AnalysisServicesProcessing Name="Process_Partitions_Data" ConnectionName="olap" ProcessingOrder="Parallel">
                <ProcessingConfigurations>
                <#    
                Server _ssas_server_ = new Server();
                _ssas_server_.Connect(_ssas_server); 

                Database _db = _ssas_server_.Databases[_ssas_database];
                foreach (Cube _c in _db.Cubes) 
                { 
                    foreach (MeasureGroup _mg in _c.MeasureGroups)
                    {  
                        List<PartitionStruct> _listPartitions = Partitions(_ssas_server, _ssas_database, _c.ID, _mg.ID);
                        for (int _i = 0; _i < _max_partitions & _i < _listPartitions.Count; _i++ ) { #>
                        <PartitionProcessingConfiguration DatabaseId="<#= _db.ID #>" CubeId="<#= _c.ID #>"  MeasureGroupId="<#= _mg.ID #>" PartitionId="<#= _listPartitions[_i].PARTITION_ID #>" ProcessingOption="ProcessData" />
                     <# }
                    }
                    } 
                    #>
                </ProcessingConfigurations>
            </AnalysisServicesProcessing>
            <AnalysisServicesProcessing Name="Process_Indexes" ConnectionName="olap" ProcessingOrder="Parallel">
                <ProcessingConfigurations>
                    <# 
                    _db = _ssas_server_.Databases[_ssas_database];
                    foreach (Cube _c in _db.Cubes) { #>
                    <CubeProcessingConfiguration ProcessingOption="ProcessIndexes"  DatabaseID="<#= _db.ID #>" CubeId="<#= _c.ID #>"/>
                    <# } 
                       _ssas_server_.Disconnect();
                    #>
                </ProcessingConfigurations>
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml>

A slight word of caution … If you look into the biml code, you will see that we explicitly set the ProcessingOrder as parallel.  This specifies that the Processing Order (mode) for the task should be in parallel and that the batch XMLA sent to your SSAS Server should include that option.  Unfortunately, this may not materialise on all versions of BIDS & BIML and the default (sequential) may be used.  I would check the outcome before implementation.

Method 2

Our second approach is essentially the same as the first, however, in this approach, we encapsulate each each measure group processing command in its own container with sequential constraints between each partition process.  Why?  Because we want to ensure that our partitions are processed according to last processed date.  That is, the last 2 processed partitions should be processed and we should process them in the order that we originally processed them.  If we add a new partition, we want to ensure that the one with the oldest data drops off the list.  You’ll also notice that we use SSIS containers to gain parallelisation for each Cube and Measure Group.

This code relies on the same methods (DimensionKeys and Partitions) used in the prior to build its output.

image

<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #>
<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 
<#@ template language="C#" #> 

<#
    String _ssas_server_name = @".\SQL2008R2";
    String _ssas_database_name = "Adventure Works DW 2008R2";
    int _max_partitions = 2;
#>
<#+ 

Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId, int PartitionNum)
{
        /* returns a number of partitions (PartitionNum)
         * based on their processed_date -> ie last n processed partitions
         * sorted in asc order
         */ 
        List<PartitionStruct> _ret = new List<PartitionStruct>(); 

        Server _ssas_server = new Server();
        _ssas_server.Connect(ServerName);
        Database _db = _ssas_server.Databases[DatabaseName];
        Cube _c = _db.Cubes[CubeId];
        MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

        foreach (Partition _p in _mg.Partitions)
        {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
        }
        _ssas_server.Disconnect();
        _ssas_server.Dispose(); 

        _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

        /* get first PartitionNum */
        while (_ret.Count > PartitionNum)
                _ret.RemoveAt(PartitionNum); 

        // ret asc
        _ret.Sort((x, y) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 
        return _ret;
} 

public struct PartitionStruct
{
        public readonly string PARTITION_NAME;
        public readonly string PARTITION_ID;
        public readonly DateTime PROCESS_DATE;
        public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
        {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
        }
}
#> 

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server_name #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server_name #>" />
    </Connections>
    <Packages>
        <Package Name="SSAS_PROCESSING <#= _ssas_database_name #>" ConstraintMode="Linear">
            <Tasks> 

                <AnalysisServicesProcessing Name="Process_Dimensions_Update" ConnectionName="olap" ProcessingOrder="Parallel" >
                      <ProcessingConfigurations>
                        <# 
                            Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server_name , _ssas_database_name);
                                foreach (string _dim in _dimensions.Keys){ #>            
                                    <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database_name #>" ProcessingOption="ProcessUpdate"  DimensionId="<#= _dim #>" />
                        <# } #>
                      </ProcessingConfigurations>
                </AnalysisServicesProcessing> 

                <Container Name="FACT_PROCESSING" ConstraintMode="Parallel">
                    <Tasks>
                    <#
                    Server _ssas_server_ = new Server();
                    _ssas_server_.Connect(_ssas_server_name);
                    Database _db = _ssas_server_.Databases[_ssas_database_name];
                    foreach (Cube _c in _db.Cubes) { #>
                        <Container Name="CUBE <#= _c.Name #>" ConstraintMode="Parallel">
                            <Tasks>
                                <# foreach (MeasureGroup _mg in _c.MeasureGroups){#>
                                <Container Name="MEASURE GROUP <#= _mg.Name #>" ConstraintMode="Linear">
                                    <Tasks>
                                        <# List<PartitionStruct> _listPartitions = Partitions(_ssas_server_name, _ssas_database_name, _c.ID, _mg.ID, _max_partitions);
                                           foreach (PartitionStruct _partition in _listPartitions) { #>
                                              <AnalysisServicesProcessing Name="PARTITION <#= _partition.PARTITION_NAME #>" ConnectionName="olap">
                                                  <ProcessingConfigurations>
                                                      <PartitionProcessingConfiguration CubeId="<#= _c.ID #>" DatabaseId="<#= _db.ID #>" MeasureGroupId="<#= _mg.ID #>" PartitionId="<#= _partition.PARTITION_ID #>" ProcessingOption="ProcessData" />
                                                  </ProcessingConfigurations>
                                              </AnalysisServicesProcessing> 
                                        <# }#>
                                    </Tasks>
                                </Container>
                                <# }  #>
                            </Tasks>
                        </Container>
                    <#} #>
                    </Tasks>
                </Container> 

            <AnalysisServicesProcessing Name="Process_Indexes" ConnectionName="olap" ProcessingOrder="Parallel">
                <ProcessingConfigurations>
                    <# 
                    _db = _ssas_server_.Databases[_ssas_database_name];
                    foreach (Cube _c in _db.Cubes) { #>
                    <CubeProcessingConfiguration ProcessingOption="ProcessIndexes"  DatabaseID="<#= _db.ID #>" CubeId="<#= _c.ID #>"/>
                    <# } 
                       _ssas_server_.Disconnect();
                    #>
                </ProcessingConfigurations>
            </AnalysisServicesProcessing> 

            </Tasks>
        </Package>
    </Packages>
</Biml>

Conclusion

You might suggest that this process is a very long way to build a processing schedule – and you may be right.  If you had to build it from scratch once to create a package that does some processing, you might have achieved it faster by hand.  However, here’s the thing about automation … you don’t have to code it, its there for you can you can use it to build out your packages and now package creation takes 10 sec! 

Biml XXII – A Framework for Partition Processing

Following on from the last post on dimension processing, the next operation in our SSAS processing strategy is to process measure group partitions.  If we remember our scenario, our SSAS database takes too long to process (when we try a full process), so we are forced to process dimensions using a ProcessUpdate and then specific partitions using ProcessData.  Of course we only want to process those partitions that have the most recent data and we want to automate this using biml.  That is, we want to point our biml package at a SSAS database and have the package generated by some logic.

The Logic

image Lets assume that we have a group of partitions in each measure group (see left).  We will assume that the last partitions processed for the measure group are the most current and therefore, we should choose the last n number of partitions to process (this is based on the individual partitions last processed date).

If you think about how you would manage this in a system, consider adding 2009 data.  When 2009 comes along, we add a partition and process it.  Now that this ‘new’ partition has been added, we want to continue processing it along with the (last processed) partition (ie 2008).

In summary, our processing logic needs to look for the last n partitions based on the last process date of the measure group partitions.

I will explain this in the code below but there is one little (perhaps very confusing) issue that I’ve found with Adventure Works and you may find this in your version.  If I look at the properties for the Internet Sales 2008 partition, I can see the date it was last processed, its name and its ID.  In my code, I’ll be using ID but take a look at the ID and Name.  The ID is Internet_Sales_2004 however, the name is Internet_Sales_2008 … confusing right?

image

What i need is a way to iterate over the partitions in a SSAS database determining which ones where processed last (say for example the last 2 partitions).  I can demonstrate this in the following psuedo code;

 
server <- connect to SSAS server
database <- connect to a database on server
foreach (cube in database)
{
    foreach (measure group in cube)
    {
        foreach (partition in measure group)
            get the last n processed partitions
    }
} 

If I run this against my Adventure Works instance, I may get something like this (assuming that I simply print the last 2 partitions).  We can see, the Cube, Measure Group and Partitions at individual levels

image

Here’s the code to produce that output.  We can see the iteration of cubes, measure groups and partitions.  Note that I’ve also defined a function Partitions to return a sorted table of partitions with the PARTITION_NAME, PARTITION_ID and (last) PROCESS_DT.  I’ll have to also include a reference to AMO.


static void Main(string[] args)
{
    string _server_name = @"Data Source=.\SQL2008R2";
    string _database_name = "Adventure Works DW 2008R2";
    int _max_partitions = 2;
    

    Server _ssas_server = new Server();
    _ssas_server.Connect(_server_name); 

    Database _db = _ssas_server.Databases[_database_name];
    foreach (Cube _c in _db.Cubes)
    {
        Console.WriteLine(_c.Name);
        foreach (MeasureGroup _mg in _c.MeasureGroups)
        {
            Console.WriteLine("\t{0}", _mg.Name);
            var _listPartitions = Partitions(_ssas_server, _ssas_database, _c.ID, _mg.ID);
            for (int _i = 0; _i < _max_partitions & _i < _listPartitions.Count; _i++ )
                Console.WriteLine("\t\t{0},{1}", _listPartitions[_i].PARTITION_NAME, _listPartitions[_i].PROCESS_DATE);


            Console.WriteLine();
        }
    } 

} 



public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId)
{
        var _ret = new List<PartitionStruct>(); 

        Server _ssas_server = new Server();
        _ssas_server.Connect(ServerName);
        Database _db = _ssas_server.Databases[DatabaseName];
        Cube _c = _db.Cubes[CubeId];
        MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

        foreach (Partition _p in _mg.Partitions)
        {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
        }
        _ssas_server.Disconnect();
        _ssas_server.Dispose(); 

        _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

        return _ret;
} 

public struct PartitionStruct
{
        public readonly string PARTITION_NAME;
        public readonly string PARTITION_ID;
        public readonly DateTime PROCESS_DATE;
        public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
        {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
        }
}

Over To Biml

In using biml, I simply want to create a package that has a single PartitionProcessing task in it.  Partition processing is just a processing configuration for the AnalysisServicesProcessing task and its use should become apparent once we see the generated biml.  We simply specify the partition that we want to process.  Our complete biml script is;


<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Data" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#"  #>
<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    String _process_type = "ProcessData";
    int _max_partitions = 2;
#> 

<#+
       public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId)
        {
            var _ret = new List<PartitionStruct>(); 

            Server _ssas_server = new Server();
            _ssas_server.Connect(ServerName);
            Database _db = _ssas_server.Databases[DatabaseName];
            Cube _c = _db.Cubes[CubeId];
            MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

            foreach (Partition _p in _mg.Partitions)
            {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
            }
            _ssas_server.Disconnect();
            _ssas_server.Dispose(); 

            _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

            return _ret;
        } 

        public struct PartitionStruct
        {
            public readonly string PARTITION_NAME;
            public readonly string PARTITION_ID;
            public readonly DateTime PROCESS_DATE;
            public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
            {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
            } 

        }
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="ProcessSassPartitions" ConstraintMode="Linear">
         <Tasks>
            <AnalysisServicesProcessing Name="Process Partitions" ConnectionName="olap" >
                <ProcessingConfigurations>     
                <#
                Server _ssas_server_ = new Server();
                _ssas_server_.Connect(_ssas_server); 

                Database _db = _ssas_server_.Databases[_ssas_database];
                foreach (Cube _c in _db.Cubes) 
                { 
                    foreach (MeasureGroup _mg in _c.MeasureGroups)
                    {  
                        List<PartitionStruct> _listPartitions = Partitions(_ssas_server, _ssas_database, _c.ID, _mg.ID);
                        for (int _i = 0; _i < _max_partitions & _i < _listPartitions.Count; _i++ ) { #>
                        <PartitionProcessingConfiguration DatabaseId="<#= _db.ID #>" CubeId="<#= _c.ID #>"  MeasureGroupId="<#= _mg.ID #>" PartitionId="<#= _listPartitions[_i].PARTITION_ID #>" ProcessingOption="ProcessData" />
                     <# }
                    }
                    } #>
                   </ProcessingConfigurations> 
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml>

To see how this code generates biml, I want to validate the output, so lets test it. In Adventure Works, the 2005 and 2006 partitions in the Internet Sales measure group.  These have the ID’s Internet_Sales_2001 and Internet_Sales_2002 (the measure group ID is Fact Internet Sales 1).  Below, I can see the generated biml and verify that the correct partitions have been chosen.

image

Now, if I’ll process the 2008 partition (through SSMS) which has an ID of Internet_Sales_2004.  Assuming I processed the 2006 partition last (in the prior step), I’d expect the 2006 and 2008 partition to be in my processing list.  That’s exactly what I see (as in the screen shot below). 

image

Of course my generated package will show the names I expect (Internet_Sales_2008 and Internet_Sales_2006).

image

Conclusion

So what’s the purpose of this snippet?  It provides a way to automate the creation of a SSAS processing task by examining the meta-data of existing database partitions.  We assume that we want to continue to process each measure group’s last n partitions and the package is generated for us.

Biml XXI – A Framework for Dimension Processing

This post looks at building an SSIS package for dimension processing – we assume that you want fine control over the processing of your SSAS cube and so use different processing activities for dimensions, facts and database (more on facts and databases later).    The reason why you would want to use a different processing option of each object relates to the processing window for you database.  If you have a large amount of fact data, it may not be possible to process the entire database within the allocated time and so we move to an approach where you process the dimensions first as an update and then the measure groups (facts) that have changed data.  Dimensions have to be processed using an update so the existing fact data is not cleared.

At the end of the post, we’ll have 2 snippets.  The first puts all dimensions into a single processing task.  Through that, you can specify the degree of parallelisation (it is part of the SSAS processing control) and secondly, our package will put individual processing tasks into a sequence container (with a parallel constraint mode).  The idea of the post though, is just to specify the SSAS server and the database and then have biml generate the package for you.

Database Dimensions

In a previous post, I looked at extracting dimension data from a database.  Its important to remember that (in a well designed system), dimensions are shared across cubes (the dimension is a major object for the database) and therefore the start of our code is identifying dimensions in the database.  In order to do that, I have a function  that returns a dictionary of dimension unique names and display names).  The code requires a reference to the AMO class (in my system this was found at )

C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.DLL .


using System.Collections.Generic;
using Microsoft.AnalysisServices; 

 

static Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

The Analysis Services Processing Task

The SSIS control flow includes a SSAS Processing task.  In biml we can implement the with a simple example that processes the Account dimension of Adventure Works (as in the following).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=.\SQL2008R2;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server=".\SQL2008R2" />
  </Connections>
  <Packages>
    <Package Name="ProcessSass" ConstraintMode="Linear">
         <Tasks>
            <AnalysisServicesProcessing Name="Process_Account" ConnectionName="olap">
                  <ProcessingConfigurations>
                    <DimensionProcessingConfiguration DatabaseId="Adventure Works DW 2008R2" ProcessingOption="ProcessUpdate"  DimensionId="Dim Account" />
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml> 

The important parts of the processing task is the processing option and the dimension id (not its name).  Of course, the dimension id may be different from its name (as the following screen image shows … remember that this is Adventure Works).

image

Naturally, if we want to process other dimensions as part of this task, we would include more DimensionProcessingConfiguration tags within the ProcessingConfiguration node.

Populating Configuration Tags.

The first snippet (I’ve included the full snippet for ease of use) populates the configuration tags (as shown below)

image

I think the key takeaways for the code are the inclusion of the AnalysisServices assembly (we have not looked at this before) and the use of server, database and processing types as ‘control’ strings at the start of the code (meaning they will only need to be set once in the code).  The iteration over each dimension to add a processing configuration (dimension processing command) should not be unfamiliar if you have been following this series.


<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#" #> 

<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    String _process_type = "ProcessUpdate";
#> 

<#+ 
Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

#>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="ProcessSass" ConstraintMode="Linear">
         <Tasks>
            <AnalysisServicesProcessing Name="Process_Dimensions" ConnectionName="olap">
                  <ProcessingConfigurations>
                    <# 
                    Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server , _ssas_database);
                    foreach (string _dim in _dimensions.Keys){ #>
                        <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database #>" ProcessingOption="<#= _process_type #>"  DimensionId="<#= _dim #>" />
                    <# } #>
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml>

A Single Processing Task For Each Dimension

Alternatively, our strategy could be to create a processing task for each dimension and then wrap that in a container.  Note that the container has a parallel constraint which allows some parallel processing.  Our package will look like the following.

image


<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#" #> 

<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    String _process_type = "ProcessUpdate";
#> 

<#+ 
Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

#>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="ProcessSass" ConstraintMode="Linear">
         <Tasks>
            <Container Name="Process Dimensions" ConstraintMode="Parallel">
            <Tasks>
            <# 
                Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server , _ssas_database);
                foreach (string _dim in _dimensions.Keys){ #>            
                <AnalysisServicesProcessing Name="Process_<#= _dim #>" ConnectionName="olap">
                  <ProcessingConfigurations>
                        <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database #>" ProcessingOption="<#= _process_type #>"  DimensionId="<#= _dim #>" />
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
            <# } #>
            </Tasks>
            </Container>
        </Tasks>
    </Package>
  </Packages>
</Biml>

Conclusion

This post includes two snippets for building dimension processing packages.  The approach (using ProcessUpdate) is required as the first stage of processing when the database is too big to be processed within the allowed window.

Understanding the LIFT CHART

The lift chart is synonymous with evaluating data mining model performance and the predictive power of one model against another. Often, in presentations and training sessions it is suggested that the chart is indicative of the models ability to accurately predict within a training population. For example, the following explanation is provided;

“the lift chart shows that this model is good because it only needs to evaluate 30% of data in order to correctly predict all the target outcomes”

This type of statement is simply not true – it is INCORRECT, WRONG, MISLEADING and shows a lack of understanding about what the chart represents. This post looks at explaining the chart by examining how it is created –  seeking to remove some of the misconceptions about the use of the chart.

Consider the following example. In this chart it would be argued that an ideal model (red line) would only need ~ 55% of the population in order to predict all the target states. Without a model, we would need to use the entire population and so our model (being somewhat useful) lies between the best model and a random guess. These values can be determined by the intercepts each model with the X axis (note that at 55% of the population, the best model achieves 100% accuracy).

Another common question arising from the interpretation of this chart occurs when we know that the target (predicted) state is found in only 55% of the population. The question is “why do we show 100% accuracy when only 55% of the population can exist in the predicted state and therefore the Y axis should have a maximum of 55%”.

For my own analysis, I shall ask a question of the reader so that the construction of the chart can better be understood. The question is simple.

If my model does not predict the correct result 100% of the time how could my accuracy ever achieve 100%? Let’s be realistic, it would have to be a pretty impressive model to never be wrong – and this is what the chart always shows à 100% accuracy!

Now let’s look at construction

In order to create a lift chart (also referred to as an accumulative gain) the data mining model needs to be able to predict the probability of its prediction. For example, we predict a state and the probability of that state. Within SSAS, this is achieved with the PredictProbability
function.

Now, since we can include the probability of our predictions, we would naturally order training data by the predicted probability in suggesting the likelihood of a test case being the predicted target state. Or perhaps put another way, if I only had 10 choices for choosing which test case (that is a an observation from the testing data) would be the predicted value, I would choose the top 10 testing cases based on their predicted probability – after all the model is suggesting that these cases have the highest probability of being the predicted state.

As we move through the testing data (and the predicted probability decreases), it is natural to expect the model to become less accurate – will make more false predictions. So let’s summarise this (training) data. For convenience, I have group my training data into 10 bins and each bin has ~ 320 cases (the red line below). Working with the assumption that the predictive power of my model decreases with probability, the number of predictions also decreases as we move through more of the training data. This is clearly visible in the chart and data below – the first bin has a high predictive power (275 correct predictions) while the last bin has only 96 correct predictions.

If I focus on the models ability to correctly predict values, I will notice that it can predict 1,764 correct results – but now let’s turn our look to the accumulative power of the model. If, from the set of my sample data I could only choose 322 cases (coincidently this is the number of cases in bin 1), I would choose all cases from Bin 1 and get 275 correct (or 16% of the possible correct values). If I had to choose 645 cases, I would choose the cases from bin 1 & 2 and get 531 correct (30% of correct possibilities). This continues with the more predictions that I make and is summarised in the following table.

This data is transposed onto the lift chart – the Bin on the X axis (representing the % of population) and the Percent Running Correct on the Y axis (representing the number of correct predictions). As we can see, the data is indicative of the models ability to quickly make accurate predictions rather than its overall predictive ability.

Best and Random Cases

The chart also includes best and random cases as guidelines for prediction – let’s focus on these. These lines are theoretical – really ‘what if’ type of scenarios.

Suppose that I had an ideal model. If this was the case my model would predict 322 in bin 1, 323 in bin 2 and so on – it must because we have ordered the data by PredictProbability and in a perfect world we would get them all correct! However, the model can only predict 1,764 correct values -we know this from the actual results. Because of this we would only need up to bin 6 to get all our correct values (see column ‘Running Correct Best Case’ in the following table. Just as we did for the model prediction we can convert this to a percent of total correct (the population) and chart it with the model.

Now for the random guess – again this is theoretical. I know that I can only predict 1,764 correct values so, if these were evenly distributed amongst my bins, I would have ~176 correct predictions in each bin. This is then added to the chart.

What is the Problem?

Now we can see that the chart is essentially just a view of how quickly the model makes accurate predictions.  Perhaps there is nothing wrong with that but what happens when we compare models?  Well, in this case, the comparison is relative.  Those steps are reproduced for each chart and what you essentially see is relative comparative performance.  Thus, the comparison of two models in the charts gives NO indication of performance accuracy – after all how could they since they each plot relative percent accuracy for their own states.

For this reason, relying on this chart as a sole measure of accuracy is just dangerous and really shows very little about the total accuracy of the model.

Conclusion

Understanding how the lift chart has been constructed can help in understanding how to interpret it. We can see that it indicates the accumulative power of the model to give predictions – or perhaps more correctly the accumulative power of the model to give its correct prediction.