Biml XIII – The Biml Script for the Entire SSAS Database

This final post on biml SSAS processing and provides the full script to produce two packages.  It really ties in the first 2 posts to provide a complete solution for our processing scenario.  Remember, our goal is to point our biml at a SSAS database and have a processing package built that only processes the last 2 fact partitions of each measure group based on their last processed date.

The First Method

Firstly, we look at a very conventional arrangement that would process dimensions, partitions and indexes.  This looks like the image below.  In order to get the most performance (lowest processing time), we need to specify that each processing task uses parallelisation.

image

The code is similar to prior posts so we wont dwell on it too much.  Basically, we use AMO (note the reference to the AMO dll in the first line) to connect to our cube, discover dimensions and partitions and these in an iterative manner to each processing task.  The two methods in the C# code of importance are DimensionKeys (which returns all dimensions in the database) and Partitions (which returns the last two processed partitions for our intended measure group).

<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#" #> 

<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    int _max_partitions = 2;
#> 

<#+ 
Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId)
{
        var _ret = new List<PartitionStruct>(); 

        Server _ssas_server = new Server();
        _ssas_server.Connect(ServerName);
        Database _db = _ssas_server.Databases[DatabaseName];
        Cube _c = _db.Cubes[CubeId];
        MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

        foreach (Partition _p in _mg.Partitions)
        {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
        }
        _ssas_server.Disconnect();
        _ssas_server.Dispose(); 

        _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

        return _ret;
} 

public struct PartitionStruct
{
        public readonly string PARTITION_NAME;
        public readonly string PARTITION_ID;
        public readonly DateTime PROCESS_DATE;
        public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
        {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
        } 

} 

#>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="SSAS_PROCESSING_ssas_process_all" ConstraintMode="Linear" ProtectionLevel="EncryptAllWithUserKey">
         <Tasks>
            <AnalysisServicesProcessing Name="Process_Dimensions_Update" ConnectionName="olap" ProcessingOrder="Parallel" >
              <ProcessingConfigurations>
                <# 
                Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server , _ssas_database);
                foreach (string _dim in _dimensions.Keys){ #>            
                        <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database #>" ProcessingOption="ProcessUpdate"  DimensionId="<#= _dim #>" />
                <# } #>
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
            <AnalysisServicesProcessing Name="Process_Partitions_Data" ConnectionName="olap" ProcessingOrder="Parallel">
                <ProcessingConfigurations>
                <#    
                Server _ssas_server_ = new Server();
                _ssas_server_.Connect(_ssas_server); 

                Database _db = _ssas_server_.Databases[_ssas_database];
                foreach (Cube _c in _db.Cubes) 
                { 
                    foreach (MeasureGroup _mg in _c.MeasureGroups)
                    {  
                        List<PartitionStruct> _listPartitions = Partitions(_ssas_server, _ssas_database, _c.ID, _mg.ID);
                        for (int _i = 0; _i < _max_partitions & _i < _listPartitions.Count; _i++ ) { #>
                        <PartitionProcessingConfiguration DatabaseId="<#= _db.ID #>" CubeId="<#= _c.ID #>"  MeasureGroupId="<#= _mg.ID #>" PartitionId="<#= _listPartitions[_i].PARTITION_ID #>" ProcessingOption="ProcessData" />
                     <# }
                    }
                    } 
                    #>
                </ProcessingConfigurations>
            </AnalysisServicesProcessing>
            <AnalysisServicesProcessing Name="Process_Indexes" ConnectionName="olap" ProcessingOrder="Parallel">
                <ProcessingConfigurations>
                    <# 
                    _db = _ssas_server_.Databases[_ssas_database];
                    foreach (Cube _c in _db.Cubes) { #>
                    <CubeProcessingConfiguration ProcessingOption="ProcessIndexes"  DatabaseID="<#= _db.ID #>" CubeId="<#= _c.ID #>"/>
                    <# } 
                       _ssas_server_.Disconnect();
                    #>
                </ProcessingConfigurations>
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml>

A slight word of caution … If you look into the biml code, you will see that we explicitly set the ProcessingOrder as parallel.  This specifies that the Processing Order (mode) for the task should be in parallel and that the batch XMLA sent to your SSAS Server should include that option.  Unfortunately, this may not materialise on all versions of BIDS & BIML and the default (sequential) may be used.  I would check the outcome before implementation.

Method 2

Our second approach is essentially the same as the first, however, in this approach, we encapsulate each each measure group processing command in its own container with sequential constraints between each partition process.  Why?  Because we want to ensure that our partitions are processed according to last processed date.  That is, the last 2 processed partitions should be processed and we should process them in the order that we originally processed them.  If we add a new partition, we want to ensure that the one with the oldest data drops off the list.  You’ll also notice that we use SSIS containers to gain parallelisation for each Cube and Measure Group.

This code relies on the same methods (DimensionKeys and Partitions) used in the prior to build its output.

image

<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #>
<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 
<#@ template language="C#" #> 

<#
    String _ssas_server_name = @".\SQL2008R2";
    String _ssas_database_name = "Adventure Works DW 2008R2";
    int _max_partitions = 2;
#>
<#+ 

Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId, int PartitionNum)
{
        /* returns a number of partitions (PartitionNum)
         * based on their processed_date -> ie last n processed partitions
         * sorted in asc order
         */ 
        List<PartitionStruct> _ret = new List<PartitionStruct>(); 

        Server _ssas_server = new Server();
        _ssas_server.Connect(ServerName);
        Database _db = _ssas_server.Databases[DatabaseName];
        Cube _c = _db.Cubes[CubeId];
        MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

        foreach (Partition _p in _mg.Partitions)
        {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
        }
        _ssas_server.Disconnect();
        _ssas_server.Dispose(); 

        _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

        /* get first PartitionNum */
        while (_ret.Count > PartitionNum)
                _ret.RemoveAt(PartitionNum); 

        // ret asc
        _ret.Sort((x, y) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 
        return _ret;
} 

public struct PartitionStruct
{
        public readonly string PARTITION_NAME;
        public readonly string PARTITION_ID;
        public readonly DateTime PROCESS_DATE;
        public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
        {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
        }
}
#> 

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
    <Connections>
        <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server_name #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server_name #>" />
    </Connections>
    <Packages>
        <Package Name="SSAS_PROCESSING <#= _ssas_database_name #>" ConstraintMode="Linear">
            <Tasks> 

                <AnalysisServicesProcessing Name="Process_Dimensions_Update" ConnectionName="olap" ProcessingOrder="Parallel" >
                      <ProcessingConfigurations>
                        <# 
                            Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server_name , _ssas_database_name);
                                foreach (string _dim in _dimensions.Keys){ #>            
                                    <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database_name #>" ProcessingOption="ProcessUpdate"  DimensionId="<#= _dim #>" />
                        <# } #>
                      </ProcessingConfigurations>
                </AnalysisServicesProcessing> 

                <Container Name="FACT_PROCESSING" ConstraintMode="Parallel">
                    <Tasks>
                    <#
                    Server _ssas_server_ = new Server();
                    _ssas_server_.Connect(_ssas_server_name);
                    Database _db = _ssas_server_.Databases[_ssas_database_name];
                    foreach (Cube _c in _db.Cubes) { #>
                        <Container Name="CUBE <#= _c.Name #>" ConstraintMode="Parallel">
                            <Tasks>
                                <# foreach (MeasureGroup _mg in _c.MeasureGroups){#>
                                <Container Name="MEASURE GROUP <#= _mg.Name #>" ConstraintMode="Linear">
                                    <Tasks>
                                        <# List<PartitionStruct> _listPartitions = Partitions(_ssas_server_name, _ssas_database_name, _c.ID, _mg.ID, _max_partitions);
                                           foreach (PartitionStruct _partition in _listPartitions) { #>
                                              <AnalysisServicesProcessing Name="PARTITION <#= _partition.PARTITION_NAME #>" ConnectionName="olap">
                                                  <ProcessingConfigurations>
                                                      <PartitionProcessingConfiguration CubeId="<#= _c.ID #>" DatabaseId="<#= _db.ID #>" MeasureGroupId="<#= _mg.ID #>" PartitionId="<#= _partition.PARTITION_ID #>" ProcessingOption="ProcessData" />
                                                  </ProcessingConfigurations>
                                              </AnalysisServicesProcessing> 
                                        <# }#>
                                    </Tasks>
                                </Container>
                                <# }  #>
                            </Tasks>
                        </Container>
                    <#} #>
                    </Tasks>
                </Container> 

            <AnalysisServicesProcessing Name="Process_Indexes" ConnectionName="olap" ProcessingOrder="Parallel">
                <ProcessingConfigurations>
                    <# 
                    _db = _ssas_server_.Databases[_ssas_database_name];
                    foreach (Cube _c in _db.Cubes) { #>
                    <CubeProcessingConfiguration ProcessingOption="ProcessIndexes"  DatabaseID="<#= _db.ID #>" CubeId="<#= _c.ID #>"/>
                    <# } 
                       _ssas_server_.Disconnect();
                    #>
                </ProcessingConfigurations>
            </AnalysisServicesProcessing> 

            </Tasks>
        </Package>
    </Packages>
</Biml>

Conclusion

You might suggest that this process is a very long way to build a processing schedule – and you may be right.  If you had to build it from scratch once to create a package that does some processing, you might have achieved it faster by hand.  However, here’s the thing about automation … you don’t have to code it, its there for you can you can use it to build out your packages and now package creation takes 10 sec! 

Biml XXII – A Framework for Partition Processing

Following on from the last post on dimension processing, the next operation in our SSAS processing strategy is to process measure group partitions.  If we remember our scenario, our SSAS database takes too long to process (when we try a full process), so we are forced to process dimensions using a ProcessUpdate and then specific partitions using ProcessData.  Of course we only want to process those partitions that have the most recent data and we want to automate this using biml.  That is, we want to point our biml package at a SSAS database and have the package generated by some logic.

The Logic

image Lets assume that we have a group of partitions in each measure group (see left).  We will assume that the last partitions processed for the measure group are the most current and therefore, we should choose the last n number of partitions to process (this is based on the individual partitions last processed date).

If you think about how you would manage this in a system, consider adding 2009 data.  When 2009 comes along, we add a partition and process it.  Now that this ‘new’ partition has been added, we want to continue processing it along with the (last processed) partition (ie 2008).

In summary, our processing logic needs to look for the last n partitions based on the last process date of the measure group partitions.

I will explain this in the code below but there is one little (perhaps very confusing) issue that I’ve found with Adventure Works and you may find this in your version.  If I look at the properties for the Internet Sales 2008 partition, I can see the date it was last processed, its name and its ID.  In my code, I’ll be using ID but take a look at the ID and Name.  The ID is Internet_Sales_2004 however, the name is Internet_Sales_2008 … confusing right?

image

What i need is a way to iterate over the partitions in a SSAS database determining which ones where processed last (say for example the last 2 partitions).  I can demonstrate this in the following psuedo code;

 
server <- connect to SSAS server
database <- connect to a database on server
foreach (cube in database)
{
    foreach (measure group in cube)
    {
        foreach (partition in measure group)
            get the last n processed partitions
    }
} 

If I run this against my Adventure Works instance, I may get something like this (assuming that I simply print the last 2 partitions).  We can see, the Cube, Measure Group and Partitions at individual levels

image

Here’s the code to produce that output.  We can see the iteration of cubes, measure groups and partitions.  Note that I’ve also defined a function Partitions to return a sorted table of partitions with the PARTITION_NAME, PARTITION_ID and (last) PROCESS_DT.  I’ll have to also include a reference to AMO.


static void Main(string[] args)
{
    string _server_name = @"Data Source=.\SQL2008R2";
    string _database_name = "Adventure Works DW 2008R2";
    int _max_partitions = 2;
    

    Server _ssas_server = new Server();
    _ssas_server.Connect(_server_name); 

    Database _db = _ssas_server.Databases[_database_name];
    foreach (Cube _c in _db.Cubes)
    {
        Console.WriteLine(_c.Name);
        foreach (MeasureGroup _mg in _c.MeasureGroups)
        {
            Console.WriteLine("\t{0}", _mg.Name);
            var _listPartitions = Partitions(_ssas_server, _ssas_database, _c.ID, _mg.ID);
            for (int _i = 0; _i < _max_partitions & _i < _listPartitions.Count; _i++ )
                Console.WriteLine("\t\t{0},{1}", _listPartitions[_i].PARTITION_NAME, _listPartitions[_i].PROCESS_DATE);


            Console.WriteLine();
        }
    } 

} 



public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId)
{
        var _ret = new List<PartitionStruct>(); 

        Server _ssas_server = new Server();
        _ssas_server.Connect(ServerName);
        Database _db = _ssas_server.Databases[DatabaseName];
        Cube _c = _db.Cubes[CubeId];
        MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

        foreach (Partition _p in _mg.Partitions)
        {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
        }
        _ssas_server.Disconnect();
        _ssas_server.Dispose(); 

        _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

        return _ret;
} 

public struct PartitionStruct
{
        public readonly string PARTITION_NAME;
        public readonly string PARTITION_ID;
        public readonly DateTime PROCESS_DATE;
        public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
        {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
        }
}

Over To Biml

In using biml, I simply want to create a package that has a single PartitionProcessing task in it.  Partition processing is just a processing configuration for the AnalysisServicesProcessing task and its use should become apparent once we see the generated biml.  We simply specify the partition that we want to process.  Our complete biml script is;


<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Data" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#"  #>
<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    String _process_type = "ProcessData";
    int _max_partitions = 2;
#> 

<#+
       public static List<PartitionStruct> Partitions(string ServerName, string DatabaseName, string CubeId, string MeasureGroupId)
        {
            var _ret = new List<PartitionStruct>(); 

            Server _ssas_server = new Server();
            _ssas_server.Connect(ServerName);
            Database _db = _ssas_server.Databases[DatabaseName];
            Cube _c = _db.Cubes[CubeId];
            MeasureGroup _mg = _c.MeasureGroups[MeasureGroupId]; 

            foreach (Partition _p in _mg.Partitions)
            {
                _ret.Add(new PartitionStruct(_p.Name.ToString(), _p.ID.ToString(), _p.LastProcessed));
            }
            _ssas_server.Disconnect();
            _ssas_server.Dispose(); 

            _ret.Sort((y, x) => x.PROCESS_DATE.CompareTo(y.PROCESS_DATE)); 

            return _ret;
        } 

        public struct PartitionStruct
        {
            public readonly string PARTITION_NAME;
            public readonly string PARTITION_ID;
            public readonly DateTime PROCESS_DATE;
            public PartitionStruct(string partition_NAME, string partition_ID, DateTime process_DATE)
            {
                PARTITION_NAME = partition_NAME;
                PARTITION_ID = partition_ID;
                PROCESS_DATE = process_DATE;
            } 

        }
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="ProcessSassPartitions" ConstraintMode="Linear">
         <Tasks>
            <AnalysisServicesProcessing Name="Process Partitions" ConnectionName="olap" >
                <ProcessingConfigurations>     
                <#
                Server _ssas_server_ = new Server();
                _ssas_server_.Connect(_ssas_server); 

                Database _db = _ssas_server_.Databases[_ssas_database];
                foreach (Cube _c in _db.Cubes) 
                { 
                    foreach (MeasureGroup _mg in _c.MeasureGroups)
                    {  
                        List<PartitionStruct> _listPartitions = Partitions(_ssas_server, _ssas_database, _c.ID, _mg.ID);
                        for (int _i = 0; _i < _max_partitions & _i < _listPartitions.Count; _i++ ) { #>
                        <PartitionProcessingConfiguration DatabaseId="<#= _db.ID #>" CubeId="<#= _c.ID #>"  MeasureGroupId="<#= _mg.ID #>" PartitionId="<#= _listPartitions[_i].PARTITION_ID #>" ProcessingOption="ProcessData" />
                     <# }
                    }
                    } #>
                   </ProcessingConfigurations> 
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml>

To see how this code generates biml, I want to validate the output, so lets test it. In Adventure Works, the 2005 and 2006 partitions in the Internet Sales measure group.  These have the ID’s Internet_Sales_2001 and Internet_Sales_2002 (the measure group ID is Fact Internet Sales 1).  Below, I can see the generated biml and verify that the correct partitions have been chosen.

image

Now, if I’ll process the 2008 partition (through SSMS) which has an ID of Internet_Sales_2004.  Assuming I processed the 2006 partition last (in the prior step), I’d expect the 2006 and 2008 partition to be in my processing list.  That’s exactly what I see (as in the screen shot below). 

image

Of course my generated package will show the names I expect (Internet_Sales_2008 and Internet_Sales_2006).

image

Conclusion

So what’s the purpose of this snippet?  It provides a way to automate the creation of a SSAS processing task by examining the meta-data of existing database partitions.  We assume that we want to continue to process each measure group’s last n partitions and the package is generated for us.

Biml XXI – A Framework for Dimension Processing

This post looks at building an SSIS package for dimension processing – we assume that you want fine control over the processing of your SSAS cube and so use different processing activities for dimensions, facts and database (more on facts and databases later).    The reason why you would want to use a different processing option of each object relates to the processing window for you database.  If you have a large amount of fact data, it may not be possible to process the entire database within the allocated time and so we move to an approach where you process the dimensions first as an update and then the measure groups (facts) that have changed data.  Dimensions have to be processed using an update so the existing fact data is not cleared.

At the end of the post, we’ll have 2 snippets.  The first puts all dimensions into a single processing task.  Through that, you can specify the degree of parallelisation (it is part of the SSAS processing control) and secondly, our package will put individual processing tasks into a sequence container (with a parallel constraint mode).  The idea of the post though, is just to specify the SSAS server and the database and then have biml generate the package for you.

Database Dimensions

In a previous post, I looked at extracting dimension data from a database.  Its important to remember that (in a well designed system), dimensions are shared across cubes (the dimension is a major object for the database) and therefore the start of our code is identifying dimensions in the database.  In order to do that, I have a function  that returns a dictionary of dimension unique names and display names).  The code requires a reference to the AMO class (in my system this was found at )

C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.DLL .


using System.Collections.Generic;
using Microsoft.AnalysisServices; 

 

static Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

The Analysis Services Processing Task

The SSIS control flow includes a SSAS Processing task.  In biml we can implement the with a simple example that processes the Account dimension of Adventure Works (as in the following).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=.\SQL2008R2;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server=".\SQL2008R2" />
  </Connections>
  <Packages>
    <Package Name="ProcessSass" ConstraintMode="Linear">
         <Tasks>
            <AnalysisServicesProcessing Name="Process_Account" ConnectionName="olap">
                  <ProcessingConfigurations>
                    <DimensionProcessingConfiguration DatabaseId="Adventure Works DW 2008R2" ProcessingOption="ProcessUpdate"  DimensionId="Dim Account" />
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml> 

The important parts of the processing task is the processing option and the dimension id (not its name).  Of course, the dimension id may be different from its name (as the following screen image shows … remember that this is Adventure Works).

image

Naturally, if we want to process other dimensions as part of this task, we would include more DimensionProcessingConfiguration tags within the ProcessingConfiguration node.

Populating Configuration Tags.

The first snippet (I’ve included the full snippet for ease of use) populates the configuration tags (as shown below)

image

I think the key takeaways for the code are the inclusion of the AnalysisServices assembly (we have not looked at this before) and the use of server, database and processing types as ‘control’ strings at the start of the code (meaning they will only need to be set once in the code).  The iteration over each dimension to add a processing configuration (dimension processing command) should not be unfamiliar if you have been following this series.


<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#" #> 

<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    String _process_type = "ProcessUpdate";
#> 

<#+ 
Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

#>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="ProcessSass" ConstraintMode="Linear">
         <Tasks>
            <AnalysisServicesProcessing Name="Process_Dimensions" ConnectionName="olap">
                  <ProcessingConfigurations>
                    <# 
                    Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server , _ssas_database);
                    foreach (string _dim in _dimensions.Keys){ #>
                        <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database #>" ProcessingOption="<#= _process_type #>"  DimensionId="<#= _dim #>" />
                    <# } #>
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
        </Tasks>
    </Package>
  </Packages>
</Biml>

A Single Processing Task For Each Dimension

Alternatively, our strategy could be to create a processing task for each dimension and then wrap that in a container.  Note that the container has a parallel constraint which allows some parallel processing.  Our package will look like the following.

image


<#@ assembly name="C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies\Microsoft.AnalysisServices.dll" #> 

<#@ import namespace="System.Collections.Generic" #>
<#@ import namespace="Microsoft.AnalysisServices" #> 

<#@ template language="C#" #> 

<#
    String _ssas_server = @".\SQL2008R2";
    String _ssas_database = "Adventure Works DW 2008R2";
    String _process_type = "ProcessUpdate";
#> 

<#+ 
Dictionary<string, string> DimensionKeys(string Server, string Database)
{
     Dictionary<string, string> _dimensions = new Dictionary<string, string>();
     Server _ssas_server = new Server();
     _ssas_server.Connect(Server); 

     Database _db = _ssas_server.Databases[Database];
     foreach (Dimension _dim in _db.Dimensions)
         _dimensions.Add(_dim.ID, _dim.Name); 

     _ssas_server.Disconnect();
     _ssas_server.Dispose(); 

     return _dimensions;
} 

#>

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <AnalysisServicesConnection Name="olap" ConnectionString="Data Source=<#= _ssas_server #>;PROVIDER=MSOLAP;Impersonation Level=Impersonate;" Server="<#= _ssas_server #>" />
  </Connections>
  <Packages>
    <Package Name="ProcessSass" ConstraintMode="Linear">
         <Tasks>
            <Container Name="Process Dimensions" ConstraintMode="Parallel">
            <Tasks>
            <# 
                Dictionary<string, string> _dimensions = DimensionKeys("Data Source=" + _ssas_server , _ssas_database);
                foreach (string _dim in _dimensions.Keys){ #>            
                <AnalysisServicesProcessing Name="Process_<#= _dim #>" ConnectionName="olap">
                  <ProcessingConfigurations>
                        <DimensionProcessingConfiguration DatabaseId="<#= _ssas_database #>" ProcessingOption="<#= _process_type #>"  DimensionId="<#= _dim #>" />
                  </ProcessingConfigurations>
            </AnalysisServicesProcessing>
            <# } #>
            </Tasks>
            </Container>
        </Tasks>
    </Package>
  </Packages>
</Biml>

Conclusion

This post includes two snippets for building dimension processing packages.  The approach (using ProcessUpdate) is required as the first stage of processing when the database is too big to be processed within the allowed window.

Biml XX – Automating Lookups (a fundamental dimensional process)

Perhaps the most fundamental activity in the creation of a dimension is the de-normalisation of data.  That is, we take the OLTP’s source data and consecutively build a wide (de-normalised) table describing the entity or subject area.  This process has a distinct build process whereby we start with the lowest level of granularity and incrementally add fields to the structure by looking up attributes based on a key in the source data.  This post looks at automating that process in biml.

For example, consider creating a product dimension that has groupings of products.  We start with a base list of products (identified by its ID), and include important attributes at that level (like name, colour, size, category grouping key etc).  Products are grouped into categories so we would then look attributes for the category (perhaps including the a key for the next level of grouping).  Those categories could be grouped again and so the process continues.

The base process that we are conducting is essentially a sequential looking up data to append to our original extract.  I say sequential because we are often reliant on levels where hierarchies exist (and therefore order is important).

In SSIS, we might expect such a process to look like that on the right.

Of course, the package is very simple and includes no consideration of changing attributes.  This is for demonstration purposes.

image

A Different Approach based on SQL metadata?

If we want to automate this process (or perhaps more correctly the automation of a package creation that uses it), we could summarise our activity in 3 steps.  Firstly, we retrieve a list of items (a base list to start with) and then (while there is a need to), add fields to that data based on the results of another query.  Finally, we do something with that list (like send it to a destination as in the example above).

As far as a metadata (or an instructed layer approach goes), this can be defined through a header/detail structure that identifies an activity and the lookups that must be conducted in order for it to be completed.  For example;

Lookup a product base list
select 
ProductAlternateKey 
, EnglishProductName 
, ProductSubcategoryKey  
from dbo.DimProduct 

Lookup subcategory information (the first level in the hierarchy).  Note that there is an implicitly defined relationship between our ‘existing’ data and this through the common field ProductSubcategoryKey
select 
ProductSubcategoryKey 
, EnglishProductSubcategoryName 
, ProductCategoryKey  
from DimProductSubcategory 
Lookup category information (second level).  Again, there is an implicit relationship based on the ProductCategoryKey field between our existing (the sequentially built table) and this dataset.

select 
ProductCategoryKey , EnglishProductCategoryName 
from DimProductCategory  

An antagonist would suggest the same output could be achieved with a single query that joins the three tables.  True, however this example is illustrative and our goal in ETL is to combine data (which may reside in disparate systems) and therefore could not be combines through a single command.

Apart from the (already stated) assumption that we assume data will be appended based on the first column of the ‘append query’, (for example joining by ProductSubcategoryKey), we are also including all fields in the lookup query will be added to the table (for example, subcategoryname and categorykey will be added).  Also, since we have the field ProductSubcategoryKey in the table, we don’t need to re-add it.

Finally, the logic for our entire ‘work object’ can be summarised in a tabular output as the two images below show.  Here, a master table defines a job (later this will be transformed into a package) and secondly an alternate table defines the list of lookups.  Note that the second dataset is returned via a procedure which is dependant on the name of the first.  Our entire build can be defined by these 3 lines :).

image 

image

Over to Biml

The entire biml required to generate this package(s) is given below.  I say package(s) here because as many packages development is driven by the rows in our master and child procs (as shown above).  Of course package names, task names etc are defined (or derived) from the name fields in those outputs.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#
    string _src_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
    string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
    DataTable _transforms = new DataTable();
    SqlDataAdapter _transforms_da = new SqlDataAdapter("exec transforms_entities", _stage_con_string);
    _transforms_da.Fill(_transforms);    
#> 
<#+
    public string[] SQLCols(string SelectSql)
    {
        /* returns an array of column strings from a select statement */ 
        int _index_of_select = SelectSql.ToUpper().IndexOf("SELECT"); 
        // remove the select
        SelectSql = SelectSql.Substring(SelectSql.ToUpper().IndexOf("SELECT") + 6); 
        // remove the From
        SelectSql = SelectSql.Substring(0, SelectSql.ToUpper().IndexOf("FROM ")); 
        // remove all spaces
        while (SelectSql.IndexOf(" ") != -1)
            SelectSql = SelectSql.Replace(" ", ""); 
        return SelectSql.Split(',');
    }
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
      <Connections>
        <OleDbConnection Name="adventure_works" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
        <OleDbConnection Name="stage" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
      </Connections>
    <Packages>
        <# foreach (DataRow _transform in _transforms.Rows) { #>
            <Package Name="DIM_<#= _transform["name"] #>" ConstraintMode="Linear">
                <Tasks>
                    <Dataflow Name="DFT_build_<#= _transform["name"] #>">
                        <Transformations>
                            <OleDbSource Name="SRC_<#= _transform["name"] #>" ConnectionName="<#= _transform["connection"] #>">
                                <DirectInput>
                                    <#= _transform["src_command"] #>
                                </DirectInput>
                            </OleDbSource> 
                            <#  DataTable _lookups = new DataTable();
                                SqlDataAdapter _lookups_da = new SqlDataAdapter("exec [dbo].[transforms_lookups]  @transforms_entity='" + _transform["name"] + "'  ", _stage_con_string);
                                _lookups_da.Fill(_lookups);    
                                foreach (DataRow _lookup in _lookups.Rows){ #>
                                    <Lookup Name="LK_<#= _lookup["name"] #>" OleDbConnectionName="<#= _lookup["connection"] #>" NoMatchBehavior="IgnoreFailure">
                                    <DirectInput><#= _lookup["src_cmd"] #></DirectInput>
                                    <# string[] _col_list = SQLCols(_lookup["src_cmd"].ToString());#>
                                    <Inputs>
                                        <Column SourceColumn="<#= _col_list[0] #>" />
                                    </Inputs>
                                    <Outputs>
                                        <# for(int _i=1; _i<_col_list.Length; _i++){ #>
                                               <Column SourceColumn="<#= _col_list[_i] #>" />
                                        <# }#>
                                    </Outputs>
                                    </Lookup>
                            <# } #>
                                                                <OleDbDestination Name="TGT_<#=  _transform["tgt_connection"] #>" ConnectionName="<#= _transform["tgt_connection"] #>">
                                        <ExternalTableOutput Table="<#=  _transform["tgt_table"] #>" />
                                    </OleDbDestination>
                        </Transformations>
                    </Dataflow>
                </Tasks>
            </Package>
        <# } #> 
    </Packages>
</Biml> 

A quick walk through

The requirements for the lookup is done here.  In our code, we iterate over the list of jobs (creating packages for each job) and define data flows (and subsequent lookups) based on the job (and child job) data.  Viewed in entirety (and there is no requirement to implement it this way), the code may look a little confusing.  As psuedo code, i think its a little more readable.

job_list -> list of packages to create /* data from proc transforms_entities */ 
foreach (job in job_list)
{ 
    create a package (name = DIM_job_name)
    add a dataflow to the package (name = DFT_build_job_name) 
        add the source to the data flow (job_list.src_command)
        lookups -> results of proc transforms_lookup(job_list.name)
        foreach (lookup in lookups)
        {
            add a lookup 
                (name = LK_lookup.name)
                (source = lookup.src_cmd && lookup.connection)
                lookup_fields -> get list of lookup.src_cmd fields
                Add Input (source column) = lookup_fields.first_column
                foreach (lookup_field from the second position of lookup_fields)
                    Add Outputcolumn = lookup_field
        }
} 

Flexible lookup requirements

In our post on lookups, we talked about the minimum requirements for the lookup.  Apart from the connection and the query, each lookup needs to define the Inputs (how to match) and Outputs (what to return).  Given our SQL command (ie SQL Select), we know that the first field will be the join key/field and the other fields will be return fields.

Practically, there are several ways to do this.  We have used a very simple technique that deconstructs the select syntax to derive field names (no its not bullet proof).  In our code, we rely on the function SQLcols to return an array of names and then derive role based on their position in the array.  The first position (index=0) is the lookup field and the remaining are returned through the loop based on the length of the array.

Conclusion

This post continues to explore the concepts of data driven (or metadata driven) ETL development which provides a flexible method to generate consistent ETL packages.

Looking back over this post, I suspect that some would question the effort of the biml code for the output that we have achieved … there seems to be a lot of code to produce a single package with a dataflow task and a few lookups.  I agree, to build one package this may be overkill however there are several other considerations for the use of this approach.

  1. Its a simple example to demonstrate a concept.
  2. Any automation technique has a setup to production trade off.  That is, the more packages & tasks we create the better return we get from automation (that is, the more productive anf faster our code becomes).
  3. The code produces consistent output (which can be changed relatively simply).
  4. The code is not dependent on an environment and can be used as a toolkit.  For example, if you are a consultant, you can reuse the snippet.  Once you have got the code, all you have to do is update the table and any number of packages/transforms can be created.

A Look at BIReady – Data Warehouse Automation

I had previously discussed some of the considerations, implications and issues associated with the automation of data warehousing. That article addressed some common discussion points that I have witnessed on resistance from IT/BI professionals who refuse to concede that their expertise could be replaced by an automation tool. Of course, there are always extremes of opinion, application and circumstance so there is never any clear cut rule that can be applied in all circumstances (and for all clients). For myself, the question revolves around where the value adding activities are and manually developing ETL does not rank to high on my priority list. So, the general benefits I see in warehouse automation revolve around 3 areas;

1. Management and a consistent approach since we expect the tool to transform our business based model to a physical model (and manage everything in between).

2. The people (expertise) factor – An ability to remove and reorganise the typical layers of development. Since the traditional development stages are moved to an automated process, the reliance on specific roles of the business intelligence life-cycle is also removed. We may have one person (perhaps not an IT specialist) responsible for development and management.

3. Return on investment- The ability to deliver data faster to those that need it. Delivering a prototype data warehouse within a week offers tangible benefits and reduces the capacity constraint associated with development.

Now to the point of this article – BIReady.

BIReady is a data warehouse automation tool (developed in the Netherlands) that has recently been purchased by Attunity (NASDAQ: ATTU). Attunity needs no introduction and so I expect there are plans on the horizon for the product through additional investment and support. Prior to that, one of the most noticeable observations about the company was their inclusion of Bill Inmon on their advisory board – great kudos there!

The application is managed through a (complied) Access application. To be clear, this is just the front end that the user interacts with. Actually there are two parts to the application, the BIReady Manager (developers front end) and the BIReady Engine (Java application) that is accessed by the front end and basically does all the work.

Application data is stored in a relational connection (you can choose from various vendor databases: SQL Server, Oracle, Teradata, DB2, MySql) however, it is important to note that the front end does not store data in the Access engine. At first, I found the front end to be a bit ugly and dated – sorry Access. For someone that works in IT, the mere mention of Access tends to send shivers down your spine. Form design (in Access) is generally unattractive however, the operation of the application (menu design around distinct operational activities) mimics how one would expect to build a data warehouse. This is an important consideration because it permits fast track development for novices and the development cycle for BIReady.

Perhaps the most robust feature of BIReady is the approach they’ve adopted in the development cycle. Here, they adopt a CIF (Corporate Information Factory) approach that builds a true data warehouse in 3rd Normal Form (alternatively Data Vault) and then spin data marts from that for end user consumption. The data warehouse is designed around the data model (which can be reverse engineered or built from scratch) with the user identifying what attributes should be historically retained. Perhaps the best way to analyse its operation is therefore to look at the group of its core operating modules (a data model, a data warehouse, a data mart and its ETL (Process Instruction) Generator.

The data model defines entities and the relationships between them (screen shot shown below). Here, the entities in their attribute is can be seen along with the historical recognition of those attributes in the model (e.g. type 1 for not recording changes to the attribute or type 2 for recording the change in values). An interesting (and noteworthy) addition here is the block feature for each attribute. The block specifies how the the attributes a materialised in a physical table (with each block number defining a table). A single entity can therefore be broken up into model tables depending on how quickly we expect that data to change.

clip_image002[13]

Should you wish to jump start development, the data model can be reverse engineered from source systems. Here, entities (you can choose which entities) and relationships in the source system can be reproduced in the BIReady model with only a few mouse clicks (regardless of the stage of development). There is (of course) a bone of contention as to whether this is a good practice since it places a heavy reliance on the design of the source (and the model not be sustainable on a system change) … However, those discussions are for another day :).

The data warehouse is the physical implementation the data model and can occur using either the Third Normal Form or Data Vault architecture. If you peek under the hood of the product it is interesting to note that the names of entities and attributes in the physical implementation are not materialised in the data warehouse.

For example the categories entity above is stored in two tables (given the two blocks as described above) with system generated names. We can (should we desire) trace this structure from the model to the physical implementation however, I think it shows the intent of BIReady to remove a user from that layer. The user is expected to access the data through a data mart.

clip_image004[7]

clip_image006[7]

 

A data mart modeller uses the data warehouse to build a star schema(s). The application can support the creation of multiple data marts from the same data model so, therefore it’s entirely possible (and arguably desirable) to siphon off functional schemas and deliver departmental reporting. The data mart editor (below) works against the data model (which was discussed above) to show the physical implementation of the data mart in a hierarchical manner (working from facts to dimensions). There is some flexibility in the designer by allowing the designer to combine entities (say for example including Category data in the products table) and selecting attributes that design. Naturally the user can choose what attributes (and measures) they wish to show in the dimension and fact.

clip_image008[8]

Since the data mart is built using the data warehouse as its base, the designer can choose to show type 1 (or 2) dimension versions regardless of the structure that is chosen in the warehouse design. This of course relies on type 2 attributes being recorded in the warehouse (why would we choose anything different?). The key point is that if the data warehouse records type II changes, the designer has the ability to show a ‘current’ view of the data or an historic view with minimal effort on their behalf.

The ETL (Processing Instruction) Generator manages the extraction of data from source systems, population of the data warehouse and subsequent population in data marts (additionally the data mart can be a virtual data mart meaning it is built over views and no additional data processing is required). This generator can create multiple jobs (depending on how you want to configure it) which can be run through the applications interface or as a command line batch file (calling on the BIReady engine to execute).

Of the product, I expect most of the work has gone into this part of the engine and one of the benefits it brings is the ability to determine and apply parallel processing to its commands. That is, the engine determines what dependencies are required between objects and then can build a parallel activities (should such an opportunity exist).

clip_image010[7]

The ability to manipulate processing instructions is where I have found the most power in the product (and sadly fun). Essentially the processing instructions automate all of the manual work that an ETL developer would do. It builds commands to extract source tables into a staging area (you can define the mapping) and then uses that area to populate the data warehouse. However the product allows you to create and execute scripts as part of that process. These occur against each entity in build stages (Pre Staging, Post Staging, Pre Data warehouse load, Post data warehouse load and so on) so alter the underlying data as you see fit. For example splitting names, salutation variation, pivot deconstruction etc. Perhaps not an option for a novice developer but an option nevertheless.

A Quick Note on Data Warehouse Automation

Having seen BIReady implemented in a production environment, I think there is certainly a need to identify that it as a data warehouse automation tool. This may sound obvious but there are some implications to that function which need to be identified (or at least brought to attention).

BIReady manages everything related to its data warehouse, it builds and maintains all its tables, ETL and logging and you do have the opportunity to influence that process. It even manages its own versioning.

It is not an ETL tool as one might associate with traditional ETL development. Sounds obvious right? However, the implication of this is that data movement tasks generated by the product relate specifically to the generation and manipulation of data within the data warehouse and not the components of (Extraction, Transform or Loading). For example, you cannot create a job to simply move data from one table to another (other than where it relates to processing required by the model).

Final Thoughts?

Firstly, I’ve found BIReady to be extremely productive. A project could deliver a functional data warehouse and data marts in under a week – no problem. That’s a real bonus in my eyes.

BIReady is reliant on relational sources (they support 7 RDBMS) and source data must exist in one of those platforms in order to be consumed. This may be an issue for some, however, having seen many BI implementations where the sole purpose is to transform transactional system data into an informational data perspective (aka a data mart), I don’t really see a problem there. In fact, I can easily imagine how their approach in building a data warehouse first and then the subsequent data marts could save a lot of pain on any BI project (of course, you have to have to foresight to retain history).

There is no doubt in my mind that there is a niche market for BIReady. Is it one size fits all? Certainly not but then no product is – is it.

BIML XIX – Creating a Staging Area using Biml

Our prior post (Part I and Part II) looked at using biml to extract data from a source system into staging tables. In that post, we focused on using defined content to create the extracts and override the default column mapping that is available in biml. That post relied on the existence of the target environment (which is a given). However, how to we create the staging environment? This post looks at how to do that using biml.

The goal of this post therefore is to create a staging environment using biml based on the system specification of a source system. Using system tables are handy for several reasons. Firstly, every database has them so we can create a generic framework to define a staging environment. Secondly, we can use the information in them to alter our output (for example, an automatic conversion from an oracle varchar2 type to a SQLServer nvarchar type). Thirdly, we can use biml script (more specifically code functions in our script) to alter values are provided (for example naming standards – the conversion of a column name ColumnName to column_name). Fourthly, we can override meta data (for example substituting schema’s). Finally, we can use biml to append fields to our target environment that do not exist in our source (for example adding additional fields).

Table Iteration

The basis of our package is the ExecuteSql task. Our output will create a package with an ExecuteSql task for each table that we want to create. Therefore our proforma biml would look like the following.

<Packages>
	<Package Name="001_Create_Staging" ConstraintMode="Linear">
		<Tasks>
			<ExecuteSQL .... some create table statement … />
		</Tasks>
	</Package>
</Package>

As a first step, we want to create a task for each table in our source system. Here, we don’t worry about defining columns – we just create the ExecuteSql place holder. In order to do that, we retrieve a table showing source schema and tables (in SQL Server, we can use the INFORMATION_SCHEMA.TABLES system table for this). Then we iterate over each row creating the ExecuteSql task (as below).

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
	string _table_name_sql = "select  TABLE_SCHEMA , table_name from INFORMATION_SCHEMA.TABLES  where TABLE_TYPE='BASE TABLE'"; 
	
	DataTable _table_names = new DataTable();
	SqlDataAdapter _table_name_da = new SqlDataAdapter(_table_name_sql, _stage_con_string);
    _table_name_da.Fill(_table_names);	
	
#>


<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
  </Connections>
	<Packages>
		<Package Name="001_Create_Staging" ConstraintMode="Linear">
			<Tasks>
				<# foreach(DataRow _table in _table_names.Rows) {	#>			
				<ExecuteSQL Name="create_import_<#= _table["TABLE_SCHEMA"] #>_<#= _table["TABLE_NAME"] #>" ConnectionName="staging_db">
					<DirectInput>						 
create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>] ()
        				</DirectInput>
				</ExecuteSQL>
				<# } #>
            </Tasks>
        </Package>
    </Packages>
</Biml>

Here, a task is created (called create_import_schema_table) for each table in our source. The SQL we are planning to generate (see the DirectInput tag) automatically moves the table in our source to an import schema in our target environment. The name in the target environment is driven by the source tables schema and name, for example, consider the results of our information_schema query (as below)

The table Production.ScrapReason would be created as import.Production_ScrapReason in our staging environment.

Column Iteration

Naturally, our next requirement is to retrieve a list of columns for each table that we are iterating through and append this to our create table syntax. This is a similar concept to the prior post where we used an inner loop, a loop that relies on the current table to produce a column list. We can do this by adding the biml script as below. Note that here we only retrieve the column spec’s for the current table and do not do anything with it.

<DirectInput>
	create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>]
	(
	<#
		string _col_name_sql = "select COLUMN_NAME, DATA_TYPE,  CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='" + _table["TABLE_SCHEMA"] + "' and TABLE_NAME='"+ _table["TABLE_NAME"] + "' order by ORDINAL_POSITION ";
		
		DataTable _col_names = new DataTable();
		SqlDataAdapter _col_names_da = new SqlDataAdapter(_col_name_sql, _stage_con_string);
		_col_names_da.Fill(_col_names);

		// iterate over each column and get the definition 
		
	#>
	)
</DirectInput>

We can see how we can use the column specification to define the rest of the create table syntax. For example, consider the table ScrapReason (as below). By iterating over each row, we can append that tables columns (including datatypes) to our create table syntax.

We should have an idea of how to do this iteration already – add some logic that looks at the values in each column to produce a string. For example, the above specification might look something like this;

Create table import.production_scrapreason
(
	ScrapReasonId smallint(5,0)
	, Name nvarchar(50)
	, ModeifedDate datetime
)

However, generating the syntax like this not without issue and the above create table will fail because of the smallint(5,0) specification. What we need to do is include some logic that can take a row and, based on the datatype, produce the appropriate column specification. This would allow us to choose what is created (as a result of the source data type) and transfer between datatypes. For example, consider the following function which does this. Here we output a string with uses the column name and datatype to determine what precision and character values to use. For example, a numeric type would use both the precision and scale, a FLOAT type would use precision and text types would use the character maximum length.

public string RowConversion(DataRow Row)
{
    string _ret = "[" + Row["COLUMN_NAME"] + "] " + Row["DATA_TYPE"];


    switch (Row["DATA_TYPE"].ToString().ToUpper())
    {
        case "NVARCHAR":
        case "VARCHAR":
        case "NCHAR":
        case "CHAR":
        case "BINARY":
        case "VARBINARY":
            if (Row["CHARACTER_MAXIMUM_LENGTH"].ToString() == "-1")
                _ret += "(max)";
            else
                _ret += "(" + Row["CHARACTER_MAXIMUM_LENGTH"] + ")";
            break;

        case "NUMERIC":
            _ret += "(" + Row["NUMERIC_PRECISION"] + "," + Row["NUMERIC_SCALE"] + ")";
            break;

        case "FLOAT":
            _ret += "(" + Row["NUMERIC_PRECISION"] + ")";
            break;
    }
  
    return _ret;
}

What we would like to do is include this function into our biml script and we can do that by adding another code block (identified as <#+ …code… #>). Since the same function will be used regardless of where we are in code (the function is the same for the entire file), the snippet can be added to the top section of our biml file. Our create table syntax then becomes an activity of calling that function (as below)

<DirectInput>
	create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>]
	(
	<#
		string _col_name_sql = "select COLUMN_NAME, DATA_TYPE,  CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='" + _table["TABLE_SCHEMA"] + "' and TABLE_NAME='"+ _table["TABLE_NAME"] + "' order by ORDINAL_POSITION ";
		
		DataTable _col_names = new DataTable();
		SqlDataAdapter _col_names_da = new SqlDataAdapter(_col_name_sql, _stage_con_string);
		_col_names_da.Fill(_col_names);

		for (int _i=0; _i<_col_names.Rows.Count ; _i++  )
		{
			DataRow _r = _col_names.Rows[_i];
			
			if (_i == 0)
				WriteLine(RowConversion(_r));
			else
				WriteLine(", " + RowConversion(_r));								
		}
		
	#>
	)
</DirectInput>

Here I have used a slightly different technique in iteration – but the main snippet of code lies in the function call RowConversion(_r). Since I need to include commas in my column definition, and that definition may require a leading or trailing comma, I can use the rows position to determine if a comma is required (or not). That is, the first row has no commas and all subsequent rows have a leading comma. This will produce the following biml snippet.

<DirectInput>
	create table import.[Production_ScrapReason]
	(
	[ScrapReasonID] smallint
	, [Name] nvarchar(50)
	, [ModifiedDate] datetime
	)
</DirectInput>

Custom Columns

If I want to add additional columns to each table (for example, an append_dt column), we can simply add those columns as static biml in our definition (notice that the complete solution below includes an append_dt field in the DirectInput tag). Naturally, column names should be unique for each table and the command will fail should an additional column (name) exist in the source table. We can solve this with script and that may be the subject of the next post.

Final Solution

For completeness, I have included the complete solution below.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
	string _table_name_sql = "select  TABLE_SCHEMA , table_name from INFORMATION_SCHEMA.TABLES  where TABLE_TYPE='BASE TABLE'"; 
	
	DataTable _table_names = new DataTable();
	SqlDataAdapter _table_name_da = new SqlDataAdapter(_table_name_sql, _stage_con_string);
    _table_name_da.Fill(_table_names);	
	
#>
<#+ 
public string RowConversion(DataRow Row)
{
    string _ret = "[" + Row["COLUMN_NAME"] + "] " + Row["DATA_TYPE"];


    switch (Row["DATA_TYPE"].ToString().ToUpper())
    {
        case "NVARCHAR":
        case "VARCHAR":
        case "NCHAR":
        case "CHAR":
        case "BINARY":
        case "VARBINARY":
            if (Row["CHARACTER_MAXIMUM_LENGTH"].ToString() == "-1")
                _ret += "(max)";
            else
                _ret += "(" + Row["CHARACTER_MAXIMUM_LENGTH"] + ")";
            break;

        case "NUMERIC":
            _ret += "(" + Row["NUMERIC_PRECISION"] + "," + Row["NUMERIC_SCALE"] + ")";
            break;

        case "FLOAT":
            _ret += "(" + Row["NUMERIC_PRECISION"] + ")";
            break;
    }
  
    return _ret;
}
#> 

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
  </Connections>
	<Packages>
		<Package Name="001_Create_Staging" ConstraintMode="Linear">
			<Tasks>
				<# foreach(DataRow _table in _table_names.Rows) {	#>			
				<ExecuteSQL Name="create_import_<#= _table["TABLE_SCHEMA"] #>_<#= _table["TABLE_NAME"] #>" ConnectionName="staging_db">
					<DirectInput>
						create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>]
						(
						<#
							
	        				string _col_name_sql = "select COLUMN_NAME, DATA_TYPE,  CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='" + _table["TABLE_SCHEMA"] + "' and TABLE_NAME='"+ _table["TABLE_NAME"] + "' order by ORDINAL_POSITION ";

				            DataTable _col_names = new DataTable();
				            SqlDataAdapter _col_names_da = new SqlDataAdapter(_col_name_sql, _stage_con_string);
				            _col_names_da.Fill(_col_names);

				            for (int _i=0; _i<_col_names.Rows.Count ; _i++  )
							{
								DataRow _r = _col_names.Rows[_i];
								
								if (_i == 0)
									WriteLine(RowConversion(_r));
	                			else
	                    			WriteLine(", " + RowConversion(_r));								
                            }
							
						#>
						, append_dt datetime
						)
          </DirectInput>
				</ExecuteSQL>
				<# } #>
      </Tasks>
		</Package>
	</Packages>
</Biml>

Conclusion

This post has looked at using biml to create a staging area. A package is created that includes ExecuteSql statements for each table in a source database information_schema. This approach has a flexible approach to creation and offers many advantages which revolve around table alteration (different target schema, additional columns, datatype alteration etc).