Biml XX – Automating Lookups (a fundamental dimensional process)


Perhaps the most fundamental activity in the creation of a dimension is the de-normalisation of data.  That is, we take the OLTP’s source data and consecutively build a wide (de-normalised) table describing the entity or subject area.  This process has a distinct build process whereby we start with the lowest level of granularity and incrementally add fields to the structure by looking up attributes based on a key in the source data.  This post looks at automating that process in biml.

For example, consider creating a product dimension that has groupings of products.  We start with a base list of products (identified by its ID), and include important attributes at that level (like name, colour, size, category grouping key etc).  Products are grouped into categories so we would then look attributes for the category (perhaps including the a key for the next level of grouping).  Those categories could be grouped again and so the process continues.

The base process that we are conducting is essentially a sequential looking up data to append to our original extract.  I say sequential because we are often reliant on levels where hierarchies exist (and therefore order is important).

In SSIS, we might expect such a process to look like that on the right.

Of course, the package is very simple and includes no consideration of changing attributes.  This is for demonstration purposes.

image

A Different Approach based on SQL metadata?

If we want to automate this process (or perhaps more correctly the automation of a package creation that uses it), we could summarise our activity in 3 steps.  Firstly, we retrieve a list of items (a base list to start with) and then (while there is a need to), add fields to that data based on the results of another query.  Finally, we do something with that list (like send it to a destination as in the example above).

As far as a metadata (or an instructed layer approach goes), this can be defined through a header/detail structure that identifies an activity and the lookups that must be conducted in order for it to be completed.  For example;

Lookup a product base list
select 
ProductAlternateKey 
, EnglishProductName 
, ProductSubcategoryKey  
from dbo.DimProduct 

Lookup subcategory information (the first level in the hierarchy).  Note that there is an implicitly defined relationship between our ‘existing’ data and this through the common field ProductSubcategoryKey
select 
ProductSubcategoryKey 
, EnglishProductSubcategoryName 
, ProductCategoryKey  
from DimProductSubcategory 
Lookup category information (second level).  Again, there is an implicit relationship based on the ProductCategoryKey field between our existing (the sequentially built table) and this dataset.

select 
ProductCategoryKey , EnglishProductCategoryName 
from DimProductCategory  

An antagonist would suggest the same output could be achieved with a single query that joins the three tables.  True, however this example is illustrative and our goal in ETL is to combine data (which may reside in disparate systems) and therefore could not be combines through a single command.

Apart from the (already stated) assumption that we assume data will be appended based on the first column of the ‘append query’, (for example joining by ProductSubcategoryKey), we are also including all fields in the lookup query will be added to the table (for example, subcategoryname and categorykey will be added).  Also, since we have the field ProductSubcategoryKey in the table, we don’t need to re-add it.

Finally, the logic for our entire ‘work object’ can be summarised in a tabular output as the two images below show.  Here, a master table defines a job (later this will be transformed into a package) and secondly an alternate table defines the list of lookups.  Note that the second dataset is returned via a procedure which is dependant on the name of the first.  Our entire build can be defined by these 3 lines :).

image 

image

Over to Biml

The entire biml required to generate this package(s) is given below.  I say package(s) here because as many packages development is driven by the rows in our master and child procs (as shown above).  Of course package names, task names etc are defined (or derived) from the name fields in those outputs.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#
    string _src_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
    string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
    DataTable _transforms = new DataTable();
    SqlDataAdapter _transforms_da = new SqlDataAdapter("exec transforms_entities", _stage_con_string);
    _transforms_da.Fill(_transforms);    
#> 
<#+
    public string[] SQLCols(string SelectSql)
    {
        /* returns an array of column strings from a select statement */ 
        int _index_of_select = SelectSql.ToUpper().IndexOf("SELECT"); 
        // remove the select
        SelectSql = SelectSql.Substring(SelectSql.ToUpper().IndexOf("SELECT") + 6); 
        // remove the From
        SelectSql = SelectSql.Substring(0, SelectSql.ToUpper().IndexOf("FROM ")); 
        // remove all spaces
        while (SelectSql.IndexOf(" ") != -1)
            SelectSql = SelectSql.Replace(" ", ""); 
        return SelectSql.Split(',');
    }
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
      <Connections>
        <OleDbConnection Name="adventure_works" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
        <OleDbConnection Name="stage" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
      </Connections>
    <Packages>
        <# foreach (DataRow _transform in _transforms.Rows) { #>
            <Package Name="DIM_<#= _transform["name"] #>" ConstraintMode="Linear">
                <Tasks>
                    <Dataflow Name="DFT_build_<#= _transform["name"] #>">
                        <Transformations>
                            <OleDbSource Name="SRC_<#= _transform["name"] #>" ConnectionName="<#= _transform["connection"] #>">
                                <DirectInput>
                                    <#= _transform["src_command"] #>
                                </DirectInput>
                            </OleDbSource> 
                            <#  DataTable _lookups = new DataTable();
                                SqlDataAdapter _lookups_da = new SqlDataAdapter("exec [dbo].[transforms_lookups]  @transforms_entity='" + _transform["name"] + "'  ", _stage_con_string);
                                _lookups_da.Fill(_lookups);    
                                foreach (DataRow _lookup in _lookups.Rows){ #>
                                    <Lookup Name="LK_<#= _lookup["name"] #>" OleDbConnectionName="<#= _lookup["connection"] #>" NoMatchBehavior="IgnoreFailure">
                                    <DirectInput><#= _lookup["src_cmd"] #></DirectInput>
                                    <# string[] _col_list = SQLCols(_lookup["src_cmd"].ToString());#>
                                    <Inputs>
                                        <Column SourceColumn="<#= _col_list[0] #>" />
                                    </Inputs>
                                    <Outputs>
                                        <# for(int _i=1; _i<_col_list.Length; _i++){ #>
                                               <Column SourceColumn="<#= _col_list[_i] #>" />
                                        <# }#>
                                    </Outputs>
                                    </Lookup>
                            <# } #>
                                                                <OleDbDestination Name="TGT_<#=  _transform["tgt_connection"] #>" ConnectionName="<#= _transform["tgt_connection"] #>">
                                        <ExternalTableOutput Table="<#=  _transform["tgt_table"] #>" />
                                    </OleDbDestination>
                        </Transformations>
                    </Dataflow>
                </Tasks>
            </Package>
        <# } #> 
    </Packages>
</Biml> 

A quick walk through

The requirements for the lookup is done here.  In our code, we iterate over the list of jobs (creating packages for each job) and define data flows (and subsequent lookups) based on the job (and child job) data.  Viewed in entirety (and there is no requirement to implement it this way), the code may look a little confusing.  As psuedo code, i think its a little more readable.

job_list -> list of packages to create /* data from proc transforms_entities */ 
foreach (job in job_list)
{ 
    create a package (name = DIM_job_name)
    add a dataflow to the package (name = DFT_build_job_name) 
        add the source to the data flow (job_list.src_command)
        lookups -> results of proc transforms_lookup(job_list.name)
        foreach (lookup in lookups)
        {
            add a lookup 
                (name = LK_lookup.name)
                (source = lookup.src_cmd && lookup.connection)
                lookup_fields -> get list of lookup.src_cmd fields
                Add Input (source column) = lookup_fields.first_column
                foreach (lookup_field from the second position of lookup_fields)
                    Add Outputcolumn = lookup_field
        }
} 

Flexible lookup requirements

In our post on lookups, we talked about the minimum requirements for the lookup.  Apart from the connection and the query, each lookup needs to define the Inputs (how to match) and Outputs (what to return).  Given our SQL command (ie SQL Select), we know that the first field will be the join key/field and the other fields will be return fields.

Practically, there are several ways to do this.  We have used a very simple technique that deconstructs the select syntax to derive field names (no its not bullet proof).  In our code, we rely on the function SQLcols to return an array of names and then derive role based on their position in the array.  The first position (index=0) is the lookup field and the remaining are returned through the loop based on the length of the array.

Conclusion

This post continues to explore the concepts of data driven (or metadata driven) ETL development which provides a flexible method to generate consistent ETL packages.

Looking back over this post, I suspect that some would question the effort of the biml code for the output that we have achieved … there seems to be a lot of code to produce a single package with a dataflow task and a few lookups.  I agree, to build one package this may be overkill however there are several other considerations for the use of this approach.

  1. Its a simple example to demonstrate a concept.
  2. Any automation technique has a setup to production trade off.  That is, the more packages & tasks we create the better return we get from automation (that is, the more productive anf faster our code becomes).
  3. The code produces consistent output (which can be changed relatively simply).
  4. The code is not dependent on an environment and can be used as a toolkit.  For example, if you are a consultant, you can reuse the snippet.  Once you have got the code, all you have to do is update the table and any number of packages/transforms can be created.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s