Biml XX – Automating Lookups (a fundamental dimensional process)

Perhaps the most fundamental activity in the creation of a dimension is the de-normalisation of data.  That is, we take the OLTP’s source data and consecutively build a wide (de-normalised) table describing the entity or subject area.  This process has a distinct build process whereby we start with the lowest level of granularity and incrementally add fields to the structure by looking up attributes based on a key in the source data.  This post looks at automating that process in biml.

For example, consider creating a product dimension that has groupings of products.  We start with a base list of products (identified by its ID), and include important attributes at that level (like name, colour, size, category grouping key etc).  Products are grouped into categories so we would then look attributes for the category (perhaps including the a key for the next level of grouping).  Those categories could be grouped again and so the process continues.

The base process that we are conducting is essentially a sequential looking up data to append to our original extract.  I say sequential because we are often reliant on levels where hierarchies exist (and therefore order is important).

In SSIS, we might expect such a process to look like that on the right.

Of course, the package is very simple and includes no consideration of changing attributes.  This is for demonstration purposes.

image

A Different Approach based on SQL metadata?

If we want to automate this process (or perhaps more correctly the automation of a package creation that uses it), we could summarise our activity in 3 steps.  Firstly, we retrieve a list of items (a base list to start with) and then (while there is a need to), add fields to that data based on the results of another query.  Finally, we do something with that list (like send it to a destination as in the example above).

As far as a metadata (or an instructed layer approach goes), this can be defined through a header/detail structure that identifies an activity and the lookups that must be conducted in order for it to be completed.  For example;

Lookup a product base list
select 
ProductAlternateKey 
, EnglishProductName 
, ProductSubcategoryKey  
from dbo.DimProduct 

Lookup subcategory information (the first level in the hierarchy).  Note that there is an implicitly defined relationship between our ‘existing’ data and this through the common field ProductSubcategoryKey
select 
ProductSubcategoryKey 
, EnglishProductSubcategoryName 
, ProductCategoryKey  
from DimProductSubcategory 
Lookup category information (second level).  Again, there is an implicit relationship based on the ProductCategoryKey field between our existing (the sequentially built table) and this dataset.

select 
ProductCategoryKey , EnglishProductCategoryName 
from DimProductCategory  

An antagonist would suggest the same output could be achieved with a single query that joins the three tables.  True, however this example is illustrative and our goal in ETL is to combine data (which may reside in disparate systems) and therefore could not be combines through a single command.

Apart from the (already stated) assumption that we assume data will be appended based on the first column of the ‘append query’, (for example joining by ProductSubcategoryKey), we are also including all fields in the lookup query will be added to the table (for example, subcategoryname and categorykey will be added).  Also, since we have the field ProductSubcategoryKey in the table, we don’t need to re-add it.

Finally, the logic for our entire ‘work object’ can be summarised in a tabular output as the two images below show.  Here, a master table defines a job (later this will be transformed into a package) and secondly an alternate table defines the list of lookups.  Note that the second dataset is returned via a procedure which is dependant on the name of the first.  Our entire build can be defined by these 3 lines :).

image 

image

Over to Biml

The entire biml required to generate this package(s) is given below.  I say package(s) here because as many packages development is driven by the rows in our master and child procs (as shown above).  Of course package names, task names etc are defined (or derived) from the name fields in those outputs.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#
    string _src_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
    string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
    DataTable _transforms = new DataTable();
    SqlDataAdapter _transforms_da = new SqlDataAdapter("exec transforms_entities", _stage_con_string);
    _transforms_da.Fill(_transforms);    
#> 
<#+
    public string[] SQLCols(string SelectSql)
    {
        /* returns an array of column strings from a select statement */ 
        int _index_of_select = SelectSql.ToUpper().IndexOf("SELECT"); 
        // remove the select
        SelectSql = SelectSql.Substring(SelectSql.ToUpper().IndexOf("SELECT") + 6); 
        // remove the From
        SelectSql = SelectSql.Substring(0, SelectSql.ToUpper().IndexOf("FROM ")); 
        // remove all spaces
        while (SelectSql.IndexOf(" ") != -1)
            SelectSql = SelectSql.Replace(" ", ""); 
        return SelectSql.Split(',');
    }
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
      <Connections>
        <OleDbConnection Name="adventure_works" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
        <OleDbConnection Name="stage" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
      </Connections>
    <Packages>
        <# foreach (DataRow _transform in _transforms.Rows) { #>
            <Package Name="DIM_<#= _transform["name"] #>" ConstraintMode="Linear">
                <Tasks>
                    <Dataflow Name="DFT_build_<#= _transform["name"] #>">
                        <Transformations>
                            <OleDbSource Name="SRC_<#= _transform["name"] #>" ConnectionName="<#= _transform["connection"] #>">
                                <DirectInput>
                                    <#= _transform["src_command"] #>
                                </DirectInput>
                            </OleDbSource> 
                            <#  DataTable _lookups = new DataTable();
                                SqlDataAdapter _lookups_da = new SqlDataAdapter("exec [dbo].[transforms_lookups]  @transforms_entity='" + _transform["name"] + "'  ", _stage_con_string);
                                _lookups_da.Fill(_lookups);    
                                foreach (DataRow _lookup in _lookups.Rows){ #>
                                    <Lookup Name="LK_<#= _lookup["name"] #>" OleDbConnectionName="<#= _lookup["connection"] #>" NoMatchBehavior="IgnoreFailure">
                                    <DirectInput><#= _lookup["src_cmd"] #></DirectInput>
                                    <# string[] _col_list = SQLCols(_lookup["src_cmd"].ToString());#>
                                    <Inputs>
                                        <Column SourceColumn="<#= _col_list[0] #>" />
                                    </Inputs>
                                    <Outputs>
                                        <# for(int _i=1; _i<_col_list.Length; _i++){ #>
                                               <Column SourceColumn="<#= _col_list[_i] #>" />
                                        <# }#>
                                    </Outputs>
                                    </Lookup>
                            <# } #>
                                                                <OleDbDestination Name="TGT_<#=  _transform["tgt_connection"] #>" ConnectionName="<#= _transform["tgt_connection"] #>">
                                        <ExternalTableOutput Table="<#=  _transform["tgt_table"] #>" />
                                    </OleDbDestination>
                        </Transformations>
                    </Dataflow>
                </Tasks>
            </Package>
        <# } #> 
    </Packages>
</Biml> 

A quick walk through

The requirements for the lookup is done here.  In our code, we iterate over the list of jobs (creating packages for each job) and define data flows (and subsequent lookups) based on the job (and child job) data.  Viewed in entirety (and there is no requirement to implement it this way), the code may look a little confusing.  As psuedo code, i think its a little more readable.

job_list -> list of packages to create /* data from proc transforms_entities */ 
foreach (job in job_list)
{ 
    create a package (name = DIM_job_name)
    add a dataflow to the package (name = DFT_build_job_name) 
        add the source to the data flow (job_list.src_command)
        lookups -> results of proc transforms_lookup(job_list.name)
        foreach (lookup in lookups)
        {
            add a lookup 
                (name = LK_lookup.name)
                (source = lookup.src_cmd && lookup.connection)
                lookup_fields -> get list of lookup.src_cmd fields
                Add Input (source column) = lookup_fields.first_column
                foreach (lookup_field from the second position of lookup_fields)
                    Add Outputcolumn = lookup_field
        }
} 

Flexible lookup requirements

In our post on lookups, we talked about the minimum requirements for the lookup.  Apart from the connection and the query, each lookup needs to define the Inputs (how to match) and Outputs (what to return).  Given our SQL command (ie SQL Select), we know that the first field will be the join key/field and the other fields will be return fields.

Practically, there are several ways to do this.  We have used a very simple technique that deconstructs the select syntax to derive field names (no its not bullet proof).  In our code, we rely on the function SQLcols to return an array of names and then derive role based on their position in the array.  The first position (index=0) is the lookup field and the remaining are returned through the loop based on the length of the array.

Conclusion

This post continues to explore the concepts of data driven (or metadata driven) ETL development which provides a flexible method to generate consistent ETL packages.

Looking back over this post, I suspect that some would question the effort of the biml code for the output that we have achieved … there seems to be a lot of code to produce a single package with a dataflow task and a few lookups.  I agree, to build one package this may be overkill however there are several other considerations for the use of this approach.

  1. Its a simple example to demonstrate a concept.
  2. Any automation technique has a setup to production trade off.  That is, the more packages & tasks we create the better return we get from automation (that is, the more productive anf faster our code becomes).
  3. The code produces consistent output (which can be changed relatively simply).
  4. The code is not dependent on an environment and can be used as a toolkit.  For example, if you are a consultant, you can reuse the snippet.  Once you have got the code, all you have to do is update the table and any number of packages/transforms can be created.

A Look at BIReady – Data Warehouse Automation

I had previously discussed some of the considerations, implications and issues associated with the automation of data warehousing. That article addressed some common discussion points that I have witnessed on resistance from IT/BI professionals who refuse to concede that their expertise could be replaced by an automation tool. Of course, there are always extremes of opinion, application and circumstance so there is never any clear cut rule that can be applied in all circumstances (and for all clients). For myself, the question revolves around where the value adding activities are and manually developing ETL does not rank to high on my priority list. So, the general benefits I see in warehouse automation revolve around 3 areas;

1. Management and a consistent approach since we expect the tool to transform our business based model to a physical model (and manage everything in between).

2. The people (expertise) factor – An ability to remove and reorganise the typical layers of development. Since the traditional development stages are moved to an automated process, the reliance on specific roles of the business intelligence life-cycle is also removed. We may have one person (perhaps not an IT specialist) responsible for development and management.

3. Return on investment– The ability to deliver data faster to those that need it. Delivering a prototype data warehouse within a week offers tangible benefits and reduces the capacity constraint associated with development.

Now to the point of this article – BIReady.

BIReady is a data warehouse automation tool (developed in the Netherlands) that has recently been purchased by Attunity (NASDAQ: ATTU). Attunity needs no introduction and so I expect there are plans on the horizon for the product through additional investment and support. Prior to that, one of the most noticeable observations about the company was their inclusion of Bill Inmon on their advisory board – great kudos there!

The application is managed through a (complied) Access application. To be clear, this is just the front end that the user interacts with. Actually there are two parts to the application, the BIReady Manager (developers front end) and the BIReady Engine (Java application) that is accessed by the front end and basically does all the work.

Application data is stored in a relational connection (you can choose from various vendor databases: SQL Server, Oracle, Teradata, DB2, MySql) however, it is important to note that the front end does not store data in the Access engine. At first, I found the front end to be a bit ugly and dated – sorry Access. For someone that works in IT, the mere mention of Access tends to send shivers down your spine. Form design (in Access) is generally unattractive however, the operation of the application (menu design around distinct operational activities) mimics how one would expect to build a data warehouse. This is an important consideration because it permits fast track development for novices and the development cycle for BIReady.

Perhaps the most robust feature of BIReady is the approach they’ve adopted in the development cycle. Here, they adopt a CIF (Corporate Information Factory) approach that builds a true data warehouse in 3rd Normal Form (alternatively Data Vault) and then spin data marts from that for end user consumption. The data warehouse is designed around the data model (which can be reverse engineered or built from scratch) with the user identifying what attributes should be historically retained. Perhaps the best way to analyse its operation is therefore to look at the group of its core operating modules (a data model, a data warehouse, a data mart and its ETL (Process Instruction) Generator.

The data model defines entities and the relationships between them (screen shot shown below). Here, the entities in their attribute is can be seen along with the historical recognition of those attributes in the model (e.g. type 1 for not recording changes to the attribute or type 2 for recording the change in values). An interesting (and noteworthy) addition here is the block feature for each attribute. The block specifies how the the attributes a materialised in a physical table (with each block number defining a table). A single entity can therefore be broken up into model tables depending on how quickly we expect that data to change.

clip_image002[13]

Should you wish to jump start development, the data model can be reverse engineered from source systems. Here, entities (you can choose which entities) and relationships in the source system can be reproduced in the BIReady model with only a few mouse clicks (regardless of the stage of development). There is (of course) a bone of contention as to whether this is a good practice since it places a heavy reliance on the design of the source (and the model not be sustainable on a system change) … However, those discussions are for another day :).

The data warehouse is the physical implementation the data model and can occur using either the Third Normal Form or Data Vault architecture. If you peek under the hood of the product it is interesting to note that the names of entities and attributes in the physical implementation are not materialised in the data warehouse.

For example the categories entity above is stored in two tables (given the two blocks as described above) with system generated names. We can (should we desire) trace this structure from the model to the physical implementation however, I think it shows the intent of BIReady to remove a user from that layer. The user is expected to access the data through a data mart.

clip_image004[7]

clip_image006[7]

 

A data mart modeller uses the data warehouse to build a star schema(s). The application can support the creation of multiple data marts from the same data model so, therefore it’s entirely possible (and arguably desirable) to siphon off functional schemas and deliver departmental reporting. The data mart editor (below) works against the data model (which was discussed above) to show the physical implementation of the data mart in a hierarchical manner (working from facts to dimensions). There is some flexibility in the designer by allowing the designer to combine entities (say for example including Category data in the products table) and selecting attributes that design. Naturally the user can choose what attributes (and measures) they wish to show in the dimension and fact.

clip_image008[8]

Since the data mart is built using the data warehouse as its base, the designer can choose to show type 1 (or 2) dimension versions regardless of the structure that is chosen in the warehouse design. This of course relies on type 2 attributes being recorded in the warehouse (why would we choose anything different?). The key point is that if the data warehouse records type II changes, the designer has the ability to show a ‘current’ view of the data or an historic view with minimal effort on their behalf.

The ETL (Processing Instruction) Generator manages the extraction of data from source systems, population of the data warehouse and subsequent population in data marts (additionally the data mart can be a virtual data mart meaning it is built over views and no additional data processing is required). This generator can create multiple jobs (depending on how you want to configure it) which can be run through the applications interface or as a command line batch file (calling on the BIReady engine to execute).

Of the product, I expect most of the work has gone into this part of the engine and one of the benefits it brings is the ability to determine and apply parallel processing to its commands. That is, the engine determines what dependencies are required between objects and then can build a parallel activities (should such an opportunity exist).

clip_image010[7]

The ability to manipulate processing instructions is where I have found the most power in the product (and sadly fun). Essentially the processing instructions automate all of the manual work that an ETL developer would do. It builds commands to extract source tables into a staging area (you can define the mapping) and then uses that area to populate the data warehouse. However the product allows you to create and execute scripts as part of that process. These occur against each entity in build stages (Pre Staging, Post Staging, Pre Data warehouse load, Post data warehouse load and so on) so alter the underlying data as you see fit. For example splitting names, salutation variation, pivot deconstruction etc. Perhaps not an option for a novice developer but an option nevertheless.

A Quick Note on Data Warehouse Automation

Having seen BIReady implemented in a production environment, I think there is certainly a need to identify that it as a data warehouse automation tool. This may sound obvious but there are some implications to that function which need to be identified (or at least brought to attention).

BIReady manages everything related to its data warehouse, it builds and maintains all its tables, ETL and logging and you do have the opportunity to influence that process. It even manages its own versioning.

It is not an ETL tool as one might associate with traditional ETL development. Sounds obvious right? However, the implication of this is that data movement tasks generated by the product relate specifically to the generation and manipulation of data within the data warehouse and not the components of (Extraction, Transform or Loading). For example, you cannot create a job to simply move data from one table to another (other than where it relates to processing required by the model).

Final Thoughts?

Firstly, I’ve found BIReady to be extremely productive. A project could deliver a functional data warehouse and data marts in under a week – no problem. That’s a real bonus in my eyes.

BIReady is reliant on relational sources (they support 7 RDBMS) and source data must exist in one of those platforms in order to be consumed. This may be an issue for some, however, having seen many BI implementations where the sole purpose is to transform transactional system data into an informational data perspective (aka a data mart), I don’t really see a problem there. In fact, I can easily imagine how their approach in building a data warehouse first and then the subsequent data marts could save a lot of pain on any BI project (of course, you have to have to foresight to retain history).

There is no doubt in my mind that there is a niche market for BIReady. Is it one size fits all? Certainly not but then no product is – is it.

BIML XIX – Creating a Staging Area using Biml

Our prior post (Part I and Part II) looked at using biml to extract data from a source system into staging tables. In that post, we focused on using defined content to create the extracts and override the default column mapping that is available in biml. That post relied on the existence of the target environment (which is a given). However, how to we create the staging environment? This post looks at how to do that using biml.

The goal of this post therefore is to create a staging environment using biml based on the system specification of a source system. Using system tables are handy for several reasons. Firstly, every database has them so we can create a generic framework to define a staging environment. Secondly, we can use the information in them to alter our output (for example, an automatic conversion from an oracle varchar2 type to a SQLServer nvarchar type). Thirdly, we can use biml script (more specifically code functions in our script) to alter values are provided (for example naming standards – the conversion of a column name ColumnName to column_name). Fourthly, we can override meta data (for example substituting schema’s). Finally, we can use biml to append fields to our target environment that do not exist in our source (for example adding additional fields).

Table Iteration

The basis of our package is the ExecuteSql task. Our output will create a package with an ExecuteSql task for each table that we want to create. Therefore our proforma biml would look like the following.

<Packages>
	<Package Name="001_Create_Staging" ConstraintMode="Linear">
		<Tasks>
			<ExecuteSQL .... some create table statement … />
		</Tasks>
	</Package>
</Package>

As a first step, we want to create a task for each table in our source system. Here, we don’t worry about defining columns – we just create the ExecuteSql place holder. In order to do that, we retrieve a table showing source schema and tables (in SQL Server, we can use the INFORMATION_SCHEMA.TABLES system table for this). Then we iterate over each row creating the ExecuteSql task (as below).

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
	string _table_name_sql = "select  TABLE_SCHEMA , table_name from INFORMATION_SCHEMA.TABLES  where TABLE_TYPE='BASE TABLE'"; 
	
	DataTable _table_names = new DataTable();
	SqlDataAdapter _table_name_da = new SqlDataAdapter(_table_name_sql, _stage_con_string);
    _table_name_da.Fill(_table_names);	
	
#>


<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
  </Connections>
	<Packages>
		<Package Name="001_Create_Staging" ConstraintMode="Linear">
			<Tasks>
				<# foreach(DataRow _table in _table_names.Rows) {	#>			
				<ExecuteSQL Name="create_import_<#= _table["TABLE_SCHEMA"] #>_<#= _table["TABLE_NAME"] #>" ConnectionName="staging_db">
					<DirectInput>						 
create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>] ()
        				</DirectInput>
				</ExecuteSQL>
				<# } #>
            </Tasks>
        </Package>
    </Packages>
</Biml>

Here, a task is created (called create_import_schema_table) for each table in our source. The SQL we are planning to generate (see the DirectInput tag) automatically moves the table in our source to an import schema in our target environment. The name in the target environment is driven by the source tables schema and name, for example, consider the results of our information_schema query (as below)

The table Production.ScrapReason would be created as import.Production_ScrapReason in our staging environment.

Column Iteration

Naturally, our next requirement is to retrieve a list of columns for each table that we are iterating through and append this to our create table syntax. This is a similar concept to the prior post where we used an inner loop, a loop that relies on the current table to produce a column list. We can do this by adding the biml script as below. Note that here we only retrieve the column spec’s for the current table and do not do anything with it.

<DirectInput>
	create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>]
	(
	<#
		string _col_name_sql = "select COLUMN_NAME, DATA_TYPE,  CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='" + _table["TABLE_SCHEMA"] + "' and TABLE_NAME='"+ _table["TABLE_NAME"] + "' order by ORDINAL_POSITION ";
		
		DataTable _col_names = new DataTable();
		SqlDataAdapter _col_names_da = new SqlDataAdapter(_col_name_sql, _stage_con_string);
		_col_names_da.Fill(_col_names);

		// iterate over each column and get the definition 
		
	#>
	)
</DirectInput>

We can see how we can use the column specification to define the rest of the create table syntax. For example, consider the table ScrapReason (as below). By iterating over each row, we can append that tables columns (including datatypes) to our create table syntax.

We should have an idea of how to do this iteration already – add some logic that looks at the values in each column to produce a string. For example, the above specification might look something like this;

Create table import.production_scrapreason
(
	ScrapReasonId smallint(5,0)
	, Name nvarchar(50)
	, ModeifedDate datetime
)

However, generating the syntax like this not without issue and the above create table will fail because of the smallint(5,0) specification. What we need to do is include some logic that can take a row and, based on the datatype, produce the appropriate column specification. This would allow us to choose what is created (as a result of the source data type) and transfer between datatypes. For example, consider the following function which does this. Here we output a string with uses the column name and datatype to determine what precision and character values to use. For example, a numeric type would use both the precision and scale, a FLOAT type would use precision and text types would use the character maximum length.

public string RowConversion(DataRow Row)
{
    string _ret = "[" + Row["COLUMN_NAME"] + "] " + Row["DATA_TYPE"];


    switch (Row["DATA_TYPE"].ToString().ToUpper())
    {
        case "NVARCHAR":
        case "VARCHAR":
        case "NCHAR":
        case "CHAR":
        case "BINARY":
        case "VARBINARY":
            if (Row["CHARACTER_MAXIMUM_LENGTH"].ToString() == "-1")
                _ret += "(max)";
            else
                _ret += "(" + Row["CHARACTER_MAXIMUM_LENGTH"] + ")";
            break;

        case "NUMERIC":
            _ret += "(" + Row["NUMERIC_PRECISION"] + "," + Row["NUMERIC_SCALE"] + ")";
            break;

        case "FLOAT":
            _ret += "(" + Row["NUMERIC_PRECISION"] + ")";
            break;
    }
  
    return _ret;
}

What we would like to do is include this function into our biml script and we can do that by adding another code block (identified as <#+ …code… #>). Since the same function will be used regardless of where we are in code (the function is the same for the entire file), the snippet can be added to the top section of our biml file. Our create table syntax then becomes an activity of calling that function (as below)

<DirectInput>
	create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>]
	(
	<#
		string _col_name_sql = "select COLUMN_NAME, DATA_TYPE,  CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='" + _table["TABLE_SCHEMA"] + "' and TABLE_NAME='"+ _table["TABLE_NAME"] + "' order by ORDINAL_POSITION ";
		
		DataTable _col_names = new DataTable();
		SqlDataAdapter _col_names_da = new SqlDataAdapter(_col_name_sql, _stage_con_string);
		_col_names_da.Fill(_col_names);

		for (int _i=0; _i<_col_names.Rows.Count ; _i++  )
		{
			DataRow _r = _col_names.Rows[_i];
			
			if (_i == 0)
				WriteLine(RowConversion(_r));
			else
				WriteLine(", " + RowConversion(_r));								
		}
		
	#>
	)
</DirectInput>

Here I have used a slightly different technique in iteration – but the main snippet of code lies in the function call RowConversion(_r). Since I need to include commas in my column definition, and that definition may require a leading or trailing comma, I can use the rows position to determine if a comma is required (or not). That is, the first row has no commas and all subsequent rows have a leading comma. This will produce the following biml snippet.

<DirectInput>
	create table import.[Production_ScrapReason]
	(
	[ScrapReasonID] smallint
	, [Name] nvarchar(50)
	, [ModifiedDate] datetime
	)
</DirectInput>

Custom Columns

If I want to add additional columns to each table (for example, an append_dt column), we can simply add those columns as static biml in our definition (notice that the complete solution below includes an append_dt field in the DirectInput tag). Naturally, column names should be unique for each table and the command will fail should an additional column (name) exist in the source table. We can solve this with script and that may be the subject of the next post.

Final Solution

For completeness, I have included the complete solution below.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _stage_con_string = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012";
	string _table_name_sql = "select  TABLE_SCHEMA , table_name from INFORMATION_SCHEMA.TABLES  where TABLE_TYPE='BASE TABLE'"; 
	
	DataTable _table_names = new DataTable();
	SqlDataAdapter _table_name_da = new SqlDataAdapter(_table_name_sql, _stage_con_string);
    _table_name_da.Fill(_table_names);	
	
#>
<#+ 
public string RowConversion(DataRow Row)
{
    string _ret = "[" + Row["COLUMN_NAME"] + "] " + Row["DATA_TYPE"];


    switch (Row["DATA_TYPE"].ToString().ToUpper())
    {
        case "NVARCHAR":
        case "VARCHAR":
        case "NCHAR":
        case "CHAR":
        case "BINARY":
        case "VARBINARY":
            if (Row["CHARACTER_MAXIMUM_LENGTH"].ToString() == "-1")
                _ret += "(max)";
            else
                _ret += "(" + Row["CHARACTER_MAXIMUM_LENGTH"] + ")";
            break;

        case "NUMERIC":
            _ret += "(" + Row["NUMERIC_PRECISION"] + "," + Row["NUMERIC_SCALE"] + ")";
            break;

        case "FLOAT":
            _ret += "(" + Row["NUMERIC_PRECISION"] + ")";
            break;
    }
  
    return _ret;
}
#> 

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage" />
  </Connections>
	<Packages>
		<Package Name="001_Create_Staging" ConstraintMode="Linear">
			<Tasks>
				<# foreach(DataRow _table in _table_names.Rows) {	#>			
				<ExecuteSQL Name="create_import_<#= _table["TABLE_SCHEMA"] #>_<#= _table["TABLE_NAME"] #>" ConnectionName="staging_db">
					<DirectInput>
						create table import.[<#= _table["TABLE_SCHEMA"] + "_" + _table["TABLE_NAME"] #>]
						(
						<#
							
	        				string _col_name_sql = "select COLUMN_NAME, DATA_TYPE,  CHARACTER_MAXIMUM_LENGTH, NUMERIC_PRECISION, NUMERIC_SCALE from INFORMATION_SCHEMA.COLUMNS where TABLE_SCHEMA='" + _table["TABLE_SCHEMA"] + "' and TABLE_NAME='"+ _table["TABLE_NAME"] + "' order by ORDINAL_POSITION ";

				            DataTable _col_names = new DataTable();
				            SqlDataAdapter _col_names_da = new SqlDataAdapter(_col_name_sql, _stage_con_string);
				            _col_names_da.Fill(_col_names);

				            for (int _i=0; _i<_col_names.Rows.Count ; _i++  )
							{
								DataRow _r = _col_names.Rows[_i];
								
								if (_i == 0)
									WriteLine(RowConversion(_r));
	                			else
	                    			WriteLine(", " + RowConversion(_r));								
                            }
							
						#>
						, append_dt datetime
						)
          </DirectInput>
				</ExecuteSQL>
				<# } #>
      </Tasks>
		</Package>
	</Packages>
</Biml>

Conclusion

This post has looked at using biml to create a staging area. A package is created that includes ExecuteSql statements for each table in a source database information_schema. This approach has a flexible approach to creation and offers many advantages which revolve around table alteration (different target schema, additional columns, datatype alteration etc).

Biml XVII – Building Extraction ETL from Metadata (Database Definitions) (Part 2)

This post is the 2st post (in 2) for using biml to create a content driven extraction layer. The first post can be found at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-building-extraction-etl-from-metadata-database-definitions/

Introduction

Our previous post on extraction (see here and here) looked at using biml (and biml script) to create a package driven by metadata which was defined in biml. We introduced the concepts of modularisation and showed how a group of biml files could be compiled sequentially to separate out the logic of solution components.

Including asset definitions in biml sounds like a good idea, you can have objects defined as structures in code so the entire solution is defined and managed within a single set of project files. However, one issue with this approach (and the problem relates to the SSDT or BIDS representation of biml) is that that object definitions are only available for package generation and you can’t visualise and edit the table as you can in MIST (a biml tool from Varigence).

As an alternative to this approach, we can define our ETL as content in a relational model. Then, we can use biml script to build a package based on that content. Note that we are not talking about a data driven modelling approach here. A data driven approach builds ETL and target schema based on the design of a source system. This approach simply uses database content to define how data is moved. The benefits are that there can be any number of interfaces to the model and packages are built with consistency in a very fast manner. We have already defined the target environment and now want some managed way of creating a package(s) to populate it.

Package Biml

Now that we’ve got an idea about how to use iteration in C# code and embed that with biml, we can look at how to define the package that moves the data. We’ll look at this as a twostep process, firstly, create the biml (and script) to iterate over each extraction (the output from the stored proc staging_extractions) and create a container with a truncate and empty data flow (as below). Then in the second step, we can populate the data flow with transforms.

Our proc (staging_extractions) has a list of extraction objects. All we have to do is iterate over those and create the containers and their tasks. This is not dissimilar to the way that we created connections – its just that the biml contained in the foreach operator is a little longer. We used fully qualify names with the extraction_id in order to uniquely identity each task.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _con = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _extract_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter(" exec dbo.staging_extractions", _con);
    _con_da.Fill(_extract_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Packages>
		<Package Name="002_Extractions" ConstraintMode="Parallel">
			<Tasks>
				<# foreach (DataRow _extraction in _extract_table.Rows){ #>
				
				<Container Name="<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConstraintMode="Linear">
					
					<Tasks>
						<ExecuteSQL Name="truncate_<#=  _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConnectionName="<#=  _extraction["connection_tgt"] #>" >
							<DirectInput>Select 1</DirectInput>
						</ExecuteSQL>
						<Dataflow Name="extract_<#=  _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>">
						</Dataflow>
                    </Tasks>
					
                </Container>
				<#} #>
				
            </Tasks>
        </Package>
    </Packages>
    
</Biml>

I think there is also an interesting use of constraint declaration that’s worth discussion. Notice that the package definition defines parralell execution? This ensures that each container can execute in parrell. However for each container, I’ve specified its ConstraintMode as liner which means that tasks will have dependancies in the order that they are created. This is a neat trick so that we can enjoy the benefits of parralell execution for extractions but enforce a constraints within each extraction without an added burden of coding.

Now that we’ve populated the control flow, we just need to populate the data flow with transformations (get source data, append the date, define the destination). This is shown in the image below. We will do this so we just add the tasks and look at the mapping later.

All we have to do is replace the data flow task (above) with the following snippet. You’ll notice that all script uses data from the _extraction table. The framework just simply adds the datasource, then the derived column and destination with names (and tables) that relate to the (current) extraction row in iteration.


<Dataflow Name="extract_<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>">

	<Transformations>

		<OleDbSource 
		Name="src_<#= _extraction["extraction_id"] + "_" + _extraction["entity_schema_source"] + "_" + _extraction["entity_definition_source"] #>" 
		ConnectionName="<#= _extraction["connection_source"] #>">
			<DirectInput>
				Select * from <#= _extraction["entity_schema_source"] #>.<#= _extraction["entity_definition_source"] #>;
			</DirectInput>
		</OleDbSource>

		<DerivedColumns Name="append_dt">
				<Columns>
					<Column Name="append_dt" DataType="DateTime">@[System::StartTime]</Column>
				</Columns>
		</DerivedColumns>
		
		<OleDbDestination 
			Name="tgt_<#= _extraction["extraction_id"]+ "_" + _extraction["entity_schema_tgt"] + "_" + _extraction["entity_definition_tgt"] #>" 
			ConnectionName="<#= _extraction["connection_tgt"] #>">
				<ExternalTableOutput Table="<#= _extraction["entity_schema_tgt"] #>.<#= _extraction["entity_definition_tgt"] #>"></ExternalTableOutput>
		</OleDbDestination>
	
	</Transformations>

</Dataflow>


Now to turn our attention to column mapping in the destination. If we generate the package and look at the mappings for a task, you’ll note that the columns are not mapped. We can recall from our post on destinations (here) that we can specify source to target column with the following sample.


<OleDbDestination Name="some_name" ConnectionName="some_connection">
	<ExternalTableOutput Table="target_table"></ExternalTableOutput>
	<Columns>
		<Column SourceColumn="ProductAlternateKey" TargetColumn="product_id" />
	</Columns>
</OleDbDestination>

Since we are already familiar with the idea of iteration, all we have to do is get a table of mappings for the current extraction and populate the column references. Remember that if we execute staging_column_mappings 1 (where 1 refers to the current extraction_id) we would get a set of results as below.

Now, all we have to do is used script to add a columns tag to our OledbDestination. This can be done by adding the following snippet.


<Columns>
	<#
		DataTable _col_maps = new DataTable();
		string _col_sql = "exec [dbo].[staging_column_mappings] " + _extraction["extraction_id"];
		SqlDataAdapter _col_da = new SqlDataAdapter(_col_sql, _con);
		_col_da.Fill(_col_maps);
		foreach (DataRow _map in _col_maps.Rows){ #>
		<Column SourceColumn="<#= _map["column_source"] #>" TargetColumn="<#= _map["column_tgt"] #>" />	
    <# }  #>	
</Columns>

Perhaps whats not so apparent (it may be moreso when you examine the entire script below) is that the generation of the column list is dependent on the extraction_id in the extraction iterator. This is really just a nested loop where the second loop (the inner loop) is dependent on the outer loop. The psuedo code for this is shown as;


Foreach (record _r in [dbo].[staging_extractions])
{
	.. do some stuff
	Foreach (record _mapp in _r.extraction_id)
	{
		Define the mappings
	}
}

For completeness, here is the entire file for package creation.


<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="2" #>
<#
	string _con = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _extract_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter(" exec dbo.staging_extractions", _con);
    _con_da.Fill(_extract_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Packages>
		<Package Name="002_Extractions" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">
			<Tasks>
				<# foreach (DataRow _extraction in _extract_table.Rows){ #>
				
				<Container Name="<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConstraintMode="Linear">
					
					<Tasks>
						<ExecuteSQL Name="truncate_<#=  _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>" ConnectionName="<#=  _extraction["connection_tgt"] #>" >
							<DirectInput>truncate table <#= _extraction["entity_schema_tgt"] #>.<#= _extraction["entity_definition_tgt"] #>;</DirectInput>
						</ExecuteSQL>
						<Dataflow Name="extract_<#= _extraction["extraction_id"] #>_<#= _extraction["entity_definition_tgt"] #>">
							<Transformations>
								<OleDbSource Name="src_<#= _extraction["extraction_id"] + "_" + _extraction["entity_schema_source"] + "_" + _extraction["entity_definition_source"] #>" 
								ConnectionName="<#= _extraction["connection_source"] #>">
									<DirectInput>
										Select * from <#= _extraction["entity_schema_source"] #>.<#= _extraction["entity_definition_source"] #>;
                                    </DirectInput>
                                </OleDbSource>
								<DerivedColumns Name="append_dt">
									<Columns>
										<Column Name="append_dt" DataType="DateTime">@[System::StartTime]</Column>
                                    </Columns>
                                </DerivedColumns>
								<OleDbDestination Name="tgt_<#= _extraction["extraction_id"]+ "_" + _extraction["entity_schema_tgt"] + "_" + _extraction["entity_definition_tgt"] #>" 
								ConnectionName="<#= _extraction["connection_tgt"] #>">
									<ExternalTableOutput Table="<#= _extraction["entity_schema_tgt"] #>.<#= _extraction["entity_definition_tgt"] #>"></ExternalTableOutput>
									<Columns>
										<#
											DataTable _col_maps = new DataTable();
											string _col_sql = "exec [dbo].[staging_column_mappings] " + _extraction["extraction_id"];
											SqlDataAdapter _col_da = new SqlDataAdapter(_col_sql, _con);
											_col_da.Fill(_col_maps);
											foreach (DataRow _map in _col_maps.Rows){ #>
											<Column SourceColumn="<#= _map["column_source"] #>" TargetColumn="<#= _map["column_tgt"] #>" />	
                                        <# }  #>	
                                    </Columns>
								</OleDbDestination>
                            </Transformations>
						</Dataflow>
                    </Tasks>
					
                </Container>
				<#} #>
				
            </Tasks>
        </Package>
    </Packages>
    
</Biml>

The entire code may be a little much to take in all at once and you could suggest that such an approach is quiet easy to create by hand. True, for three ‘tasks’, dragging and dropping may be easier than setting up this solution. Remember though, that the goal of the post(s) is to build packages based on content. Adding tasks would be as simple as adding records to our extract table and creating additional mappings (although the mappings are only needed to the extent that columns must be mapped).

Biml XVII – Building Extraction ETL from Metadata (Database Definitions) (Part I)

This post is the 1st post (in 2) for using biml to create a content driven extraction layer. The second post can be found at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-building-extraction-etl-from-metadata-database-definitions-part-2/

Our previous post on extraction (see here and here) looked at using biml (and biml script) to create a package driven by metadata which was defined in biml. We introduced the concepts of modularisation and showed how a group of biml files could be compiled sequentially to separate out the logic of solution components.

Including asset definitions in biml sounds like a good idea, you can have objects defined as structures in code so the entire solution is defined and managed within a single set of project files. However, one issue with this approach (and the problem relates to the SSDT or BIDS representation of biml) is that that object definitions are only available for package generation and you can’t visualise and edit the table as you can in MIST (a biml tool from Varigence).

As an alternative to this approach, we can define our ETL as content in a relational model. Then, we can use biml script to build a package based on that content. Note that we are not talking about a data driven modelling approach here. A data driven approach builds ETL and target schema based on the design of a source system. This approach simply uses database content to define how data is moved. The benefits are that there can be any number of interfaces to the model and packages are built with consistency in a very fast manner. We have already defined the target environment and now want some managed way of creating a package(s) to populate it.

The Data Definition

We will continue with the same theme from our previous example (plus a few twists) where we want to extract Adventure Works data into our ‘staging’ area. It’s very simple and demonstrates our intent and something that we can build on later. So to revisit, we have a source database (AdventureWorksDW2012) and we want to transfer data to our staging database (stage). Unlike the prior example, our source and target field names do not match so we cannot rely on auto mapping in package generation. We can also assume that (unlike the prior example), the target and source table names do not follow a convenient naming convention as so we must explicitly specify the source and target tables. For example, we may want the stage.dbo.import_product table to be called p_im_01 (and we would not derive that it related to DimProduct) in the source. Additionally, we have included an append_dt field in the target which should be populated by the ETL (ie not using a table default). We can see the required mapping in the image below.

The Proforma Package

If we consider how a single table would be populated (say for example import_product), we might propose a template package structure (as below). Our process is similar to the prior post except that this time, our tasks are bounded by a container, we append the field append_dt and explicitly map columns in the oledb destination.

There is one other thing to remember – the creation of this package should be driven by the content stored in a database.

Definition of a Metadata Environment (database content)

I will not go into how the metadata repository is physically defined. For our purposes, all we need to be concerned about it what we can get out of it. For that, I have a few procedures that produce some output.

[dbo].[staging_connections] which lists connection information (below)

[dbo].[staging_extractions] which lists an source and target objects (including their schema and connection)

[dbo].[staging_column_mappings] which lists any column mappings that should be applied (as identified by an extraction_id). For example, extraction 1 (as above) applies the following mapping

Incremental Build Methodology

Just as in the prior post, we’ll break our biml files out into 2 distinct files. The first defines connections and the second defines the extraction package.

Connections Biml

Our connections biml is shown below.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="1" #>
<#
	string _con = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _con_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter("exec [dbo].[staging_connections];", _con);
    _con_da.Fill(_con_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

    <Connections>
			
        <#
		foreach (DataRow _r in _con_table.Rows){ #>
			<Connection Name="<#= _r["connection_name"].ToString() #>" ConnectionString="<#= _r["connection_definition"] #>"   />
		<# } #>

	</Connections>
    
</Biml>

To understand how the inclusion of biml script interacts with (static) biml, it may be a good idea to look at what actually happens when the above code is run. This can be considered a twostep process whereby the script is executed to produce an ‘all biml’ version of the file and secondly, that ‘all biml’ script is executed to produce our packages. The output of the first step is shown below.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Connections>
		<Connection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage"   />
		<Connection Name="adventure_works_2012" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012"   />
	</Connections>

</Biml

We can see that when the script executes, it defines our connections. Now let’s make some observations about the operation of it. I will not go into the declarations (ie the stuff between the tags <#@ declaration #>) and focus only on code.

The first block (reproduced below) fills a data table with the results of my connection stored proc. The code block is defined as the space between the opening and closing tags (<# and #> ). Anything defined in this block is available (since its already been defined) through the entire file and not restricted to the space between the script tags. For example, I could access the data table _con_table in any other code block and that’s exactly what we do later in the file.

<#
	string _con = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _con_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter("exec [dbo].[staging_connections];", _con);
    _con_da.Fill(_con_table);	
#>

We also need to recognise that variables may have a limited scope. As stated above, they are still are available in regardless of how the script block is broken. Consider how we iterate over rows using foreach (which is reproduced below). We declare _r as each row in the table (and then loop over each row), _r’s scope is defined (in C# code) between the two parenthesis ({ }). However, note that we close off the scripting block after we create the first { and start using static biml again. Since we have defined _r in code between the parenthesis, its value can be used in any code snippet (that lies between those {} ). That’s how we can use the connection_name and connection_definition fields from _r to mesh our static biml with script values (as below).

As the saying goes, there are many ways to skin a cat and have options for producing biml (remember that we considering execution as a twostep process that first produces an all biml file and then compiles it). How you output that biml is up to you and you could (if you like) output it directly from code. For example, consider how we define the connection in row iteration (above).

Notice that the actual connection definition (the tag <Connection …..) mixes script tags and biml? We could (if we like) output that string entirely from C# code and not worry about mixing the script with the biml. I guess this raises the question why not generate the entire biml from c#? You could and the choice is yours 🙂

<#
foreach (DataRow _r in _con_table.Rows){
		WriteLine("<Connection Name=\"" + _r["connection_name"] + "\" ConnectionString=\"" + _r["connection_definition"] +"\" /> ")	;
}#>

Part 2 continues at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-building-extraction-etl-from-metadata-database-definitions-part-2/

BIML XVI – Script and Reuse (Part II)

This post continues a prior post about scripting and asset definition in biml. You can find the first part here.

For clarification, we are using biml to define assets (objects in our BI environment) and then use that definition in our biml code. We also introduce biml script as a method of reusing biml code templates.

Biml – its More than Just a Package Generator

The complete Biml for the file 01_Env.biml is shown below. In summary, it specifies the objects in our environment (connections, databases, schemas and tables). Although SSIS offers us no ability to define and see tables (perhaps you could argue that a database project may), Biml does, and we can define it within the XML.

It is also interesting to note the relationship between tables and their containing database. That is, a table belongs to a schema which (of course) belongs to a database which exists in a connection. The definition of those properties are required if we are to use ‘lower’ level objects. That is, the database requires a connection (and so on).

It would seem intuitive that a table definition require columns. After all, could you have a table without them? However, there is no requirement to specify columns within a biml table and this is demonstrated in the tables below.

Finally, the astute ready will notice that I’ve referred to a tier in my code (note the first line). This has not been done before and the purpose of a tier is to guarantee build order by the biml compiler. Up until now, we have only generated biml in one file because all requirements have existed in that file. When we submit multiple files (we can build more than one biml file at a time) we want to guarantee the execution order and the tier allows this. We couldn’t (after all) use a table before its defined.

 
<#@ template tier="1" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
    <OleDbConnection Name="stage" ConnectionString="Provider=SQLOLEDB;Data Source=.\SQL2012;Integrated Security=SSPI;Initial Catalog=stage" />
    <OleDbConnection Name="adventure_works" ConnectionString="Provider=SQLOLEDB;Data Source=.\SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorks2012" />
  </Connections>
  
  <Databases>
    <Database ConnectionName="stage" Name="stage"  />
  </Databases>
  
  <Schemas>
    <Schema DatabaseName="stage" Name="dbo"  />
  </Schemas>
  
  <Tables>
    
    <!-- ah yes .... of course i should specify columns 🙂 -->  
    <Table Name="Product" SchemaName="stage.dbo" />
    <Table Name="ProductSubCategory" SchemaName="stage.dbo" />
    
    <Table Name="ProductCategory" SchemaName="stage.dbo">
      <Columns>
        <Column Name="ProductCategoryID" DataType="Int32" />
        <Column Name="Name" DataType="String" Length="50" />
      </Columns>
    </Table>
  
  </Tables>
  
</Biml>

Biml Script, Package Creation and the ASP Comparison

The second file in our biml example defines a package. To reiterate how it works, it uses a generic template (described above) and iterates over each table defined in the first biml file to use the name of the table in the template. For this reason, biml script is often compared to asp – the initial scripting language for html. Actually biml is often described as html with biml script being asp code.

Of course this is only possible in this instance because we have employed a consistent design.

<#@ template tier="2" #>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
      <Packages>
        <Package Name="build_stage" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">
            <Tasks>
                <# foreach( var _table in RootNode.Tables ) { #>
                <ExecuteSQL Name="TRUNC_<#= _table #>" ConnectionName="stage">
                    <DirectInput>Truncate Table dbo.import_<#= _table #>;</DirectInput>
                </ExecuteSQL>
				
                <Dataflow Name="LOAD_<#= _table #>" >
					 
                    <Transformations>
           
                      <OleDbSource Name="SRC_<#= _table #>" ConnectionName="adventure_works">
                            <DirectInput>Select * from Production.<#= _table #>;
                            </DirectInput>
                        </OleDbSource>  
			
 			  <OleDbDestination Name="TGT_<#= _table #>" ConnectionName="stage">
			  	<ExternalTableOutput Table="dbo.import_<#= _table #>"></ExternalTableOutput>
                        </OleDbDestination> 
                    </Transformations>
			 
                    <PrecedenceConstraints>
                        <Inputs>
                            <Input OutputPathName="TRUNC_<#= _table #>.Output">
                            </Input>
                        </Inputs>
                    </PrecedenceConstraints>

                </Dataflow>

                <#  }#>
            </Tasks>
        </Package>
    </Packages>
</Biml>

In examining the script, we’ll look at 2 items. Firstly the loop identifier/creator, this occurs as the first child of the task node (after all we want to generate the code for each table). This is defined with the statement and is similar to C# code.

foreach( var _table in RootNode.Tables ) {}

If you are not familiar with code, we are just creating a loop over the tables (node) in the biml definition (that is, the environment file). The use of {} means that anything between those parenthesis will get repeated for each table. We may remember back to the first post, where we said there are many root child node types.

We have to have some way of telling the compiler (what actually builds the packages) that this is code (and not biml) so, any code needs to be encompassed in its special tag (ie <# #> ).

Secondly, we need to refer to the table name within each iteration. This is done with and equal sign following the opening tag (that is, <#= #> ) and is just the same as saying print the script value of what’s between the opening (<#=) and closing (#>) tag. Since we have defined the _table variable (in the loop), we can use it here to specify the name.

If your having trouble understanding whats going on, the biml script is just rerunning the template code (the one with Product) and replacing product with the table name (which is defined in the first biml file).

Also note that this file is a second tier which means that it will execute after the first.

Execution

We can execute more than one biml file at a time. Simply, highlight both and select ‘Generate SSIS Packages’. The definition of tiers will solve (an otherwise random execution order) and make sure that the second file can use components that are created in the first.

Conclusion

I think there’s a lot in this post. We’ve looked into using biml to specify our BI assets and then used that definition (or its master data) to create packages. We have also compared Biml to asp and demonstrated how to use script to iterate over template code.

BIML XVI – Script and Reuse (Part I)

All our posts on Biml so far have focused on how to using Biml to mimic SSIS tasks. That is, they’ve looked at how to use Biml to generate a package and focused on task configuration. One post looked at how to generate XML automatically, however it’s not quite the same as using script to generate code. Of course, the purpose of using script is to reuse a template and generate multiple objects based on master data. If the master data changes, we could automatically regenerate those objects.

The purpose of this post is to introduce Biml script as a way of automation (iterating over a template) and examine the other benefits that Biml can bring to a project. Remember that Biml is an acronym for Business Intelligence Markup Language and hence we would expect it to provide an environment for business intelligence asset definition. In that way we can use Biml to define our environment, rather than just generating SSIS packages from an XML dialect. There are several benefits in this approach;

  1. Multiple people can work on a solution at once.
  2. The solution/project can fully encompass an environment.
  3. The objects defined within the environment can be referenced from it as master data.

To illustrate this process we will look at an example that’s common in building data warehouses/marts – the extraction of staging data. To be fair, the solution presented is by no means “production ready”. We are only interested in demonstrating some concepts, introducing scripting and trying to keep the post concise. So let’s get started.

What Are We Trying To Do Here?

The scenario that we are demonstrating is quiet simple. We have three tables in our staging environment that we wish to populate from our production database. You can see an image of the production database (AdventureWorks2012) below on the left hand side and the mapping of tables to the staging database. The fields in the staging database are named in the same way, however the tables are in a different schema (dbo) and are prefixed with import_. Also, (for those not too familiar with adventure works), we’ll call out that not all fields are duplicated in the staging environment. Note that the table import_Product only has 3 fields whereas there are 25 fields in Production.Product.

If you want to create the staging tables in your own environment, you can use the following snippet.

CREATE TABLE dbo.import_ProductCategory(
	[ProductCategoryID] [int]  NOT NULL,
	[Name] nvarchar(50) NOT NULL
); 

CREATE TABLE dbo.import_ProductSubcategory(
	[ProductSubcategoryID] [int] NOT NULL,
	[ProductCategoryID] [int] NOT NULL,
	[Name] nvarchar(50) NOT NULL
);

CREATE TABLE dbo.import_Product(
	[ProductID] [int] NOT NULL,
	[Name] nvarchar(50) NOT NULL,
	[ProductSubcategoryID] [int] NULL
);

Package Template/Output

What I’d like to achieve with my Biml is to generate a package that moves data from the production table to the target. Each movement would naturally require a truncation the existing (staging) table and then dataflow to reload it from production data. An overview of the package is shown below. Note that each movement exists within its own constraints and, in that, way the movements can execute in parallel and are dependent on the truncation step.

If we look at the first load (ie Product), we could use the Biml snippet below.

Since all objects (source and destination tables) have a consistent naming convention, we know the schema and table name variations that would be required to produce this code for any table. That is, we would only have to replace Product with ProductCategory (or ProductSubCategory) and we could extract each of those tables. We are relying on the auto mapping feature that was discussed here which relies on consistent column names but baby steps first J.

 
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    
    <Package Name="build_stage" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">

      <Tasks>
        
        <ExecuteSQL Name="TRUNC_Product" ConnectionName="stage">
          <DirectInput>Truncate Table dbo.import_Product;</DirectInput>
        </ExecuteSQL>
        
        <Dataflow Name="LOAD_Product">

          <Transformations>
            <OleDbSource Name="SRC_Product" ConnectionName="adventure_works">
              <DirectInput>Select * from dbo.Product;</DirectInput>
            </OleDbSource>
            
            <OleDbDestination Name="TGT_Product" ConnectionName="stage">
              <ExternalTableOutput Table="dbo.import_Product">
              </ExternalTableOutput>
            </OleDbDestination>
            
          </Transformations>
          
          <!-- this constraint is for the data flow -->
          <PrecedenceConstraints>
            <Inputs>
              <Input OutputPathName="TRUNC_Product.Output"></Input>
            </Inputs>
          </PrecedenceConstraints>
        
        </Dataflow>
        
      </Tasks>
    </Package>
  </Packages>
</Biml> 

It may be worth pointing out two conditions that may not be that apparent in the snippet (they become obvious once pointed out and the package is generated). Firstly, the package assumes a parallel constraint mode. This ensures that each task is placed on the control flow without any precedence constraints between them. Secondly, the data flow must execute after the truncation so we need to specify a precedence between that and the (prior) truncation. You can see its position based on the comment in the code – its use may not be obvious straight away but just keep in mind that this relates to the dataflow task.

Project & Build Logic

It makes a lot of sense to separate out objects into modular components. If we think about it, there is an inherent relationship between objects that dictate a natural order in the task we are trying to achieve. For example, we have an environment with connection (definitions), that specifies target tables and finally, we have a work specification for our ETL (which we have said will rely on the prior environments metadata to generate code).

Aside from that logical sequence for building something, modularisation also provides the ability to separate work among different people – we could have someone responsible for connections, someone different responsible for table definitions and finally a developer working on package snippets.

In our example, we’ll demonstrate this by breaking our ‘project’ into 2 biml files. The first (01_Env.biml) defines the environment and the second (02_Biml) will be used to build our staging package. We can see how this in the image below.

What’s Next ?

Continue reading the next instalment at https://paultebraak.wordpress.com/2015/01/05/biml-xvi-script-and-reuse-part-ii/