BIML XVI – Script and Reuse (Part I)


All our posts on Biml so far have focused on how to using Biml to mimic SSIS tasks. That is, they’ve looked at how to use Biml to generate a package and focused on task configuration. One post looked at how to generate XML automatically, however it’s not quite the same as using script to generate code. Of course, the purpose of using script is to reuse a template and generate multiple objects based on master data. If the master data changes, we could automatically regenerate those objects.

The purpose of this post is to introduce Biml script as a way of automation (iterating over a template) and examine the other benefits that Biml can bring to a project. Remember that Biml is an acronym for Business Intelligence Markup Language and hence we would expect it to provide an environment for business intelligence asset definition. In that way we can use Biml to define our environment, rather than just generating SSIS packages from an XML dialect. There are several benefits in this approach;

  1. Multiple people can work on a solution at once.
  2. The solution/project can fully encompass an environment.
  3. The objects defined within the environment can be referenced from it as master data.

To illustrate this process we will look at an example that’s common in building data warehouses/marts – the extraction of staging data. To be fair, the solution presented is by no means “production ready”. We are only interested in demonstrating some concepts, introducing scripting and trying to keep the post concise. So let’s get started.

What Are We Trying To Do Here?

The scenario that we are demonstrating is quiet simple. We have three tables in our staging environment that we wish to populate from our production database. You can see an image of the production database (AdventureWorks2012) below on the left hand side and the mapping of tables to the staging database. The fields in the staging database are named in the same way, however the tables are in a different schema (dbo) and are prefixed with import_. Also, (for those not too familiar with adventure works), we’ll call out that not all fields are duplicated in the staging environment. Note that the table import_Product only has 3 fields whereas there are 25 fields in Production.Product.

If you want to create the staging tables in your own environment, you can use the following snippet.

CREATE TABLE dbo.import_ProductCategory(
	[ProductCategoryID] [int]  NOT NULL,
	[Name] nvarchar(50) NOT NULL
); 

CREATE TABLE dbo.import_ProductSubcategory(
	[ProductSubcategoryID] [int] NOT NULL,
	[ProductCategoryID] [int] NOT NULL,
	[Name] nvarchar(50) NOT NULL
);

CREATE TABLE dbo.import_Product(
	[ProductID] [int] NOT NULL,
	[Name] nvarchar(50) NOT NULL,
	[ProductSubcategoryID] [int] NULL
);

Package Template/Output

What I’d like to achieve with my Biml is to generate a package that moves data from the production table to the target. Each movement would naturally require a truncation the existing (staging) table and then dataflow to reload it from production data. An overview of the package is shown below. Note that each movement exists within its own constraints and, in that, way the movements can execute in parallel and are dependent on the truncation step.

If we look at the first load (ie Product), we could use the Biml snippet below.

Since all objects (source and destination tables) have a consistent naming convention, we know the schema and table name variations that would be required to produce this code for any table. That is, we would only have to replace Product with ProductCategory (or ProductSubCategory) and we could extract each of those tables. We are relying on the auto mapping feature that was discussed here which relies on consistent column names but baby steps first J.

 
<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    
    <Package Name="build_stage" ConstraintMode="Parallel" ProtectionLevel="EncryptSensitiveWithUserKey">

      <Tasks>
        
        <ExecuteSQL Name="TRUNC_Product" ConnectionName="stage">
          <DirectInput>Truncate Table dbo.import_Product;</DirectInput>
        </ExecuteSQL>
        
        <Dataflow Name="LOAD_Product">

          <Transformations>
            <OleDbSource Name="SRC_Product" ConnectionName="adventure_works">
              <DirectInput>Select * from dbo.Product;</DirectInput>
            </OleDbSource>
            
            <OleDbDestination Name="TGT_Product" ConnectionName="stage">
              <ExternalTableOutput Table="dbo.import_Product">
              </ExternalTableOutput>
            </OleDbDestination>
            
          </Transformations>
          
          <!-- this constraint is for the data flow -->
          <PrecedenceConstraints>
            <Inputs>
              <Input OutputPathName="TRUNC_Product.Output"></Input>
            </Inputs>
          </PrecedenceConstraints>
        
        </Dataflow>
        
      </Tasks>
    </Package>
  </Packages>
</Biml> 

It may be worth pointing out two conditions that may not be that apparent in the snippet (they become obvious once pointed out and the package is generated). Firstly, the package assumes a parallel constraint mode. This ensures that each task is placed on the control flow without any precedence constraints between them. Secondly, the data flow must execute after the truncation so we need to specify a precedence between that and the (prior) truncation. You can see its position based on the comment in the code – its use may not be obvious straight away but just keep in mind that this relates to the dataflow task.

Project & Build Logic

It makes a lot of sense to separate out objects into modular components. If we think about it, there is an inherent relationship between objects that dictate a natural order in the task we are trying to achieve. For example, we have an environment with connection (definitions), that specifies target tables and finally, we have a work specification for our ETL (which we have said will rely on the prior environments metadata to generate code).

Aside from that logical sequence for building something, modularisation also provides the ability to separate work among different people – we could have someone responsible for connections, someone different responsible for table definitions and finally a developer working on package snippets.

In our example, we’ll demonstrate this by breaking our ‘project’ into 2 biml files. The first (01_Env.biml) defines the environment and the second (02_Biml) will be used to build our staging package. We can see how this in the image below.

What’s Next ?

Continue reading the next instalment at https://paultebraak.wordpress.com/2015/01/05/biml-xvi-script-and-reuse-part-ii/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s