Biml XVII – Building Extraction ETL from Metadata (Database Definitions) (Part I)


This post is the 1st post (in 2) for using biml to create a content driven extraction layer. The second post can be found at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-building-extraction-etl-from-metadata-database-definitions-part-2/

Our previous post on extraction (see here and here) looked at using biml (and biml script) to create a package driven by metadata which was defined in biml. We introduced the concepts of modularisation and showed how a group of biml files could be compiled sequentially to separate out the logic of solution components.

Including asset definitions in biml sounds like a good idea, you can have objects defined as structures in code so the entire solution is defined and managed within a single set of project files. However, one issue with this approach (and the problem relates to the SSDT or BIDS representation of biml) is that that object definitions are only available for package generation and you can’t visualise and edit the table as you can in MIST (a biml tool from Varigence).

As an alternative to this approach, we can define our ETL as content in a relational model. Then, we can use biml script to build a package based on that content. Note that we are not talking about a data driven modelling approach here. A data driven approach builds ETL and target schema based on the design of a source system. This approach simply uses database content to define how data is moved. The benefits are that there can be any number of interfaces to the model and packages are built with consistency in a very fast manner. We have already defined the target environment and now want some managed way of creating a package(s) to populate it.

The Data Definition

We will continue with the same theme from our previous example (plus a few twists) where we want to extract Adventure Works data into our ‘staging’ area. It’s very simple and demonstrates our intent and something that we can build on later. So to revisit, we have a source database (AdventureWorksDW2012) and we want to transfer data to our staging database (stage). Unlike the prior example, our source and target field names do not match so we cannot rely on auto mapping in package generation. We can also assume that (unlike the prior example), the target and source table names do not follow a convenient naming convention as so we must explicitly specify the source and target tables. For example, we may want the stage.dbo.import_product table to be called p_im_01 (and we would not derive that it related to DimProduct) in the source. Additionally, we have included an append_dt field in the target which should be populated by the ETL (ie not using a table default). We can see the required mapping in the image below.

The Proforma Package

If we consider how a single table would be populated (say for example import_product), we might propose a template package structure (as below). Our process is similar to the prior post except that this time, our tasks are bounded by a container, we append the field append_dt and explicitly map columns in the oledb destination.

There is one other thing to remember – the creation of this package should be driven by the content stored in a database.

Definition of a Metadata Environment (database content)

I will not go into how the metadata repository is physically defined. For our purposes, all we need to be concerned about it what we can get out of it. For that, I have a few procedures that produce some output.

[dbo].[staging_connections] which lists connection information (below)

[dbo].[staging_extractions] which lists an source and target objects (including their schema and connection)

[dbo].[staging_column_mappings] which lists any column mappings that should be applied (as identified by an extraction_id). For example, extraction 1 (as above) applies the following mapping

Incremental Build Methodology

Just as in the prior post, we’ll break our biml files out into 2 distinct files. The first defines connections and the second defines the extraction package.

Connections Biml

Our connections biml is shown below.

<#@ import namespace="System.Data" #>
<#@ import namespace="System.Data.SqlClient" #>
<#@ template language="C#" tier="1" #>
<#
	string _con = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _con_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter("exec [dbo].[staging_connections];", _con);
    _con_da.Fill(_con_table);	
#>
<Biml xmlns="http://schemas.varigence.com/biml.xsd">

    <Connections>
			
        <#
		foreach (DataRow _r in _con_table.Rows){ #>
			<Connection Name="<#= _r["connection_name"].ToString() #>" ConnectionString="<#= _r["connection_definition"] #>"   />
		<# } #>

	</Connections>
    
</Biml>

To understand how the inclusion of biml script interacts with (static) biml, it may be a good idea to look at what actually happens when the above code is run. This can be considered a twostep process whereby the script is executed to produce an ‘all biml’ version of the file and secondly, that ‘all biml’ script is executed to produce our packages. The output of the first step is shown below.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

	<Connections>
		<Connection Name="staging_db" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=stage"   />
		<Connection Name="adventure_works_2012" ConnectionString="Provider=SQLNCLI11;Data Source=.\SQL2012;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012"   />
	</Connections>

</Biml

We can see that when the script executes, it defines our connections. Now let’s make some observations about the operation of it. I will not go into the declarations (ie the stuff between the tags <#@ declaration #>) and focus only on code.

The first block (reproduced below) fills a data table with the results of my connection stored proc. The code block is defined as the space between the opening and closing tags (<# and #> ). Anything defined in this block is available (since its already been defined) through the entire file and not restricted to the space between the script tags. For example, I could access the data table _con_table in any other code block and that’s exactly what we do later in the file.

<#
	string _con = @"Data Source=.\SQL2012;Persist Security Info=true;Integrated Security=SSPI;Initial Catalog=stage";
	DataTable _con_table = new DataTable();
	SqlDataAdapter _con_da = new SqlDataAdapter("exec [dbo].[staging_connections];", _con);
    _con_da.Fill(_con_table);	
#>

We also need to recognise that variables may have a limited scope. As stated above, they are still are available in regardless of how the script block is broken. Consider how we iterate over rows using foreach (which is reproduced below). We declare _r as each row in the table (and then loop over each row), _r’s scope is defined (in C# code) between the two parenthesis ({ }). However, note that we close off the scripting block after we create the first { and start using static biml again. Since we have defined _r in code between the parenthesis, its value can be used in any code snippet (that lies between those {} ). That’s how we can use the connection_name and connection_definition fields from _r to mesh our static biml with script values (as below).

As the saying goes, there are many ways to skin a cat and have options for producing biml (remember that we considering execution as a twostep process that first produces an all biml file and then compiles it). How you output that biml is up to you and you could (if you like) output it directly from code. For example, consider how we define the connection in row iteration (above).

Notice that the actual connection definition (the tag <Connection …..) mixes script tags and biml? We could (if we like) output that string entirely from C# code and not worry about mixing the script with the biml. I guess this raises the question why not generate the entire biml from c#? You could and the choice is yours 🙂

<#
foreach (DataRow _r in _con_table.Rows){
		WriteLine("<Connection Name=\"" + _r["connection_name"] + "\" ConnectionString=\"" + _r["connection_definition"] +"\" /> ")	;
}#>

Part 2 continues at https://paultebraak.wordpress.com/2015/01/11/biml-xiii-building-extraction-etl-from-metadata-database-definitions-part-2/

Advertisements

3 thoughts on “Biml XVII – Building Extraction ETL from Metadata (Database Definitions) (Part I)

  1. Pingback: Biml XIII – Building Extraction ETL from Metadata (Database Definitions) (Part 2) | Paul te Braak

  2. Pingback: BIML XIX – Creating a Staging Area using Biml | Paul te Braak

  3. Pingback: BIML - Stage pro DWH snadno, rychle a automaticky - BI Notes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s