BIML IV– Extracting All Tables to Raw Files

Perhaps the first task of any (DW/ DM) project is to extract data – after all it’s the E in the ETL isn’t it? I know there’s a lot more that goes on here but it’s the most boring job I can think of doing and one which inevitably has to be done. In a previous Post (Biml III), I looked at using BIML to extract a table to a Raw File. This post will extend that so that we can automatically generate the code required to do the task – that is, we want to have a source database and automatically create a package to extract all the tables from the database.

To be honest, the Biml framework was put in place in the prior post. What I like about what we are going to do here is the quick generation of a package to achieve this. We will look into some Xml SQL also and, while you could argue that the process is a bit of a hack, it gets a result very fast. Also, we could extend this practice to other database platforms, all we want to do is generate Biml Code. So let’s begin.

Overview

I’ve already stated our objective is to generate Biml so that we can create a package which will extract all tables from our database. In order to do this, we are going to create the Xml specification through a set of queries that populate the proforma for a single table extract. Here’s the proforma;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost\SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
    <FileConnection Name="Target" FilePath="F:\\Output.raw"  />
  </Connections>
   
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
	…. (column list)
      </Columns>
    </RawFileFormat>
  </FileFormats>
 
  <Packages>
    <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
              <FileOutput ConnectionName ="Target" />
            </RawFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

 

There are 3 major things we’ll need to define. There are;

  1. The output file connection which is held in the Connections node. Let’s assume that is name should include the schema and table_name. For example, the extract for Person.Address should output a file called Person_Address.
  2. The file format for a raw file which is simply the list of columns that we want to export (held in the FileFormats node). There is a parent node of the column list which is defined by name and, (in our work), we should specify this by the schema and table.
  3. The data flow transform with an Oledb source and raw file destination (mapping will be managed by name matching). The target uses the file connection (defined in step 1) and the file format (step 2) which is defined by schema and table_name. Since we want to give our data flow components a meaningful name we’ll also prefix them with src (for source) and dest (for destination).

Now, lets look at how we can generate XML for each of our 3 components.

Target Files

The generation of target files is pretty straight forward. All we need is a query that produces a file connection node with the (file output). One for each table in the database. The query below does this and produces the output. There’s nothing too complicate here, but I’ve you never heard of the information schema on SQL Server, you may want to read a little more about the views here. Basically, they define the database (and what we are interested in is tables and columns).

The query,

/*	PTB, 10-Nov-2014
	SQL to Generate Biml FileConnections (all tables in a DB)
 */

declare @output_directory nvarchar(50) = 'F:\\DB_DUMP'

select 
'<FileConnection Name="' 
	+ TABLE_SCHEMA + '_' 
	+ TABLE_NAME + '" FilePath="' 
	+ @output_directory + '\\' 
	+ TABLE_SCHEMA + '_' + TABLE_NAME + '.raw" />'
from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = 'BASE TABLE'

Produces the folowing output.


All we want to do is take the output from the query and post it where our connections node.



File Formats

All the file format needs to do is specify a list of columns under a RawFileFormat Node. Of course the format must have a name but in essence we need to produce some xml that looks like the following snippet. And, it needs to be done for every table that we are going to extract.

    <RawFileFormat Name="Person_Address">
      <Columns>
        <Column Name="AddressID" />
        <Column Name="AddressLine1" />
        <Column Name="AddressLine2" />
        <Column Name="City" />
        <Column Name="StateProvinceID" />
        <Column Name="PostalCode" />
        <Column Name="rowguid" />
        <Column Name="ModifiedDate" />
      </Columns>

We are also a little restricted by the capabilities of the raw file output. This destination does not accept certain data types (eg Geography, XML or images) and so these need to be excluded from the definition. If they are not, our package will try and map source columns to these fields and the transform will fail (actually, it won’t validate). This is not an issue with Biml, rather a limitation of the destination.

Again, if you’ve not used the INFORMATION_SCHEMA before, you may want to have a quick look at the INFORMATION_SCHEMA.COLUMNS view. This view lists all the columns (and tables of the database). Since I want xml returned, I can use the XML clause to have the query generate an XML node for me (rather than a standard row set). There is a lot in the XML clause (if you want more information, you can read it here). My query;

/*	PTB, 10-Nov-2014
	SQL to Generate Biml FileFormats for all tables in a DB
 */

select 

  TABLE_SCHEMA + '_' + TABLE_NAME "@Name"
,  (select COLUMN_NAME "@Name" 
	from INFORMATION_SCHEMA.COLUMNS 
	where TABLE_NAME = t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA
		and  
		( 
			DATA_TYPE <> 'geography'
			and not ( DATA_TYPE='nvarchar' and CHARACTER_MAXIMUM_LENGTH = -1 ) 
			and not ( DATA_TYPE='varbinary' and CHARACTER_MAXIMUM_LENGTH = -1 ) 
			and DATA_TYPE <> 'Xml'
		)
	for xml path('Column'), type, root('Columns')
	)
from INFORMATION_SCHEMA.TABLES t 
WHERE TABLE_NAME in
					(
						SELECT TABLE_NAME 
						from INFORMATION_SCHEMA.TABLES 
						where TABLE_TYPE = 'BASE TABLE'
					)

for XML PATH('RawFileFormat'), type , ROOT('FileFormats')  

Returns an xml node which is shown in the results view like this.


Now, when we click the xml (note that it looks like a url), a new output window opens up with the following code. This is exactly what we need for the file formats and we can past over the proforma FileFormats node.

The Xml output from the File Formats query

 

 

 

Data Flows

Since we want to include all our data flows in a single package (with one dataflow for each table), we need to generate the xml for the data flow in a packages node. We know that the package has a tasks child which specifies the control flow and each data flow should reside within it. For example, a single data flow would look like this.

      <Tasks>
 
        <Dataflow Name="EXTRACT_Person_Address">
    
          <Transformations>
          <OleDbSource Name="src_Person_Address" ConnectionName="Source">
            <DirectInput>Select * from Person.Address</DirectInput>
          </OleDbSource>
          <RawFileDestination Name="dest_Person_Address" RawFileFormatName="Person_Address" WriteOption="CreateAlways">
            <FileOutput ConnectionName="Person_Address" />
          </RawFileDestination>
        </Transformations>
       
        </Dataflow>
     
      </Tasks>

Now, we need a query to generate xml for all tables…. (and here is it). It gives us the full tasks node for our package.

/*	PTB, 10-Nov-2014
	SQL to Generate Dataflow's for Biml output (all tables in a DB)
 */

select 

'EXTRACT_' + TABLE_SCHEMA + '_' + TABLE_NAME as "@Name"
, 'src_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/OleDbSource/@Name"
, 'Source' as "Transformations/OleDbSource/@ConnectionName"
, 'Select * from ' + TABLE_SCHEMA + '.' + TABLE_NAME  as "Transformations/OleDbSource/DirectInput"
, 'dest_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@Name"
, TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@RawFileFormatName"
, 'CreateAlways' as "Transformations/RawFileDestination/@WriteOption"
, TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/FileOutput/@ConnectionName"

from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = 'BASE TABLE' 
order by TABLE_SCHEMA, TABLE_NAME
for XML path('Dataflow'), root('Tasks')

Putting it All Together

Because the xml generated is so big, I will not include the full output. However, I think its good to have an indication of what the output should look like. So a subset of the output will look like this.

Conclusion

When we build the Biml, we will get a single ssis package with a data flow for each table (as shown below) with all the effort of generating the transforms handled by Biml. I’d like to go into the xml clause for SQL a little further but I think it would complicate what we are trying to achieve (after all this post is pretty long as it is).

Now … I have to ask … Is this the most elegant solution and could it be done better?

 

Perhaps not but I think it’s a pretty cheeky (and extremely fast) way to generate a package to do an awful lot of copying.

 

BIML III – Extracting Database Tables to Raw Files

In a previous post, we looked at extracting a table to a flat file. Personally, I am not a huge fan of flat file output because you lose so much metadata when a table with all its data types are dumped into text. To preserve this information, I would much rather use a raw file output. This post looks at the Biml requirements to output to a raw file.

Overview

If you haven’t read the post about output to test files, it may be worth a quick read. Basically we discuss the process that we use to extract a table as the following list of activities.

  1. Create a connection to SQL Server
  2. Add a dataflow task
  3. Specify a source (let’s just use an OLEDB source) and provide the query.
  4. Add a flat file as a destination output then join the source to destination and then configure the output by creating a new flat file connection manager.

Of course, we could replace flat file here with raw file and we would pretty much have the same method. That’s exactly what is required – the only real difference is that the raw file format does not require delimiter specifications. Our proforma template remains the same;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <!-- Placeholder for connection -->
  </Connections>
  <FileFormats>
    <!-- Placeholder for structures-->
  </FileFormats>
  <Packages>
    <Package Name="CopyAccount" ConstraintMode="Linear">
    <!-- Place Holder for a Package -->
    </Package>
  </Packages>
</Biml>

Connections

Our database connection remains the same. The output (raw file) is specified as a file connection.

<Connections>
  <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
  <FileConnection Name="Target" FilePath="F:\\Output.raw"  />
</Connections>

 

FileFormat

The file format (remembering that it’s a child of the FileFormats and defines a table structure) only needs to specify the format name and column output.

  
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
        <Column Name="AccountKey"   />
        <Column Name="AccountCodeAlternateKey"   />
        <Column Name="AccountDescription"   />
        <Column Name="AccountType"    />
        <Column Name="Operator"   />
      </Columns>
    </RawFileFormat>
  </FileFormats> 

 

As with text formats, the order is not important, only the Name. We will utilise auto mapping between the source and target.

The Package/ Data Flow

For the text file extract, our package contained a dataflow which simply had a source and destination. The source specified an SQL statement and the destination specified the type (FlatFile) which had an embedded file format. This is an important point… the flat file destination requires a file connection and the connection requires a format.

This is not the same for the raw file output. Here the raw file destination requires both a connection and the file format. If we have a quick look at the dataflow snippet, you’ll see what I mean.

<Dataflow Name="Copy_Dim_Account">
  <Transformations>
	<OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
	  <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
	</OleDbSource>
	<RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
	  <FileOutput ConnectionName ="Target" />
	</RawFileDestination>
  </Transformations>
</Dataflow> 

Now compare this to the flat file and notice that the file format is not required (for text).

<Dataflow Name="Copy_Dim_Account">
  <Transformations>
	<OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
	  <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
	</OleDbSource>
	<FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
	</FlatFileDestination>
  </Transformations>
</Dataflow> 

 

Complete Code

For completeness, the full code to extract an table (or more precisely a SQL query) to a raw file is;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost\SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
    <FileConnection Name="Target" FilePath="F:\\Output.raw"  />
  </Connections>
  
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
        <Column Name="AccountKey"   />
        <Column Name="AccountCodeAlternateKey"   />
        <Column Name="AccountDescription"   />
        <Column Name="AccountType"    />
        <Column Name="Operator"   />
      </Columns>
    </RawFileFormat>
  </FileFormats>

  <Packages>
    <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
              <FileOutput ConnectionName ="Target" />
            </RawFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

 

Conclusion

There is not a great deal of variation here between a flat file output and a text file so you may ask why include it? In a future post well look at a simple solution to output an entire database to raw files. Our goal in that post is to mimic the staging steps of data warehouse creation. However, in order to do that, we need the basics of file output.

 

To Automate or Not? – The Pros and Cons of Data Warehouse Automation

I have been thinking about data warehouse automation for at least 2 years now. That’s just a long enough time for me to see the pros and cons, examine the state of affairs, consider the consequences of both camps and lastly consider my reaction to those consequences. I’ve come to realise that the last point (my reaction) plays on my mind because the subject is filled more with emotion than rational decisions. What I hope to achieve with this post is to materialise some of my thoughts on the matter and present a few ideas that may be considered by others (if they were considering utilising this technology).

Before we get into the grit of discussion, it is probably worth noting that I consider the data warehouse in the more traditional sense. That is, a relational engine that stores and delivers agreed data. Recently, the additional possibilities arising from the big data echo system and the concepts of a data lake or swamp, the possibilities of schema on read (where data doesn’t necessary have to conform to a model on load) will add a layer of complexity to this discussion which we just don’t need. We could also add data virtualisation but that would also cloud the water in what could be simplified as decision to automate or not.

In my traditional view of this world, the data warehouse needs to deliver two things. Firstly, it needs to store an historical record data (as it existed) and secondly, it needs to present information to the consumer in the manner that they require. Those two points might seem like real no-brainers but I thought it’s better to call them out now because they create some flow on implications which are;

  • the historical record of data may not be what the consumer currently wants
  • the ‘delivery’ requirements of the system will change
  • the system must be capable of changing requirement

Data Warehouse Development

In traditional development, the system is designed is designed at a conceptual level, a data modeller designs and builds the relational framework (the schema) and then ETL developers populate it. The last stage (ETL development) requires the largest effort on a project and has been documented to require more than 60% of total effort.

Now, let’s examine the first two steps. I suggest that the conceptual design is driven by understood practices#1 – to the extent of a cookie cutter template. Regardless of the methodology used, most employ a series of processes and silos that do not change from implementation to implementation. Data is extracted into a staging area, transformed into a warehouse, deployed to a presentation server/layer and delivered via a user interface (reporting/analytic layer). We expect the modeller to use a tool to manage and maintain the modelling process (eg ERwin, DeZign) and would think it unprofessional and prone to error if components were hand coded manually.

In contrast, when it comes to ETL development, most processes manual. Sure, there’s a tool that is used (eg SSIS, Kettle, Talend) which acts as an IDE however, the development process is still labour intensive. The developer links together components to perform some action or task and each task/package must be manually developed and configured (for example, you would write 10 tasks to extract 10 tables into a staging area).

Is the process of the ETL development using an IDE similar to that of the modeller (and therefore argued as automated)? I think yes and no. Yes because there is a layer of abstraction between the development process and the output (today we would think it insane to manually write code in a text editor) and no because the abstraction layer is not high enough (meaning that still very task orientated).

Automation

Enter the realm of automation technologies. To summarise these tools, they;

  1. Generally use a model (and data) driven approach to define the conceptual layer required. For example, you point the tool source system to reverse engineer the underlying model which can later adjust (for example at other entities, relationships etc).
  2. Define a schema based on the prior step (and generate the DDL to define an alternate schema).
  3. Map source systems to the target schema and generate the code and processes to do this.
  4. Manage the execution of the code/processes derived in step 3.
  5. Provide some presentation layer.

The Arguments

So why are automation products considered by purists and coders as second-rate citizens in the world of data warehouse development? There are a few key arguments that I’ve come across;

  • The tool does not produce the schema that we want or we can’t change it the way we want. This contention presents itself in several ways.
    • The modeller suggests that there is a correct way to build the schema and the tool should conform to that (or be capable of alteration to the correct schema). I’ve italicised the word correct here because that can really be a matter of conjecture. Consider storing related data in a different table which is directly related to the first (ie 1:1 relationship). You can convincingly argue both sides (there should be 1 table or 2). I think the key point of this argument is that the tool is not flexible for schema alteration. Sometimes this is true, sometimes it’s just nonsense and based on hearsay or single product evaluation.
  • Another point is that the data is not retained correctly (for example, history is not recorded and is incapable of change). Most people that present this have not completed an in depth analysis of the products available. Just like other process based software, each product is not created equally and it would be foolish to judge the class of products on a limited scope (I am sure there’s a product that suites your needs). What is interesting about this point is how much bias is targeted at automation software. The same scenarios can be readily found in traditional implementations where requirements have changed (or dare I suggest an oversight in design) renders the existing schema redundant. Change and rework is inevitable in this scenario and it’s surprising how that circumstance is eliminated from our evaluation process.
  • The tool is a black-box and I can write better transformations. This is perhaps the most common discussion point I see amongst developers. They argue that the tool produces generic code which could be improved had it been created manually (by them). Perhaps this is true and there are some instances where code could be written that outperforms and is more elegant and generated code. However the converse may also be true – that is, manual development may be more prone to poor performance. How can this be? Well, for a start you assume that the developer can deliver the results they say (and they actually know what they’re talking about). But this is not always the case in some developers obtain their positions through bluffs on their resume which are all too often found out too late.
    • When presented with this argument there is a natural association that the solution requires an elegant and well performing solution. Sometimes this is true and sometimes it’s not. Performance improvements that look good on paper may have absolutely no benefit to the overall solution even if they deliver a tangible result. For example, the a generated process taking 25 minutes compared to a manually created process of 20 minutes may show improvement however, is there really an improvement if the entire process completes in the processing window?
  • It’s too expensive. When an automation tool is evaluated its price is naturally considered as part of the evaluation equation. Let’s say the typical range of prices is between $50K and $80K. That is evaluated against a consultant for six to ten weeks or an FT employee the nine months. More than enough time to build it manually right? – I’m not so sure. ETL projects are notorious for overruns and while we expect our projects to go smoothly, the chances are that there will be overruns and these are not factored into the cost equation.
    • Also I’d like to consider in the situation where the developer(s) leave. In this situation, there is often more unplanned rework through either a lack of knowledge or a different consultants approach.
  • Further considering the “it’s too expensive line“, I would also like to introduce the concept of the time value of data (whoops information). I have never seen this addressed in a cost benefit analysis, but to my mind it must play a role. The time value of data relates to the increase in productivity, revenue or output by having solution faster (let’s assume that an automated tool can do this). What does it mean in monetary terms (to the business) to have data now (and not in six/nine months)? Can they utilise the data warehouse from competitive advantage to increase profitability?

The Plusses – Viva Automation

What can automation deliver that has not been addressed above? On the pro side of automation software we could suggest;

  • A consistent approach. Since the automation software develops everything for us, its output is delivered in a consistent manner. Of course this has good and bad points, but the key is that there is a consistent approach. Developer peculiarities should not exist since the tools operate through meta-data (and generation). Theoretically one developer is equivalent to another so there may be no need for sliding pay scales based on experience or special skills.
  • Since the operator manages the software, one might suggest that developers are not needed (and neither are their specialist skills). This may make resourcing a lot simpler and remove the barriers between IT and the business (provided we accept those barriers exist). There is the potential for a business user with training the software to build and manage the data warehouse rather than a specialist IT resource. This type of transformation has occurred with accounting software that does not require an accountant to operate it (perhaps only a data entry clerk).

The Emotion

Every discussion I have had around the use of automation tools involves some emotion (in some cases I’d go so far to say that the conversation is sometimes dictated to by emotion). To get straight to the point, the idea that specialist developers can be replaced by software is so confrontational that we (myself included) simply refuse to accept it and so live in a state of denial. The denial says “I’ve been doing this for XX years, have specialist skills and no machine can take my place”.

This has happened in so many industries and is such a predictable response, I think all we need do is identify its (potential) presence.

Finally

There have been lots of points in this post. If you think I’ve missed something, want to add your own point or expand on something I’ve mentioned, then just comment or drop me a line. As usual comments are open and welcome.

I also say that all comments will be published (provided they are not spam, contribute to the discussion and are vendor agnostic). If you would like to insert the comment in a section so that the flow of the article remains intact, just let me know and I’ll add it.

The floor is now open J.

****

#1 As a side note, its time to revise these practices – perhaps more in a later post.

BIML II – Extracting Database Tables to Text Files

In this post we look at the Biml required to generate a SSIS package that copies data from a table to a flat file (that is, a text file output). It’s a pretty simple example but one that can build on our prior knowledge. We have already gone over the basics of Biml in our first post so we won’t worry about an overview of Biml, how to install it or generate packages. What’s worth remembering though, is that Biml is an XML specification and packages and connections are child nodes of the XML root. We’ll also deviate little bit in this post by using an OLEDB connection to SQL Server and specifying a file format.

Overview

If we think about the way our package would be built in SSIS, we would probably follow a simple set of steps;

  1. Create a connection to SQL Server
  2. Add a dataflow task
  3. Specify a source (let’s just use an OLEDB source) and provide the query.
  4. Add a flat file as a destination output then join the source to destination and then configure the output by creating a new flat file connection manager.

What’s somewhat hidden in this process is that the output file structure is defined automatically when we create a new connection – that is, in the destination wizard (it is handled by the wizard when you create a new connection manager). Keep in mind that the Biml has no idea about this and so we need to specify the file structure somewhere in the Biml code.

So if we are going to achieve this in Biml, we would need to;

  1. Add the connections
  2. Specify the file format
  3. Create a package that has a data flow task
    1. Uses the source connection (with a query) to produce a dataset … and
    2. Places that dataset into an output connection.

So we would expect our structure to be something like this (I’m using a little prior knowledge and know that the FileFormats tag is a root element);

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
    <!-- Placeholder for connection -->
  </Connections>

  <FileFormats>
    <!-- Placeholder for structures-->
  </FileFormats>

  <Packages>
    <Package Name="CopyAccount" ConstraintMode="Linear">
    <!-- Place Holder for a Package -->
    </Package>
  </Packages>

</Biml>

Connections

If we follow our previous example, we can simply specify our connections under the connection node as in the following code (again Intellisense can work wonders here). The OLEDB connection should be pretty straight forward, but what is interesting is the simplicity of the FlatFile connection. As you would expect, its go a Name and a Path (or destination), but what’s interesting is the FileFormat specification. This is a required property for a flat file and specifies the structure of the stream.

<Connections>
  <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
  <FlatFileConnection Name="Target" FilePath="F:\\Output.txt" FileFormat="AccountTable" />
</Connections>

FileFormat

We have briefly discussed the creation of a format. For our purposes we could summarise it as the definition of a dataflow (somewhat akin to the definition of a table). As you would expect (and is logical for an XML table definition), the structure defines a header (that is the table container) and the columns within that table.

<FileFormats>
    <FlatFileFormat Name="AccountTable" ColumnNamesInFirstDataRow="true" >
      <Columns>
        <Column Name="AccountKey" Delimiter="Comma" />
        <Column Name="AccountCodeAlternateKey" Delimiter="Comma" />
        <Column Name="AccountDescription" Delimiter="Comma" />
        <Column Name="AccountType" Delimiter="Comma" />
        <Column Name="Operator" Delimiter="LF" />
      </Columns>
    </FlatFileFormat>
</FileFormats>

Naturally each node has to have a Name, and columns also require a Delimiter. You’ll notice that the last columns delimiter is a Line Feed (LF or new line). This just ensures that each row appears on its on line.

We could also specify data types for columns and other properties however, these are not mandatory for our example. What is required is that the field names for the FileFormat match those from our table. When the SSIS package is generated, the mappings from the source (OLEDB data source) to the destination (the FlatFile) will be based on name. The order is not important, only the Name.

The Package

If we consider that we have the framework for moving data, we would expect that our package require the following tasks with a precedence constraint between them;

  1. A Data Flow Task
    1. OLEDB Source (which specifies the connection and the SQL Command)
    2. A flat file destination (specifying the file and format).

This is exactly what we need to specify in our package. The Biml for this is;

    <Package Name="Copy_Dim_Account" ConstraintMode="Linear" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
            </FlatFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>

We can easily see how these tasks relate to both the source and destination of the dataflow and how they reference the objects that we previously defined (ie the connection and FileFormat).

Previously, we also examined the use of the packages ConstraintMode as a property which dictates how tasks on the Control Flow relate to each other. That is, a linear mode creates constraints between tasks where as a Parallel mode does not.
This property is not materialised in the DataFlow. For example, if the package has a Parrallel Constraint mode, the tasks (or transformations) within the data flow will automatically have the precedence constraint created (in the order that they appear in code).

Also note that the source uses a direct input tag as a child of the OLEDB Source which is the same as the syntax for the ExecuteSql tag.

The Complete Solution

For completeness, the full code is ;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
    <FlatFileConnection Name="Target" FilePath="F:\\Output.txt" FileFormat="AccountTable" />
  </Connections>
  <FileFormats>
    <FlatFileFormat Name="AccountTable" ColumnNamesInFirstDataRow="true" RowDelimiter="LF" >
      <Columns>
        <Column Name="AccountKey" Delimiter="Comma" />
        <Column Name="AccountCodeAlternateKey" Delimiter="Comma" />   
        <Column Name="AccountDescription" Delimiter="Comma" />
        <Column Name="AccountType" Delimiter="Comma"  />
        <Column Name="Operator" Delimiter="LF" />
      </Columns>
    </FlatFileFormat>
  </FileFormats>

  <Packages>
    <Package Name="Copy_Dim_Account" ConstraintMode="Linear" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
            </FlatFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
  </Biml>

Conclusion

In this post, we have looked at the generation of packages that move data from a table to a text file. This is a simple scenario that demonstrates the minimum requirements for tag specification and utilises automatic mapping of source fields to target.

Data Vault and Data Mining (AusDM)

What does the data vault data warehousing methodology have in common with data mining?

There are two conferences / courses being run in Australia in November, 2014.

Data Vault

The Data Vault data warehousing methodology has gained a lot of traction over the past few years. If you’ve never heard of it, I would suggest that it is somewhat of a cross between normalisation and star schema design (of course I am leaving myself open to a bit of criticism with that definition but it is only half a sentence).

Apart from the mandatory recognition to the inventor (Dan Linstedt), the best book that I’ve read on the methodology was written by Hans Hultgren and it is simply a must have if you are interested in learning the method (you can check it out on amazon here ). I cannot recommend this book highly enough. Anyway, I digress.

Hans will be in Sydney conducting Data Vault training on 12th – 14th November (that is, training to become a certified Data Vault Modeller). He has partnered with Analytics8 and MIP (don’t ask me why the prices are different) and while I would never usually promote vendors training solutions, I can make an exception for training delivered by Hans. If you are interested in finding out more about here.

AusDM

AusDM or Australasian Data Mining Conference is begin run in Brisbane at Queensland University of Technology (Gardens Point) on 27th-28th November. The first time I went to this conference, it was purely an academic conference on data mining. If you’ve never been to an academic conference, they are a lot like other conferences, except that research papers are presented, so the presentations are focused and specific (you’ll also need a bit of background in the subject area). To be honest most industry participants find this pretty dry and boring. Research findings leads technical implementation by a good number of years and the work seems mostly theoretical.

What I like about AusDm is that they have an industry focus in addition to the academic presentations. For example, there is workshop on R which (IMO) makes the price of admission inconsequential (when compared to other R training).

If you would like to know more about AusDm please check out the site http://ausdm14.ausdm.org/home

BIML I – Getting Started with BIML

Traditional SSIS development requires the ETL developer to create a package and then manually add and configure tasks. While this method of visual development is fine, it suffers from reproducibility and maintenance overhead. For example, imagine (a very contrived scenario) where you create a package with a single ‘Execute SQL Task‘ and configure it to fire a procedure. Simple enough but now suppose you want to create a new package and fire a different proc – here you have to manually go in and follow all the same steps again. Perhaps not so bad but what if you create 10 packages and then decide that the Execute SQL Task should be prefixed EXSQL, you would have to manually go into each package and rename the task. This could be very cumbersome and it (by nature) carries a maintenance burden (or maintenance overhead).

BIML (Business Intelligence Markup Language) is an XML specification that (amongst other things) allows you to specify the creation SSIS packages. One of the great things about BIML is that it can also include code snippets that amplifies the ability to generate code. This first post will form part of a series that show how to use BIML to create packages and what type of outcome we can achieve. But (of course) first things first so let’s start with a simple goal, so let’s look at using BIML to generate a package that executes an SQL command.

Before we continue, a quick revisit my above comments about maintainability for a moment.

An SSIS package is really just an XML file right, so why can’t I just edit that?

If you’ve ever played around with an SSIS package, you’ll probably know that it is an XML file. You can see this by viewing the code of your packages as shown below (right click on the package and view code). So you may think that you can generate or alter this pretty easily.

However, I have found this XML to be very finicky. Suppose you wanted to redefine a connection by editing script (something which should be simple you’d think and I’ve discussed here). You would expect that you could just examine the code, find the connection string, alter it and save it. But I’ve never had any success in that approach. When you try to open the file again (using the visual designer) it is invalid so your pretty much helpless with the changes you tried to make (sorry but there’s no comeback). That’s only to change an existing package so the creation of a package is (IMO of course) is just not possible. You could use an API but that just seems out of reach for most.

Of course, newer versions of SSIS have project wide objects (like connections). Using these would naturally require less maintenance however were are still faced with task management (altering and creation) and package development. Further, if we want to define a group of packages by meta-data (say for example a list of tables to extract), there is no current way to do it in SSIS. A much simpler approach is BIML with give us a generic specification to development.

What do I Need to get running with BIML?

BIML is available when BIDS helper is installed. BIDS Helper is a community addin for SQL Server Data Tools (SSDT and BIDS) and is available on code plex. If you haven’t heard of BIDS helper before, take some time to look around the features. It really is a must have for business intelligence development in the Microsoft space.

Once BIDS helper is installed, you can add a new BIML file by right clicking on the Project (or SSIS Packages) node and selecting ‘Add New Biml File‘. The file is created in the Miscellaneous Folder (as shown below).

 

A single BIML file can contain the code that creates many packages (as will become apparent) but for now would we will just work with the basics and use it to create a package that fires an SQL statement.

Adding Code

Opening the BimlScript file shows the XML in the file (we should probably rename it to give it a decent name but we will not worry about it this time). When you start, the file just contains the root node (as below).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

What BIML now allows us to do is create objects within the root (and later other nodes). For example, since we want to create a package, we naturally expect that we need to add a package child. Naming is pretty intuitive since it follows the SSIS object names. I’ve already hinted that we can use the same file to create multiple packages so it seems natural that a package tag should be embedded within a packages tag. Something like the following.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package></Package>
  </Packages>
</Biml>

Actually, I did cheat a little (and so can you). The intellisense functionality lets you know what tags are allowed at the position you are. Remember in XML that a tag is just an identifier for an object so since I want to do something in within by BIML tag (where else would I put it), I can just open a tag and see what options are available. I want to create a package (so a search for package shows I can add packages and of course within that, I can add a package).

Validation and Creation

Back to our package snippet. We’ve got the packages declaration (as above) to specify a package within a packages node. You’d expect there is a way to validate the XML to be sure it has the required information to create packages? Well, just right click on the (explorer solution) and select ‘Check Biml for Errors‘. A dialog opens with a list of errors.

 

If we think about it, it makes sense doesn’t it? Of course we need to give our packages a name! In contrast, the error for ConstraintMode is something that we take for granted when we generate packages visually. We define the order of tasks in a package through constraints (or connectors). However Biml has no indication and so the layout of tasks other than the order in which they appear in XML. How they relate to each other must be defined in the package declaration and this is the purpose of the ConstraintMode property. There are two allowable values for this property (either parallel or linear) which (as the name suggests) specifies that tasks will not be connected or connected sequentially.

So in order to create a package, we update the Biml code as (note that since we are only having one task in this package the constraint mode is irrelevant);

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
  <Package ConstraintMode="Linear" Name="package_01"></Package>
</Packages>
</Biml>

We now have everything that we need to generate packages from the script. Simply right click on the Biml file and select the Generate SSIS Packages (as above). You’ll see a new packages create under the SIS packages node in solution explorer. Of course there are no tasks in our package.

Adding the Execute SQL task to the Package.

We already seen that the XML nodes in naming conventions are very similar to that of SSIS (both in structure and definition). If we think about adding an Execute SQL task to the package, then, there would be a few points we could make (or at least assume);

  1. a task would be embedded within a tasks node
  2. the task would probably be defined by name (or a task type)
  3. an execute SQL task requires a connection
  4. and execute SQL task requires a command
  5. the connection would be required before the command.

Turns out that the Biml to add this task to our package is pretty straightforward. Of course you can use the IntelliSense to suggest what nodes are available and, when you look at the code and think about what you’re trying to do it seems to make sense (that is, create a connection, create a package, at a task to the package uses a connection and has some syntax).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <Connection Name="SQL01" ConnectionString="Provider=SQLNCLI10.1;Data Source=.;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW"/>
  </Connections>
  <Packages>
    <Package ConstraintMode="Linear" Name="package_01">
      <Tasks>
        <ExecuteSQL Name="fire_proc_01" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Perhaps the most interesting thing about this snippet is the connection object. We’d expect it to be embedded within connections however, the location as a root node gives us an advantage of reuse, so adding another task with the package or additional packages and tasks can reuse that connection. For example, to create 2 packages with 2 Execute SQL commands in each (which require the first to complete and then the next), we could write;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <Connection Name="SQL01" ConnectionString="Provider=SQLNCLI10.1;… (as above) "/>
  </Connections>
  <Packages>
    <Package ConstraintMode="Linear" Name="package_01">
      <Tasks>
        <ExecuteSQL Name="SQL_CMD_1" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="SQL_CMD_2" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
    <Package ConstraintMode="Parallel" Name="package_02">
      <Tasks>
        <ExecuteSQL Name="SQL_CMD_1" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="SQL_CMD_2" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>
This produces package_01 which looks like and we can see how the specification of ConstraintMode has caused SQL_CMD_1 to link to SQL_CMD_2.

 

Note the output from the second package definition (where ConstratinMode is Parallel) does not introduce precedence constraints between the 2 tasks.

Conclusion

This post has looked at using Biml to create packages in SSIS. It’s a pretty simple example, but we can see how the XML biml can be used to automatically create (and recreate) packages. If we wanted to change a property (say for example the SQL syntax), we could just change it in the code and regenerate the packages.

In later posts, we’ll extend this functional knowledge for more complicated scenarios.

 

 

Changing Excel Table Connections in VBA

In previous posts, I’ve discussed using VBA to extract connection information from Pivot Tables and hacking a tables command so that you can programmatically change the results of a query. But what if you have a table and you want to change its connection and its command?

This can be achieved pretty easy using the tables .QueryTable object. In fact, we can change the entire table using this class.

So how does it work? Suppose, I’ve inserted some data into a worksheet from SQL Server. The menu path is DATA à From Other Sources à From SQL Server. Next I get the familiar connection windows to connect to a server and then select a table (as below).

I’ll give the table a friendly name (adventure_works) and return the data to a table. I get a connected table which is sourced from a query and can be updated (either right click somewhere in the table and select Refresh from the popup menu or select the refresh button from the TABLE TOOLS à DESIGN (ribbon) menu).

I can also use the DESIGN of the table tools to give my table a decent name by just replacing its name (as in the following image).

Now I want to change the database server (and potentially the query) that my table uses. There are 2 things to keep in mind here. Firstly, we need to change the query’s connection string (no brainer eh) and secondly, the table (in Excel) was added as a table.

As in the previous post, I can refer to the table as a list object in VBA, this is a property of the sheet, so, in order to reference the table I’ll have to use some code like this (note the table was on Sheet2).

Dim li As ListObject

Set li = Sheet2.ListObjects(“aw_employees”)

For the connection string, I can refer to it through the ListObjects QueryTable class. So I could determine what the current connection string or set it by using the following code;

Debug.Print li.QueryTable.Connection ‘retrieve the connectionstring

li.QueryTable.Connection = “OLEDB;Provider=SQLOLEDB.1;…(this is my connection string) ..”

If I then refresh the table (using li.Refresh), one may think that’s the end of the story and our works completed – but that’s not quite true just yet.

Remember that we added the table as an SQL table? Because of this, the ‘query’ for the .querytable actually refers to a table including the SQL database and has its .CommandType as a table. If we debug the code to determine the command (as below) we could see this

After we change the connection, we could set the new table to use employee (without a database) with something simple (like);

li.QueryTable.CommandText = “dbo.DimEmployee”

li.Refresh

Or, we could change the query type of an Command, and provide its text. This would suffice for that;

li.QueryTable.CommandType = xlCmdSql

li.QueryTable.CommandText = “select * from dbo.DimEmployee”

li.Refresh

That’s about it, we can change the tables connection, its query and have these changes reflected in our table.