BIML IV– Extracting All Tables to Raw Files


Perhaps the first task of any (DW/ DM) project is to extract data – after all it’s the E in the ETL isn’t it? I know there’s a lot more that goes on here but it’s the most boring job I can think of doing and one which inevitably has to be done. In a previous Post (Biml III), I looked at using BIML to extract a table to a Raw File. This post will extend that so that we can automatically generate the code required to do the task – that is, we want to have a source database and automatically create a package to extract all the tables from the database.

To be honest, the Biml framework was put in place in the prior post. What I like about what we are going to do here is the quick generation of a package to achieve this. We will look into some Xml SQL also and, while you could argue that the process is a bit of a hack, it gets a result very fast. Also, we could extend this practice to other database platforms, all we want to do is generate Biml Code. So let’s begin.

Overview

I’ve already stated our objective is to generate Biml so that we can create a package which will extract all tables from our database. In order to do this, we are going to create the Xml specification through a set of queries that populate the proforma for a single table extract. Here’s the proforma;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost\SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />
    <FileConnection Name="Target" FilePath="F:\\Output.raw"  />
  </Connections>
   
  <FileFormats>
    <RawFileFormat Name="Output" >
      <Columns>
	…. (column list)
      </Columns>
    </RawFileFormat>
  </FileFormats>
 
  <Packages>
    <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output"  WriteOption="CreateAlways" >
              <FileOutput ConnectionName ="Target" />
            </RawFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

 

There are 3 major things we’ll need to define. There are;

  1. The output file connection which is held in the Connections node. Let’s assume that is name should include the schema and table_name. For example, the extract for Person.Address should output a file called Person_Address.
  2. The file format for a raw file which is simply the list of columns that we want to export (held in the FileFormats node). There is a parent node of the column list which is defined by name and, (in our work), we should specify this by the schema and table.
  3. The data flow transform with an Oledb source and raw file destination (mapping will be managed by name matching). The target uses the file connection (defined in step 1) and the file format (step 2) which is defined by schema and table_name. Since we want to give our data flow components a meaningful name we’ll also prefix them with src (for source) and dest (for destination).

Now, lets look at how we can generate XML for each of our 3 components.

Target Files

The generation of target files is pretty straight forward. All we need is a query that produces a file connection node with the (file output). One for each table in the database. The query below does this and produces the output. There’s nothing too complicate here, but I’ve you never heard of the information schema on SQL Server, you may want to read a little more about the views here. Basically, they define the database (and what we are interested in is tables and columns).

The query,

/*	PTB, 10-Nov-2014
	SQL to Generate Biml FileConnections (all tables in a DB)
 */

declare @output_directory nvarchar(50) = 'F:\\DB_DUMP'

select 
'<FileConnection Name="' 
	+ TABLE_SCHEMA + '_' 
	+ TABLE_NAME + '" FilePath="' 
	+ @output_directory + '\\' 
	+ TABLE_SCHEMA + '_' + TABLE_NAME + '.raw" />'
from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = 'BASE TABLE'

Produces the folowing output.


All we want to do is take the output from the query and post it where our connections node.



File Formats

All the file format needs to do is specify a list of columns under a RawFileFormat Node. Of course the format must have a name but in essence we need to produce some xml that looks like the following snippet. And, it needs to be done for every table that we are going to extract.

    <RawFileFormat Name="Person_Address">
      <Columns>
        <Column Name="AddressID" />
        <Column Name="AddressLine1" />
        <Column Name="AddressLine2" />
        <Column Name="City" />
        <Column Name="StateProvinceID" />
        <Column Name="PostalCode" />
        <Column Name="rowguid" />
        <Column Name="ModifiedDate" />
      </Columns>

We are also a little restricted by the capabilities of the raw file output. This destination does not accept certain data types (eg Geography, XML or images) and so these need to be excluded from the definition. If they are not, our package will try and map source columns to these fields and the transform will fail (actually, it won’t validate). This is not an issue with Biml, rather a limitation of the destination.

Again, if you’ve not used the INFORMATION_SCHEMA before, you may want to have a quick look at the INFORMATION_SCHEMA.COLUMNS view. This view lists all the columns (and tables of the database). Since I want xml returned, I can use the XML clause to have the query generate an XML node for me (rather than a standard row set). There is a lot in the XML clause (if you want more information, you can read it here). My query;

/*	PTB, 10-Nov-2014
	SQL to Generate Biml FileFormats for all tables in a DB
 */

select 

  TABLE_SCHEMA + '_' + TABLE_NAME "@Name"
,  (select COLUMN_NAME "@Name" 
	from INFORMATION_SCHEMA.COLUMNS 
	where TABLE_NAME = t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA
		and  
		( 
			DATA_TYPE <> 'geography'
			and not ( DATA_TYPE='nvarchar' and CHARACTER_MAXIMUM_LENGTH = -1 ) 
			and not ( DATA_TYPE='varbinary' and CHARACTER_MAXIMUM_LENGTH = -1 ) 
			and DATA_TYPE <> 'Xml'
		)
	for xml path('Column'), type, root('Columns')
	)
from INFORMATION_SCHEMA.TABLES t 
WHERE TABLE_NAME in
					(
						SELECT TABLE_NAME 
						from INFORMATION_SCHEMA.TABLES 
						where TABLE_TYPE = 'BASE TABLE'
					)

for XML PATH('RawFileFormat'), type , ROOT('FileFormats')  

Returns an xml node which is shown in the results view like this.


Now, when we click the xml (note that it looks like a url), a new output window opens up with the following code. This is exactly what we need for the file formats and we can past over the proforma FileFormats node.

The Xml output from the File Formats query

 

 

 

Data Flows

Since we want to include all our data flows in a single package (with one dataflow for each table), we need to generate the xml for the data flow in a packages node. We know that the package has a tasks child which specifies the control flow and each data flow should reside within it. For example, a single data flow would look like this.

      <Tasks>
 
        <Dataflow Name="EXTRACT_Person_Address">
    
          <Transformations>
          <OleDbSource Name="src_Person_Address" ConnectionName="Source">
            <DirectInput>Select * from Person.Address</DirectInput>
          </OleDbSource>
          <RawFileDestination Name="dest_Person_Address" RawFileFormatName="Person_Address" WriteOption="CreateAlways">
            <FileOutput ConnectionName="Person_Address" />
          </RawFileDestination>
        </Transformations>
       
        </Dataflow>
     
      </Tasks>

Now, we need a query to generate xml for all tables…. (and here is it). It gives us the full tasks node for our package.

/*	PTB, 10-Nov-2014
	SQL to Generate Dataflow's for Biml output (all tables in a DB)
 */

select 

'EXTRACT_' + TABLE_SCHEMA + '_' + TABLE_NAME as "@Name"
, 'src_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/OleDbSource/@Name"
, 'Source' as "Transformations/OleDbSource/@ConnectionName"
, 'Select * from ' + TABLE_SCHEMA + '.' + TABLE_NAME  as "Transformations/OleDbSource/DirectInput"
, 'dest_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@Name"
, TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@RawFileFormatName"
, 'CreateAlways' as "Transformations/RawFileDestination/@WriteOption"
, TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/FileOutput/@ConnectionName"

from INFORMATION_SCHEMA.TABLES 
where TABLE_TYPE = 'BASE TABLE' 
order by TABLE_SCHEMA, TABLE_NAME
for XML path('Dataflow'), root('Tasks')

Putting it All Together

Because the xml generated is so big, I will not include the full output. However, I think its good to have an indication of what the output should look like. So a subset of the output will look like this.

Conclusion

When we build the Biml, we will get a single ssis package with a data flow for each table (as shown below) with all the effort of generating the transforms handled by Biml. I’d like to go into the xml clause for SQL a little further but I think it would complicate what we are trying to achieve (after all this post is pretty long as it is).

Now … I have to ask … Is this the most elegant solution and could it be done better?

 

Perhaps not but I think it’s a pretty cheeky (and extremely fast) way to generate a package to do an awful lot of copying.

 

Advertisements

2 thoughts on “BIML IV– Extracting All Tables to Raw Files

  1. Pingback: BIML VI – Using Include to Separate BIML Components into Files | Paul te Braak

  2. Pingback: BIML XVI – Script and Reuse (Part I) | Paul te Braak

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s