Perhaps the first task of any (DW/ DM) project is to extract data – after all it’s the E in the ETL isn’t it? I know there’s a lot more that goes on here but it’s the most boring job I can think of doing and one which inevitably has to be done. In a previous Post (Biml III), I looked at using BIML to extract a table to a Raw File. This post will extend that so that we can automatically generate the code required to do the task – that is, we want to have a source database and automatically create a package to extract all the tables from the database.
To be honest, the Biml framework was put in place in the prior post. What I like about what we are going to do here is the quick generation of a package to achieve this. We will look into some Xml SQL also and, while you could argue that the process is a bit of a hack, it gets a result very fast. Also, we could extend this practice to other database platforms, all we want to do is generate Biml Code. So let’s begin.
I’ve already stated our objective is to generate Biml so that we can create a package which will extract all tables from our database. In order to do this, we are going to create the Xml specification through a set of queries that populate the proforma for a single table extract. Here’s the proforma;
<Biml xmlns="http://schemas.varigence.com/biml.xsd"> <Connections> <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost\SQL2012;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" /> <FileConnection Name="Target" FilePath="F:\\Output.raw" /> </Connections> <FileFormats> <RawFileFormat Name="Output" > <Columns> …. (column list) </Columns> </RawFileFormat> </FileFormats> <Packages> <Package Name="02_Copy_Table_To Raw_File" ConstraintMode="Parallel" ProtectionLevel="DontSaveSensitive" > <Tasks> <Dataflow Name="Copy_Dim_Account"> <Transformations> <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" > <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput> </OleDbSource> <RawFileDestination Name="Dest_Dim_Account" RawFileFormatName="Output" WriteOption="CreateAlways" > <FileOutput ConnectionName ="Target" /> </RawFileDestination> </Transformations> </Dataflow> </Tasks> </Package> </Packages> </Biml>
There are 3 major things we’ll need to define. There are;
- The output file connection which is held in the Connections node. Let’s assume that is name should include the schema and table_name. For example, the extract for Person.Address should output a file called Person_Address.
- The file format for a raw file which is simply the list of columns that we want to export (held in the FileFormats node). There is a parent node of the column list which is defined by name and, (in our work), we should specify this by the schema and table.
- The data flow transform with an Oledb source and raw file destination (mapping will be managed by name matching). The target uses the file connection (defined in step 1) and the file format (step 2) which is defined by schema and table_name. Since we want to give our data flow components a meaningful name we’ll also prefix them with src (for source) and dest (for destination).
Now, lets look at how we can generate XML for each of our 3 components.
The generation of target files is pretty straight forward. All we need is a query that produces a file connection node with the (file output). One for each table in the database. The query below does this and produces the output. There’s nothing too complicate here, but I’ve you never heard of the information schema on SQL Server, you may want to read a little more about the views here. Basically, they define the database (and what we are interested in is tables and columns).
/* PTB, 10-Nov-2014 SQL to Generate Biml FileConnections (all tables in a DB) */ declare @output_directory nvarchar(50) = 'F:\\DB_DUMP' select '<FileConnection Name="' + TABLE_SCHEMA + '_' + TABLE_NAME + '" FilePath="' + @output_directory + '\\' + TABLE_SCHEMA + '_' + TABLE_NAME + '.raw" />' from INFORMATION_SCHEMA.TABLES where TABLE_TYPE = 'BASE TABLE'
Produces the folowing output.
All we want to do is take the output from the query and post it where our connections node.
All the file format needs to do is specify a list of columns under a RawFileFormat Node. Of course the format must have a name but in essence we need to produce some xml that looks like the following snippet. And, it needs to be done for every table that we are going to extract.
<RawFileFormat Name="Person_Address"> <Columns> <Column Name="AddressID" /> <Column Name="AddressLine1" /> <Column Name="AddressLine2" /> <Column Name="City" /> <Column Name="StateProvinceID" /> <Column Name="PostalCode" /> <Column Name="rowguid" /> <Column Name="ModifiedDate" /> </Columns>
We are also a little restricted by the capabilities of the raw file output. This destination does not accept certain data types (eg Geography, XML or images) and so these need to be excluded from the definition. If they are not, our package will try and map source columns to these fields and the transform will fail (actually, it won’t validate). This is not an issue with Biml, rather a limitation of the destination.
Again, if you’ve not used the INFORMATION_SCHEMA before, you may want to have a quick look at the INFORMATION_SCHEMA.COLUMNS view. This view lists all the columns (and tables of the database). Since I want xml returned, I can use the XML clause to have the query generate an XML node for me (rather than a standard row set). There is a lot in the XML clause (if you want more information, you can read it here). My query;
/* PTB, 10-Nov-2014 SQL to Generate Biml FileFormats for all tables in a DB */ select TABLE_SCHEMA + '_' + TABLE_NAME "@Name" , (select COLUMN_NAME "@Name" from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = t.TABLE_NAME and TABLE_SCHEMA=t.TABLE_SCHEMA and ( DATA_TYPE <> 'geography' and not ( DATA_TYPE='nvarchar' and CHARACTER_MAXIMUM_LENGTH = -1 ) and not ( DATA_TYPE='varbinary' and CHARACTER_MAXIMUM_LENGTH = -1 ) and DATA_TYPE <> 'Xml' ) for xml path('Column'), type, root('Columns') ) from INFORMATION_SCHEMA.TABLES t WHERE TABLE_NAME in ( SELECT TABLE_NAME from INFORMATION_SCHEMA.TABLES where TABLE_TYPE = 'BASE TABLE' ) for XML PATH('RawFileFormat'), type , ROOT('FileFormats')
Returns an xml node which is shown in the results view like this.
Now, when we click the xml (note that it looks like a url), a new output window opens up with the following code. This is exactly what we need for the file formats and we can past over the proforma FileFormats node.
The Xml output from the File Formats query
Since we want to include all our data flows in a single package (with one dataflow for each table), we need to generate the xml for the data flow in a packages node. We know that the package has a tasks child which specifies the control flow and each data flow should reside within it. For example, a single data flow would look like this.
<Tasks> <Dataflow Name="EXTRACT_Person_Address"> <Transformations> <OleDbSource Name="src_Person_Address" ConnectionName="Source"> <DirectInput>Select * from Person.Address</DirectInput> </OleDbSource> <RawFileDestination Name="dest_Person_Address" RawFileFormatName="Person_Address" WriteOption="CreateAlways"> <FileOutput ConnectionName="Person_Address" /> </RawFileDestination> </Transformations> </Dataflow> </Tasks>
Now, we need a query to generate xml for all tables…. (and here is it). It gives us the full tasks node for our package.
/* PTB, 10-Nov-2014 SQL to Generate Dataflow's for Biml output (all tables in a DB) */ select 'EXTRACT_' + TABLE_SCHEMA + '_' + TABLE_NAME as "@Name" , 'src_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/OleDbSource/@Name" , 'Source' as "Transformations/OleDbSource/@ConnectionName" , 'Select * from ' + TABLE_SCHEMA + '.' + TABLE_NAME as "Transformations/OleDbSource/DirectInput" , 'dest_' + TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@Name" , TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/@RawFileFormatName" , 'CreateAlways' as "Transformations/RawFileDestination/@WriteOption" , TABLE_SCHEMA + '_' + TABLE_NAME as "Transformations/RawFileDestination/FileOutput/@ConnectionName" from INFORMATION_SCHEMA.TABLES where TABLE_TYPE = 'BASE TABLE' order by TABLE_SCHEMA, TABLE_NAME for XML path('Dataflow'), root('Tasks')
Putting it All Together
Because the xml generated is so big, I will not include the full output. However, I think its good to have an indication of what the output should look like. So a subset of the output will look like this.
When we build the Biml, we will get a single ssis package with a data flow for each table (as shown below) with all the effort of generating the transforms handled by Biml. I’d like to go into the xml clause for SQL a little further but I think it would complicate what we are trying to achieve (after all this post is pretty long as it is).
Now … I have to ask … Is this the most elegant solution and could it be done better?
Perhaps not but I think it’s a pretty cheeky (and extremely fast) way to generate a package to do an awful lot of copying.