Schema Definitions from an ODBC Text Source

Recently, I’ve worked on disjointed migration project where things just didn’t work the way you expect.  For example, you usually like to access data direct from its source (direct from a relational database) but in this situation, this was just not allowed and we were given a load of text files (hundreds) to load.  Perhaps not so bad, you may say however, unfortunately, the documentation around those files was incomplete and incorrect.  It seems that we spent for too much time confirming valid formats and playing too and fro with vendors – Its not really value adding and lets face it, its not really a good use of time to transfer and manually configure text import.  So what could we do?  All files were deposited in a directory, so why not use the Schema collections from an ODBC source?

The idea is to setup an ODBC source that points to the directory with files in it, create a connection to that source and then use the meta data of that connection to iterate over each table in the source and get the column information from that table.  Note that for a text driver, each file represents a table and that’s why I refer to tables above.

1. Setting Up the Driver

I think this is relatively straight forward so I’ll just add the key points with a word of warning.  There is no x64 bit driver installed by default for windows, so you may want to download the Access one from here if you need to.  Access the ODBC drivers from your settings (just type ODBC in the search bar and then choose to setup a connection or access ODBC sources), then just add a new source and choose the text driver.

image

When you click Finish you’ll be presented with the window below where you can specify the directory you want to use and give the source a name (see below).  There’s no need to look at options, but if you want to, you can specify specific extensions but for now lets assume that everything in the folder is for import.

image

2. Data Access

As a side note, after you’ve set up the driver, you can now import the data as a table into Excel.  The menu path is (Data Tab) Get External Data > From Other Sources > Microsoft Query.  I know you could import the file directly (ie Get External Data > From Text) however, you want to use the text driver and text your ODBC source.  You’ll notice that this source is available as a database (see below), and each file in the directory is presented as a table.  The source that I previously created is shown below and you can see the following list of tables.

image image

3. Using the ODBC Shema Object

As a general rule, when we create a database connection in code, the connection object is accompanied by a schema and we can use this to interrogate  the source.  As you can see from the snippet, all we do is acquire the “tables” schema (ie list the tables), then iterate over each table to get the columns in that table.  The sample output this shown below and, as you can see, it defines the column names and types.

            OdbcConnection _con = new OdbcConnection("DSN=biml_text_files");
_con.Open();

DataTable _schema = _con.GetSchema();
DataTable _restrictions = _con.GetSchema("Restrictions");
DataTable _tables = _con.GetSchema("Tables");

foreach (DataRow _tablerow in _tables.Rows)
{
Console.WriteLine("{0}", _tablerow["TABLE_NAME"]);

string[] _table_restrictions = { null, null, _tablerow["TABLE_NAME"].ToString() };
DataTable _columns = _con.GetSchema("Columns", _table_restrictions);
foreach (DataRow _colrow in _columns.Rows)
{
Console.WriteLine("\t{0},{1},{2}", _colrow["COLUMN_NAME"], _colrow["TYPE_NAME"], _colrow["COLUMN_SIZE"]);

}

}


_con.Close();

image

Actually, you don’t even need to go to this trouble.  You could just get the columns schema without the restrictions which will give you a complete list of all columns for every table.  However, in our implementation, the data was sent back to a database so others could use it.

Conclusion

Working with the Schema Information objects allows us derive information about the source.  Its conceptually similar to the INFORMATION_SCHEMA tables in most databases however, when we apply this to text files, we can use that meta-data to build our ETL.

Advertisements

BIML VII – Extracting Data Using ODBC Destinations

In previous posts we looked at using Biml to extract data from our database tables to flat files and raw files. While these destinations can be handy, the majority of extracts that I use are to staging tables in a relational engine. This post looks at the Biml required to send our data to ODBC destinations.

The Environment

For the purpose of demonstration, I will use the DimCustomer table from adventure works. There’s nothing too complicated about that table, all I want to do is extract some data from it so we’ll just use a simple query (with code, first name and last name);

select 
CustomerAlternateKey
, FirstName
, LastName 
from dbo.DimCustomer;

For the destination, we’ll look at two options. Firstly we will rely on an auto mapping feature (so that we do not specify column mapping between the source and destination). Secondly, we will override that feature and specify column to column mapping. For this I’ll create two tables in a staging database called customer_automapping (with the same field names as the source) and customer_manualmapping (with different names). The SQL for these tables is given below;

create table dbo.customer_automapping
(
	CustomerAlternateKey nvarchar(25)
	, FirstName nvarchar(50)
	, LastName nvarchar(50)
);

create table dbo.customer_manual
(
	customer_cd nvarchar(25)
	, first_name nvarchar(50)
	, last_name nvarchar(50)
);

Proforma Template

We’ll use Biml to create two parallel dataflows in a package (the package will be called copy_customer) and the data flows will be called customer_automapping and customer_manualmapping. Our template package looks like the one below;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="dest_stage" ConnectionString="Driver={SQL Server Native Client 11.0};...." />
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};..." />
  </Connections>
  <Packages>
    <Package Name="copy_customer" ConstraintMode="Parallel">
      <Tasks>
        <Dataflow Name="customer_manualmapping">
			<Transformations>
			</Transformations>
        </Dataflow>
        <Dataflow Name="customer_automapping">
			<Transformations>
			</Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Note we have already define placeholders for each dataflow, all we have to do is populate those (transformation) nodes. We also define our source and destination connections.

The ODBC Source

We have previously addressed an ODBC source so for simplicity will just add the code. This needs to be added to each dataflow transformation node;

<OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
  <DirectInput>
	select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
  </DirectInput>
</OdbcSource>

The ODBC Destination

We’d expect the ODBC destination to require a name (since every item requires one), connection and a table (name). These are the only requirements for an automapping feature (where the source fields are automatically mapped to the destination fields on the basis of name). The snippet for the automapping dataflow is therefore;

<Dataflow Name="customer_automapping">
  <Transformations>
	<OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
	  <DirectInput>
		select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
	  </DirectInput>
	</OdbcSource>
	<OdbcDestination Name="dest_customer" Connection="dest_stage" >
	  <ExternalTableOutput Table="[dbo].[customer_automapping]" />
	</OdbcDestination>
  </Transformations>
</Dataflow>

Note that the table is defined by the ExternalTableOutput tag. There is no need to fully qualify the table as I have done (by inclusion of the schema) and <ExternalTableOutput
Table=customer_automapping />

would suffice provided the default schema for the connection was the same as the table. As expected, our dataflow creates the source and destination components and automatically maps fields from the source to the destination based on name.

To explicitly set the source to target mappings, we need to include a columns (and column(s)) tag as a child node of the destination. The Biml for this is below but it is worth identifying that mapping will only be overwritten if it exists in this tag. For example, if two columns have the same name and there is no mapping column specification, the two fields will be auto matched. Therefore, to fully specify the mapping in the ODBC destination we would use the following snippet;

<OdbcDestination Name="dest_customer" Connection="dest_stage" >
  <ExternalTableOutput Table="[dbo].[customer]" />
  <Columns>
	<Column SourceColumn="CustomerAlternateKey" TargetColumn="customer_cd"  />
	<Column SourceColumn="FirstName" TargetColumn="first_name"  />
	<Column SourceColumn="LastName" TargetColumn="last_name"  />
  </Columns>
</OdbcDestination>

It’s clear to see that the mapping occurs between the SourceColumn and TargetColumn attributes of each column node.

The Complete Biml File

For completeness, we’ll add the complete Biml Code;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OdbcConnection Name="dest_stage" ConnectionString="Driver={SQL Server Native Client 11.0};…." />
    <OdbcConnection Name="src_adventure_works" ConnectionString="Driver={SQL Server Native Client 11.0};…;" />
  </Connections>
  <Packages>
    <Package Name="copy_customer" ConstraintMode="Parallel">
      <Tasks>
        <Dataflow Name="customer_manualmapping">
          <Transformations>
            <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
              <DirectInput>
                select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
              </DirectInput>
            </OdbcSource>
            <OdbcDestination Name="dest_customer" Connection="dest_stage" >
              <ExternalTableOutput Table="[dbo].[customer]" />
              <Columns>
                <Column SourceColumn="CustomerAlternateKey" TargetColumn="customer_cd"  />
                <Column SourceColumn="FirstName" TargetColumn="first_name"  />
                <Column SourceColumn="LastName"   />
              </Columns>
            </OdbcDestination>
          </Transformations>
        </Dataflow>
        <Dataflow Name="customer_automapping">
          <Transformations>
            <OdbcSource Name="src_adventure_works" Connection="src_adventure_works">
              <DirectInput>
                select CustomerAlternateKey, FirstName, LastName from dbo.DimCustomer;
              </DirectInput>
            </OdbcSource>
            <OdbcDestination Name="dest_customer" Connection="dest_stage" >
              <ExternalTableOutput Table="[dbo].[customer_automapping]" />
            </OdbcDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Overwriting the Automap

There may be a situation where you want to Automap some fields but not others (say for example we did not want to populate the CustomerAlternateKey field). In this situation we can rely on auto-mapping and specify the column as not being used through the IsUsed attribute. For example, we could use;

<OdbcDestination Name="dest_customer" Connection="dest_stage" >
  <ExternalTableOutput Table="[dbo].[customer_automapping]" />
  <Columns>
	<Column SourceColumn="CustomerAlternateKey"  IsUsed="false" />
  </Columns>
</OdbcDestination>