To Automate or Not? – The Pros and Cons of Data Warehouse Automation

I have been thinking about data warehouse automation for at least 2 years now. That’s just a long enough time for me to see the pros and cons, examine the state of affairs, consider the consequences of both camps and lastly consider my reaction to those consequences. I’ve come to realise that the last point (my reaction) plays on my mind because the subject is filled more with emotion than rational decisions. What I hope to achieve with this post is to materialise some of my thoughts on the matter and present a few ideas that may be considered by others (if they were considering utilising this technology).

Before we get into the grit of discussion, it is probably worth noting that I consider the data warehouse in the more traditional sense. That is, a relational engine that stores and delivers agreed data. Recently, the additional possibilities arising from the big data echo system and the concepts of a data lake or swamp, the possibilities of schema on read (where data doesn’t necessary have to conform to a model on load) will add a layer of complexity to this discussion which we just don’t need. We could also add data virtualisation but that would also cloud the water in what could be simplified as decision to automate or not.

In my traditional view of this world, the data warehouse needs to deliver two things. Firstly, it needs to store an historical record data (as it existed) and secondly, it needs to present information to the consumer in the manner that they require. Those two points might seem like real no-brainers but I thought it’s better to call them out now because they create some flow on implications which are;

  • the historical record of data may not be what the consumer currently wants
  • the ‘delivery’ requirements of the system will change
  • the system must be capable of changing requirement

Data Warehouse Development

In traditional development, the system is designed is designed at a conceptual level, a data modeller designs and builds the relational framework (the schema) and then ETL developers populate it. The last stage (ETL development) requires the largest effort on a project and has been documented to require more than 60% of total effort.

Now, let’s examine the first two steps. I suggest that the conceptual design is driven by understood practices#1 – to the extent of a cookie cutter template. Regardless of the methodology used, most employ a series of processes and silos that do not change from implementation to implementation. Data is extracted into a staging area, transformed into a warehouse, deployed to a presentation server/layer and delivered via a user interface (reporting/analytic layer). We expect the modeller to use a tool to manage and maintain the modelling process (eg ERwin, DeZign) and would think it unprofessional and prone to error if components were hand coded manually.

In contrast, when it comes to ETL development, most processes manual. Sure, there’s a tool that is used (eg SSIS, Kettle, Talend) which acts as an IDE however, the development process is still labour intensive. The developer links together components to perform some action or task and each task/package must be manually developed and configured (for example, you would write 10 tasks to extract 10 tables into a staging area).

Is the process of the ETL development using an IDE similar to that of the modeller (and therefore argued as automated)? I think yes and no. Yes because there is a layer of abstraction between the development process and the output (today we would think it insane to manually write code in a text editor) and no because the abstraction layer is not high enough (meaning that still very task orientated).

Automation

Enter the realm of automation technologies. To summarise these tools, they;

  1. Generally use a model (and data) driven approach to define the conceptual layer required. For example, you point the tool source system to reverse engineer the underlying model which can later adjust (for example at other entities, relationships etc).
  2. Define a schema based on the prior step (and generate the DDL to define an alternate schema).
  3. Map source systems to the target schema and generate the code and processes to do this.
  4. Manage the execution of the code/processes derived in step 3.
  5. Provide some presentation layer.

The Arguments

So why are automation products considered by purists and coders as second-rate citizens in the world of data warehouse development? There are a few key arguments that I’ve come across;

  • The tool does not produce the schema that we want or we can’t change it the way we want. This contention presents itself in several ways.
    • The modeller suggests that there is a correct way to build the schema and the tool should conform to that (or be capable of alteration to the correct schema). I’ve italicised the word correct here because that can really be a matter of conjecture. Consider storing related data in a different table which is directly related to the first (ie 1:1 relationship). You can convincingly argue both sides (there should be 1 table or 2). I think the key point of this argument is that the tool is not flexible for schema alteration. Sometimes this is true, sometimes it’s just nonsense and based on hearsay or single product evaluation.
  • Another point is that the data is not retained correctly (for example, history is not recorded and is incapable of change). Most people that present this have not completed an in depth analysis of the products available. Just like other process based software, each product is not created equally and it would be foolish to judge the class of products on a limited scope (I am sure there’s a product that suites your needs). What is interesting about this point is how much bias is targeted at automation software. The same scenarios can be readily found in traditional implementations where requirements have changed (or dare I suggest an oversight in design) renders the existing schema redundant. Change and rework is inevitable in this scenario and it’s surprising how that circumstance is eliminated from our evaluation process.
  • The tool is a black-box and I can write better transformations. This is perhaps the most common discussion point I see amongst developers. They argue that the tool produces generic code which could be improved had it been created manually (by them). Perhaps this is true and there are some instances where code could be written that outperforms and is more elegant and generated code. However the converse may also be true – that is, manual development may be more prone to poor performance. How can this be? Well, for a start you assume that the developer can deliver the results they say (and they actually know what they’re talking about). But this is not always the case in some developers obtain their positions through bluffs on their resume which are all too often found out too late.
    • When presented with this argument there is a natural association that the solution requires an elegant and well performing solution. Sometimes this is true and sometimes it’s not. Performance improvements that look good on paper may have absolutely no benefit to the overall solution even if they deliver a tangible result. For example, the a generated process taking 25 minutes compared to a manually created process of 20 minutes may show improvement however, is there really an improvement if the entire process completes in the processing window?
  • It’s too expensive. When an automation tool is evaluated its price is naturally considered as part of the evaluation equation. Let’s say the typical range of prices is between $50K and $80K. That is evaluated against a consultant for six to ten weeks or an FT employee the nine months. More than enough time to build it manually right? – I’m not so sure. ETL projects are notorious for overruns and while we expect our projects to go smoothly, the chances are that there will be overruns and these are not factored into the cost equation.
    • Also I’d like to consider in the situation where the developer(s) leave. In this situation, there is often more unplanned rework through either a lack of knowledge or a different consultants approach.
  • Further considering the “it’s too expensive line“, I would also like to introduce the concept of the time value of data (whoops information). I have never seen this addressed in a cost benefit analysis, but to my mind it must play a role. The time value of data relates to the increase in productivity, revenue or output by having solution faster (let’s assume that an automated tool can do this). What does it mean in monetary terms (to the business) to have data now (and not in six/nine months)? Can they utilise the data warehouse from competitive advantage to increase profitability?

The Plusses – Viva Automation

What can automation deliver that has not been addressed above? On the pro side of automation software we could suggest;

  • A consistent approach. Since the automation software develops everything for us, its output is delivered in a consistent manner. Of course this has good and bad points, but the key is that there is a consistent approach. Developer peculiarities should not exist since the tools operate through meta-data (and generation). Theoretically one developer is equivalent to another so there may be no need for sliding pay scales based on experience or special skills.
  • Since the operator manages the software, one might suggest that developers are not needed (and neither are their specialist skills). This may make resourcing a lot simpler and remove the barriers between IT and the business (provided we accept those barriers exist). There is the potential for a business user with training the software to build and manage the data warehouse rather than a specialist IT resource. This type of transformation has occurred with accounting software that does not require an accountant to operate it (perhaps only a data entry clerk).

The Emotion

Every discussion I have had around the use of automation tools involves some emotion (in some cases I’d go so far to say that the conversation is sometimes dictated to by emotion). To get straight to the point, the idea that specialist developers can be replaced by software is so confrontational that we (myself included) simply refuse to accept it and so live in a state of denial. The denial says “I’ve been doing this for XX years, have specialist skills and no machine can take my place”.

This has happened in so many industries and is such a predictable response, I think all we need do is identify its (potential) presence.

Finally

There have been lots of points in this post. If you think I’ve missed something, want to add your own point or expand on something I’ve mentioned, then just comment or drop me a line. As usual comments are open and welcome.

I also say that all comments will be published (provided they are not spam, contribute to the discussion and are vendor agnostic). If you would like to insert the comment in a section so that the flow of the article remains intact, just let me know and I’ll add it.

The floor is now open J.

****

#1 As a side note, its time to revise these practices – perhaps more in a later post.

BIML II – Extracting Database Tables to Text Files

In this post we look at the Biml required to generate a SSIS package that copies data from a table to a flat file (that is, a text file output). It’s a pretty simple example but one that can build on our prior knowledge. We have already gone over the basics of Biml in our first post so we won’t worry about an overview of Biml, how to install it or generate packages. What’s worth remembering though, is that Biml is an XML specification and packages and connections are child nodes of the XML root. We’ll also deviate little bit in this post by using an OLEDB connection to SQL Server and specifying a file format.

Overview

If we think about the way our package would be built in SSIS, we would probably follow a simple set of steps;

  1. Create a connection to SQL Server
  2. Add a dataflow task
  3. Specify a source (let’s just use an OLEDB source) and provide the query.
  4. Add a flat file as a destination output then join the source to destination and then configure the output by creating a new flat file connection manager.

What’s somewhat hidden in this process is that the output file structure is defined automatically when we create a new connection – that is, in the destination wizard (it is handled by the wizard when you create a new connection manager). Keep in mind that the Biml has no idea about this and so we need to specify the file structure somewhere in the Biml code.

So if we are going to achieve this in Biml, we would need to;

  1. Add the connections
  2. Specify the file format
  3. Create a package that has a data flow task
    1. Uses the source connection (with a query) to produce a dataset … and
    2. Places that dataset into an output connection.

So we would expect our structure to be something like this (I’m using a little prior knowledge and know that the FileFormats tag is a root element);

<Biml xmlns="http://schemas.varigence.com/biml.xsd">

  <Connections>
    <!-- Placeholder for connection -->
  </Connections>

  <FileFormats>
    <!-- Placeholder for structures-->
  </FileFormats>

  <Packages>
    <Package Name="CopyAccount" ConstraintMode="Linear">
    <!-- Place Holder for a Package -->
    </Package>
  </Packages>

</Biml>

Connections

If we follow our previous example, we can simply specify our connections under the connection node as in the following code (again Intellisense can work wonders here). The OLEDB connection should be pretty straight forward, but what is interesting is the simplicity of the FlatFile connection. As you would expect, its go a Name and a Path (or destination), but what’s interesting is the FileFormat specification. This is a required property for a flat file and specifies the structure of the stream.

<Connections>
  <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
  <FlatFileConnection Name="Target" FilePath="F:\\Output.txt" FileFormat="AccountTable" />
</Connections>

FileFormat

We have briefly discussed the creation of a format. For our purposes we could summarise it as the definition of a dataflow (somewhat akin to the definition of a table). As you would expect (and is logical for an XML table definition), the structure defines a header (that is the table container) and the columns within that table.

<FileFormats>
    <FlatFileFormat Name="AccountTable" ColumnNamesInFirstDataRow="true" >
      <Columns>
        <Column Name="AccountKey" Delimiter="Comma" />
        <Column Name="AccountCodeAlternateKey" Delimiter="Comma" />
        <Column Name="AccountDescription" Delimiter="Comma" />
        <Column Name="AccountType" Delimiter="Comma" />
        <Column Name="Operator" Delimiter="LF" />
      </Columns>
    </FlatFileFormat>
</FileFormats>

Naturally each node has to have a Name, and columns also require a Delimiter. You’ll notice that the last columns delimiter is a Line Feed (LF or new line). This just ensures that each row appears on its on line.

We could also specify data types for columns and other properties however, these are not mandatory for our example. What is required is that the field names for the FileFormat match those from our table. When the SSIS package is generated, the mappings from the source (OLEDB data source) to the destination (the FlatFile) will be based on name. The order is not important, only the Name.

The Package

If we consider that we have the framework for moving data, we would expect that our package require the following tasks with a precedence constraint between them;

  1. A Data Flow Task
    1. OLEDB Source (which specifies the connection and the SQL Command)
    2. A flat file destination (specifying the file and format).

This is exactly what we need to specify in our package. The Biml for this is;

    <Package Name="Copy_Dim_Account" ConstraintMode="Linear" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
            </FlatFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>

We can easily see how these tasks relate to both the source and destination of the dataflow and how they reference the objects that we previously defined (ie the connection and FileFormat).

Previously, we also examined the use of the packages ConstraintMode as a property which dictates how tasks on the Control Flow relate to each other. That is, a linear mode creates constraints between tasks where as a Parallel mode does not.
This property is not materialised in the DataFlow. For example, if the package has a Parrallel Constraint mode, the tasks (or transformations) within the data flow will automatically have the precedence constraint created (in the order that they appear in code).

Also note that the source uses a direct input tag as a child of the OLEDB Source which is the same as the syntax for the ExecuteSql tag.

The Complete Solution

For completeness, the full code is ;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <OleDbConnection Name="Source" ConnectionString="Provider=SQLOLEDB;Data Source=localhost;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW2012" />  
    <FlatFileConnection Name="Target" FilePath="F:\\Output.txt" FileFormat="AccountTable" />
  </Connections>
  <FileFormats>
    <FlatFileFormat Name="AccountTable" ColumnNamesInFirstDataRow="true" RowDelimiter="LF" >
      <Columns>
        <Column Name="AccountKey" Delimiter="Comma" />
        <Column Name="AccountCodeAlternateKey" Delimiter="Comma" />   
        <Column Name="AccountDescription" Delimiter="Comma" />
        <Column Name="AccountType" Delimiter="Comma"  />
        <Column Name="Operator" Delimiter="LF" />
      </Columns>
    </FlatFileFormat>
  </FileFormats>

  <Packages>
    <Package Name="Copy_Dim_Account" ConstraintMode="Linear" >
      <Tasks>
        <Dataflow Name="Copy_Dim_Account">
          <Transformations>
            <OleDbSource Name="Src_Dim_Account" ConnectionName="Source" >
              <DirectInput>Select AccountKey, AccountCodeAlternateKey, AccountDescription, AccountType, Operator from DimAccount;</DirectInput>
            </OleDbSource>
            <FlatFileDestination Name="Dest_Dim_Account" ConnectionName="Target" Overwrite="true">
            </FlatFileDestination>
          </Transformations>
        </Dataflow>
      </Tasks>
    </Package>
  </Packages>
  </Biml>

Conclusion

In this post, we have looked at the generation of packages that move data from a table to a text file. This is a simple scenario that demonstrates the minimum requirements for tag specification and utilises automatic mapping of source fields to target.

Data Vault and Data Mining (AusDM)

What does the data vault data warehousing methodology have in common with data mining?

There are two conferences / courses being run in Australia in November, 2014.

Data Vault

The Data Vault data warehousing methodology has gained a lot of traction over the past few years. If you’ve never heard of it, I would suggest that it is somewhat of a cross between normalisation and star schema design (of course I am leaving myself open to a bit of criticism with that definition but it is only half a sentence).

Apart from the mandatory recognition to the inventor (Dan Linstedt), the best book that I’ve read on the methodology was written by Hans Hultgren and it is simply a must have if you are interested in learning the method (you can check it out on amazon here ). I cannot recommend this book highly enough. Anyway, I digress.

Hans will be in Sydney conducting Data Vault training on 12th – 14th November (that is, training to become a certified Data Vault Modeller). He has partnered with Analytics8 and MIP (don’t ask me why the prices are different) and while I would never usually promote vendors training solutions, I can make an exception for training delivered by Hans. If you are interested in finding out more about here.

AusDM

AusDM or Australasian Data Mining Conference is begin run in Brisbane at Queensland University of Technology (Gardens Point) on 27th-28th November. The first time I went to this conference, it was purely an academic conference on data mining. If you’ve never been to an academic conference, they are a lot like other conferences, except that research papers are presented, so the presentations are focused and specific (you’ll also need a bit of background in the subject area). To be honest most industry participants find this pretty dry and boring. Research findings leads technical implementation by a good number of years and the work seems mostly theoretical.

What I like about AusDm is that they have an industry focus in addition to the academic presentations. For example, there is workshop on R which (IMO) makes the price of admission inconsequential (when compared to other R training).

If you would like to know more about AusDm please check out the site http://ausdm14.ausdm.org/home

BIML I – Getting Started with BIML

Traditional SSIS development requires the ETL developer to create a package and then manually add and configure tasks. While this method of visual development is fine, it suffers from reproducibility and maintenance overhead. For example, imagine (a very contrived scenario) where you create a package with a single ‘Execute SQL Task‘ and configure it to fire a procedure. Simple enough but now suppose you want to create a new package and fire a different proc – here you have to manually go in and follow all the same steps again. Perhaps not so bad but what if you create 10 packages and then decide that the Execute SQL Task should be prefixed EXSQL, you would have to manually go into each package and rename the task. This could be very cumbersome and it (by nature) carries a maintenance burden (or maintenance overhead).

BIML (Business Intelligence Markup Language) is an XML specification that (amongst other things) allows you to specify the creation SSIS packages. One of the great things about BIML is that it can also include code snippets that amplifies the ability to generate code. This first post will form part of a series that show how to use BIML to create packages and what type of outcome we can achieve. But (of course) first things first so let’s start with a simple goal, so let’s look at using BIML to generate a package that executes an SQL command.

Before we continue, a quick revisit my above comments about maintainability for a moment.

An SSIS package is really just an XML file right, so why can’t I just edit that?

If you’ve ever played around with an SSIS package, you’ll probably know that it is an XML file. You can see this by viewing the code of your packages as shown below (right click on the package and view code). So you may think that you can generate or alter this pretty easily.

However, I have found this XML to be very finicky. Suppose you wanted to redefine a connection by editing script (something which should be simple you’d think and I’ve discussed here). You would expect that you could just examine the code, find the connection string, alter it and save it. But I’ve never had any success in that approach. When you try to open the file again (using the visual designer) it is invalid so your pretty much helpless with the changes you tried to make (sorry but there’s no comeback). That’s only to change an existing package so the creation of a package is (IMO of course) is just not possible. You could use an API but that just seems out of reach for most.

Of course, newer versions of SSIS have project wide objects (like connections). Using these would naturally require less maintenance however were are still faced with task management (altering and creation) and package development. Further, if we want to define a group of packages by meta-data (say for example a list of tables to extract), there is no current way to do it in SSIS. A much simpler approach is BIML with give us a generic specification to development.

What do I Need to get running with BIML?

BIML is available when BIDS helper is installed. BIDS Helper is a community addin for SQL Server Data Tools (SSDT and BIDS) and is available on code plex. If you haven’t heard of BIDS helper before, take some time to look around the features. It really is a must have for business intelligence development in the Microsoft space.

Once BIDS helper is installed, you can add a new BIML file by right clicking on the Project (or SSIS Packages) node and selecting ‘Add New Biml File‘. The file is created in the Miscellaneous Folder (as shown below).

 

A single BIML file can contain the code that creates many packages (as will become apparent) but for now would we will just work with the basics and use it to create a package that fires an SQL statement.

Adding Code

Opening the BimlScript file shows the XML in the file (we should probably rename it to give it a decent name but we will not worry about it this time). When you start, the file just contains the root node (as below).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

What BIML now allows us to do is create objects within the root (and later other nodes). For example, since we want to create a package, we naturally expect that we need to add a package child. Naming is pretty intuitive since it follows the SSIS object names. I’ve already hinted that we can use the same file to create multiple packages so it seems natural that a package tag should be embedded within a packages tag. Something like the following.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package></Package>
  </Packages>
</Biml>

Actually, I did cheat a little (and so can you). The intellisense functionality lets you know what tags are allowed at the position you are. Remember in XML that a tag is just an identifier for an object so since I want to do something in within by BIML tag (where else would I put it), I can just open a tag and see what options are available. I want to create a package (so a search for package shows I can add packages and of course within that, I can add a package).

Validation and Creation

Back to our package snippet. We’ve got the packages declaration (as above) to specify a package within a packages node. You’d expect there is a way to validate the XML to be sure it has the required information to create packages? Well, just right click on the (explorer solution) and select ‘Check Biml for Errors‘. A dialog opens with a list of errors.

 

If we think about it, it makes sense doesn’t it? Of course we need to give our packages a name! In contrast, the error for ConstraintMode is something that we take for granted when we generate packages visually. We define the order of tasks in a package through constraints (or connectors). However Biml has no indication and so the layout of tasks other than the order in which they appear in XML. How they relate to each other must be defined in the package declaration and this is the purpose of the ConstraintMode property. There are two allowable values for this property (either parallel or linear) which (as the name suggests) specifies that tasks will not be connected or connected sequentially.

So in order to create a package, we update the Biml code as (note that since we are only having one task in this package the constraint mode is irrelevant);

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
  <Package ConstraintMode="Linear" Name="package_01"></Package>
</Packages>
</Biml>

We now have everything that we need to generate packages from the script. Simply right click on the Biml file and select the Generate SSIS Packages (as above). You’ll see a new packages create under the SIS packages node in solution explorer. Of course there are no tasks in our package.

Adding the Execute SQL task to the Package.

We already seen that the XML nodes in naming conventions are very similar to that of SSIS (both in structure and definition). If we think about adding an Execute SQL task to the package, then, there would be a few points we could make (or at least assume);

  1. a task would be embedded within a tasks node
  2. the task would probably be defined by name (or a task type)
  3. an execute SQL task requires a connection
  4. and execute SQL task requires a command
  5. the connection would be required before the command.

Turns out that the Biml to add this task to our package is pretty straightforward. Of course you can use the IntelliSense to suggest what nodes are available and, when you look at the code and think about what you’re trying to do it seems to make sense (that is, create a connection, create a package, at a task to the package uses a connection and has some syntax).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <Connection Name="SQL01" ConnectionString="Provider=SQLNCLI10.1;Data Source=.;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW"/>
  </Connections>
  <Packages>
    <Package ConstraintMode="Linear" Name="package_01">
      <Tasks>
        <ExecuteSQL Name="fire_proc_01" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Perhaps the most interesting thing about this snippet is the connection object. We’d expect it to be embedded within connections however, the location as a root node gives us an advantage of reuse, so adding another task with the package or additional packages and tasks can reuse that connection. For example, to create 2 packages with 2 Execute SQL commands in each (which require the first to complete and then the next), we could write;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <Connection Name="SQL01" ConnectionString="Provider=SQLNCLI10.1;… (as above) "/>
  </Connections>
  <Packages>
    <Package ConstraintMode="Linear" Name="package_01">
      <Tasks>
        <ExecuteSQL Name="SQL_CMD_1" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="SQL_CMD_2" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
    <Package ConstraintMode="Parallel" Name="package_02">
      <Tasks>
        <ExecuteSQL Name="SQL_CMD_1" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="SQL_CMD_2" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>
This produces package_01 which looks like and we can see how the specification of ConstraintMode has caused SQL_CMD_1 to link to SQL_CMD_2.

 

Note the output from the second package definition (where ConstratinMode is Parallel) does not introduce precedence constraints between the 2 tasks.

Conclusion

This post has looked at using Biml to create packages in SSIS. It’s a pretty simple example, but we can see how the XML biml can be used to automatically create (and recreate) packages. If we wanted to change a property (say for example the SQL syntax), we could just change it in the code and regenerate the packages.

In later posts, we’ll extend this functional knowledge for more complicated scenarios.

 

 

Changing Excel Table Connections in VBA

In previous posts, I’ve discussed using VBA to extract connection information from Pivot Tables and hacking a tables command so that you can programmatically change the results of a query. But what if you have a table and you want to change its connection and its command?

This can be achieved pretty easy using the tables .QueryTable object. In fact, we can change the entire table using this class.

So how does it work? Suppose, I’ve inserted some data into a worksheet from SQL Server. The menu path is DATA à From Other Sources à From SQL Server. Next I get the familiar connection windows to connect to a server and then select a table (as below).

I’ll give the table a friendly name (adventure_works) and return the data to a table. I get a connected table which is sourced from a query and can be updated (either right click somewhere in the table and select Refresh from the popup menu or select the refresh button from the TABLE TOOLS à DESIGN (ribbon) menu).

I can also use the DESIGN of the table tools to give my table a decent name by just replacing its name (as in the following image).

Now I want to change the database server (and potentially the query) that my table uses. There are 2 things to keep in mind here. Firstly, we need to change the query’s connection string (no brainer eh) and secondly, the table (in Excel) was added as a table.

As in the previous post, I can refer to the table as a list object in VBA, this is a property of the sheet, so, in order to reference the table I’ll have to use some code like this (note the table was on Sheet2).

Dim li As ListObject

Set li = Sheet2.ListObjects(“aw_employees”)

For the connection string, I can refer to it through the ListObjects QueryTable class. So I could determine what the current connection string or set it by using the following code;

Debug.Print li.QueryTable.Connection ‘retrieve the connectionstring

li.QueryTable.Connection = “OLEDB;Provider=SQLOLEDB.1;…(this is my connection string) ..”

If I then refresh the table (using li.Refresh), one may think that’s the end of the story and our works completed – but that’s not quite true just yet.

Remember that we added the table as an SQL table? Because of this, the ‘query’ for the .querytable actually refers to a table including the SQL database and has its .CommandType as a table. If we debug the code to determine the command (as below) we could see this

After we change the connection, we could set the new table to use employee (without a database) with something simple (like);

li.QueryTable.CommandText = “dbo.DimEmployee”

li.Refresh

Or, we could change the query type of an Command, and provide its text. This would suffice for that;

li.QueryTable.CommandType = xlCmdSql

li.QueryTable.CommandText = “select * from dbo.DimEmployee”

li.Refresh

That’s about it, we can change the tables connection, its query and have these changes reflected in our table.

The Wisdom of the (Tableau) Crowd?

Ok … I’ll admit it … Tableau has some great features.

One of the standout features of the new release (8.2) is the new data access method – ability to model data by dropping sheets from a file and manipulating it visually as a data source.  The benefit of this is that it removes the manual requirement of connections that are typically associated with user end reporting tools and combining data.  Prior versions could do this to some extent by linking data sources however, the new version allows it to happen is what could be effectively called a model designer.  Another slightly different approach (and one which is way cool) was developed by Targit (they used a folder share as a directory share to drop files which could then be consumed by the application automatically). There’s an obvious benefit to the tight integration of data and its downstream analysis – especially for business users that are not so concerned with technical model design but the analysis of data and using it in the tool. Removing layers of integration delivers on the promise of traditional pitches that use catch cries of

spend less time working on data and more time understanding it

To be fair though, there is an assumption (IMO) that the data being used is pretty clean to begin with … all we are really defining aggregation paths, streamlining the import of data(s).  A common negation for this type of activity (where users mange and consume their own data) is that the production of data is not verified by IT or officially sanctioned (most noticeably it is raised from an IT centric standpoint). There could be lots written about that discussion point but for now we will just accept that it is going to happen anyway and leave it.

Another great feature is the idea of story boards – where you can trace (for want of a better word) thought patterns used in investigation and present a complete view of the analysis – I’ve heard it dubbed a Power Point Killer (and there may be some truth in that where Power Point is used with a heavy reliance on data integration). Again, the concept is not new (Yellowfin has had it enabled for at least one version now).

But I digress and should move onto crowds … or more importantly the vendor conference. I was staggered to discover that the Tableau conference in Seattle was sold out early with 5,200 attendees (yes that’s right 5,200). That’s a remarkable number when you consider that the PASS Summit on SQL Server (incorporating the full MS stack) gets (an advertised more than) 5,000 attendees (I am assuming that it’s just over the 5,000 so somewhere in the range of 5,000 – 5,100). Since this includes both the database administration and the BI arm, those dedicated to BI would fall far short of the 5,000. In contrast, the PASS BA conference which specifically targets Business Analytics gets 750+ attendees (again I’ll assume were talking 750-760 attendees). Given the ubiquity of the MS platform as a tool, the similar number of attendees at the Tableau conference is astonishing.

 

Brisbane SQL Saturday

SQL Saturday in Brisbane is Back!!

Im happy to announce that we are concreting arrangements for a Brisbane SQL Saturday .. its simply the best way to get SQL Server and BI training from some of Australia’s (and perhaps some international) best trainers and consultants.  Of course, the great thing is, ITS FREE and run by the SQL community members.  So tell your boss, family and friends.

We’ve scoped out a great venue at Queensland University of Technology and are in the process of finalising some paperwork now– so leave the 20th September open for some great SQL and BI goodness.

Like to know more?  Keep an eye out on the SQL Saturday site for the event of ping me and I’ll add you to the distribution list.

Update : 21-07-2013

Ok …. we could not host the event on SQL Saturdays site – will that stop us?

NO we’ve created our own web site to manage the event.  Check out www.sqlgroup.com.au/home.aspx to find out more.

Same great content, same great day, same venue – Just another name