To Automate or Not? – The Pros and Cons of Data Warehouse Automation


I have been thinking about data warehouse automation for at least 2 years now. That’s just a long enough time for me to see the pros and cons, examine the state of affairs, consider the consequences of both camps and lastly consider my reaction to those consequences. I’ve come to realise that the last point (my reaction) plays on my mind because the subject is filled more with emotion than rational decisions. What I hope to achieve with this post is to materialise some of my thoughts on the matter and present a few ideas that may be considered by others (if they were considering utilising this technology).

Before we get into the grit of discussion, it is probably worth noting that I consider the data warehouse in the more traditional sense. That is, a relational engine that stores and delivers agreed data. Recently, the additional possibilities arising from the big data echo system and the concepts of a data lake or swamp, the possibilities of schema on read (where data doesn’t necessary have to conform to a model on load) will add a layer of complexity to this discussion which we just don’t need. We could also add data virtualisation but that would also cloud the water in what could be simplified as decision to automate or not.

In my traditional view of this world, the data warehouse needs to deliver two things. Firstly, it needs to store an historical record data (as it existed) and secondly, it needs to present information to the consumer in the manner that they require. Those two points might seem like real no-brainers but I thought it’s better to call them out now because they create some flow on implications which are;

  • the historical record of data may not be what the consumer currently wants
  • the ‘delivery’ requirements of the system will change
  • the system must be capable of changing requirement

Data Warehouse Development

In traditional development, the system is designed is designed at a conceptual level, a data modeller designs and builds the relational framework (the schema) and then ETL developers populate it. The last stage (ETL development) requires the largest effort on a project and has been documented to require more than 60% of total effort.

Now, let’s examine the first two steps. I suggest that the conceptual design is driven by understood practices#1 – to the extent of a cookie cutter template. Regardless of the methodology used, most employ a series of processes and silos that do not change from implementation to implementation. Data is extracted into a staging area, transformed into a warehouse, deployed to a presentation server/layer and delivered via a user interface (reporting/analytic layer). We expect the modeller to use a tool to manage and maintain the modelling process (eg ERwin, DeZign) and would think it unprofessional and prone to error if components were hand coded manually.

In contrast, when it comes to ETL development, most processes manual. Sure, there’s a tool that is used (eg SSIS, Kettle, Talend) which acts as an IDE however, the development process is still labour intensive. The developer links together components to perform some action or task and each task/package must be manually developed and configured (for example, you would write 10 tasks to extract 10 tables into a staging area).

Is the process of the ETL development using an IDE similar to that of the modeller (and therefore argued as automated)? I think yes and no. Yes because there is a layer of abstraction between the development process and the output (today we would think it insane to manually write code in a text editor) and no because the abstraction layer is not high enough (meaning that still very task orientated).

Automation

Enter the realm of automation technologies. To summarise these tools, they;

  1. Generally use a model (and data) driven approach to define the conceptual layer required. For example, you point the tool source system to reverse engineer the underlying model which can later adjust (for example at other entities, relationships etc).
  2. Define a schema based on the prior step (and generate the DDL to define an alternate schema).
  3. Map source systems to the target schema and generate the code and processes to do this.
  4. Manage the execution of the code/processes derived in step 3.
  5. Provide some presentation layer.

The Arguments

So why are automation products considered by purists and coders as second-rate citizens in the world of data warehouse development? There are a few key arguments that I’ve come across;

  • The tool does not produce the schema that we want or we can’t change it the way we want. This contention presents itself in several ways.
    • The modeller suggests that there is a correct way to build the schema and the tool should conform to that (or be capable of alteration to the correct schema). I’ve italicised the word correct here because that can really be a matter of conjecture. Consider storing related data in a different table which is directly related to the first (ie 1:1 relationship). You can convincingly argue both sides (there should be 1 table or 2). I think the key point of this argument is that the tool is not flexible for schema alteration. Sometimes this is true, sometimes it’s just nonsense and based on hearsay or single product evaluation.
  • Another point is that the data is not retained correctly (for example, history is not recorded and is incapable of change). Most people that present this have not completed an in depth analysis of the products available. Just like other process based software, each product is not created equally and it would be foolish to judge the class of products on a limited scope (I am sure there’s a product that suites your needs). What is interesting about this point is how much bias is targeted at automation software. The same scenarios can be readily found in traditional implementations where requirements have changed (or dare I suggest an oversight in design) renders the existing schema redundant. Change and rework is inevitable in this scenario and it’s surprising how that circumstance is eliminated from our evaluation process.
  • The tool is a black-box and I can write better transformations. This is perhaps the most common discussion point I see amongst developers. They argue that the tool produces generic code which could be improved had it been created manually (by them). Perhaps this is true and there are some instances where code could be written that outperforms and is more elegant and generated code. However the converse may also be true – that is, manual development may be more prone to poor performance. How can this be? Well, for a start you assume that the developer can deliver the results they say (and they actually know what they’re talking about). But this is not always the case in some developers obtain their positions through bluffs on their resume which are all too often found out too late.
    • When presented with this argument there is a natural association that the solution requires an elegant and well performing solution. Sometimes this is true and sometimes it’s not. Performance improvements that look good on paper may have absolutely no benefit to the overall solution even if they deliver a tangible result. For example, the a generated process taking 25 minutes compared to a manually created process of 20 minutes may show improvement however, is there really an improvement if the entire process completes in the processing window?
  • It’s too expensive. When an automation tool is evaluated its price is naturally considered as part of the evaluation equation. Let’s say the typical range of prices is between $50K and $80K. That is evaluated against a consultant for six to ten weeks or an FT employee the nine months. More than enough time to build it manually right? – I’m not so sure. ETL projects are notorious for overruns and while we expect our projects to go smoothly, the chances are that there will be overruns and these are not factored into the cost equation.
    • Also I’d like to consider in the situation where the developer(s) leave. In this situation, there is often more unplanned rework through either a lack of knowledge or a different consultants approach.
  • Further considering the “it’s too expensive line“, I would also like to introduce the concept of the time value of data (whoops information). I have never seen this addressed in a cost benefit analysis, but to my mind it must play a role. The time value of data relates to the increase in productivity, revenue or output by having solution faster (let’s assume that an automated tool can do this). What does it mean in monetary terms (to the business) to have data now (and not in six/nine months)? Can they utilise the data warehouse from competitive advantage to increase profitability?

The Plusses – Viva Automation

What can automation deliver that has not been addressed above? On the pro side of automation software we could suggest;

  • A consistent approach. Since the automation software develops everything for us, its output is delivered in a consistent manner. Of course this has good and bad points, but the key is that there is a consistent approach. Developer peculiarities should not exist since the tools operate through meta-data (and generation). Theoretically one developer is equivalent to another so there may be no need for sliding pay scales based on experience or special skills.
  • Since the operator manages the software, one might suggest that developers are not needed (and neither are their specialist skills). This may make resourcing a lot simpler and remove the barriers between IT and the business (provided we accept those barriers exist). There is the potential for a business user with training the software to build and manage the data warehouse rather than a specialist IT resource. This type of transformation has occurred with accounting software that does not require an accountant to operate it (perhaps only a data entry clerk).

The Emotion

Every discussion I have had around the use of automation tools involves some emotion (in some cases I’d go so far to say that the conversation is sometimes dictated to by emotion). To get straight to the point, the idea that specialist developers can be replaced by software is so confrontational that we (myself included) simply refuse to accept it and so live in a state of denial. The denial says “I’ve been doing this for XX years, have specialist skills and no machine can take my place”.

This has happened in so many industries and is such a predictable response, I think all we need do is identify its (potential) presence.

Finally

There have been lots of points in this post. If you think I’ve missed something, want to add your own point or expand on something I’ve mentioned, then just comment or drop me a line. As usual comments are open and welcome.

I also say that all comments will be published (provided they are not spam, contribute to the discussion and are vendor agnostic). If you would like to insert the comment in a section so that the flow of the article remains intact, just let me know and I’ll add it.

The floor is now open J.

****

#1 As a side note, its time to revise these practices – perhaps more in a later post.
Advertisements

6 thoughts on “To Automate or Not? – The Pros and Cons of Data Warehouse Automation

  1. Many different approaches towards Data (Warehouse) Automation exist today. In my view, they all beat manual coding.

    Automated coding approaches like BIML (requires coding knowledge, SQL server platform), tool based approaches like TimeXtender (SQL server platform), Wherescape (focus on TerraData) and BIReady (Multi-platform, Black Box approach). Data Virtualization techniques (linking DW and Big Data), some Data Warehouse appliances (unproven, expensive) , etc …

    They all make similar promises: to cut back time to market and allow for more control with less technical skill.

    I agree with both statements in the article you wrote, and would like to offer a question:

    1) yes, tool based automation has proven superior time to market (by a factor), and time to market is often key

    2) yes – hand coding will be – by definition – better tailored to the specific environment, (just like a tailored application will be better tailored than a standard one like SAP), hence hand coding still has some merit

    Question:
    Would a combined approach, where you generate code and then manually alter parts of the code not be a best-of-both worlds alternative?

    • Hi Dirk,

      I’d like to say yes … I really would – it’s the tech bias (I think), that little nagging man that says “challenge” and I wonder, do I see the need because it want to retain control or because I actually need performance improvement? For the required performance challenge, there are always options.

      Are we are assuming an injection point that’s leveraged through manual code? Some tools will allow that (to an extent) and that may be enough so it’s becomes a practice of working the tool.

      What I suspect though, is that an injection point may bias also – you’d look at the set of problems and say ‘now where’s my hammer – I have some boards and nails’  . For example, I once worked with a developer who created SSIS packages that would just execute a series of SQL tasks and that was their ETL – not at all optimised and the problem was that’s how they saw the tool (as a product for linking steps).

  2. A very interesting article, than you for your thoughts.
    While trying to automate first a 3NF DataWarehouse and later on (whith more succes) a Data Vault Warehouse, we had the discussion whether to tag every attribute of the source or simply make the way ‘modeling attribute after attribute’ easier. And we decided for the later approach.
    In that way we can build a model and think on it while it emerges. To me that is way more easier than to correct a full blown model. And as data vault model has more than twice as much entities as the 3NF Version of it, it is a time saving approach.
    Building version 2,3,4 and 5 of the model is easier as well. Because which an approach like this, we can evolve, grow, understand more while modeling.
    Out of the model we will then generate the code for filling the model automatically and in one step.
    With the arguments in your article I feel reassured in out thoughts on automation – thank you.

    • Hi Michael,

      It sounds like you’re a bit of a convert! I also like the incremental/agile build approach – it delivers results faster to those that need them. This is something we desperately need in our projects.

  3. Pingback: A Look at BIReady – Data Warehouse Automation | Paul te Braak

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s