I have been thinking about data warehouse automation for at least 2 years now. That’s just a long enough time for me to see the pros and cons, examine the state of affairs, consider the consequences of both camps and lastly consider my reaction to those consequences. I’ve come to realise that the last point (my reaction) plays on my mind because the subject is filled more with emotion than rational decisions. What I hope to achieve with this post is to materialise some of my thoughts on the matter and present a few ideas that may be considered by others (if they were considering utilising this technology).
Before we get into the grit of discussion, it is probably worth noting that I consider the data warehouse in the more traditional sense. That is, a relational engine that stores and delivers agreed data. Recently, the additional possibilities arising from the big data echo system and the concepts of a data lake or swamp, the possibilities of schema on read (where data doesn’t necessary have to conform to a model on load) will add a layer of complexity to this discussion which we just don’t need. We could also add data virtualisation but that would also cloud the water in what could be simplified as decision to automate or not.
In my traditional view of this world, the data warehouse needs to deliver two things. Firstly, it needs to store an historical record data (as it existed) and secondly, it needs to present information to the consumer in the manner that they require. Those two points might seem like real no-brainers but I thought it’s better to call them out now because they create some flow on implications which are;
- the historical record of data may not be what the consumer currently wants
- the ‘delivery’ requirements of the system will change
- the system must be capable of changing requirement
Data Warehouse Development
In traditional development, the system is designed is designed at a conceptual level, a data modeller designs and builds the relational framework (the schema) and then ETL developers populate it. The last stage (ETL development) requires the largest effort on a project and has been documented to require more than 60% of total effort.
Now, let’s examine the first two steps. I suggest that the conceptual design is driven by understood practices#1 – to the extent of a cookie cutter template. Regardless of the methodology used, most employ a series of processes and silos that do not change from implementation to implementation. Data is extracted into a staging area, transformed into a warehouse, deployed to a presentation server/layer and delivered via a user interface (reporting/analytic layer). We expect the modeller to use a tool to manage and maintain the modelling process (eg ERwin, DeZign) and would think it unprofessional and prone to error if components were hand coded manually.
In contrast, when it comes to ETL development, most processes manual. Sure, there’s a tool that is used (eg SSIS, Kettle, Talend) which acts as an IDE however, the development process is still labour intensive. The developer links together components to perform some action or task and each task/package must be manually developed and configured (for example, you would write 10 tasks to extract 10 tables into a staging area).
Is the process of the ETL development using an IDE similar to that of the modeller (and therefore argued as automated)? I think yes and no. Yes because there is a layer of abstraction between the development process and the output (today we would think it insane to manually write code in a text editor) and no because the abstraction layer is not high enough (meaning that still very task orientated).
Enter the realm of automation technologies. To summarise these tools, they;
- Generally use a model (and data) driven approach to define the conceptual layer required. For example, you point the tool source system to reverse engineer the underlying model which can later adjust (for example at other entities, relationships etc).
- Define a schema based on the prior step (and generate the DDL to define an alternate schema).
- Map source systems to the target schema and generate the code and processes to do this.
- Manage the execution of the code/processes derived in step 3.
- Provide some presentation layer.
So why are automation products considered by purists and coders as second-rate citizens in the world of data warehouse development? There are a few key arguments that I’ve come across;
The tool does not produce the schema that we want or we can’t change it the way we want. This contention presents itself in several ways.
The modeller suggests that there is a correct way to build the schema and the tool should conform to that (or be capable of alteration to the correct schema). I’ve italicised the word correct here because that can really be a matter of conjecture. Consider storing related data in a different table which is directly related to the first (ie 1:1 relationship). You can convincingly argue both sides (there should be 1 table or 2). I think the key point of this argument is that the tool is not flexible for schema alteration. Sometimes this is true, sometimes it’s just nonsense and based on hearsay or single product evaluation.
- Another point is that the data is not retained correctly (for example, history is not recorded and is incapable of change). Most people that present this have not completed an in depth analysis of the products available. Just like other process based software, each product is not created equally and it would be foolish to judge the class of products on a limited scope (I am sure there’s a product that suites your needs). What is interesting about this point is how much bias is targeted at automation software. The same scenarios can be readily found in traditional implementations where requirements have changed (or dare I suggest an oversight in design) renders the existing schema redundant. Change and rework is inevitable in this scenario and it’s surprising how that circumstance is eliminated from our evaluation process.
The tool is a black-box and I can write better transformations. This is perhaps the most common discussion point I see amongst developers. They argue that the tool produces generic code which could be improved had it been created manually (by them). Perhaps this is true and there are some instances where code could be written that outperforms and is more elegant and generated code. However the converse may also be true – that is, manual development may be more prone to poor performance. How can this be? Well, for a start you assume that the developer can deliver the results they say (and they actually know what they’re talking about). But this is not always the case in some developers obtain their positions through bluffs on their resume which are all too often found out too late.
- When presented with this argument there is a natural association that the solution requires an elegant and well performing solution. Sometimes this is true and sometimes it’s not. Performance improvements that look good on paper may have absolutely no benefit to the overall solution even if they deliver a tangible result. For example, the a generated process taking 25 minutes compared to a manually created process of 20 minutes may show improvement however, is there really an improvement if the entire process completes in the processing window?
It’s too expensive. When an automation tool is evaluated its price is naturally considered as part of the evaluation equation. Let’s say the typical range of prices is between $50K and $80K. That is evaluated against a consultant for six to ten weeks or an FT employee the nine months. More than enough time to build it manually right? – I’m not so sure. ETL projects are notorious for overruns and while we expect our projects to go smoothly, the chances are that there will be overruns and these are not factored into the cost equation.
- Also I’d like to consider in the situation where the developer(s) leave. In this situation, there is often more unplanned rework through either a lack of knowledge or a different consultants approach.
- Further considering the “it’s too expensive line“, I would also like to introduce the concept of the time value of data (whoops information). I have never seen this addressed in a cost benefit analysis, but to my mind it must play a role. The time value of data relates to the increase in productivity, revenue or output by having solution faster (let’s assume that an automated tool can do this). What does it mean in monetary terms (to the business) to have data now (and not in six/nine months)? Can they utilise the data warehouse from competitive advantage to increase profitability?
The Plusses – Viva Automation
What can automation deliver that has not been addressed above? On the pro side of automation software we could suggest;
A consistent approach. Since the automation software develops everything for us, its output is delivered in a consistent manner. Of course this has good and bad points, but the key is that there is a consistent approach. Developer peculiarities should not exist since the tools operate through meta-data (and generation). Theoretically one developer is equivalent to another so there may be no need for sliding pay scales based on experience or special skills.
- Since the operator manages the software, one might suggest that developers are not needed (and neither are their specialist skills). This may make resourcing a lot simpler and remove the barriers between IT and the business (provided we accept those barriers exist). There is the potential for a business user with training the software to build and manage the data warehouse rather than a specialist IT resource. This type of transformation has occurred with accounting software that does not require an accountant to operate it (perhaps only a data entry clerk).
Every discussion I have had around the use of automation tools involves some emotion (in some cases I’d go so far to say that the conversation is sometimes dictated to by emotion). To get straight to the point, the idea that specialist developers can be replaced by software is so confrontational that we (myself included) simply refuse to accept it and so live in a state of denial. The denial says “I’ve been doing this for XX years, have specialist skills and no machine can take my place”.
This has happened in so many industries and is such a predictable response, I think all we need do is identify its (potential) presence.
There have been lots of points in this post. If you think I’ve missed something, want to add your own point or expand on something I’ve mentioned, then just comment or drop me a line. As usual comments are open and welcome.
I also say that all comments will be published (provided they are not spam, contribute to the discussion and are vendor agnostic). If you would like to insert the comment in a section so that the flow of the article remains intact, just let me know and I’ll add it.
The floor is now open J.