Cleaning Your Database with BIML

Blow it away and start again 🙂

When we are working in a dynamic environment, we need to push through changes quickly and a prerequisite for this means starting from scratch.  Starting with a blank database and deploying the build from a clean state.  If you can’t drop and recreate the database, then the next best option is to clean it out, delete all the objects and recreate them as part of your build.

You can scrub tables in several ways.  The one that I propose here is a sequential (and dynamic method) that is relatively straight forward (and of course we can look into BIML for it).  The template for the pattern is given in the pseudo code below;

Assume there are tables to delete
While there are tables to delete then
  table_list <- get the list of tables
  foreach table in table_list, try to delete the table
  determine if there are (still) tables to delete

In, SSIS, this looks like the following pattern and I have included a list of variables that I used to run the package.  There some more advanced features of SSIS being used here which I will call out;

  1. We use TABLE_COUNT (initial value 1) to control the continued execution of the package (that is, the WHILE_TABLE_EXISTS container).  This container runs while the TABLE_COUNT value is greater than 1 (or my database still has tables).
  2. The TABLES_IN_DB is an object that holds a list of table names, this is the result of the SQL Command GET_TABLE_LIST.  The purpose here, is to query the database metadata in order to determine a list of names.
  3. The iterator FOREACH_RECORD_IN_TALBE_LIST enumerates over each record in the TALBES_IN_DB (assigning the table name to the variable TABLE_NAME).  Within that container, we generate what SQL to execute (ie the drop command) in the GENERATE_SQL expression by assigning it to the SQL variable.  Then we execute that variable via an Execute Command.  Since we do not want the task to fail if the command does not work (after all there may be some dependencies between tables and execution order).
  4. Finally, after the first batch of executions has run (and hopefully all the tables are deleted), we recount the tables in the database, storing the values in the TABLE_COUNT variable.  Control is then passed back to the WHILE_TABLES_EXIST to see if there are tables in the database and determine whether the process should begin again.

image

 In BIML

This process is very generic and can be applied in any database.  There’s also no need for BIML Script in code (since we do not rely on the generation of tasks which specifically rely on data).  Here is the full snippet;


<Biml xmlns="http://schemas.varigence.com/biml.xsd">

<Connections>
    <OleDbConnection Name="FM_MYSQL" ConnectionString="provider=MSDataShape;server=foodmart_mysql;uid=foodmart_admin;pwd=whats_my_password?" DelayValidation="true"  />
</Connections>

    <Packages>
        <Package Name="01-Clean_MySQL" ProtectionLevel="EncryptAllWithUserKey" ConstraintMode="Linear">

            <!-- these variables are created in the packages scope -->
            <Variables>
                <Variable Name="TABLES_IN_DB" DataType="Object">
                </Variable>
                <Variable Name="TABLE_COUNT" DataType="Int32">1</Variable>
            </Variables>

            <Tasks>

                <!-- the first container (while records exists) note the evaluation expresssion -->
                <ForLoop Name="WHILE TABLES EXIST" ConstraintMode="Linear">
                    <LoopTestExpression>@TABLE_COUNT>=1</LoopTestExpression>

                    <!-- tasks within the container are contained in a tasks tag -->
                    <Tasks>

                        <!-- get the list of table (names) .. note the record set is assigned to the variable TABLES_IN_DB -->
                        <ExecuteSQL Name="GET_TABLE_LIST" ResultSet="Full" ConnectionName="FM_MYSQL">
                            <DirectInput>SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='foodmart';</DirectInput>
                            <Results>
                                <Result Name="0" VariableName="User.TABLES_IN_DB" />
                            </Results>
                        </ExecuteSQL>

                        <!-- Iterate over each record in the TABLES_IN_DB variable
                            Note how we assign the current record to the TABLE_NAME variable -->
                        <ForEachAdoLoop Name="FOREACH_RECORD_IN_TABLE_LIST" SourceVariableName="User.TABLES_IN_DB" ConstraintMode="Linear">
                            <Variables>
                                <Variable Name="TABLE_NAME" DataType="String">
                                </Variable>
                                <Variable Name="SQL" DataType="String">
                                </Variable>
                            </Variables>
                            <VariableMappings>
                                <VariableMapping Name="0" VariableName="User.TABLE_NAME" />
                            </VariableMappings>

                            <Tasks>

                                <!-- Generate the SQL Statement –>
                                <Expression Name="GENERATE SQL" Expression="@[User::SQL]=&quot;DROP TABLE foodmart.&quot; + @[User::TABLE_NAME]"/>
                                <!-- Execute the SQL Statement (based on the user variable (SQL)) –>

                                <ExecuteSQL Name="DELETE TABLE" ConnectionName="FM_MYSQL" DelayValidation="true" FailPackageOnFailure="false">
                                    <VariableInput VariableName="User.SQL" />
                                </ExecuteSQL>

                            </Tasks>

                        </ForEachAdoLoop>

                        <!-- Recount Tables and store in the variable (which is passed back to the parent container)  -->
                        <ExecuteSQL Name="GET_TABLES_COUNT" ResultSet="SingleRow" ConnectionName="FM_MYSQL">
                            <DirectInput>SELECT COUNT(*) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='foodmart';</DirectInput>
                            <Results>
                                <Result Name="0" VariableName="User.TABLE_COUNT" />
                            </Results>
                        </ExecuteSQL>

                    </Tasks>
                </ForLoop>
            </Tasks>
        </Package>
    </Packages>

</Biml>

You might notice that I am using an MYSQL ODBC database connection.  Other features are commented in the code.

As stated, there is no reliance on BIML Script here, so you may ask ‘Why use BIML?’.  The reason for this is that the snippet includes samples of a few controls and how they are configured in different circumstances.

Advertisements

Barry Devlin & the Business un-Intelligence Course

I really enjoy Barry Devlin’s approach to Business Intelligence.  He is an under-appreciated name in the industry (he was one of the originators of ‘Data Warehouse’ way back in the mid-’80s), so when he has something to say, it’s worth a listen – especially when he challenges ‘modern’ perceptions of Data Warehousing and the BI Industry.

And this brings me to the purpose of this post.  For those in the Asia Pacific region, Barry is presenting a course on Big Data Analytics & Reinventing Business Intelligence in Singapore July 6-8th, 2015 (http://bit.ly/1E3I90e).

So, what’s to be expected from the course? You can get the full agenda here (http://bit.ly/1c7hHIv).  Briefly, the course covers the complete range of topics on how to design a modern information/data environment, ranging from the fundamental principles of how all types of data should be absorbed and used within the organisation to an overview/assessment of tool choices and the implications of new neurobiological thinking about how decisions are really made.

Having read Barry’s latest book “Business unIntelligence – Insight and Innovation Beyond Analytics and Big Data” (http://bit.ly/BunI_Book), I suggest that this is a perfect opportunity to investigate the strategic direction of BI (from an organizational or consulting perspective) and address the implementation of BI.

Oh, and by the way – I’m told that foreign attendees get special discounts, all you have to do is ask 🙂

SSIS – Creating an OLEDB Connection to an ODBC Source and SSIS parameters

This post was originally going to be about using the advanced features of a SSIS OLEDB command to set parameters.  I’ve been using Oracle sources lately and the OLEDB command did not recognise the parameters for my query. 

For example, my in my work environment, the statement (below) just did not recognise the key_field parameter and so I had to manually add it through the components advanced editor.

Update some_table 
set some_field = new_value 
where key_field = ?

We’ll look into the advanced features a little later, however, when I tried to mimic the behaviour (using a MYSQL connection), the only issue I had was creating an OLEDB connection in SSIS.  I have ODBC connections on my machine however, (naturally enough given the components name), the OLEDB command requires an OLEDB connection.

So, to create an OLEDB command (and have it reference an ODBC source), simply use the Native OLE DB\MSDataShape provider and use the ODBC name as the server (as in the image below).  The UserId, Password and catalogue/schema can all be set through the connection managers interface.

image

 

Parameter Management under OLEDB

I can now use that OLEDB connection to execute an OLEDB Command.  As discussed above, my original intent was to show how to add a parameter when SSIS did not recognise the parameters.  In my situation, this was against an Oracle source.  In this demo, I’ll use a simple command

update adventure_works_2012.customer 
set territory_id = ? 
where account_number = ? 

and the column mappings tab of the OLEDB command allows me to map fields in my source (input data flow) to my command (ie parameters in the destination).  We can see this in the mapping screen shot below.

image

You will notice that the names are generic (Param_0 …. ) and refer the the placement of the ? symbols in the source query.  That’s not really user friendly and if i want to give these meaningful names, you would think that I can set them in the Input and Output Properties (using the Name Property as below).  However, this is not the way this UI works and the name defaults to the placement of the ? in the query.  Long story short, you can’t control the names of the parameters.

image

However, this part of the (advanced UI) does give us a little indication into how to add parameters.  If no parameters are recognised, my can add them through the ‘Add Column’ button on the External Columns node of the Inputs and Outputs tab.  This is what i had to do with my Oracle command.