Cleaning Your Database with BIML

Blow it away and start again :)

When we are working in a dynamic environment, we need to push through changes quickly and a prerequisite for this means starting from scratch.  Starting with a blank database and deploying the build from a clean state.  If you can’t drop and recreate the database, then the next best option is to clean it out, delete all the objects and recreate them as part of your build.

You can scrub tables in several ways.  The one that I propose here is a sequential (and dynamic method) that is relatively straight forward (and of course we can look into BIML for it).  The template for the pattern is given in the pseudo code below;

Assume there are tables to delete
While there are tables to delete then
  table_list <- get the list of tables
  foreach table in table_list, try to delete the table
  determine if there are (still) tables to delete

In, SSIS, this looks like the following pattern and I have included a list of variables that I used to run the package.  There some more advanced features of SSIS being used here which I will call out;

  1. We use TABLE_COUNT (initial value 1) to control the continued execution of the package (that is, the WHILE_TABLE_EXISTS container).  This container runs while the TABLE_COUNT value is greater than 1 (or my database still has tables).
  2. The TABLES_IN_DB is an object that holds a list of table names, this is the result of the SQL Command GET_TABLE_LIST.  The purpose here, is to query the database metadata in order to determine a list of names.
  3. The iterator FOREACH_RECORD_IN_TALBE_LIST enumerates over each record in the TALBES_IN_DB (assigning the table name to the variable TABLE_NAME).  Within that container, we generate what SQL to execute (ie the drop command) in the GENERATE_SQL expression by assigning it to the SQL variable.  Then we execute that variable via an Execute Command.  Since we do not want the task to fail if the command does not work (after all there may be some dependencies between tables and execution order).
  4. Finally, after the first batch of executions has run (and hopefully all the tables are deleted), we recount the tables in the database, storing the values in the TABLE_COUNT variable.  Control is then passed back to the WHILE_TABLES_EXIST to see if there are tables in the database and determine whether the process should begin again.

image

 In BIML

This process is very generic and can be applied in any database.  There’s also no need for BIML Script in code (since we do not rely on the generation of tasks which specifically rely on data).  Here is the full snippet;


<Biml xmlns="http://schemas.varigence.com/biml.xsd">

<Connections>
    <OleDbConnection Name="FM_MYSQL" ConnectionString="provider=MSDataShape;server=foodmart_mysql;uid=foodmart_admin;pwd=whats_my_password?" DelayValidation="true"  />
</Connections>

    <Packages>
        <Package Name="01-Clean_MySQL" ProtectionLevel="EncryptAllWithUserKey" ConstraintMode="Linear">

            <!-- these variables are created in the packages scope -->
            <Variables>
                <Variable Name="TABLES_IN_DB" DataType="Object">
                </Variable>
                <Variable Name="TABLE_COUNT" DataType="Int32">1</Variable>
            </Variables>

            <Tasks>

                <!-- the first container (while records exists) note the evaluation expresssion -->
                <ForLoop Name="WHILE TABLES EXIST" ConstraintMode="Linear">
                    <LoopTestExpression>@TABLE_COUNT>=1</LoopTestExpression>

                    <!-- tasks within the container are contained in a tasks tag -->
                    <Tasks>

                        <!-- get the list of table (names) .. note the record set is assigned to the variable TABLES_IN_DB -->
                        <ExecuteSQL Name="GET_TABLE_LIST" ResultSet="Full" ConnectionName="FM_MYSQL">
                            <DirectInput>SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='foodmart';</DirectInput>
                            <Results>
                                <Result Name="0" VariableName="User.TABLES_IN_DB" />
                            </Results>
                        </ExecuteSQL>

                        <!-- Iterate over each record in the TABLES_IN_DB variable
                            Note how we assign the current record to the TABLE_NAME variable -->
                        <ForEachAdoLoop Name="FOREACH_RECORD_IN_TABLE_LIST" SourceVariableName="User.TABLES_IN_DB" ConstraintMode="Linear">
                            <Variables>
                                <Variable Name="TABLE_NAME" DataType="String">
                                </Variable>
                                <Variable Name="SQL" DataType="String">
                                </Variable>
                            </Variables>
                            <VariableMappings>
                                <VariableMapping Name="0" VariableName="User.TABLE_NAME" />
                            </VariableMappings>

                            <Tasks>

                                <!-- Generate the SQL Statement –>
                                <Expression Name="GENERATE SQL" Expression="@[User::SQL]=&quot;DROP TABLE foodmart.&quot; + @[User::TABLE_NAME]"/>
                                <!-- Execute the SQL Statement (based on the user variable (SQL)) –>

                                <ExecuteSQL Name="DELETE TABLE" ConnectionName="FM_MYSQL" DelayValidation="true" FailPackageOnFailure="false">
                                    <VariableInput VariableName="User.SQL" />
                                </ExecuteSQL>

                            </Tasks>

                        </ForEachAdoLoop>

                        <!-- Recount Tables and store in the variable (which is passed back to the parent container)  -->
                        <ExecuteSQL Name="GET_TABLES_COUNT" ResultSet="SingleRow" ConnectionName="FM_MYSQL">
                            <DirectInput>SELECT COUNT(*) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='foodmart';</DirectInput>
                            <Results>
                                <Result Name="0" VariableName="User.TABLE_COUNT" />
                            </Results>
                        </ExecuteSQL>

                    </Tasks>
                </ForLoop>
            </Tasks>
        </Package>
    </Packages>

</Biml>

You might notice that I am using an MYSQL ODBC database connection.  Other features are commented in the code.

As stated, there is no reliance on BIML Script here, so you may ask ‘Why use BIML?’.  The reason for this is that the snippet includes samples of a few controls and how they are configured in different circumstances.

Barry Devlin & the Business un-Intelligence Course

I really enjoy Barry Devlin’s approach to Business Intelligence.  He is an under-appreciated name in the industry (he was one of the originators of ‘Data Warehouse’ way back in the mid-’80s), so when he has something to say, it’s worth a listen – especially when he challenges ‘modern’ perceptions of Data Warehousing and the BI Industry.

And this brings me to the purpose of this post.  For those in the Asia Pacific region, Barry is presenting a course on Big Data Analytics & Reinventing Business Intelligence in Singapore July 6-8th, 2015 (http://bit.ly/1E3I90e).

So, what’s to be expected from the course? You can get the full agenda here (http://bit.ly/1c7hHIv).  Briefly, the course covers the complete range of topics on how to design a modern information/data environment, ranging from the fundamental principles of how all types of data should be absorbed and used within the organisation to an overview/assessment of tool choices and the implications of new neurobiological thinking about how decisions are really made.

Having read Barry’s latest book “Business unIntelligence – Insight and Innovation Beyond Analytics and Big Data” (http://bit.ly/BunI_Book), I suggest that this is a perfect opportunity to investigate the strategic direction of BI (from an organizational or consulting perspective) and address the implementation of BI.

Oh, and by the way – I’m told that foreign attendees get special discounts, all you have to do is ask :)

SSIS – Creating an OLEDB Connection to an ODBC Source and SSIS parameters

This post was originally going to be about using the advanced features of a SSIS OLEDB command to set parameters.  I’ve been using Oracle sources lately and the OLEDB command did not recognise the parameters for my query. 

For example, my in my work environment, the statement (below) just did not recognise the key_field parameter and so I had to manually add it through the components advanced editor.

Update some_table 
set some_field = new_value 
where key_field = ?

We’ll look into the advanced features a little later, however, when I tried to mimic the behaviour (using a MYSQL connection), the only issue I had was creating an OLEDB connection in SSIS.  I have ODBC connections on my machine however, (naturally enough given the components name), the OLEDB command requires an OLEDB connection.

So, to create an OLEDB command (and have it reference an ODBC source), simply use the Native OLE DB\MSDataShape provider and use the ODBC name as the server (as in the image below).  The UserId, Password and catalogue/schema can all be set through the connection managers interface.

image

 

Parameter Management under OLEDB

I can now use that OLEDB connection to execute an OLEDB Command.  As discussed above, my original intent was to show how to add a parameter when SSIS did not recognise the parameters.  In my situation, this was against an Oracle source.  In this demo, I’ll use a simple command

update adventure_works_2012.customer 
set territory_id = ? 
where account_number = ? 

and the column mappings tab of the OLEDB command allows me to map fields in my source (input data flow) to my command (ie parameters in the destination).  We can see this in the mapping screen shot below.

image

You will notice that the names are generic (Param_0 …. ) and refer the the placement of the ? symbols in the source query.  That’s not really user friendly and if i want to give these meaningful names, you would think that I can set them in the Input and Output Properties (using the Name Property as below).  However, this is not the way this UI works and the name defaults to the placement of the ? in the query.  Long story short, you can’t control the names of the parameters.

image

However, this part of the (advanced UI) does give us a little indication into how to add parameters.  If no parameters are recognised, my can add them through the ‘Add Column’ button on the External Columns node of the Inputs and Outputs tab.  This is what i had to do with my Oracle command.

Data Vault Training – Australia in 2015

The Genesee Academy’s Data Vault training has been announced for 2015 in the Australian region. 

Hans Hultgren will be running classes on Brisbane, Melbourne, Sydney and Auckland in July and August this year (the dates are below with links to the courses and registration). 

It’s a great class (I’ve heard multiple recommendations about this one) so if you interested in Data Vault, why not give attend and learn from someone who literally wrote the book

 

Jul 13 – 15th 3 Day Certification – Brisbane Australia Registration
Jul 22 – 24th 3 Day Certification – Melbourne Australia Registration
Jul 27 – 28th 2 Day Certification – Sydney Registration
Aug 3 – 4th 3 Day Certification – Auckland Registration

Forcing Values into SSRS Matrix Columns and Rows

The concept of the trusty SSRS matrix control is great for dynamically expanding columns and rows based on the objects source data.  It’s behaviour operates in a very similar way to a Pivot table, that is, expanding and contracting based on underlying data.

However, when you want to force data to be shown in rows or columns, the expanding nature of the control can present a problem where the data does not exist. Consider a report where we allow the user to select multiple subcategories and then we display the number of products in each category based on their selection.

The hierarchy for this relationship is product –> product subcategory –> product category, so if we allow the user to select a subcategory, and then restrict our result set based on this, the matrix will not show the all available Categories.

Consider the base (standard query) that uses a parameter for subcategory_id (its a multi-valued parameter).

select
pc.Name as CategoryName
, count(*) as ProductCount
from [Production].[Product] p
join [Production].[ProductSubcategory] psc
on p.ProductSubcategoryID = psc.ProductSubcategoryID
and p.ProductSubcategoryID IN (@subcategory_id)
join [Production].[ProductCategory] pc
on psc.ProductCategoryID = pc.ProductCategoryID

group by pc.Name

 

image If we have a matrix (as in the image to the left), the Category values that are shown are dependent on what values are in the dataset.

If we want to extend the control to show all values, we have to append it to the dataset.  Therefore, our query would need to be changed to the following;

select
pc.Name as CategoryName
, count(*) as ProductCount
from [Production].[Product] p
join [Production].[ProductSubcategory] psc
on p.ProductSubcategoryID = psc.ProductSubcategoryID
and p.ProductSubcategoryID IN (@subcategory_id)
join [Production].[ProductCategory] pc
on psc.ProductCategoryID = pc.ProductCategoryID

group by pc.Name

union all select
Name as CategoryName
, null as ProductCount

from Production.ProductCategory

With such a small example this approach works fine. However, when the dataset includes additional fields, and the matrix groups on those fields, the grouping behaviour may create undesired results because the additional union adds another layer of data to group on.

Consider, for example if we included country, our dataset would look like

[country name (as a value)],[category],[count]   -- for the 'real data'
[country name (as a null) ],[category],[count]   -- for the 'additional data'

If a report included a grouping on country name, the nulls would be in their own group as a value and this is not what we want to display.

To solve this, we can apply a filter on the matrix to remove the blank countries. This does not remove the additional categories (that is values being shown), it just effect what country values are shown. A handy little trick :)

SSRS Report Parameters from Unfriendly Sources

We have recently developed some SSRS reports against Non SQL Server sources and have run into a few issues regarding the use of multi-value parameters.  Since, we are sure to meet this problem again (and variations of it), this post is just as much of a bookmark for ways around the problems that we’ve encountered.

So what’s the problem?  Lets assume were using an  MySql Source with the same data as AventureWorks, we can to pick a Subcategory and then have our report show a list of products.  This is pretty straight forward.  The products query uses a ? to identify a parameter, and that parameter is assigned a value to in the dataset properties;

SELECT        *
FROM            adventure_works_2012.product
WHERE        (ProductSubCategoryID = ?)

You can see how this is setup in the image below.  The SubCategory dataset is the source for the SubCategory parameter and that is used as a parameter for the Products query.

image

Unfortunately, this only works if our SubCategory Parameter DOESNOT allow multiple values.  If we enable this as an option for the Parameter, we’ll get an error like the following when we try to view the report.

image

To make the products query run with multiple parameter values we need to dynamically generate the SQL associated with the parameter.  Actually there are a couple variations of this however they all use the same theme.

So, first, we change our query to recognise the need for multiple values from the parameter;

SELECT     *
FROM            adventure_works_2012.product
WHERE        ProductSubCategoryID in (?)

Then, instead of supplying the parameter value to the dataset, we use an expression that delimits parameter values.

image

Here, the source type is a integer so we are expecting something like 1,2,3 etc to be passed as the parameter’s value.  If the source type was a string, we’d just have to append quotes around the statement.

= Join(Parameters!SubCategory.Value, "’ ,’")

Variations

We can use a few variations of this approach to generate the parameter list.  For example, we could exclude the parameter definition from the Products dataset and specify the query as a function that returns an SQL string.  Using this approach, the Parameter value is still called but the parameter object is not defined in the dataset.

We can also define a some VB code that returns the SQL string (I find this a little more flexible).  Here the string value of the parameter is passed to the VB function.  So here, I’ve created a simple function and added it to the report;

Public Function SQL(ByVal Keys as String) as String
  return “select * from adventure_works_2012.product where ProductSubCategoryID in (” & Keys & “)”
End Function

Now, our dataset is defined by the expression

=Code.SQL( Join(Parameters!SubCategory.Value, ",") )

Oracle Batches & Vairables

More of a prompt for me than an informative post :)

Oracle params/variables can be defined in a batch statement through a variable keyword and then assigned a value using the exec (execute) command.  For example, we can define 2 variables and give them a value with the statements below (note the use of := for assignment)

variable v1 nvarchar2(20);
variable v2 nvarchar2(20);
variable v3 nvarchar2(20);

exec :v1 := 'Val 1';
exec :v2 := 'Val 2';
exec :v3 := 'Val 3';

select :v1 as v1, :v3 as v2,
from dual;

select :v2 as v2 , :v1 as v1
from dual;

When executing the command (I’ve tried this on TOAD and SQL Developer), the function key used with either

  1. Execute the batch  (if F5 is pressed) or
  2. Execute the current command where the cursor is (if F9 is pressed).  This relates to the statement surrounding the cursor position.

Where the command is executed (F9), any variables in the command are prompted for entry (as shown below).

oracle_batch