BIML I – Getting Started with BIML

Traditional SSIS development requires the ETL developer to create a package and then manually add and configure tasks. While this method of visual development is fine, it suffers from reproducibility and maintenance overhead. For example, imagine (a very contrived scenario) where you create a package with a single ‘Execute SQL Task‘ and configure it to fire a procedure. Simple enough but now suppose you want to create a new package and fire a different proc – here you have to manually go in and follow all the same steps again. Perhaps not so bad but what if you create 10 packages and then decide that the Execute SQL Task should be prefixed EXSQL, you would have to manually go into each package and rename the task. This could be very cumbersome and it (by nature) carries a maintenance burden (or maintenance overhead).

BIML (Business Intelligence Markup Language) is an XML specification that (amongst other things) allows you to specify the creation SSIS packages. One of the great things about BIML is that it can also include code snippets that amplifies the ability to generate code. This first post will form part of a series that show how to use BIML to create packages and what type of outcome we can achieve. But (of course) first things first so let’s start with a simple goal, so let’s look at using BIML to generate a package that executes an SQL command.

Before we continue, a quick revisit my above comments about maintainability for a moment.

An SSIS package is really just an XML file right, so why can’t I just edit that?

If you’ve ever played around with an SSIS package, you’ll probably know that it is an XML file. You can see this by viewing the code of your packages as shown below (right click on the package and view code). So you may think that you can generate or alter this pretty easily.

However, I have found this XML to be very finicky. Suppose you wanted to redefine a connection by editing script (something which should be simple you’d think and I’ve discussed here). You would expect that you could just examine the code, find the connection string, alter it and save it. But I’ve never had any success in that approach. When you try to open the file again (using the visual designer) it is invalid so your pretty much helpless with the changes you tried to make (sorry but there’s no comeback). That’s only to change an existing package so the creation of a package is (IMO of course) is just not possible. You could use an API but that just seems out of reach for most.

Of course, newer versions of SSIS have project wide objects (like connections). Using these would naturally require less maintenance however were are still faced with task management (altering and creation) and package development. Further, if we want to define a group of packages by meta-data (say for example a list of tables to extract), there is no current way to do it in SSIS. A much simpler approach is BIML with give us a generic specification to development.

What do I Need to get running with BIML?

BIML is available when BIDS helper is installed. BIDS Helper is a community addin for SQL Server Data Tools (SSDT and BIDS) and is available on code plex. If you haven’t heard of BIDS helper before, take some time to look around the features. It really is a must have for business intelligence development in the Microsoft space.

Once BIDS helper is installed, you can add a new BIML file by right clicking on the Project (or SSIS Packages) node and selecting ‘Add New Biml File‘. The file is created in the Miscellaneous Folder (as shown below).

 

A single BIML file can contain the code that creates many packages (as will become apparent) but for now would we will just work with the basics and use it to create a package that fires an SQL statement.

Adding Code

Opening the BimlScript file shows the XML in the file (we should probably rename it to give it a decent name but we will not worry about it this time). When you start, the file just contains the root node (as below).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
</Biml>

What BIML now allows us to do is create objects within the root (and later other nodes). For example, since we want to create a package, we naturally expect that we need to add a package child. Naming is pretty intuitive since it follows the SSIS object names. I’ve already hinted that we can use the same file to create multiple packages so it seems natural that a package tag should be embedded within a packages tag. Something like the following.

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Packages>
    <Package></Package>
  </Packages>
</Biml>

Actually, I did cheat a little (and so can you). The intellisense functionality lets you know what tags are allowed at the position you are. Remember in XML that a tag is just an identifier for an object so since I want to do something in within by BIML tag (where else would I put it), I can just open a tag and see what options are available. I want to create a package (so a search for package shows I can add packages and of course within that, I can add a package).

Validation and Creation

Back to our package snippet. We’ve got the packages declaration (as above) to specify a package within a packages node. You’d expect there is a way to validate the XML to be sure it has the required information to create packages? Well, just right click on the (explorer solution) and select ‘Check Biml for Errors‘. A dialog opens with a list of errors.

 

If we think about it, it makes sense doesn’t it? Of course we need to give our packages a name! In contrast, the error for ConstraintMode is something that we take for granted when we generate packages visually. We define the order of tasks in a package through constraints (or connectors). However Biml has no indication and so the layout of tasks other than the order in which they appear in XML. How they relate to each other must be defined in the package declaration and this is the purpose of the ConstraintMode property. There are two allowable values for this property (either parallel or linear) which (as the name suggests) specifies that tasks will not be connected or connected sequentially.

So in order to create a package, we update the Biml code as (note that since we are only having one task in this package the constraint mode is irrelevant);

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
<Packages>
  <Package ConstraintMode="Linear" Name="package_01"></Package>
</Packages>
</Biml>

We now have everything that we need to generate packages from the script. Simply right click on the Biml file and select the Generate SSIS Packages (as above). You’ll see a new packages create under the SIS packages node in solution explorer. Of course there are no tasks in our package.

Adding the Execute SQL task to the Package.

We already seen that the XML nodes in naming conventions are very similar to that of SSIS (both in structure and definition). If we think about adding an Execute SQL task to the package, then, there would be a few points we could make (or at least assume);

  1. a task would be embedded within a tasks node
  2. the task would probably be defined by name (or a task type)
  3. an execute SQL task requires a connection
  4. and execute SQL task requires a command
  5. the connection would be required before the command.

Turns out that the Biml to add this task to our package is pretty straightforward. Of course you can use the IntelliSense to suggest what nodes are available and, when you look at the code and think about what you’re trying to do it seems to make sense (that is, create a connection, create a package, at a task to the package uses a connection and has some syntax).

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <Connection Name="SQL01" ConnectionString="Provider=SQLNCLI10.1;Data Source=.;Persist Security Info=False;Integrated Security=SSPI;Initial Catalog=AdventureWorksDW"/>
  </Connections>
  <Packages>
    <Package ConstraintMode="Linear" Name="package_01">
      <Tasks>
        <ExecuteSQL Name="fire_proc_01" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>

Perhaps the most interesting thing about this snippet is the connection object. We’d expect it to be embedded within connections however, the location as a root node gives us an advantage of reuse, so adding another task with the package or additional packages and tasks can reuse that connection. For example, to create 2 packages with 2 Execute SQL commands in each (which require the first to complete and then the next), we could write;

<Biml xmlns="http://schemas.varigence.com/biml.xsd">
  <Connections>
    <Connection Name="SQL01" ConnectionString="Provider=SQLNCLI10.1;… (as above) "/>
  </Connections>
  <Packages>
    <Package ConstraintMode="Linear" Name="package_01">
      <Tasks>
        <ExecuteSQL Name="SQL_CMD_1" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="SQL_CMD_2" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
    <Package ConstraintMode="Parallel" Name="package_02">
      <Tasks>
        <ExecuteSQL Name="SQL_CMD_1" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
        <ExecuteSQL Name="SQL_CMD_2" ConnectionName="SQL01" >
          <DirectInput>Select 1</DirectInput>
        </ExecuteSQL>
      </Tasks>
    </Package>
  </Packages>
</Biml>
This produces package_01 which looks like and we can see how the specification of ConstraintMode has caused SQL_CMD_1 to link to SQL_CMD_2.

 

Note the output from the second package definition (where ConstratinMode is Parallel) does not introduce precedence constraints between the 2 tasks.

Conclusion

This post has looked at using Biml to create packages in SSIS. It’s a pretty simple example, but we can see how the XML biml can be used to automatically create (and recreate) packages. If we wanted to change a property (say for example the SQL syntax), we could just change it in the code and regenerate the packages.

In later posts, we’ll extend this functional knowledge for more complicated scenarios.

 

 

Changing Excel Table Connections in VBA

In previous posts, I’ve discussed using VBA to extract connection information from Pivot Tables and hacking a tables command so that you can programmatically change the results of a query. But what if you have a table and you want to change its connection and its command?

This can be achieved pretty easy using the tables .QueryTable object. In fact, we can change the entire table using this class.

So how does it work? Suppose, I’ve inserted some data into a worksheet from SQL Server. The menu path is DATA à From Other Sources à From SQL Server. Next I get the familiar connection windows to connect to a server and then select a table (as below).

I’ll give the table a friendly name (adventure_works) and return the data to a table. I get a connected table which is sourced from a query and can be updated (either right click somewhere in the table and select Refresh from the popup menu or select the refresh button from the TABLE TOOLS à DESIGN (ribbon) menu).

I can also use the DESIGN of the table tools to give my table a decent name by just replacing its name (as in the following image).

Now I want to change the database server (and potentially the query) that my table uses. There are 2 things to keep in mind here. Firstly, we need to change the query’s connection string (no brainer eh) and secondly, the table (in Excel) was added as a table.

As in the previous post, I can refer to the table as a list object in VBA, this is a property of the sheet, so, in order to reference the table I’ll have to use some code like this (note the table was on Sheet2).

Dim li As ListObject

Set li = Sheet2.ListObjects(“aw_employees”)

For the connection string, I can refer to it through the ListObjects QueryTable class. So I could determine what the current connection string or set it by using the following code;

Debug.Print li.QueryTable.Connection ‘retrieve the connectionstring

li.QueryTable.Connection = “OLEDB;Provider=SQLOLEDB.1;…(this is my connection string) ..”

If I then refresh the table (using li.Refresh), one may think that’s the end of the story and our works completed – but that’s not quite true just yet.

Remember that we added the table as an SQL table? Because of this, the ‘query’ for the .querytable actually refers to a table including the SQL database and has its .CommandType as a table. If we debug the code to determine the command (as below) we could see this

After we change the connection, we could set the new table to use employee (without a database) with something simple (like);

li.QueryTable.CommandText = “dbo.DimEmployee”

li.Refresh

Or, we could change the query type of an Command, and provide its text. This would suffice for that;

li.QueryTable.CommandType = xlCmdSql

li.QueryTable.CommandText = “select * from dbo.DimEmployee”

li.Refresh

That’s about it, we can change the tables connection, its query and have these changes reflected in our table.

The Wisdom of the (Tableau) Crowd?

Ok … I’ll admit it … Tableau has some great features.

One of the standout features of the new release (8.2) is the new data access method – ability to model data by dropping sheets from a file and manipulating it visually as a data source.  The benefit of this is that it removes the manual requirement of connections that are typically associated with user end reporting tools and combining data.  Prior versions could do this to some extent by linking data sources however, the new version allows it to happen is what could be effectively called a model designer.  Another slightly different approach (and one which is way cool) was developed by Targit (they used a folder share as a directory share to drop files which could then be consumed by the application automatically). There’s an obvious benefit to the tight integration of data and its downstream analysis – especially for business users that are not so concerned with technical model design but the analysis of data and using it in the tool. Removing layers of integration delivers on the promise of traditional pitches that use catch cries of

spend less time working on data and more time understanding it

To be fair though, there is an assumption (IMO) that the data being used is pretty clean to begin with … all we are really defining aggregation paths, streamlining the import of data(s).  A common negation for this type of activity (where users mange and consume their own data) is that the production of data is not verified by IT or officially sanctioned (most noticeably it is raised from an IT centric standpoint). There could be lots written about that discussion point but for now we will just accept that it is going to happen anyway and leave it.

Another great feature is the idea of story boards – where you can trace (for want of a better word) thought patterns used in investigation and present a complete view of the analysis – I’ve heard it dubbed a Power Point Killer (and there may be some truth in that where Power Point is used with a heavy reliance on data integration). Again, the concept is not new (Yellowfin has had it enabled for at least one version now).

But I digress and should move onto crowds … or more importantly the vendor conference. I was staggered to discover that the Tableau conference in Seattle was sold out early with 5,200 attendees (yes that’s right 5,200). That’s a remarkable number when you consider that the PASS Summit on SQL Server (incorporating the full MS stack) gets (an advertised more than) 5,000 attendees (I am assuming that it’s just over the 5,000 so somewhere in the range of 5,000 – 5,100). Since this includes both the database administration and the BI arm, those dedicated to BI would fall far short of the 5,000. In contrast, the PASS BA conference which specifically targets Business Analytics gets 750+ attendees (again I’ll assume were talking 750-760 attendees). Given the ubiquity of the MS platform as a tool, the similar number of attendees at the Tableau conference is astonishing.

 

Brisbane SQL Saturday

SQL Saturday in Brisbane is Back!!

Im happy to announce that we are concreting arrangements for a Brisbane SQL Saturday .. its simply the best way to get SQL Server and BI training from some of Australia’s (and perhaps some international) best trainers and consultants.  Of course, the great thing is, ITS FREE and run by the SQL community members.  So tell your boss, family and friends.

We’ve scoped out a great venue at Queensland University of Technology and are in the process of finalising some paperwork now– so leave the 20th September open for some great SQL and BI goodness.

Like to know more?  Keep an eye out on the SQL Saturday site for the event of ping me and I’ll add you to the distribution list.

Update : 21-07-2013

Ok …. we could not host the event on SQL Saturdays site – will that stop us?

NO we’ve created our own web site to manage the event.  Check out www.sqlgroup.com.au/home.aspx to find out more.

Same great content, same great day, same venue – Just another name

Flawless Excel in Business Intelligence

Somewhat prompted by some recent posts about the inherent error associated with Excel as a BI tool (and more importantly how some magical tool solves this problem) I thought I’d discuss Excel as a BI tool. In-fact, it’s one of my favourite pondering points “what role should Excel should play business intelligence”

As I’ve stated about it’s a discussion that’s often prompted by claims that a product improves the inherent risk associated with Excel (and to a lesser extent Access). Vendors often suggest that their Excel support mitigates the inherent risk associated with Excel because it adds a corporate security and authorised data capabilities the users trusted tool (often it is argued that Excel the only tool business users want to use or perhaps can use).

There are several flavours presented in the pitch (often several of them are presented in the same discussion) so let’s look at a few of the key selling (or scaring) points;

  1. Excel has inherent risk and is error prone. For this, I’ll take an extreme case since it suites my discussion. The nature of these stories is that Excel caused some monumental loss because of an error in a formula or there was some other error in the spreadsheet. There’s nothing like a bit of fear so make you sit up and pay attention is there? If you have not had this error yet don’t worry because you will … it’s only a matter of time (in fact we’ll even through in some statistics to show just how common this can be). It’s no big deal if you haven’t found it yet because you just haven’t looked hard enough. Infact some vendors even go so far to claim that users are arrogant (now that’s going to mend the bridge between IT and Business). But then again, there’s nothing like creating a bit of division is there – ‘oh yeah now I remember divide and conquer’
  2. The Excel (or Access System) is undocumented and the removal of the Key Person that developed the tool would cause monumental loss to the company. How could any responsible manager allow a system or some reporting tool to operate if it is reliant on a key person … this should never happen should it? If you want to be a good manager (and there’s the no back-out argument), ensure business continuality you need a system that’s IT sanctioned with proper documentation – so who wouldn’t want or need that that?
  3. Oh yes – did I mention that IT hasn’t sanctioned your Excel (or Access) work? They don’t support it and you can’t go to them with problems. This really enforces the risk associated with point 1 & 2. If there’s an issue then you are own your own. This is a very compelling argument that enforces a CYA mentality – perhaps we should rewrite the argument to the words ‘if you can’t transfer blame, you better not take responsibility for it’.
  4. End users should focus on data analysis and not data preparation. Here we are presented with the classic position that too much of time is spent in data preparation and not analysis. Could this possibly be a hint to suggest that an unproductive employee that could get so much more done if only used the tool.
  5. IT takes too long or does not do it right. This is such a common selling point that it is almost taken for granted (especially in the realm of reporting). A pitch that focuses on the ‘corporate lifecycle of reporting’ may go something like this;
    1. The business user describes the report to the IT analyst
    2. Sometime later the IT guys get around to writing the spec for the report and pass it to the dev queue where the report writer creates it, this doesn’t happen straight away though because the report writer is so busy. So say 6 months later the user gets their report.
    3. The report is inevitability wrong and the cycle begins again.
    4. By now the business has moved on and the business user needs a new report.
  6. Users need Excel interaction. If there’s been no justification presented above we need one resounding point – the final nail in the coffin (as it where). The pitch here focuses on user needs. We often hear very compelling statements along the lines that users just want to use Excel because that’s what they know or it’s just the way they work. Picture the preparation of a presentation we the analyst grabs some information from one report, creates a LOOKUP from data from another and then creates a chart to paste into Power Point. Based on this story it’s easy to see why users always are going to use Excel (well so the pitch goes).

I have listed a more than a few points here but I think they can be summarised into a few categories.

  1. Relying on Excel for the manipulation of data is a risky proposition (so you need some involvement of IT to provide corporate data to manipulate) and
  2. Users need Excel.

Now the vendor presents their tool – something to solve all the problems listed above, but how appropriate are such tools in reality? Aside from the idea that they present a very narrow perspective of what BI can do, we will now focus on how the vendors tool satisfies the gaping hole in the end users tool set.

Now (the vendor) present the age of enlightenment in their product … our (their) tool solves all these problems because it;

  1. Is managed by IT –hosted so that there will not be any issues associated with lost or deleted files.
  2. Has gone through a rigorous testing process.
  3. Can be used by anyone. The software is so intuitive that any user can pick it up and use it
  4. Delivers data to users in the format they want, that is, its accessible in Excel.

As convincing as these arguments might appear on the surface there’s a real problem with this line of thinking and we can sum this up in two ways;

Firstly to paraphrase Lyndsay Wise (W|T) – if users need Excel then the BI tool is not doing what it should. While I could go into a long winded discussion, I think the statement is succinct. Where users have to bring data out into Excel to manipulate it (you don’t tend to hear the buzz word Mash Up anymore do you?) then there’s a pretty convincing argument that the sanctioned solution just does cut it (sanctioned as in corporate structure). If we think about it for a minute longer and push the case for Excel we’d realise that what we are really saying is that our business model is so inflexible it needs an additional layer of manipulation in order to provide any real value. This may be the case but is so shouldn’t the problem be attacked at the source and not downstream, that is, when it enters the model and not when it’s spat out?

Now for the second point – one where the tool somehow removes error where there is some form of manipulation required. I am very interested to know exactly how any tool achieves this magical proposition (and have never seen a convincing argument by a vendor). At best, you’ve got a cell location (or multiple) that links back to corporate (and supposedly agreed) data. However, if we look at process of what Excel is being used for, we are faced with exactly the same issue that would be present had they not had a tool. That is, they are free to use the data how they wish. After all, that’s the reason why Excel is so popular – it allows the user (of any skill level) a programmatic interface for working with data. The tool adds no comfort than would otherwise be available from Excel.

If we would like to suggest that the availability of data in Excel is the key to Excel’s woes as a corporate BI tool (and those nasty errors that are so well published) then we are simply adding another layer for users to get wrong. After all the placement of data relies on the end user finding the right information to begin with!

Not surprisingly the argument for safety around Excel as a BI tool resolves in the user and not the tool. Once its placed in Excel, data is at the direction of the user and it is their ability that we seek mitigation against.

 

Packt’s 10 Year Celebration

Usually, the cost of IT & reference books is … well pricey to say the least.  Unfortunately, if you wait for a sale, you save some cash but end up with old tech books.

Well luckily Packt Publishing is having a 10 year celebration.  No e-book over $10.  If you are in the market, it might be worthwhile checking it out .  But here’s the kicker.  Its only lasting 10 days.

 

Extracting Connection Information from Pivot Tables

Extracting Connection Information from Pivot Tables

It’s pretty easy to add pivot tables to worksheets – even easier to add them when the source is a cube. However there are a couple of ways to go about adding them which could make your life easier – that’s one of the things we look at in this post.

When you first create a connection to an OLAP source (and I am including tabular in that), a connection is created in the work book. So just for fun, let’s go through the process and connect to the cube. On the Data tab chose From Analysis Services from the Other Sources dropdown.

We’re asked for a server (not shown) so let’s put in the correct details and then click next. Then we get a list of cubes and perspectives to connect to (I’m using adventure works here so the names appearing in the image below would be familiar to most).

We then have the option to specify a few additional details. Perhaps the most important of these are the filename that stores the connection and the friendly name. Most of the time though we probably discard putting any thought into this and just click next. However in the image below, I’ve specified the filename and the friendly name. Finally we can click Finish and get to work on the pivot table.

The Next Pivot Table

Now when it comes to the next pivot table we can run through the process again – if we chose the same names we’ll get a warning but let’s not go into that. Alternatively, we can just insert a new pivot table and select an existing connection.

You can see this in the screen flow below. When we create a new pivot, we can choose the external source and, then the connection.

There’s one big reason why you’d want to insert a new pivot like this – it’s a lot easier to maintain the workbook when you need to understand where data is coming from (or you need to change data sources).

The Detail – Getting Connection and Pivot Info

It would be a pretty ordinary post if that’s all I had to say (no code after all) – but now the fun starts. Workbooks never finish up as simple and elegant as we’d like – NEVER. It’s easy to add new data sources, new pivots and even pivots of data sourced from other pivots. Throw 20-30 of them in a workbook and your left scratching your head as to where the information is coming from and what do I need to update (say when the data source changes). If you only had one connection it would be easy, but life’s not like that and there’s connections flying around everywhere. So I needed to look at each pivot, determine its source and connection string.

Here’s the code:

Sub GetPivotData()

  Dim wks As Worksheet

  Dim pt As PivotTable

  For Each wks In Worksheets

    For Each pt In wks.PivotTables

      Debug.Print wks.Name & “:” & pt.Name

      Debug.Print “Cache Index : ” & pt.CacheIndex

      GetCache (pt.CacheIndex)

Debug.Print “— break —“

    Next pt

  Next wks

End Sub

 

Function GetCache(Index As Long)

  Dim wb As Workbook

  Set wb = ActiveWorkbook

  Dim pc As PivotCache

  Set pc = wb.PivotCaches(Index)

 

  If pc.OLAP Then

    Debug.Print “Is OLAP : ” & pc.OLAP

    Debug.Print “Connection Name : ” & pc.WorkbookConnection

    Debug.Print “Connection String : ” & pc.Connection

  Else

    Debug.Print “Is OLAP : ” & pc.OLAP

    Debug.Print “Data Range : ” & pc.SourceData

    End If

End Function

 

My Sub (GetPivotData) loops through each sheet on the workbook, getting each pivot on the sheet and then calls a function (GetCache) to get the cache (or connection) information for that pivot. While I could get connection information by just adapting the function to iterate over the slicer caches (as below), I’d like to associate each pivot table with its connection.

Additionally, I want to identify any pivots that are not based on an external (OLAP) source. To do this, I can use the Cache.OLAP property and retrieve the appropriate connection info.