Great Courses Down Under

For all Australian readers, there are a couple of great courses coming to Australia. 

Firstly, Barry Devlin is running a 2 day course from 24th August (see here).  For those that don’t know Barry, he is the un-official farther of data warehousing.  There at the start and  never go the media attention of others.  What I like about his approach (and specifically his later work) is that he challenges the status quo of data warehousing and information management.

Secondly, Hans Hultgren is running Data Vault certifications in July. You can find our more here AND, there is also a course being run in Brisbane (at our offices). If you want to get into Data Vault, I think this is a really great place to start.

If you want to find out any information about either of these courses, please feel free to ping me direct.

Chernoff Faces, R and Power BI

One of my favorite visualizations are Chernoff faces because they allow you to incorporate many attributes into a single visualization.  Compared to a bar chart (which only displays one attribute), you can get a ‘full picture’ of the situation in a single iconic image (pardon the pun).  Further, faces are a natural recognizable characteristic (for us humans) and this a fair bit of evidence to suggest you notice things in faces a lot easier than bland representations as a result of our social upbringing.  This post is about implementing Chernoff faces in Power BI (through R).  Firstly (in this post), I will the infamous crime statistics demonstration, and in the second post, I’ll show child faces against different dataset (Northwind) which incorporates normalization.

If you are not familiar with Chernoff faces, you can read more about them at Wikipedia here but basically they use facial characteristics to display an attribute value.  For example, the size (height) of the face may be related to income, the width of the face might be related to quantity with the size of the eyes related to margin.

Chernoff implementation in R

I will include only a snippet of R required to show Chernoff faces (and then the output).  A more complete tutorial can be found here and here.  Basically, we need to install the library, get some data and then display the visual.  This can be done with the following snippet;

install.packages(“aplpack”) # install the package
crime <- read.csv(“”) #get the data
  ,main=’Crime Stats USA’
)  #plot the chart

We then get the face visual (as below)


Chernoff implementation Power BI

In Power BI, there are a few things we need to do.  These are;

  1. Setup Power BI for R
  2. Get our data (or build our model)
  3. Add the visualization (by)
    1. Telling the R (custom component R visualization what data to use) and
    2. Use R code to create/format the visualization

These steps are followed below;

1. Setting up Power BI for R

Before you can use R in Power BI, you need to install that are is installed on your local machine and that you have set and enabled R’s working directories in Power BI.  To set R options in Power BI use the File > Options and settings > Options menu path.  You can see that the directory of R needs to be set and you need to allow R to plot a visual in Power BI (this is a preview feature).  I’ve included these images show the both the R directory setting and the use of R to plot visuals.



Once these features are enabled, you should see the R icon appear in your Visualisations pane (note that you’ll need to restart Power BI if you enabled R as in the above checkbox).


2. Get Data

Usually we expect to have a data model in Power BI (that is, we have imported some data, built a model and now we want to visualize it).  For this demo though, I want to import the same data set that I used in the above snippet so I can use the R Script data source in Power BI.  Here, all I have to do is choose the R source (accepting that this is a preview connector), and paste the R (read) script into the R dialog that Power BI Shows.  This is shown in the 2 images below;



After this occurs, Power BI opens up the Data Navigator and you can either Load or Edit the Import query.  I’ll just Load it (as shown below)


3. Add the Visualization

Once we have data in our Power BI workbook (you’ll notice that we have a crime table), all we need to do is;

  1. add the R visualization to the page canvas
  2. enable the script component (after all its custom component)
  3. add the data (that the visualization uses) and
  4. code the visualization (in R).  Note that the data available in R code (ie the data frame) is defined by the data that was added in the prior step. 

I have tried to show all these steps in the image below.  One thing to note about adding data to the visualization is that the order in which you add fields (from the model) defines the columns that are in the data frame.  This is important (for the Chernoff faces in R) because the first fields of the data frame defines the axis (grouping) of faces.  Note that you can also see this order in R script editor in Power BI (the code is in the image below and un-editable in Power BI but looks like # dataset <- data.frame (  columns )).


You may recall (from the R script above) that we referred to crime as a data frame in R when we built created the code in R.  In Power BI however, the data frame is referred to as dataset so in order to use our original code, we need to either create a new data frame in R (called crime) or change our existing R code to use a data frame called dataset.  I’ll go with the first option and paste the following code into the R script editor (where you see ‘Paste or type your R-script code here).

crime <- dataset
  ,main=’Crime Stats USA’
)  #plot the chart

In order to change the visual (while still editing code), just press the run button (in the R Script editor).  My Power BI screen now looks like the following;


Of course the great thing about Power BI is that there is automatic filtering among visuals on the page.  The next thing to do is add a slicer for state and allow the user to select what states to show.  The R visual updates automatically.  Doing this we can easily build a visual that allows the user to select and compare the faces of the states they select (as below).


Temporal Tables != Data Warehouse

With the release of SQL2016 just around the corner, there are an increasing number of posts appearing about new features that mean I don’t have to work.  One post (which motivated me to write this) claimed that this breakthrough would mean we no longer needed to store historical values – an interesting interpretation of a Data Warehouse to say the least.  Microsofts own content (here) recommends that there are huge productivity benefits in Slowly Changing Dimensions because you can compare the same key at two points in time.

The idea’s of temporary tables (recording every change to a table in an table_History table) is a cool feature – don’t get me wrong, I think its great.  However, there is (sadly) a lacking
amount of discussion about how the feature can be incorporated into the Data Warehouse.  Those that blindly follow the sales pitch “I don’t need a data warehouse” because I’ve got history tracking or perhaps “yeah, I’ve got a data warehouse – its a history tracked OLTP” will ensure their users cant get the data they need.

So lets call out some issues with reliance on temporal tables as a data warehouse replacement (and bring some data warehouse assumptions to the surface).  I will focus on the star schema design since most references explicit refer to changing dimensions (however we can apply these ideas to other methodologies).

A fundamental construct of the star schema is the surrogate key (almost as fundamental as the concept of dimension and fact).  Using the surrogate uniquely identifies an instance of a dimension at a point in time and therefore, state of the dimension can be precisely identified for the fact record.  For example, if I sold a product on a date, I need to look up the product dimension and determine which version of the product was applicable on that date.  The products surrogate key (not the Product Id or Code) is used in the fact.  This is the fundamental design of the star schema.  A temporal table does not provide you the capacity to do this – all it can do is provide the data to construct the star. 

How could you solve this with temporary tables?  Well, there may be the thought that you could concatenate the tables primary key and the records start date for uniqueness and then determine what (dimension) record is applicable to a fact record via a query.  Interesting idea but not something I’d take on litely.  I suspect that performance would degenerate so quickly both the BI users and the Ops users (remember that this is occurring on the OLTP) would walk away in droves.  Remember that this has to occur for every record in the fact – (and yes they are those LONG narrow tables)!

So lets leave it to the presentation tool – pick one, Power BI, SSAS, Tableau, Qlik, Jedox, …..  All these products rely on uniqueness between separated tables so we still require the surrogate to enforce and deliver uniqueness.   The star (or at least the principle) is still required.

The real power of the dimension (and to a lesser extent the fact) is that it adds business context that does not exist (or can not be easily calculated). Of course this is my opinion but think about it for a moment. Forget about historic values for a moment – raw information is in the source, if the user wanted that you could give it to them no problem. What the star gives is a modeled perspective of a particular part of the business. Consider a customer dimension – what adds values in analysis? It is often the supplementary data (age group, segment profile, status classification, targeted customer …. ) and all of these things are defined and stored in the dimension. So, if we are going to add value (as data warehousing professionals), we still need the dimension to provide this.

All business applications offer some form of reporting – if you’ve ever seen an information maturity chart, it is the first stage of maturity (see below thanks to an EMC2 slide deck).


Riddle me this then, if the source application (OLTP) provides reporting why do we need a data warehouse? Show reports at a particular point in time? Maybe – (although a lot of users struggle with this and tend to think in current terms). There are a lot of tools that provide adhoc query (OLAP) capabilities so performance the performance of analysis isn’t a real consideration (after all, they could just use an OLAP engine over the OLTP right?).

I think one of the primary reasons is integration. We want to either integrate data from different systems (creating an enterprise view) or we want to supplement current data with with other, richer information (which is really just integration anyway isn’t it). We could even say that business rules and derived information falls into this category.

Here also temporal tables do not negate the need for the data warehouse. The data warehouse is responsible for delivering a consistent, conformed, business verified data that incorporates information from various sources. Nothing changed there (and still the need for a data warehouse).

Finally, lets consider the dimension. That subject orientated view of an object. Its the Product table that tells me everything I need to know about a Product – its category, groupings, margin positions and alike. The dimension is notorious for redundancy and de-normalisation but that’s the price we are prepared to pay for
delivering a single concise view to a user because it breaks down the complexity of the model for the user (they don’t have to combine products to product categories in a query). The idea that we have de-normalise breaks the basic OLTP conventions which force normalisation (after all, we expect 3rd normal form).

The data warehouse is designed to do this work for us and present the view to the user. Essentially, its just another integration problem but one that’s handled (and hidden) by the data warehouse. Our BI tools (again with the presentation layers) may be able to create these consolidations for us however we are still presented with the issue of uniqueness in related table records (that is, we have to identify which category related to a product at a point in time and the BI tools will not do that for us).

So, are temporal tools a replacement for a data warehouse? I think not, sure they may be able to assist with record change tracking (we haven’t discussed the shift in OLTP resource management). Temporary tables are only a tool and I’d be very careful of anyone that claims they could replace a data warehouse.

MySql Import Issues with Tabular and OldGuids

In my last post, I looked at a MySql connection property that causes loads to fail in Power BI and how you can work around that issue (see here).  In this post, I’ll look at those implications in SSAS Tabular.  In this example, I’ve downloaded the MySql ODBC drivers (both x32 and x64) from here and created ODBC connections (x32 and x64 with the same name) to my MySQL database.

Data is initially imported through the Table Wizard where I have to choose the relational source Others ( OLEDB/ODBC).  This brings me to the Table Import Wizard window (below) where I can choose my ODBC Connection.  Testing the connection proves successful.  So you navigate next a few times until you can chose to import from a list of tables or write a query.


If you choose to select tables (and any) of the tables have Guids in them, the entire load will fail and you are presented with the following errors.  image

Here the error is raised by the SQL generated because it is not compliant with MySQL (that is the SQL generated is “Select [calls].* from [calls]” and MySql does not like square brackets).

Far enough, so you continue.  Now that the connection has been created in the model all you need to do to import data is Open the existing connection (existing connections in the model can be viewed by clicking the Connections button in the ribbon (image) or by selecting “Existing Connections..” from the Model Menu group).  You can then open that connection (either Double Click it or highlight it and press the Open button) and you are presented with the select from tables or write a query).  This time you specify the query (Select * From Calls).  The query validates however, a new error is presented when you try to load data.  Although not specified, it does relate to the OldGuids property.


Now we have to manually edit the connection to specify its properties.  When we do so, we are presented with the following window.  Now I want to build the connection string so specify connection string and select the build option.  Then I can choose the machine data source (btw I set up my ODBC connections as machine sources).


Now I am taken to the drivers creation window (as below).  I don’t need to change anything here (since the connection works anyway), so I will accept the defaults and hit OK.  (Note that the OldGuids property is not an option in the details section of the drivers configuration)


My Connection String has now been specified (note the additional properties in the window and snippet below) and I can add other properties as I need.






Then I simply specify my impersonation and I am right to go (noting that I must write a query).

MySql Import Issues with Power BI and OldGuids

One of the great things about self service BI is the ability to model data from different sources.  Recently, I have been using MySql sources against different versions of Tabular Models and have run into some issues.  This post shows those issues and how to get around them in Power BI Desktop designer (well at least how I did).

The import into Power BI Desktop appears straight forward, with MySQL listed as a source.  There are some dependencies on drivers (which are not installed by default) however, the installations of those is quite simple.  So lets move directly to the Import Data screen (as below).


After specifying the server, database and connection information, you come up with a familiar list of tables applicable to the database (like the s/s below).  Selecting some tables gave the usual preview however, at one selection, an error arose relating to an OldGuids option in the connection and then, no tables were available in preview mode.


So I chose to Edit the M query and are straight away presented with the same error.  Note that the query setting errors on the Navigation step (or Applied Steps) and, had it navigated to source, I could see a list of tables within the database schema.



The source of the table specifies the database (both server and catalog).  In fact, you can see this in generated M as shown below.

= MySQL.Database(“myservername”, “catalogname”, [ReturnSingleDatabase=true])

To enable OldGuids in the connection string, simply specify it as a property in the source.  Therefore, the ‘Source’ syntax now becomes;

= MySQL.Database(“myservername”, “catalogname”, [ReturnSingleDatabase=true, OldGuids=true])

Depending on the schema of the database, this would have to be done for any table that you want to import which has throws an error.

Now, I cant help feeling that that the simple import wizard creates an unnecessary step in the M script.  After all, why specify the source database and then restrict the output to a single table in it?  The natural way around this seems to would appear to import just a table (via an SQL statement).  The Code for this is more elegant (IMO) however, the requirement for setting the OldGuids property is still required.  This time, when I add data, I’ll specify the query (as an SQL statement).  My import import and error looks like the following;


Unfortunately, when I Edit the error here, I am returned to the basic import screen (on the left) and can not do any ‘advanced editing’ (like setting the OldGuids property).  The only option I have here is to specify my query as a custom M Script using a blank query, then use the advanced editor to provide the M syntax for a query.


= MySQL.Database(“myservername”, “catalogname”, [Query=”select * from calls”, ReturnSingleDatabase=true, OldGuids=true])

There is some high level documentation on the MS Site about importing from mySQL here and here (including perquisites).


One alternative method to creating a blank query was to force a valid SQL statement to and then edit the query.  Here, I specify that I will get data from a MySql database, and choose to enter the query.  Note my query (below) is simply “Select 1 from dual


I can then choose to edit that query which brings me directly to the Power Query editor and I can hack the M from there.  That is, the window below opens.  I can then chose the advance editor option to edit the M (changing the query and adding the OldGuids Option) or type those changes directly into the formula bar.