Creating Tables in Excel with VBA and External Data – Part I

This post looks at how we can add a table to an Excel sheet which uses a MDX query as its source. This is a very handy feature to use for a couple reasons;

    1. The table retains the connection OLAP source (hence can be updated by a user at will)
    2. We can use it to extract data from MOLAP or tabular sources (i.e. run MDX or DAX)
    3. We can define complex queries to return a result set that cannot be obtained with a pivot table

Note that most workarounds for creating a table from OLAP sources rely on the creation of the pivot table, its formatting is a tabular source and a copy and paste the values. Hardly an attractive option!

  1. We can use the table!! – (This is really important for certain activities like data mining table analysis)

How to Do It

We’ll look at a simple query from adventure works;

select [Measures].[Reseller Sales Amount] on 0,
[Product].[Category].[Category] on 1
from [Adventure Works]
where [Geography].[Geography].[Country].&[Australia]

and an OLEDB connection string (note the OLEDB specification at the start of the string)

OLEDB;Provider=MSOLAP;Data Source=@server;Initial Catalog=Adventure Works DW 2008R2;

I have incorporated those to strings into 2 functions (MyQuery and MyConnectionString) – this just removes some of the clutter from the code.

Now we just need to use the ListObjects.Add method. The code (now in with all Sub’s and Functions) is pretty much the bare bones you need to add the table. In other posts, I’ll look into higher level of control for the output.

The CODE

The complete code is shown below. Ive included everything so it can simply be pasted into a new VB module

Sub CreateTable()

  With Sheet1.ListObjects.Add(SourceType:=0 _
, Source:=MyConnectionString() _
, Destination:=Range(“$A$1”) _
                            ).QueryTable
.CommandType = xlCmdDefault
.CommandText = MyQuery()
.ListObject.DisplayName = “MyMDXQueryTable”
.Refresh BackgroundQuery:=False
.PreserveColumnInfo = False

  End With

End Sub

Private Function MyQuery() As String

     MyQuery = “select [Measures].[Reseller Sales Amount] on 0, ” & _
“[Product].[Category].[Category] on 1 ” & _
“from [Adventure Works] ” & _
“where [Geography].[Geography].[Country].&[Australia]”

End Function

Private Function MyConnectionString() As String

     MyConnectionString = “OLEDB;Provider=MSOLAP;Data Source=@server;Initial Catalog=Adventure Works DW 2008R2;”

End Function

Walk through

This is pretty much the bare bones approach. As code walk through (see Sub CreateTable), we add the list object specifying its connection string and destination, set the command and refresh info. The only statement that is not entirely necessary is naming the table (see .ListObject.DisplayName) but I tend to think is a good idea because we will want to refer to it by name at a later stage.

Out Come

The code will add a table like the one in the following image. The field names are fully qualified which is not that nice and we will look at how this can be changed in another post. For now, our purpose is to get a table is in the workbook (the purpose of this post) so that it can be used as a table and refreshed.


PS – the code above adds the listobject by reference to the sheet within VBA (see Sheet1.ListObjects). Its probably worthwhile to point out that this is the sheet reference (ie the number of the sheet in the book) and not the name of the sheet.

One more thing – when the query uses refers to a attributes in a hierarchy the OLEDB result set (table) will include parent attributes of the hierarchy as a column. This is nothing to worry about for the moment!

Next – changing the tables query.

Advertisements

Understanding the LIFT CHART

The lift chart is synonymous with evaluating data mining model performance and the predictive power of one model against another. Often, in presentations and training sessions it is suggested that the chart is indicative of the models ability to accurately predict within a training population. For example, the following explanation is provided;

“the lift chart shows that this model is good because it only needs to evaluate 30% of data in order to correctly predict all the target outcomes”

This type of statement is simply not true – it is INCORRECT, WRONG, MISLEADING and shows a lack of understanding about what the chart represents. This post looks at explaining the chart by examining how it is created –  seeking to remove some of the misconceptions about the use of the chart.

Consider the following example. In this chart it would be argued that an ideal model (red line) would only need ~ 55% of the population in order to predict all the target states. Without a model, we would need to use the entire population and so our model (being somewhat useful) lies between the best model and a random guess. These values can be determined by the intercepts each model with the X axis (note that at 55% of the population, the best model achieves 100% accuracy).

Another common question arising from the interpretation of this chart occurs when we know that the target (predicted) state is found in only 55% of the population. The question is “why do we show 100% accuracy when only 55% of the population can exist in the predicted state and therefore the Y axis should have a maximum of 55%”.

For my own analysis, I shall ask a question of the reader so that the construction of the chart can better be understood. The question is simple.

If my model does not predict the correct result 100% of the time how could my accuracy ever achieve 100%? Let’s be realistic, it would have to be a pretty impressive model to never be wrong – and this is what the chart always shows à 100% accuracy!

Now let’s look at construction

In order to create a lift chart (also referred to as an accumulative gain) the data mining model needs to be able to predict the probability of its prediction. For example, we predict a state and the probability of that state. Within SSAS, this is achieved with the PredictProbability
function.

Now, since we can include the probability of our predictions, we would naturally order training data by the predicted probability in suggesting the likelihood of a test case being the predicted target state. Or perhaps put another way, if I only had 10 choices for choosing which test case (that is a an observation from the testing data) would be the predicted value, I would choose the top 10 testing cases based on their predicted probability – after all the model is suggesting that these cases have the highest probability of being the predicted state.

As we move through the testing data (and the predicted probability decreases), it is natural to expect the model to become less accurate – will make more false predictions. So let’s summarise this (training) data. For convenience, I have group my training data into 10 bins and each bin has ~ 320 cases (the red line below). Working with the assumption that the predictive power of my model decreases with probability, the number of predictions also decreases as we move through more of the training data. This is clearly visible in the chart and data below – the first bin has a high predictive power (275 correct predictions) while the last bin has only 96 correct predictions.

If I focus on the models ability to correctly predict values, I will notice that it can predict 1,764 correct results – but now let’s turn our look to the accumulative power of the model. If, from the set of my sample data I could only choose 322 cases (coincidently this is the number of cases in bin 1), I would choose all cases from Bin 1 and get 275 correct (or 16% of the possible correct values). If I had to choose 645 cases, I would choose the cases from bin 1 & 2 and get 531 correct (30% of correct possibilities). This continues with the more predictions that I make and is summarised in the following table.

This data is transposed onto the lift chart – the Bin on the X axis (representing the % of population) and the Percent Running Correct on the Y axis (representing the number of correct predictions). As we can see, the data is indicative of the models ability to quickly make accurate predictions rather than its overall predictive ability.

Best and Random Cases

The chart also includes best and random cases as guidelines for prediction – let’s focus on these. These lines are theoretical – really ‘what if’ type of scenarios.

Suppose that I had an ideal model. If this was the case my model would predict 322 in bin 1, 323 in bin 2 and so on – it must because we have ordered the data by PredictProbability and in a perfect world we would get them all correct! However, the model can only predict 1,764 correct values -we know this from the actual results. Because of this we would only need up to bin 6 to get all our correct values (see column ‘Running Correct Best Case’ in the following table. Just as we did for the model prediction we can convert this to a percent of total correct (the population) and chart it with the model.

Now for the random guess – again this is theoretical. I know that I can only predict 1,764 correct values so, if these were evenly distributed amongst my bins, I would have ~176 correct predictions in each bin. This is then added to the chart.

What is the Problem?

Now we can see that the chart is essentially just a view of how quickly the model makes accurate predictions.  Perhaps there is nothing wrong with that but what happens when we compare models?  Well, in this case, the comparison is relative.  Those steps are reproduced for each chart and what you essentially see is relative comparative performance.  Thus, the comparison of two models in the charts gives NO indication of performance accuracy – after all how could they since they each plot relative percent accuracy for their own states.

For this reason, relying on this chart as a sole measure of accuracy is just dangerous and really shows very little about the total accuracy of the model.

Conclusion

Understanding how the lift chart has been constructed can help in understanding how to interpret it. We can see that it indicates the accumulative power of the model to give predictions – or perhaps more correctly the accumulative power of the model to give its correct prediction.

Presenting Sessions on Data Mining

I am currently preparing a few presentations on using Data Mining in business intelligence.  These will be at the Brisbane SQL Server user group (this month) and SQL Rally Amsterdam (in Nov).  I am especially looking forward to Amsterdam because this will be my first trip to the Netherlands.

 The application of data mining within an organisation is an interesting topic for me which I liken to a milestone in organisational maturity.  When most people discuss business intelligence (and analytics), they are talking about the production of a system so that the and end user can either get canned reports quickly or (more importantly) interrogate data in a real time manner.  After all, this was the intent of OLAP! – the idea that the end user can drive data in real time so that once they think of a problem and a potential solution, they can verify their thoughts against data. 

 However, as good as this type of system is (compared to the starting base), this can be one of the short comings of stoping the BI journey here.  That is, the user needs to think of a solution to a problem, and then verify against it.  But what do we verify against? – and when do we stop investigating?  This is one of the major benefits of data mining.  It allows a scientific analysis at a targeted problem without the limitations of our dimensional thought (after all we can only think in a relatively small number of dimensions & attributes at a time).

DMX 102 – SSAS Time Series with Multiple Attributes

This post looks at extending the time series prediction so that multiple values can be predicted. In the last post (DMX 101) only a single value was predicted and while this can be useful, the approach has obvious draw backs because there is often an expectation that different classes have different patterns. Consider seasonal fashion. We expect that seasonal products will have a distinct life-cycle and to group products (from multiple seasons) would distort a specific seasons sales.

In this post, we predict two sequences within the same model. For brevity, both series follow the function y=mx+b;

    y1 = ½ X + 20 (and)

y2 = 2x

There are two ways of dealing with multiple predictors. Firstly, we can treat each as a function value of the same sequence. In this method, we identify a value for y1 and a corresponding value for y2. In this approach, the series key is simply the x (time) value. The second method considers that only one value is being predicted (y) which is a value derived from the series key and an additional identifier which specifies the function (as either y1 or y2).

Method 1 – Multiple Y values

In this situation, our predicting data is represented as functions of (the time) series key (see right). This is represented by the SQL below.
with xs(x) as
( select convert(money,1) union  all
select x+1 from xs where x<20
)
select x, x/2+20 as y1, 2*x as y2
from xs

The model is created and trained with the following commands.

— MODEL CREATION : create single sequence key with mulitple prediction values
create mining model ts_method1
(   x long key time
, y1 double continuous predict
, y2 double continuous predict
) USING MICROSOFT_TIME_SERIES

— MODEL TRAINING : train the model
INSERT INTO ts_method1(x, y1, y2)
OPENROWSET(‘SQLOLEDB.1’
, ‘Provider=SQLOLEDB.1;Integrated Security=SSPI;Initial Catalog=master;Data Source=.\SQL2008R2’
 , ‘        with xs(x) as (select convert(money,1) union all select x+1 from xs where x<20)
select x, x/2+20 as y1, 2*x as y2 from xs’
)

Note with this method, the predicted values (function) must be explicitly identified. That is, we can predict y1 or y2 but not y1+y2. For example, we can predict y1 with;

— select flattned results for y1
select flattened predict([y1],5)
from [ts_method1]

Method 2 – Single Y Value and a Composite Key

The second approach incorporates the function (or class type) into the model key. We can see that the data for the model shows a y_value that is dependent on both the x and y_function columns.

with xs(x) as (select convert(money,1)
union all select x+1 from xs where x<20)
select x ‘y1’ as y_function , x/2+20 as y_value
from xs
union all
select x
, ‘y2’ as y_function, 2*x as y2
from xs

The model can be created and trained with the following commands;

— MODEL CREATION : create complex sequence key with single prediction values
create mining model ts_method2
 x long key time
, y_function text key
, y_value double continuous predict
) USING MICROSOFT_TIME_SERIES

— MODEL TRAINING : Train the model
insert into ts_method2(x, y_function, y_value)
OPENROWSET (‘SQLOLEDB.1’
, ‘Provider=SQLOLEDB.1;Integrated Security=SSPI;Initial Catalog=master;Data Source=.\SQL2008R2’
 , ‘        with xs(x) as ( select convert(money,1) union all select x+1 from xs where x<20)
select x, ”y1” as y_function, x/2+20 as y_value from xs
union all select x, ”y2” as y_function, 2*x as y2 from xs
order by 1
)

Additionally, it is important to note that training Time Series model requires ordered data.

Prediction from a model which includes a composite key requires that the function (class) be specified in the query. The first query (as the results on the right show) does not show what class relates to what predictions. Note that two series of data are returned

— Q1 : Prediction without class specification
select flattened Predict([y_value], 5)
from [ts_method2]


This can be overcome by showing the class in the result set (in which case all results are still returned

— Q2 : Prediction showing class specification
select flattened [y_function], Predict([y_value], 5)
from [ts_method2]
or, by specifying the class that will be predicted.

— Q2 : Prediction specifying class
select flattened Predict([y_value], 5)
from [ts_method2]
where [y_function]=’y1′

The addition of class tags into time series models improves their useability by disseminating an aggregate sequence into components. This post has looked at the two methods for including class tags and the prediction implicatios of each.

DMX 101 – Time Series Prediction

Life is ironic: After the short rant about company willingness to use data mining, I was asked to mentor a client on it! They wanted to investigate the predictive capability of time series modelling.

Regardless of whether mining is applied in a descriptive or predictive manner, Analysis Services makes it easy to include some facet of data mining into operations (or at least investigate it). The use of DMX (Data Mining Extensions) makes it simple to create the model and obtain values from it and is equivalent to SQL for data mining. Some are surprised to find that BIDS is not required to create, train and query the model.

In this example, we will predict a sinusoidal curve using the time series algorithm. The curve takes the form as seen in the picture below;

There are three basic steps to obtaining the predictions. Firstly, the model has to be created. For all intensive purposes, this is the implementation of one mining algorithm with a distinct set of parameters. The model relies on a structure which defines the meta-data for input but, just as in BIDS, the creation of a model will automatically create the structure for model use. Secondly, the model needs to be trained, and finally, the model can be queried.

Model Creation

A model can be created by simply using the CREATE MIING MODEL SYNATX (as below). Model columns must be defined by name, data type and content type. Content types define behaviour (eg role as key and time) and data behaviour (DISCRETE, CONTINUOUS, DISCRETIZED). Additionally, prediction columns (ones being predicted) require a prediction request (PREDICT OR PREDICT_ONLY).

CREATE  MINING  MODEL [SinPredict]
(
    Degree    LONG KEY TIME
    , SinValue DOUBLE CONTINUOUS PREDICT
) USING
MICROSOFT_TIME_SERIES
WITH DRILLTHROUGH

As stated, the creation of the model will automatically create a structure. By default, the structure name will be the models name with a ‘_Structure’ suffix (eg SinPredict_Structure). Had we wished to specify the structure first (also perfectly valid) and append the model to it, we could use the create / alter-append syntax below.

— Create structure
CREATE  MINING  STRUCTURE [SinTimeSeries]
(
   Degree LONG KEY TIME
   , SinValue DOUBLE CONTINUOUS
)

— Append the model
ALTER  MINING STRUCTURE [SinTimeSeries]
ADD MINING MODEL [SinPredict]
(
   Degree
   , SinValue PREDICT
) USING MICROSOFT_TIME_SERIES
WITH  DRILLTHROUGH

Model Training

Once the model has been created, it can be trained (or populated). Queries run against an unprocessed model will err as will trying to re-train the model without first deleting prior contents. The generic form of training is an insert statement. For example,

INSERT INTO Model(Columns)
<< The RowSet >>

Since the generation of sinusoidal data for training can be easily accomplished with a CTE, an OPENROWSET statement to an SQL server can be used to populate the model (note the cte provided the chart above). The DMX query to populate the model is;

INSERT INTO SinPredict(Degree, SinValue)
OPENROWSET(‘SQLOLEDB.1’
, ‘Provider=SQLOLEDB.1;Integrated Security=SSPI;Initial Catalog=master;Data Source=.\SQL2008R2’
, ‘    WITH Deg AS
       (
           SELECT CAST(1 AS MONEY) AS Deg
           UNION ALL SELECT Deg + 1
           FROM Deg
           WHERE Deg < 180
       )
       SELECT *, CAST(SIN(Deg/9) AS MONEY) AS SinValue
       FROM Deg OPTION (MAXRECURSION 180)’
)

Note that, by default, rowset queries are not enabled in SSAS and shouls be configured in server properties. These are basic (open rowset queries) and advanced properties (allowed providers).

Model Prediction

Once the model has been trained, it can be queried. For time series, we can use the PREDICT or PREDICTTIMESERIES
functions (both achieve the same result) and forecasts the next n values. By default, the model returns results in the nested form which can be flattened using the keyword (as below).

SELECT     PREDICTTIMESERIES(SinValue, 20)
FROM    [SinPredict]

SELECT    FLATTENED PREDICTTIMESERIES(SinValue, 20)
FROM    [SinPredict]

One of the interesting observations about time series predictions is that the sequence identifier values are incremented from the last value in the training record set. These values are not predicted as part of the prediction function which becomes apparent when using smart keys. For example a date smart key in the form of YYYYMMDD with the last training value of 20110520 (20-May-2011) will predict incrementally from 21-May. When 30 is reached, the next sequence number is 31, then 32 and so on.

Conclusion

The use of DMX offers a simple way of implementing SSAS data mining. This simple example demonstrates how to create a time series model, train it and then predict from it.

Data Mining in the Cloud

It has always amazed me, how un-eager people have been to adopt or at least entertain some aspect of data mining as opposed to the more traditional report and investigate style of analysis. This is especially so when it’s essentially freely available (as most SSAS instances already exist) and easy to use thanks to the data mining plugin for excel.

For an overview of the excel plugin, you can check out Add-ins Launch or download it from Add-ins Download and start playing.

The excel add-in requires a connection to a SSAS instance and if that’s not available you may want to check out Microsoft’s (very simplified) cloud version of the add-in at http://clouddm.msftlabs.com/ which allows for csv upload (and maintains some of the bike buyers sample data). I am sure that this will increase functionality as time goes on.

Finally, I would be remiss if I didn’t mention Predixion Software (https://www.predixionsoftware.com/predixion/) which offers both an excel add-in and cloud solution. Predixion’s principle architect was a driving force behind the MS excel add-in.