OLAP Rigidity wins against Tabular?

There are many reasons why you might choose a Tabular Model over Multidimensional one. Marco Russo discusses some of the pros of tabular in his blog here which general relate to the flexibility that the tabular model gives you. However, one reason surprising reason for choosing multidimensional may be its rigid dimensional structure and the certainty that this structure gives in changing data environments.

A Multidimensional dimension provides many ways to record information against an attribute. Most importantly these are the attributes KEY, NAME and VALUE which are set under the dimension source properties. For example we can see the [State-Province] attribute from the Geography dimension in Adventure Works as below. The Key is the only property that must be set for an attribute so if no name is specified, the key will automatically be applied as the name.

Note that in the tabular engine, there is no concept of the KEY or Name. These are not specified as native DAX refers to column value(s).

Referring to Attribute Members

When an MDX query refers to members of the attribute it can do so by using the member name or the members unique name. Thus, the two MDX statements are equivalent;

– query using members unique name

select

[Measures].[Internet Sales Amount] on 0,

[Product].[Product Categories].[Category].&[4] on 1

from [Direct Sales]

 

– query using members name

select

[Measures].[Internet Sales Amount] on 0,

[Product].[Product Categories].[Category].[Accessories] on 1

from [Direct Sales]

 

There are many reasons why you would choose the first query over the second, namely, the second can give unexpected values when there is more than one member with the name being sought. That is, if there were two product categories (with different keys) and the same name (as Accessories), then the query would return only the first member. One might expect that the query would aggregate the results but this does not occur.

Once the query has been written, the members unique name is confusing and many people complain that it lacks reference to the underlying data (after all its impossible to know what does .&[4] means anyway). However, it is still the best way to refer to a member and this is the way most client tools generate MDX (using the unique_member_name).

Note that since the tabular engine has no concept of a key for an attribute (you don’t uniquely specify the key or name for an attribute) the MDX equivalent passed to the tabular engine uses what we would otherwise consider the name as the key. Thus, as in the example above, the tabular equivalent for this member is [Product].[Category].[Category].&[Accessories]. To an OLAP (multidimensional) developer, this type of key (ie the name as the key) is generally considered a real NO NO and contravenes best practice.

 

Client Queries and the interesting stuff.

For tabular models, the unique name for a member is generated by the name. Furthermore, most client tools refer to the key in their script. I have even seen some tools hard code the published member for parameters. So, what happens when the underlying attribute name changes? The change could break your report.

We can easily demonstrate how this error would occur using Excel as a client. Suppose I have a pivot based on a multidimensional cube and I convert the simple pivot to formulas (as below). The Accessories member in Cell A2 is hardcoded (the formula is in D2 and uses the key [4] as above). If the name of this member changes, the spread sheet still works fine, and the new name is displayed.

However, for the tabular example (as below), when the hardcoded name is changed, the MDX returns a #N/A error for the CUBEMEMBER function.

In the second picture, the name of the member has changed to [Accessories & Other] however the ‘report’ still relies on the hard coded value .&[Accessories] and the error occurs.

Conclusion

The use of a ‘key’ for each member would offer a way to solve the problem as addressed above and this is only available in multidimensional dimensions. To be fair though, the problem is as much a function of design as it is of structure. A multidimensional dimension would still raise the issue if the attributes name had been used as a key (but no-one does that do they?).

The Last Non-Empty Dilemma TABULAR Style

In a previous post, I addressed the function of the last non empty aggregation in SSAS (OLAP). Specifically, I looked at what happens when there are ‘missing’ chucks of data in the fact table and how the OLAP engine natively carries forward the balance to an aggregated time member without regard for the what is the ‘true’ last date. That is, we expect the last date of stock to be applied for all aggregated date values regardless of whether data exists or not. This post looks at the LNE calculation in the tabular model.

Generally, we expect the stock schema to take the form of the fact (balances) surrounded by the dimension table. The fact table (Stock) holds the [Quantity on Hand] (on_hand) for a given date with the Dates table showing a natural hierarchy of Year, Month and Date.

Summing the column on_hand will allow the calculation of daily balances (we will be able to show correct balances when a date is selected) however, if we use any aggregated Date field, the [Stock on Hand] value will be incorrect because the formula adds the amount for all days filtered by the date restriction.

For example, if we consider a very small subset of data (shown as ‘Source Data’) and [Stock on Hand] defined as

Stock On Hand:=Sum([on_hand])

We can see that Jan 2011 for Product A (20)=15+4+1 and for product B (2011)=10+8

Source Data [Stock On Hand] using Sum

Clearly, the addition for aggregate time members does not give us what we want.

The next thought is to change the stock filter context to the max of time. That is, return the sum of stock for the last date in the selected time period. For example;

Stock On Hand:=CALCULATE(sum([on_hand]), FILTER(ALL(Dates[date]), Dates[date]=max(Dates[date])))

The problem with this approach is that the aggregate values will only show for the last date in the month (or year). This is not a problem when we have full year worth of data in the fact and the maximum of the date filter is the last date in the period. But in this case we don’t and usually the most recent data will not be complete for the year. For Product A in the picture below, the total for Jan-2011 is correct (1) because the maximum date for Jan is 31-Jan and has data (see source). However, in the year context (ie 2011) the maximum date is 31-Dec-2011 and there is no stock for this and therefore the sum is empty.

Source Data [Stock On Hand] using max(date.date)

To solve this problem, it is important to remember that the filter context applies a filter to the fact. So, for the filter context of Jan-2011 (say when Jan-2011 is on a row), the data in the fact is filtered for all the dates in Jan. What we are really interested in therefore is the last date in the fact for the current filter context.

Stock On Hand:=CALCULATE(sum([on_hand]), filter(Stock, Stock[date_key]=Max(Stock[date_key])))

In this formula, we recognise that the filter context has been applied to the fact table and use this to derive the maximum date for the current filter period and then use this to further filter the fact.

Source Data [Stock On Hand] using max(Stock.date)

Alternate Approach using Transactions

In addition to using the quantity of stock as a field in the fact, we may also consider using the transactional amount. In this case the use of summing against the Date table filter context works because we are interested in summing all data less than the max date of the filter context. In this situation, our formula would be;

Stock On Hand:=CALCULATE(sum([qty_mvt]), FILTER(Stock, Stock[date_key]<=max(Dates[date])))

Tabular verse Dimensional Design Concepts

The multidimensional model appears to provide a richer environment for model design. However, for the case of snapshot fact tables, the tabular design may offer a much faster design and refreshing options. This post looks at the snapshot design and how business requirements are modelled in both multidimensional and tabular platforms.

The Snapshot Fact

Snapshot data (or accumulating snapshot fact tables) are usually used to record process driven data where records change over time. Consider, for example, the tables FactResellerSales and FactInternetSales tables in AdventureWorks. Here, the record shows Order Date, Due Date and a Ship Date. When the sale is made, the Order Date is known and the Due date is probably known, however, the ship date can change and is only truly known after the order has shipped. For this type of fact table, data should be added when the sale occurs and updated as information about the sale comes to light (that is, when it is shipped).

Classic Star Modelling

The standard way to model this relationship is through the reuse of the date table. That is, each fact field relates to the same Date.DateKey field in the dates table. This is shown for the tabular and multidimensional design models below

Tabular

Multidimensional

 

Multidimensional Interpretation

When a cube is built from this design, the Dates dimension becomes a role playing dimension and is joined to the fact for each relationship identified in the data source view. Thus, the Dates dimension is reused by the cube and with the dimension names appearing as the name of the fact field (after camel case conversion). Notice that there is a single dimension in the solution however, there appears to be three date dimensions (Order Date, Due Date and Ship Date).

Solution View

Cube View

 

While this approach may provide answers to simple business questions (eg, What is that value / quantity of product shipped on date xyz), the modelling technique fails when the query becomes complicated across restrictive dates. For example, it is not straight forward to determine the quantity ordered, shipped and due on date xyz.

Tabular Interpretation

In contrast to the multidimensional model, the tabular model employees an active relationship as the default join between fact and dimensions and each table appears only once in the ‘dimensional representation’ of the model. For example, there is only one date table in the pivot view.

By default, aggregation functions will use this relationship. In this situation, the active relationship (solid line) is between the OrderDate and the Date table. A sum measure ( sum([OrderQuantity])
) defined without context will therefore show the quantity ordered on each date.

The tabular model also allows functions to specify which relationship will be used as joins between tables. Therefore, the quantity of products shipped on a date can be determined by specifying the relationship between ResellerSales.ShipDateKey and Dates.DateKey. For example,

Ship Quantity:=CALCULATE(sum([OrderQuantity]),USERELATIONSHIP(‘ResellerSales’[ShipDateKey],’Dates’[DateKey]))

This allows the determination of measures that relate to more generic dimensions. For example, we can easily define [Order Quantity], [Ship Quantity], [Due Quantity] which specifies these values by date. This is in direct contrast to the default multidimensional behaviour and allows for more native browsing. For example, the date x value pivot below quickly identifies the sparse nature and trend of adventure works data.

I had never looked at adventure works data like this before. Here we can easily see that products are ordered on the 1st of the month, shipped on the 8th and due on the 13th. There are very few exceptions to this in the fact data.

 

Compromise?

The UDM can be designed to produce this outcome; however, it is not part of the ‘default’ behaviour. One way to achieve this would be to conform all dates into a single field (for example through a union) and specify separate measures for each union join (ie; add the Order Data, then the Ship Data and finally the Due Date data). However, this would require longer load times (since we are effectively staking facts) and increase the measure group size. The tabular approach (in my opinion) is a much nicer compromise.

NB It is also easy to mimic the multidimensional behaviour in tabular. Mulitple date tables are added to the model (one for each fact date) and labelled ‘Ship Date’ , ‘Due Date’, …