Rank Partitioning in ETL using SSIS
A common use of partitioning functions within an SQL query is to rank data with in subsets of results. For example, using Adventure Works we may want to ask “what’s the sales rank of each product within each subcategory?” or, “what’s the best selling product with in each subcategory” ? Partitioning functions in SQL server allow us to answer this type of question easily. There are also many uses for partitioning in an ETL process when a query cannot be executed against prepared data within SQL Server. Consider the requirement to determine the likely value of codes based on ordered logic. In this situation, we have to determine what descriptions are given to codes where there are many code, name combinations.
This post looks at how to include partitioning with a SSIS Data Flow.
Why use SSIS?
With the availability of partitioning within SQL Server, you may question why would this be required within SSIS? The main reason for doing so is simply because the required data may not be available from a single SQL Server or we wish to combine multiple data sources. Without an integrated data, an SQL query is simply not a viable option partitioning.
One way to create an integrated data set would be to create a ‘temporary table’ in SQL Server and then query it, however, this is a cumbersome process because it requires multiple staging and extract layers. A more succinct approach is to simply EXTRACT, TRANSFORM and LOAD data in one step without touching the relational engine multiple times. Besides that, performance is better since there are no duplicate loads!
The SQL Function
In this example we are trying to mimic the SQL row_number() function (output as below). That is, we are seeking to ranks the products based on order_quantity within each sub-category.
from dbo.DimProduct p
Implementing a Data Flow Approach
When implementing this solution within a data flow task we note that the data is ranked by sub-category and order quantity (in at order). Then, the rank simply follows the following criteria;
- Use a number counter to determine the row_order. This increments by one with each new row within the same sub_category.
- When a new sub-category is found, the row_order defaults to 1 (since this must be the first new product within a new sub_category group).
This can be seen by looking at the subcategories below. Note that the data is ordered by sub_category and order_quantity (descending). Each time we change sub_category, we simply reset a counter to 1.
The dataflow looks very simple. Note that I have included the order_quantity in the products_src datasource. Practically (to satisfy the non-sql criteria) this would be included by reference to a lookup task.
// Data source query
group by ps.EnglishProductSubcategoryName
Secondly, the input data is sorted by the required partitioning scheme. This includes the sorting field for the rank qualifier. As below, this is defined as the order_quantity within sub_category.
Adding the Script Task
When the scripting task is added to the dataflow, specify the transform option and add an additional column for the row_rank (see below). We also need to access the sub_cateogory field within the transform so this must be specified as an input (below).
Image 1 – Add a Colum to the output
By default, the script task as has three methods (PreExecute, PostExecute & ProcessRow). As the name implies these are processed before, after and during the consumption of the buffer. Additionally, we can utilise a global variable within class. The use of class variables allows us to keep track of the ‘current’ sub_category and row_rank (that is, the ‘current’ sub_category and row_number). If the row sub_category is not the same as the global sub_category, we know we have a new sub_category.
The code for this is shown below.
ScriptMain : UserComponent
string _sub_category = “”;
int _row_rank = 1;
public override void Input0_ProcessInputRow(Input0Buffer Row)
if (Row.subcategory != _sub_category)
_row_rank = 1;
Row.rowrank = _row_rank;
_sub_category = Row.subcategory;
Row.rowrank = _row_rank;
The Output of the buffer can be seen below;