Home > SSIS > Rank Partitioning in ETL using SSIS

Rank Partitioning in ETL using SSIS


A common use of partitioning functions within an SQL query is to rank data with in subsets of results. For example, using Adventure Works we may want to ask “what’s the sales rank of each product within each subcategory?” or, “what’s the best selling product with in each subcategory” ? Partitioning functions in SQL server allow us to answer this type of question easily. There are also many uses for partitioning in an ETL process when a query cannot be executed against prepared data within SQL Server. Consider the requirement to determine the likely value of codes based on ordered logic. In this situation, we have to determine what descriptions are given to codes where there are many code, name combinations.

This post looks at how to include partitioning with a SSIS Data Flow.

Why use SSIS?

With the availability of partitioning within SQL Server, you may question why would this be required within SSIS? The main reason for doing so is simply because the required data may not be available from a single SQL Server or we wish to combine multiple data sources. Without an integrated data, an SQL query is simply not a viable option partitioning.

One way to create an integrated data set would be to create a ‘temporary table’ in SQL Server and then query it, however, this is a cumbersome process because it requires multiple staging and extract layers. A more succinct approach is to simply EXTRACT, TRANSFORM and LOAD data in one step without touching the relational engine multiple times. Besides that, performance is better since there are no duplicate loads!

The SQL Function

In this example we are trying to mimic the SQL row_number() function (output as below). That is, we are seeking to ranks the products based on order_quantity within each sub-category.

select
    psc.EnglishProductSubcategoryName as sub_category
  ,  p.ProductAlternateKey  as product_id
,  
sum(fis.OrderQuantityas order_quantity
 ,  ROW_NUMBER() over (    partition  by  psc.EnglishProductSubcategoryName
                            order by sum(fis.OrderQuantitydesc
                       ) as row_rank

from  dbo.DimProduct  p
join  dbo.FactInternetSales  fis
    on  p.ProductKey = fis.ProductKey
join  dbo.DimProductSubcategory  psc
    on  p.ProductSubcategoryKey = psc.ProductSubcategoryKey
group  by
psc.EnglishProductSubcategoryName, p.ProductAlternateKey
order  by
sub_category, row_rank

Implementing a Data Flow Approach

When implementing this solution within a data flow task we note that the data is ranked by sub-category and order quantity (in at order). Then, the rank simply follows the following criteria;

  1. Use a number counter to determine the row_order. This increments by one with each new row within the same sub_category.
  2. When a new sub-category is found, the row_order defaults to 1 (since this must be the first new product within a new sub_category group).

This can be seen by looking at the subcategories below. Note that the data is ordered by sub_category and order_quantity (descending). Each time we change sub_category, we simply reset a counter to 1.


The dataflow looks very simple. Note that I have included the order_quantity in the products_src datasource. Practically (to satisfy the non-sql criteria) this would be included by reference to a lookup task.

// Data source query
select
ps.EnglishProductSubcategoryName as  sub_category
p.ProductAlternateKey  as  product_id
sum(fis.OrderQuantity) as order_quantity
from  dbo.FactInternetSales fis
join  dbo.DimProduct  p
    on  fis.ProductKey = p.ProductKey
join  dbo.DimProductSubcategory  ps
on  p.ProductSubcategoryKey = ps.ProductSubcategoryKey

group by ps.EnglishProductSubcategoryName
p.ProductAlternateKey

Secondly, the input data is sorted by the required partitioning scheme. This includes the sorting field for the rank qualifier. As below, this is defined as the order_quantity within sub_category.

Adding the Script Task

When the scripting task is added to the dataflow, specify the transform option and add an additional column for the row_rank (see below). We also need to access the sub_cateogory field within the transform so this must be specified as an input (below).

Image 1 – Add a Colum to the output


Scripting Code

By default, the script task as has three methods (PreExecute, PostExecute & ProcessRow). As the name implies these are processed before, after and during the consumption of the buffer. Additionally, we can utilise a global variable within class. The use of class variables allows us to keep track of the ‘current’ sub_category and row_rank (that is, the ‘current’ sub_category and row_number). If the row sub_category is not the same as the global sub_category, we know we have a new sub_category.

The code for this is shown below.

public
class
ScriptMain : UserComponent

{
 string _sub_category = “”;
int _row_rank = 1;

public  override  void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.subcategory != _sub_category)
{
_row_rank = 1;
Row.rowrank = _row_rank;

_sub_category = Row.subcategory;
}
else
{
   _row_rank++;
   Row.rowrank = _row_rank;

}

}
}

The Output of the buffer can be seen below;



About these ads
  1. Salvador Segura
    January 9, 2014 at 7:54 pm | #1

    Hello Paul:
    I try the scripting code but need another line to function properly

    public
    class
    ScriptMain : UserComponent

    {
    string _sub_category = “”;
    int _row_rank = 1;

    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
    if (Row.subcategory != _sub_category)
    {
    _row_rank = 1;
    Row.rowrank = _row_rank;

    ********** _sub_category = Row.subcategory; **********

    }
    else
    {
    _row_rank++;
    Row.rowrank = _row_rank;
    }
    }
    }

    I do this in BIDS 2008 R2

  2. January 9, 2014 at 8:22 pm | #2

    Hi Salvador … You had to add the _sub_category = Row.subcategory; ??
    Sorry i missed it – I’ve appended it to the blog
    Thanks for the heads up

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 159 other followers

%d bloggers like this: