Archive

Archive for the ‘SSIS’ Category

SSIS 2012 Execution Reporting

SSIS 2012 has (arguably) two main improvements over prior versions. The first is a project deployment model where an entire project can be deployed and managed with a common structure. The second is a more favourable logging mechanism. This post looks at the latter and examines reporting on package and job execution in SSIS 2012.

There are standard inbuilt reports that show the execution of the project (and individual packages within the project). These can be seen in SQL Server Management Studio (SSMS) by right clicking on the SSIS catalogue (or node) and selecting the “All Executions” report from the reports menu (as below). Note that although you can get to the report through the path Reports > All Executions or Reports > Standard Reports > All Executions, the reports are the same (and shown below).

Accessing the Execution Reports through SSMS
Sample Report Output

One of the really great features of this report is that you can click on the Overview hyperlink and see what packages have executed as part of the (node) execution. So, for example, the packages that have executed under my ‘Catalog Testing’ (project) execution are listed since this is the node I selected (when I selected overview from the ‘All Executions’ report above. An example of this output is seen below. You may note that the overview includes the tasks that have executed within the package.

Overview Report (accessed by the Overview link in the ‘All Executions’ report)

But Wait – There’s More!

While these reports are good, they require SSMS to view the reports, so you might want to interrogate the SSIS database directly. Using this you can create some custom reports and expose these as normal SSRS reports.

Two views that are invaluable for examining package executions (and task executions) are catalog.executions
and catalog.executable_statistics. From MSDN the executions view shows “instances package execution in the Integration Services Catalogue”, therefore, we can use this to show (and identify the execution_id) of packages when execute. Consider the “All Executions” report above which shows the execution_id of 10275. This can be easily identified in the executions view with the query;

select  *
from catalog.executions
where execution_id = 10275

More importantly we can easily identify executions of packages (or groups of packages) by referring to the package_name field in this table.

Once we have the execution_id, we can then use this in the statistics view to see the detailed execution of the package. This is the same data that is shown in the ‘Overview’ report above.

select  *
from catalog.executable_statistics
where execution_id = 10275

Most of my projects utilise a parent/child architecture, so when I consider the natural drill path for package executions as the master_package à child_package à tasks. If I want to only show the packages that have executed as part of the Master package (note Master_1.dtsx package name in the executions above) I can simply use the following query;

select e.*
, CONVERT(datetimees.start_timeAS start_time
CONVERT(datetimees.end_timeAS end_time
es.execution_duration es.statistics_id
es.execution_result
case es.execution_result
when 0 then ‘Success’
when 1 then ‘Failure’
     when 2 then ‘Completion’
    when 3 then ‘Cancelled’
  end  as execution_result_description
from catalog.executables e
join catalog.executable_statistics es
on  e.executable_id = es.executable_id
 and e.execution_id = es.execution_id
where package_path  ‘\Package’
and  e.execution_id = 10275

A Word of Caution

while they SSISDB database can be used to provide some nice reporting, the executions view contain some inbuilt security which implements row security. If we examine the definition of the view we can see that the where clause includes the following conditions.

WHERE  opers.[operation_id] in  (SELECT id  FROM  [internal].current_user_readable_operations])
OR  (IS_MEMBER(‘ssis_admin’ = 1)
OR  (IS_SRVROLEMEMBER(‘sysadmin’= 1)

The credential for accessing the database should be a member of ssis_amdin or sysadmin role. If not, no rows are returned.

I will leave it to your own devices to come up with alternate methods for bypassing this requirement.


Rank Partitioning in ETL using SSIS

February 25, 2013 Leave a comment

A common use of partitioning functions within an SQL query is to rank data with in subsets of results. For example, using Adventure Works we may want to ask “what’s the sales rank of each product within each subcategory?” or, “what’s the best selling product with in each subcategory” ? Partitioning functions in SQL server allow us to answer this type of question easily. There are also many uses for partitioning in an ETL process when a query cannot be executed against prepared data within SQL Server. Consider the requirement to determine the likely value of codes based on ordered logic. In this situation, we have to determine what descriptions are given to codes where there are many code, name combinations.

This post looks at how to include partitioning with a SSIS Data Flow.

Why use SSIS?

With the availability of partitioning within SQL Server, you may question why would this be required within SSIS? The main reason for doing so is simply because the required data may not be available from a single SQL Server or we wish to combine multiple data sources. Without an integrated data, an SQL query is simply not a viable option partitioning.

One way to create an integrated data set would be to create a ‘temporary table’ in SQL Server and then query it, however, this is a cumbersome process because it requires multiple staging and extract layers. A more succinct approach is to simply EXTRACT, TRANSFORM and LOAD data in one step without touching the relational engine multiple times. Besides that, performance is better since there are no duplicate loads!

The SQL Function

In this example we are trying to mimic the SQL row_number() function (output as below). That is, we are seeking to ranks the products based on order_quantity within each sub-category.

select
    psc.EnglishProductSubcategoryName as sub_category
  ,  p.ProductAlternateKey  as product_id
,  
sum(fis.OrderQuantityas order_quantity
 ,  ROW_NUMBER() over (    partition  by  psc.EnglishProductSubcategoryName
                            order by sum(fis.OrderQuantitydesc
                       ) as row_rank

from  dbo.DimProduct  p
join  dbo.FactInternetSales  fis
    on  p.ProductKey = fis.ProductKey
join  dbo.DimProductSubcategory  psc
    on  p.ProductSubcategoryKey = psc.ProductSubcategoryKey
group  by
psc.EnglishProductSubcategoryName, p.ProductAlternateKey
order  by
sub_category, row_rank

Implementing a Data Flow Approach

When implementing this solution within a data flow task we note that the data is ranked by sub-category and order quantity (in at order). Then, the rank simply follows the following criteria;

  1. Use a number counter to determine the row_order. This increments by one with each new row within the same sub_category.
  2. When a new sub-category is found, the row_order defaults to 1 (since this must be the first new product within a new sub_category group).

This can be seen by looking at the subcategories below. Note that the data is ordered by sub_category and order_quantity (descending). Each time we change sub_category, we simply reset a counter to 1.


The dataflow looks very simple. Note that I have included the order_quantity in the products_src datasource. Practically (to satisfy the non-sql criteria) this would be included by reference to a lookup task.

// Data source query
select
ps.EnglishProductSubcategoryName as  sub_category
p.ProductAlternateKey  as  product_id
sum(fis.OrderQuantity) as order_quantity
from  dbo.FactInternetSales fis
join  dbo.DimProduct  p
    on  fis.ProductKey = p.ProductKey
join  dbo.DimProductSubcategory  ps
on  p.ProductSubcategoryKey = ps.ProductSubcategoryKey

group by ps.EnglishProductSubcategoryName
p.ProductAlternateKey

Secondly, the input data is sorted by the required partitioning scheme. This includes the sorting field for the rank qualifier. As below, this is defined as the order_quantity within sub_category.

Adding the Script Task

When the scripting task is added to the dataflow, specify the transform option and add an additional column for the row_rank (see below). We also need to access the sub_cateogory field within the transform so this must be specified as an input (below).

Image 1 – Add a Colum to the output


Scripting Code

By default, the script task as has three methods (PreExecute, PostExecute & ProcessRow). As the name implies these are processed before, after and during the consumption of the buffer. Additionally, we can utilise a global variable within class. The use of class variables allows us to keep track of the ‘current’ sub_category and row_rank (that is, the ‘current’ sub_category and row_number). If the row sub_category is not the same as the global sub_category, we know we have a new sub_category.

The code for this is shown below.

public
class
ScriptMain : UserComponent

{
 string _sub_category = “”;
int _row_rank = 1;

public  override  void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.subcategory != _sub_category)
{
_row_rank = 1;
Row.rowrank = _row_rank;
}
else
{
_row_rank++;
Row.rowrank = _row_rank;
}
}
}

The Output of the buffer can be seen below;



SSIS – Dynamically Adding Package Configurations

September 8, 2012 Leave a comment

For my ETL design, most activity (or real work) is conducted through child packages that are controlled through master packages. This gives us a good amount of flexibility to pass variables (like audit keys and configuration file names) to the child from the master.

However, creating and maintaining the configurations in the child can be laborious. The UI is ok for a small number of changes but for large numbers, a programmatic approach is better. This post looks at managing configurations through code.

Assemblies and Constructs

The configuration class belongs to the package. This is automatically included in SSIS scripting tasks. If the code is simply a C# program the assembly the Dts.Runtime assembly is Microsoft.SQLServer.ManagedDTS.dll

And the package can be referenced in the following manner

Application _a = new
Application();

Package _p = _a.LoadPackage(@”Your Package File Name”, null);

Configuration Class

The package must be enabled for configurations. Simply set the EnableConfigurations property to true

_p.EnableConfigurations = true;

Adding a Configuration

A new configuration can be added with the .Add method. The new configuration is added at the end (becomes the last) and therefore we can refer to it by its ordinal. All new properties can be maintained in this manner.

To add a new configuration (config_file) to the package which assigns the package variable (config_file) the value of the parent packages config_file variable, we would use the following code;

_p.Configurations.Add();
_p.Configurations[_p.Configurations.Count - 1].Name = “config_file”;
_p.Configurations[_p.Configurations.Count - 1].ConfigurationType = DTSConfigurationType.ParentVariable;
_p.Configurations[_p.Configurations.Count - 1].ConfigurationString = “config_file”;
_p.Configurations[_p.Configurations.Count - 1].PackagePath = “\\Package.Variables[User::config_file].Properties[Value]“;
_p.Configurations[_p.Configurations.Count - 1].Description = “Inherit config_file from master”;

Iteration and Deletion

The configuration object is innumerable. Therefore we can iterate over existing package configurations using an iterator (as below). Similarly, we could iterate by ordinal (using its index position). To remove the configuration, simply use the remove method.

foreach (Configuration _c in _p.Configurations)
{
_p.Configurations.Remove(_c);
}

Save the Package

The package must be saved before any changes made through code are reflected in bids.

_a.SaveToXml(@”Your package file name”, _p, null);

Bringing it Together

This code can be used in a package to append a variable to a package and add its configuration. I simply used a foreach container with a script task and the following code (note the variable file_name is the fully qualified file name passed from the container);

string _file_name = Dts.Variables["file_name"].Value.ToString();
string _file = System.IO.Path.GetFileName(_file_name);
_file = _file.Substring(0, _file.Length – 5);

Microsoft.SqlServer.Dts.Runtime.Application _a = new Microsoft.SqlServer.Dts.Runtime.Application();

Package _p = _a.LoadPackage(_file_name, null);
_p.Variables.Add(“config_file”, false, _file, “”);
_p.EnableConfigurations = true;
_p.Configurations.Add();
_p.Configurations[_p.Configurations.Count - 1].Name = “config_file”;
_p.Configurations[_p.Configurations.Count - 1].ConfigurationType = DTSConfigurationType.ParentVariable;
_p.Configurations[_p.Configurations.Count - 1].ConfigurationString = “config_file”;
_p.Configurations[_p.Configurations.Count - 1].PackagePath = \\Package.Variables[User::config_file].Properties[Value];
_p.Configurations[_p.Configurations.Count - 1].Description = “Inherit config_file from master”;
_a.SaveToXml(_file_name, _p, null);
Dts.TaskResult = (int)ScriptResults.Success;


Categories: SSIS Tags: ,

Real Time ETL with SSIS and WMI

Often when we design ETL packages we assume that there is a hard start point, or a specific part of the day when the ETL Process can begin. However, in a distributed environment, remote machines may exchange data with a host at random intervals and it may be this exchange that should dictate the start of an ETL process. For example we may want to start our ETL when files are copied to an FTP directory. This posts looks at how to monitor a directory location using WMI (Window Management Instrumentation) and start ETL for this type of scenario.

Overview

Remote systems are often disconnected from their host. That is, the remote site retains its own data independently of a continuous connection to the host. In order to sync data between the two sites, the remote systems extract master data and transfer files back to the host. These transfers occur when the remote site needs to operate independently of the host.

Data transferred back to the host may be triggered by some event at the remote site (for example closing the daily register at the remote). The implication of this is that data arrives at the host at different and unknown times. If we wish to provide a low latency ETL at the host, we need to load this data as soon as it becomes available on the host (that is, as soon as the files are created).

A Brute Force Approach

One method of managing this is to cycle our ETL between our expected load times. For example, we can schedule a job to run at 5 minute intervals while we expect the files to be deposited. This will achieve low latency but is restricted to a latency determined by the execution interval. For example, if we run the job every five minutes, we can expect a latency of at least 5+ minutes.

One consideration of using this approach lies in the execution history for the job. Because each job executes in its own right, the history for the job shows each execution (that is, one for every 5 minutes as below).

Reviewing the log for the job may be troublesome where we expect or desire to see a single job for each day.

A WMI Approach

Another approach is to execute the package and have it listen for changes in the target directory (for want of a better word), then the package commences the lifting component of the ETL after those changes have occurred (that is the files are created). In this situation, the package is running but effectively dormant until a file is created. Furthermore, we can get a much finer polling interval than would be achieved with the Brute Force approach. An example of this type of approach can be seen in the control flow below.

The key consideration for this package is that it executes and continuously and polls the source directory until the WMI query is satisfied (and files have been created). Once this occurs, the package can then iterate over the created files and conduct the expected ETL (which in this case is just moves the file and deletes it).

The WMI Task

The msdn reference for the WMI Task can be found here. For our purposes, we are really only concerned with the ActionAtEvent and WqlQuery components. A zero time out dictates that the task will not time out however we could use this to control how long we listen for and choose an appropriate time out action and event (log the event or log and fire the package).

The Connection Manger simply specifies the server and namespace;

With the query defined as;

SELECT * FROM __InstanceCreationEvent WITHIN 10

WHERE TargetInstance ISA “CIM_DirectoryContainsFile”

and TargetInstance.GroupComponent= “Win32_Directory.Name=\”E:\\\\Blog\\\\Process Folder\”"

 

Also note that the escape characters (ie the ‘\’ preceding a special character) are required in the directory name. The WMI reference can be found here.

Query polling is defined by the WITHIN clause and indicates the number of seconds to wait before the query is run again (10 seconds in this case).

 

Categories: SSIS Tags: ,

Extracting & Managing SSIS Connection Information

I have recently been involved in a large project that created a large number of SSIS packages which in turn lead to an extremely large number of connection managers. In BIDS this is unmanageable! True, SSIS does allows the creation of a shared data sources which can in turn be added to a package (essentially creating a project data source), however, the use of these data sources does have some drawbacks. Firstly, when the package is saved; it saves the connection string associated with the data source. When the data source is changed, the package retains the old connection until the package is reopened in BIDS. Secondly, the connection is lost when the package is published (to the file system). Furthermore, there is no convenient method of obtaining information about the connections amongst the packages. To manage the connections, I created a small app that has the following features.

The APP

The lists all packages I in a directory (listed in the right pane). In doing so it builds a library of all connections in the packages. So (for example), if we chose a connection type, only the connections that have that type (and the packages that have those connections) are retained in the connection and package list box.

If we then chose a connection, only the packages that use that connection are highlighted. Because I was looking for consistency in the connection strings, I show the first package connection string and identify if there are any different connection strings amongst the other packages (by highlighted by a salmon colour).

Since the goal is to manage the connections I’ve also included a ‘New Connection String’ change mechanism, a check to evaluate expressions (as connection strings) and the ability to output a csv report for all connections and packages in the library.

Default Load
Identifying packages using connection by type and name

Code Basics

The principle part of the code revolves around the ManagedDTS library which can be found in the following location / dll ‘C:\Program Files (x86)\Microsoft SQL Server\100\SDK\Assemblies\Microsoft.SQLServer.ManagedDTS.dll’

This assembly allows all SSIS (or dts) management including the creation, alteration and control flow (see http://msdn.microsoft.com/en-us/library/microsoft.sqlserver.dts.runtime.aspx ).

In order to alter an existing package we need to create an Application instance and use it to load the package. The following code achieves this;

Microsoft.SqlServer.Dts.Runtime.Application _app = new Microsoft.SqlServer.Dts.Runtime.Application();
Package _p = _app.LoadPackage(this_is_my_file_name, null);

The connection managers within the package can then be managed with the connectionmanager class (for example iterating over them by)’

foreach (ConnectionManager _c in _p.Connections)
{}

or by referring to a specific connection through the connectionmanager class and specifying its index, identity or name;

ConnectionManager _c = _p.Connections["other_oledb_1"];
ConnectionManager _c = _p.Connections[0];

Finally, we can use the properties / methods to determine the connections type, connection string, expression (connection string expression);

_c.ConnectionString; // connection string
_c.CreationName; // type
_c.GetExpression(“ConnectionString”);     //expression

Setting properties is relatively easy (just by setting the property value). The only exception to this is setting the connection string expression (which required a call of the .SetExpression method);

_c.ConnectionString = connection_string;
_c.SetExpression(“ConnectionString”, expression_value);

The interesting thing I found for this was that when the connection manager has a connection string expression attached to it, the connection string will not be changed unless the expression is changed. Then when the expresssion is changed, the connection string also changes. I thought this was a little weired.

Configuration Files

There may have been scope in the project to use configuration files however, the auditing requirement and the use of child package execution causes a failure within the child package before the configuration was applied. That is, there are some audit events that are fired before the connection can be changed and where these would otherwise err, the child package fails.

Code

I am happy to share this project with anyone that wants it. Just drop me a line and I’ll send it through (subject to the standard disclaimer!)

Categories: SSIS Tags: ,

Dynamic Package Configurations (Part 2)

This article continue discussion about a method of dynamically assigning variable values at runtime Apply the Record Set to Applicable Variables

Initial State

The initial states of the string variables are set to C:\ with variables being used for the record set container.

The code in the Show Values script tasks is simple and only there to demonstrate a variable values. This is the same code for all three ‘Show Value’ scripts. Note that the variables are passed to the script via the default ReadOnlyVaraibles inputs for the script task.

Assign the Record Sets to the variables (Object Container)

The Execute SQL tasks (‘Get Params (1)’ and ‘Get Params (2)’) are used to pass the query to a container. The queries are not complicated (below) and the standard methods for assigning a record set to a variable apply. These are to ensure that the result set is set to ‘Full Result Set’

Query for Each Result Set

Assign Result Set to Variable

There are two requirements to assign the result set to a variable. Firstly ensure that the result set is set to ‘Full Result Set’ and secondly, that the variable is mapped to the result set.

The two script tasks (‘Assign Params (1)’ and ‘Assign Params (2)’) both contain the same code and only expose the record set variable. One of the main benefits of this technique is that the assignment script does not need to know ahead of time what variables are going to be assigned. This is deemed by the field ‘variable_name’ in the record set result.

There are two main pieces of code in this script task. Firstly, we assign the record set variable (variables) to a data table (in order to iterate over it) and secondly, we iterate over the table and assign variable a value.

Assign the Record Set to a Data Table

The code to assign the record set to the data table is below.

Assigning the Dataset To Variables

The code to assign the data table to variables is below. The iteration of data table (rows) shouldn’t be too much of an issue here or should the checking for package variable name but the key takeaways are that we

  1. Check to see that the variable exists (using VariableDespenser.Contains)
  2. Lock the variable (VariableDespenser.LockForOneWrite)
  3. and then write the value.

Benefits

There are other methods of variable assignment that can be used to change the value of variables at run time, for example, we could assign values in a execute SQL task based on a single row. However, I think that limitation with an approach like this is that task has to be configured ahead of time and the variable names must be known. This method allows variable assignment based solely on the passed record set and as such is very reusable.

I should also point out that this example assumes a sting data type. The code can be extended to permits user defined data types so the record set specifies the data type however this was not included for brevity.

Categories: SSIS

Dynamic Package Configurations (Part 1)

Introduction

In the following two posts I show a method of dynamically assigning variable values at runtime with minimal user intervention. I consider this method of assignment dynamic because the package (and tasks) do not require specific configuration for to the variable that is being altered. Variable values are changed based on the variable names in a record set, and this offers an extremely high degree of flexibility in packages.

The use of SSIS configurations and variable assignment are convenient way to change the value of package variables at runtime.
Configuration files can be created through the menu path <SSIS><Package Configurations> or by selecting the ‘Package Configuration’ option after a right click in the control flow and variables can be assigned a runtime value by using the /set switch in dtexec (either command line or job execution text (as an agent job)). Coincidently, the package configurations can also be set a runtime using the /Conf switch.

However, a possible downside of these utilities is that variable values are set at runtime for the entire package and cannot be changed throughout package execution. In this blog, we will show a method assigning indeterminate variables at runtime without user intervention. This is achieved by

  1. Obtaining a record set for the variables (and values) that should be changed (and)
  2. Assigning the package variables values based on the record set parameter names.

This is a two set process that retrieves a list of variables to change (and their values) and then iterates the list, changing the variable identified on each iteration. Also, although the code presented here uses a query to retrieve these values, I find a stored procedure (with param) is a much cleaner way of retrieving the list.

Overview

To demonstrate this works, we will create a simple package that will dynamically change value of variables ouput_path and input_path at run time. We show the start values, first assignment values then final assignment values. This is shown in the mark up below;

Categories: SSIS
Follow

Get every new post delivered to your Inbox.

Join 114 other followers