Often when we design ETL packages we assume that there is a hard start point, or a specific part of the day when the ETL Process can begin. However, in a distributed environment, remote machines may exchange data with a host at random intervals and it may be this exchange that should dictate the start of an ETL process. For example we may want to start our ETL when files are copied to an FTP directory. This posts looks at how to monitor a directory location using WMI (Window Management Instrumentation) and start ETL for this type of scenario.
Remote systems are often disconnected from their host. That is, the remote site retains its own data independently of a continuous connection to the host. In order to sync data between the two sites, the remote systems extract master data and transfer files back to the host. These transfers occur when the remote site needs to operate independently of the host.
Data transferred back to the host may be triggered by some event at the remote site (for example closing the daily register at the remote). The implication of this is that data arrives at the host at different and unknown times. If we wish to provide a low latency ETL at the host, we need to load this data as soon as it becomes available on the host (that is, as soon as the files are created).
A Brute Force Approach
One method of managing this is to cycle our ETL between our expected load times. For example, we can schedule a job to run at 5 minute intervals while we expect the files to be deposited. This will achieve low latency but is restricted to a latency determined by the execution interval. For example, if we run the job every five minutes, we can expect a latency of at least 5+ minutes.
One consideration of using this approach lies in the execution history for the job. Because each job executes in its own right, the history for the job shows each execution (that is, one for every 5 minutes as below).
Reviewing the log for the job may be troublesome where we expect or desire to see a single job for each day.
A WMI Approach
Another approach is to execute the package and have it listen for changes in the target directory (for want of a better word), then the package commences the lifting component of the ETL after those changes have occurred (that is the files are created). In this situation, the package is running but effectively dormant until a file is created. Furthermore, we can get a much finer polling interval than would be achieved with the Brute Force approach. An example of this type of approach can be seen in the control flow below.
The key consideration for this package is that it executes and continuously and polls the source directory until the WMI query is satisfied (and files have been created). Once this occurs, the package can then iterate over the created files and conduct the expected ETL (which in this case is just moves the file and deletes it).
The WMI Task
The msdn reference for the WMI Task can be found here. For our purposes, we are really only concerned with the ActionAtEvent and WqlQuery components. A zero time out dictates that the task will not time out however we could use this to control how long we listen for and choose an appropriate time out action and event (log the event or log and fire the package).
The Connection Manger simply specifies the server and namespace;
With the query defined as;
SELECT * FROM __InstanceCreationEvent WITHIN 10
WHERE TargetInstance ISA “CIM_DirectoryContainsFile”
and TargetInstance.GroupComponent= “Win32_Directory.Name=\”E:\\\\Blog\\\\Process Folder\””
Also note that the escape characters (ie the ‘\’ preceding a special character) are required in the directory name. The WMI reference can be found here.
Query polling is defined by the WITHIN clause and indicates the number of seconds to wait before the query is run again (10 seconds in this case).