Thursday, August 26, 2010

Processing group of files executing a job once per file

How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.

Suppose we've two groups of files that we call filegroup1 and filegroup2 and suppose we have 2 groups of transformations that we call transf_group1 and transf_group2. The requirement is: we want to execute the transformations in transf_group1 once for each file in  filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of  the transformations in transf_group2 once for each file in  filegroup2. Let me analyse how we can do that.


A little about some main Kettle topics

Kettle processes informations that flows through a path made by steps. The user takes the needed steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of steps: we've input/output steps, transformation steps, script steps and so on. Any step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a transformation. So transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the job. The job orchestrates the execution of a set of transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.

In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.

How Kettle starts nested Jobs or Transactions

The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the Start transaction or Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the step configuration dialog , select the Advanced tab and check Execute for every single row. We see an example of that below in the Start transaction configuration dialog. You'll find the same setting in the Start job configuration dialog.


The job step Add filenames to result and why it isn't good for us

So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations, transf_group1 and transf_group2, it is clear that we will have two jobs one that chains all the transformations of transf_group1 and the second all the transformations for transf_group2. We call them respectively jobs1 and jobs2. So we will have:

a) A root job chains together 2 jobs job1 and job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.

Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the Add filenames to result. The picture below depict a possible flow for our root job.

We're starting to smile but unfortunately this solutions is not applicable because it doesn't work. To understand the why we need to understand the difference between row result and file result for Kettle. Typically a file result is a set of filenames that can be used only by steps that are able to manage attachments. The Mail step is the one step that can manage such a result. Row results instead are made by real data typically as output of a transaction. If you  look at the Kettle internals you can notice that a job step manages these two datasets as two completely separated collections. The important thing to note here is that whenever you check Execute for every single row in our job/transformation configuration your're saying that you'll start your job/transformation for each row of your row result. So way our solution isn't good for us? Because our Add filenames to result steps fill a file result so our jobs will never starts.

So what to do??

The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.

You need to call that transformation through a Transformation step in our root job chaining it before job1 and job2 to get respectively filegroup1 and filegroup2. Here it is the complete layout of our definitive root job


You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the row result and the job can be executed once per file that is present in our directory. Remember check the magic flag Execute for every single row in the Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.

How to execute the provided sample

To execute my sample unzip the file in whatever directory. Edit the get file list transformations and change the Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.

Download from here the sample

2 comments:

  1. Many thanks for this. For whatever reason, this task simply was not intuitive for me. Filenames, row names, bang head against desk, etc. I appreciate the handholding on this one.

    ReplyDelete