How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.
Suppose we've two groups of files that we call
filegroup1 and
filegroup2 and suppose we have 2 groups of transformations that we call
transf_group1 and
transf_group2. The requirement is: we want to execute the transformations in
transf_group1 once for each file in
filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of the transformations in
transf_group2 once for each file in
filegroup2. Let me analyse how we can do that.
A little about some main Kettle topics
Kettle processes informations that flows through a path made by
steps. The user takes the needed
steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of
steps: we've input/output steps, transformation steps, script steps and so on. Any
step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a
transformation. So
transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the
job. The
job orchestrates the execution of a set of
transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a
job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.
In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.
How Kettle starts nested Jobs or Transactions
The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the
Start transaction or
Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the
step configuration dialog , select the
Advanced tab and check
Execute for every single row. We see an example of that below in the
Start transaction configuration dialog. You'll find the same setting in the
Start job configuration dialog.
The job step Add filenames to result and why it isn't good for us
So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations,
transf_group1 and
transf_group2, it is clear that we will have two jobs one that chains all the transformations of
transf_group1 and the second all the transformations for
transf_group2. We call them respectively
jobs1 and
jobs2. So we will have:
a) A root job chains together 2 jobs
job1 and
job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.
Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the
Add filenames to result. The picture below depict a possible flow for our root job.
We're starting to smile but unfortunately this solutions is not applicable because it doesn't work. To understand the why we need to understand the difference between
row result and
file result for Kettle. Typically a file result is a set of filenames that can be used only by steps that are able to manage attachments. The Mail step is the one step that can manage such a result. Row results instead are made by real data typically as output of a transaction. If you look at the Kettle internals you can notice that a job step manages these two datasets as two completely separated collections. The important thing to note here is that whenever you check
Execute for every single row in our job/transformation configuration your're saying that you'll start your job/transformation for each row of your row result. So way our solution isn't good for us? Because our
Add filenames to result steps fill a file result so our jobs will never starts.
So what to do??
The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.
You need to call that transformation through a Transformation step in our root job chaining it before job1 and job2 to get respectively
filegroup1 and
filegroup2. Here it is the complete layout of our definitive root job
You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the
row result and the job can be executed once per file that is present in our directory. Remember check the magic flag
Execute for every single row in the
Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.
How to execute the provided sample
To execute my
sample unzip the file in whatever directory. Edit the
get file list transformations and change the
Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.
Download from here the sample