How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.
Suppose we've two groups of files that we call filegroup1 and filegroup2 and suppose we have 2 groups of transformations that we call transf_group1 and transf_group2. The requirement is: we want to execute the transformations in transf_group1 once for each file in filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of the transformations in transf_group2 once for each file in filegroup2. Let me analyse how we can do that.
A little about some main Kettle topics
Kettle processes informations that flows through a path made by steps. The user takes the needed steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of steps: we've input/output steps, transformation steps, script steps and so on. Any step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a transformation. So transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the job. The job orchestrates the execution of a set of transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.
In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.
How Kettle starts nested Jobs or Transactions
The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the Start transaction or Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the step configuration dialog , select the Advanced tab and check Execute for every single row. We see an example of that below in the Start transaction configuration dialog. You'll find the same setting in the Start job configuration dialog.
The job step Add filenames to result and why it isn't good for us
So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations, transf_group1 and transf_group2, it is clear that we will have two jobs one that chains all the transformations of transf_group1 and the second all the transformations for transf_group2. We call them respectively jobs1 and jobs2. So we will have:
a) A root job chains together 2 jobs job1 and job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.
Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the Add filenames to result. The picture below depict a possible flow for our root job.
So what to do??
The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.
You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the row result and the job can be executed once per file that is present in our directory. Remember check the magic flag Execute for every single row in the Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.
How to execute the provided sample
To execute my sample unzip the file in whatever directory. Edit the get file list transformations and change the Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.
Download from here the sample
Recently I had the need to connect Pentaho to MS Active Directory for user authentication/authorization. Immediately I asked myself how to c...
How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an ...
Today I had the opportunity to design some complex reports using the cubes published in my customer's Pentaho BI Server as datasources f...
- ▼ August (3)