Rama's Free Thoughts: Pentaho PDI

Showing posts with label Pentaho PDI. Show all posts

Monday, June 10, 2013

Print Pentaho Reports from PDI efficiently and without pain!

These days I'm involved in a project where we need to print reports directly from our PDI processes. So what a good option to use the PRD output step to do that? The idea was to find out a way for not writing the JDBC connection informations in the definition of our Pentaho reports connections. To do this, we decided to go for report connections defined by using JNDI datasources.

Pentaho Reporting and JNDI database connections

PRD, as any of the tools in the Pentaho family, supports the creation of a database connection using JNDI datasources. This is a cool feature! Creating Pentaho reports database connections using JNDI datasource is a good approach for the following reasons

We are able to not write connection information in the report making the it more easily adaptable to the system environment changes.
We are able to use a server side common datasource, greatly simplify the deployment of our reports to the Pentaho BI server.
Usually a datasource is directly related to the concept of database connection pooling that helps us to efficiently use and streamline server side resources.

Tipically a JNDI datasource is created by an application server or a servlet engine using an appropriate set of connection parameters that differs depending on the particular database we are going to connect to. As soon as the server side middleware creates the datasource, it is flagged with a name decided at design time by the developer and under that name is made available to any application that requires to use it. It is always a good rule of thumb to use a datasource, as soon as we have one, to let our application connect to our database. Pentaho BI server has a section in the Pentaho Administration Console (PAC) to let you create all the datasource connections you need. Pentaho Reporting and any other tool of the Pentaho suite makes extensive use of the JNDI datasources defined under PAC to connect to all the SQL databases. But now, as developers using PRD (Pentaho Report Designer) to develop our reports, the problem is really simple: how can we develop our reports using Pentaho Reporting JDBC Connections based on JNDI JDBC datasources if we don't have any datasource created on our client (remember that the datasource lives on the server).

First approach about using JNDI datasources in PRD

The first idea, not a good option unfortunately but the first we can find out, could be as follow:

Develop and test the report by defining Pentaho Reporting JDBC Connections based on Native JDBC connections. For this particular Pentaho Reporting JDBC Connection type, the connection information are saved into the Pentaho Report metadata definition making it very difficult to be maintained and eventually changed.
As soon as the report is ready to be deployed to the Pentaho BI Server, edit the Pentaho Reporting JDBC connection definition and change it to use a JNDI JDBC datasource. Basically, this kind of JDBC connection requires that you type the name of a datasource that is made available on the server you are going to deploy. Using JNDI JDBC datasource has the major plus in having a single place were the connections information are kept: in the datasource definition on the application server or servlet engine configuration. So if something changes in our connection information the impact on our big set of reports is really low. That's a good thing.

This approach, at first, seems a good idea but what's wrong with this? For sure it's a bit elaborated but the worst thing is that you can forget to make the change suggested immediately before to deploy to your Pentaho BI Server. I forget for sure.... So what can we do to make the things the easiest the possible?

Use SimpleJNDI to emulate datasources on the client

Pentaho Reporting, and the same is for any other Pentaho tool, integrates a very cool library called SimpleJNDI. This library, as stated in the library website "is intended to solve two problems. The first is that of finding a container independent way of opening a database connection, the second is to find a good way of specifying application configurations". It is entirely library based, so no server instances are started, and it sits upon Java .properties files, XML files or Windows-style .ini files, so it is easy to use and simple to understand. The files may be either on the file system or in the classpath.

In case of PRD we have a file called default.properties typically located in the directory <user_home>/.pentaho/simple-jndi. That file contains the datasources definitions. Any datasource is defined by using a set of by 5 entries and any entry is made up by a key-value pair where the key is defined with the following syntax

<datasource name>/<attribute>

That said let's have a look at the example below:

SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/user=pentaho_user
SampleData/password=password
SampleData/url=jdbc:hsqldb:mem:SampleData

This example represents the definition of a datasource named SampleData; it has 5 attributes as detailed below:

type: it represents the type of JNDI resource we're going to define. Basically it's valued to the Java interface that represents a datasource.
driver: it is the name of the database JDBC driver and changes depending on the db we're going to connect to
user: the username used on the connection
password: the user password used for the connection
url: the jdbc connection url used to open the connection to the database. Again, this value depends on the database we're going to connect to.

As soon as the Reporting Engine is going to fill the report template with the data and so a database connection is require the library crease the datasource and the connection is shared with Pentaho Report designer that

The way to go to develop Pentaho reports using JNDI datasources

That said to properly define a Pentaho JDBC Connection that uses JDBC JNDI datasources, definitely, the way to go is to do it as detailed below

Open the default.properties file located under <user_home>/.pentaho/simple-jndi.
Copy and paste the definition of an already defined datasource to simplify the new datasource definition.
Change the name of the datasource in the copied rows to the name of the new datasource we're going ìto define. Remember to call the datasource with the same name of the datasource defined on your Pentaho BI server that the report will use once deployed.
Change the values of the attributes and update them with the connection parameters for the new datasource.
Save and close the file
Open the Pentaho Report Designer and start a new report.
Define a new Pentaho JDBC Datasource connection. Give it a proper name
Add a new JDBC connection in the Pentaho JDBC datasource configuration.
Give the connection a name, select the database connection type and JNDI as access type.
As soon as the JNDI access type is selected set the value of the JNDI name field to the name of the datasource you just configured in the default.properties file (as detailed from point 1 to 5 above).
Press the Test button to verify that the connection through the newly defined datasource works.
Press OK to close the database connection defininition dialog.

At this point your JNDI connection is defined and uses a fake datasource that has the same name as the datasource on the server. So you're not required to remember to make any change before deploying the report because everything is already done. Cool!

Print Pentaho reports from a PDI transformation

PDI (Pentaho Data Integration) has an output task called Pentaho Reporting Output that you can use to print a Pentaho Report from inside a transformation. Basically it prints your report by configuring the complete template filename, the path to the output directory, the report output format (PDF, Excel, HTML ecc.) and eventually the parameters.

So that said seems very simple for a PDI developer to print a Pentaho Report and obviously it is. Take data from your sources, manipulate them in your transformation and start printing your report using this cool output step. But there are some things that anyone must know to let the things works properly in any case. Because PDI doesn't inject connection information in Pentaho Reporting, a good way to have Pentaho Reporting JDBC connections details not written into the report metadata is to use JNDI JDBC datasources to have all that stuff externally defined and easily maintainable. Again, the simple-jndi library comes to our attention to help us in dealing with this.

PDI doesn't inject into our report a database connection

PDI doesn't inject a database connection into the report. So the report uses the connection information defined in Pentaho Reporting datasources connections. At this point, the better option we have to externalize the report database connection information is to use the JNDI capability of our Pentaho Reporting by using a locally defined datasource connection (as detailed above). In this case the default.properties file containing the Pentaho Reporting datasources connection information works as a new external configuration file of our PDI process ad you can distribute it with your process configuration files. As soon as your PDI transformation will start the print output, Pentaho Reporting uses simple-jndi library to create the JNDI JDBC datasource connection and will make that connection available to your report. When we talked about using simple-jndi to support the development process in Pentaho Report Designer of reports using JNDI datasources I said that PRD looks for the file in a specific location of your filesystem. My assumption at this point was that, as usual, even if we use this mechanism to print a report from inside PDI, the reporting subsystem knows where the default.properties file is located. But, unfortunately, I was wrong because Pentaho Reporting is unable to locate the correct location of our default.properties file so the JNDI datasources definitions made in our report are completely. Let analyze the standard use-case: Pentaho Report Designer (the tool) gets the location information for that file through a configuration read by the tool and when we deploy the report to the Pentaho BI server the JNDI name for the datasource is get from the server. But in our new use-case PDI that prints a Pentaho report that uses JNDI definitions, the things are totally different. So how can we deal with this to have our report printed from the PDI step without pain?

Report with JNDI JDBC datasource connections printed by PDI: what happens?

As soon as the print is started by PDI the the Pentaho reporting subsystem the path to default.properties is resolved locally to the script that starts the PDI process (either spoon, or kitchen or pan). Let me try to explain. It your ETL process files are located in the /tmp/foo/etlprocess directory and your start your ETL process with kitchen locally to that directory with a relative path to the job file as in this example

$ cd /tmp/foo/etlprocess
$ kitchen.sh -file:./my_print_job.kjb

PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo/etlprocess so it looks for it in /tmp/foo/etlprocess/simple-jndi. But what happens if you are starting kitchen from inside totally different directory, let me say for example /tmp/foo1 and you're going to start your job my_print_job.kjb that is locate in /tmp/foo/etlprocess
$ cd /tmp/foo1
$ kitchen.sh -file:/tmp/foo/etlprocess/my_print_job.kjb

In this case, PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo1 so it looks for it in /tmp/foo1/simple-jndi. Because you are unaware about how your final user will start your job this is completely a mess! But don't be afraid there's a solution. The best idea to solve this is:

Have a configuration directory local to your ETL process file, that already contains other configuration items for your process and that you distribute from within your package.
Put your default.properties file inside that directory
Have a way to specify to PDI where that simple-jndi configuration file is located.

This elegantly solves your problem.

How to specify simple-jndi information to have your report fully working

There is a way to specify to the simple-jndi library all the information needed to elegantly solve our issue by specifying a set of environment variables. The source of information for this is in the documentation you find in the binary download that you can get from here.

Basically to fix our issue we need redefine the standard value of the PENTAHO_DI_JAVA_OPTIONS to this value:

PENTAHO_DI_JAVA_OPTIONS="-Xmx512m -XX:MaxPermSize=256m -XX:-UseGCOverheadLimit -Djava.naming.factory.initial=org.osjava.sj.SimpleContextFactory -Dorg.osjava.sj.root=/tmp/foo/etlprocess/config -Dorg.osjava.sj.delimiter=/"

As you can see we have the standard memory settings plus three new parameters:

-Djava.naming.factory.initial, this first parameter sets the complete name of the initial Context Factory for SimpleContexts. This is an internal simple-jndi object
-Dorg.osjava.sj.root, this parameter sets the complete path to the directory that will contain the default.properties file containing our datasources definitions. Starting from our previous example we're saying to simple.jndi to locate that file in /tmp/foo/etlprocess/config where config would be nic that will be our process configuration directory.
-Dorg.osjava.sj.delimiter, this third parameter sets the the delimiter used to separate elements in a lookup value. This allows code to get closer to pretending to be another JNDI implementation, such as DNS or LDAP. In our case we need to use the / character.

Fill free to choose which mechanism to use to set these environment variables; you can either put the variable definition in the user profile file or you can make a script file and call the PDI scripts from there after having set these environment variables.

That's all!

So for now that's all! You have all the information required to have your JNDI based reports fully working even when launched through PDI. This is a good way to fully separate connections information from the report and have them saved together to your ETL configuration files. This will give you a clean distribution for your customer and of course, the more easy and clean the things are organized the more happy they will be!

Stay tuned for the next post that will come later on. I've interesting tips to share with you about using a java profiler to support you in investigating memory usage problems in your ETL process so that you get quickly to the solution right solution. Have fun and see you later.

Monday, April 22, 2013

Pretty print XML and JSON files with PDI (2/2)

Today I'm back with the second part of my recipe about how two pretty print XML and JSON files with PDI. You can find the first part here.

To pretty print whatever JSON stream a good starting point is to use the GSON library. It is a nice component that lets you

serialize JSON streams starting from a set of Java objects or
convert a JSON stream into an equivalent set of JAVA objects

The setup

To prepare PDI to run this example you must:

Download the GSON library from the following link. In my case I've downloaded the version 2.2.3 but consider the same steps with other versions of the library.
Unzip the file on a temporary directory
Copy the gson-2.2.3.jar file to the <PDI_HOME>/libext directory
Restart PDI

The how-to

First of all I started by making an example to obtain an ugly JSON sample stream to format. To do this I built a new transformation by reusing the input files of the sample multilayer xml file transformation to obtain a simple JSON stream. The interesting part is at the very end of this transformation. Again you have a User Defined Java Class step that contains all the code that does the dirty job for you.

1:  import com.google.gson.Gson;  
2:  import com.google.gson.GsonBuilder;  
3:  import com.google.gson.JsonParser;  
4:  import com.google.gson.JsonElement;  
5:  String jsonOutputField;  
6:  String jsonPPField;  
7:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
8:  {  
9:    // First, get a row from the default input hop  
10:       //  
11:       Object[] r = getRow();  
12:    // If the row object is null, we are done processing.  
13:       //  
14:       if (r == null) {  
15:            setOutputDone();  
16:            return false;  
17:       }  
18:       // Let's look up parameters only once for performance reason.  
19:       //  
20:       if (first) {  
21:            jsonOutputField = getParameter("JSONOUTPUT_FIELD");  
22:            jsonPPField = getParameter("JSONPP_FIELD");  
23:         first=false;  
24:       }  
25:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
26:    // enough to handle any new fields you are creating in this step.  
27:       //  
28:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
29:       logBasic("Input row size: " + r.length);  
30:       logBasic("Output row size: " + data.outputRowMeta.size());  
31:       String jsonOutput = get(Fields.In, jsonOutputField).getString(r);  
32:       Gson gson = new GsonBuilder().setPrettyPrinting().create();  
33:       JsonParser jp = new JsonParser();  
34:       JsonElement je = jp.parse(jsonOutput);  
35:       String jsonpp = gson.toJson(je);  
36:       // Set the value in the output field  
37:       //    
38:       get(Fields.Out, jsonPPField).setValue(outputRow, jsonpp);  
39:    // putRow will send the row on to the default output hop.  
40:       //  
41:    putRow(data.outputRowMeta, outputRow);  
42:       return true;  
43:  }

This time the interesting code is between lines 32 and 35:

The Gson object is created enabling the pretty printing.
The JSON stream that is coming in is read and parsed appropriately (lines 33-34)
A new JSON stream pretty printed is built and used to fill the jsonpp rowset field (line 35)

Next the output pretty printed JSON stream is saved to a .js file and that's all.

You can download the sample transformation from this link. I hope you enjoyed this two part article and that it can be useful for you.

Friday, April 19, 2013

Pretty print XML and JSON files with PDI (1/2)

These days I've got into a stupid trouble with PDI. A customer asked me to build a process that produces an XML file as a result; the result I get was horrible from a presentation standpoint: the file wasn't rightly pretty printed and that makes it very difficult to be read. You will encounter this problem anytime you are going to obtain XML files as output, but the same thing happens also anytime you want a JSON output. So this time I decided to take care of this by trying to get into a quick and simple solution to beautify my output; this is what I'm going to talk about in this post.

Pretty Printing XML files How-To

So what to do in case I need to pretty print XML files? Well that's very simple. PDI ships with dom4j, a very well known library that let you easily work with XML, XPATH and XSLT. That library has an object in its API model, OutputFormat, which represents the format configuration used to format the XML output. There you find a method called createPrettyPrint(); this is all the needed to reach our goal very effectively. But now how can we apply all of this to the XML that we're going to produce in our transformation? Very easy to do.

Let's start considering one of the samples shipped with PDI the multilayer XML files sample. You can find it in the samples/transformations directory in you PDI home directory. Take the original file and make a copy so that we can start working on it.

Now, in between the Text Output step and the 5th XML Join step (named "XML Join Step 5") I added two new steps a Select Value step and a User Defined Java Step as you can see in the detail shown in the figure below

While the Select values steps selects only the xmlOutput field that contains the raw xml that needs to be beautified, the User Defined Java class step is the place where the dirty job is made. To properly configure the User Defined Java Class step, that I renamed "Pretty Print Out XML", we need to

Under the tab Fields below, define a new field named xpp whose type is String that will contain the produced pretty printed xml stream
Select the tab Parameters
Define a new parameter XMLOUTPUT_FIELD whose value is set to xmlOutput
Define a new parameter XMLPP_FIELD whose value is set to xmlpp
Copy and paste the code below in the related area in the task properties dialog

1:  import java.io.StringWriter;  
2:  import org.dom4j.io.OutputFormat;  
3:  import org.dom4j.Document;  
4:  import org.dom4j.DocumentHelper;  
5:  import org.dom4j.io.XMLWriter;  
6:  String xmlOutputField;  
7:  String xmlPPField;  
8:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
9:  {  
10:    // First, get a row from the default input hop  
11:       //  
12:       Object[] r = getRow();  
13:    // If the row object is null, we are done processing.  
14:       //  
15:       if (r == null) {  
16:            setOutputDone();  
17:            return false;  
18:       }  
19:       // Let's look up parameters only once for performance reason.  
20:       //  
21:       if (first) {  
22:            xmlOutputField = getParameter("XMLOUTPUT_FIELD");  
23:            xmlPPField = getParameter("XMLPP_FIELD");  
24:         first=false;  
25:       }  
26:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
27:    // enough to handle any new fields you are creating in this step.  
28:       //  
29:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
30:       logBasic("Input row size: " + r.length);  
31:       logBasic("Output row size: " + data.outputRowMeta.size());  
32:       String xmlOutput = get(Fields.In, xmlOutputField).getString(r);  
33:    StringWriter sw;  
34:    try {  
35:      OutputFormat format = OutputFormat.createPrettyPrint();  
36:      Document document = DocumentHelper.parseText(xmlOutput);  
37:      sw = new StringWriter();  
38:      XMLWriter writer = new XMLWriter(sw, format);  
39:      writer.write(document);  
40:    }  
41:    catch (Exception e) {  
42:      throw new RuntimeException("Error pretty printing xml:\n" + xmlOutputField, e);  
43:    }  
44:    String xmlpp = sw.toString();  
45:       // Set the value in the output field  
46:       //  
47:       get(Fields.Out, xmlPPField).setValue(outputRow, xmlpp);  
48:    // putRow will send the row on to the default output hop.  
49:       //  
50:    putRow(data.outputRowMeta, outputRow);  
51:       return true;  
52:  }

The code we're interested in is between lines 32 and 44.

The xml that is caming in is read from the xmlOutput row field and is put into the xmlOutput variable
We create an OutputFormat object with the pretty printing enabled by using OutputFormat.createPrettyPrint() fuction (see line 35).
Parse the XML document and then write it to a StringWriter (see lines 36 - 39)
Put the obtained pretty printed XML in the xmlpp variable and the into the new field created in the output row (see lines 44 - 47)

The resulting XML stream is then saved to disk by mean of the Text Output step.

We made tests by pretty printing quite significant XML files in terms of size with good performances so I strongly suggest you to use this method. We're also planning to build a custom task to do this in a smoother way that will be released to SeraSoft github public repository soon.

You can download the following sample by clicking here. So for now we've done; next time I'll show you how to beautify a JSON file. Stay tuned!

Thursday, August 26, 2010

Processing group of files executing a job once per file

How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.

Suppose we've two groups of files that we call filegroup1 and filegroup2 and suppose we have 2 groups of transformations that we call transf_group1 and transf_group2. The requirement is: we want to execute the transformations in transf_group1 once for each file in filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of the transformations in transf_group2 once for each file in filegroup2. Let me analyse how we can do that.

A little about some main Kettle topics

Kettle processes informations that flows through a path made by steps. The user takes the needed steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of steps: we've input/output steps, transformation steps, script steps and so on. Any step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a transformation. So transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the job. The job orchestrates the execution of a set of transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.

In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.

How Kettle starts nested Jobs or Transactions

The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the Start transaction or Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the step configuration dialog , select the Advanced tab and check Execute for every single row. We see an example of that below in the Start transaction configuration dialog. You'll find the same setting in the Start job configuration dialog.

The job step Add filenames to result and why it isn't good for us

So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations, transf_group1 and transf_group2, it is clear that we will have two jobs one that chains all the transformations of transf_group1 and the second all the transformations for transf_group2. We call them respectively jobs1 and jobs2. So we will have:

a) A root job chains together 2 jobs job1 and job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.

Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the Add filenames to result. The picture below depict a possible flow for our root job.

We're starting to smile but unfortunately this solutions is not applicable because it doesn't work. To understand the why we need to understand the difference between row result and file result for Kettle. Typically a file result is a set of filenames that can be used only by steps that are able to manage attachments. The Mail step is the one step that can manage such a result. Row results instead are made by real data typically as output of a transaction. If you look at the Kettle internals you can notice that a job step manages these two datasets as two completely separated collections. The important thing to note here is that whenever you check Execute for every single row in our job/transformation configuration your're saying that you'll start your job/transformation for each row of your row result. So way our solution isn't good for us? Because our Add filenames to result steps fill a file result so our jobs will never starts.

So what to do??

The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.

You need to call that transformation through a Transformation step in our root job chaining it before job1 and job2 to get respectively filegroup1 and filegroup2. Here it is the complete layout of our definitive root job

You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the row result and the job can be executed once per file that is present in our directory. Remember check the magic flag Execute for every single row in the Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.

How to execute the provided sample

To execute my sample unzip the file in whatever directory. Edit the get file list transformations and change the Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.

Download from here the sample

Wednesday, August 25, 2010

Dealing with Kettle DBCache

I came from 4 weeks of summer holidays... First time in my life but it was beautiful to spend a lot of time with my family. I need to take some time for them more often.

Anyway, today I was doing my usual work with Kettle but something strange happened that makes me crazy. I added a new field to my table and then I came to Kettle to update the fields layout in my Database lookup step. When I tried to find out my new field from the fields list I remained surprised... No new field appeared in the fields list for my table. I cried because I thought that it magically disappeared by my table but after a rapid check with my database client I saw my field was there in my table and I was happy. So what happened?

Cache database metadata with DBCache

DBCache is a cache where recurring query results (about database metadata informations for example) are saved the first time you access them. That informations are persisted in a file called db.cache located in the .kettle directory below your home directory. Informations are saved in row format so you have a look inside the file or edit its content.

Every time, for example, you go through Kettle database explorer looking for a table layout, the first time you access table metadata that results are cached through DBCache so that second time you you go through the cache saving an access to the db. But if you forget that and you update your table DDL in any way (like me after a veeeery long period of holidays) you could be surprised seeing that your updates seems not to be caught by Kettle.

How can we clear the cache and have our table metadata updated next time I need to access them? You can go through the Tool menu and choose Database > Clear cache and your fresh set of table metadata will be get from the database.

Wednesday, July 21, 2010

Setup Kettle debugging in 2 minutes

Sometimes it could be very interesting to be able to debug something to understand better how it works from an internal standpoint. My experience says that this is the case with the majority the opensource projects: sooner or later this is will happen.

Today that happened with Kettle trying to better understand if my assumptions were correct.

So how can we easily debug Kettle? The answer is very easy is: use remote debugging. I'll explain everything in a minute. The development tool I'm using is IntelliJ IDEA but it is fairly simple to set up everything with Eclipse and Netbeans.

Setup Kettle to enable remote debugging

1) If you're on Windows, open Spoon.bat and add the following line

 set JAVA_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n

The same on Linux will be to open Spoon.sh and add the following line

 export JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n"

I've used the port 8000 as the socket port number at which the debugger attaches but feel free to use the one you prefer.

2) Go to the last line of Spoon .bat or Spoon.sh and update that last line as follow

Linux (Spoon.sh):

 $JAVA_BIN $JAVA_OPTS $OPT $STARTUP -lib $LIBPATH "${1+$@}"

Windows (Spoon.bat):

 start javaw %JAVA_OPTS% %OPT% -jar launcher\launcher.jar -lib %LIBSPATH% %_cmdline%

Configure IntelliJ IDE for remote debugging

1) Configure a project for Kettle source tree
2) Choose Build > Edit configurations from the menu
3) The Run/Debug Configurations dialog opens. Choose Add New Configuration (the + sign in the upper left corner) and select Remote from the Add New Configuration list

3) The Remote Configuration dialog opens

Configure Transport and Debugger mode choices as displayed in the image above. The set the Host address (localhost in this case or wherever you need) and the port (8000 in my case) and press Ok

And now let's dance...

So now start Kettle. As soon as Kettle started load you job or transformation and, before running it, launch the debugger from your IDE. As soon as the debugger started you're ready to launch your job/transformation and begin your investigation. The advantage of this approach is the ability to debug also remotely running kettle instances and that is useful sometimes in helping to investigate problems.

Have fun playing with your Kettle and see you the next article.

Rama's Free Thoughts