Monday, June 10, 2013

Print Pentaho Reports from PDI efficiently and without pain!

2 comments
These days I'm involved in a project where we need to print reports directly from our PDI processes. So what a good option to use the PRD output step to do that? The idea was to find out a way for not writing the JDBC connection informations in the definition of our Pentaho reports connections. To do this, we decided to go for report connections defined by using JNDI datasources.

Pentaho Reporting and JNDI database connections

PRD, as any of the tools in the Pentaho family, supports the creation of a database connection using JNDI datasources. This is a cool feature! Creating Pentaho reports database connections using JNDI datasource is a good approach for the following reasons

  • We are able to not write connection information in the report making the it more easily adaptable to the system environment changes.
  • We are able to use a server side common datasource, greatly simplify the deployment of our reports to the Pentaho BI server.
  • Usually a datasource is directly related to the concept of database connection pooling that helps us to efficiently use and streamline server side resources.

Tipically a JNDI datasource is created by an application server or a servlet engine using an appropriate set of connection parameters that differs depending on the particular database we are going to connect to. As soon as the server side middleware creates the datasource, it is flagged with a name decided at design time by the developer and under that name is made available to any application that requires to use it. It is always a good rule of thumb to use a datasource, as soon as we have one, to let our application connect to our database. Pentaho BI server has a section in the Pentaho Administration Console (PAC) to let you create all the datasource connections you need. Pentaho Reporting and any other tool of the Pentaho suite makes extensive use of the JNDI datasources defined under PAC to connect to all the SQL databases. But now, as developers using PRD (Pentaho Report Designer) to develop our reports, the problem is really simple: how can we develop our reports using Pentaho Reporting JDBC Connections based on JNDI JDBC datasources if we don't have any datasource created on our client (remember that the datasource lives on the server).

First approach about using JNDI datasources in PRD

The first idea, not a good option unfortunately but the first we can find out, could be as follow:

  • Develop and test the report by defining Pentaho Reporting JDBC Connections based on Native JDBC connections. For this particular Pentaho Reporting JDBC Connection type, the connection information are saved into the Pentaho Report metadata definition making it very difficult to be maintained and eventually changed.
  • As soon as the report is ready to be deployed to the Pentaho BI Server, edit the Pentaho Reporting JDBC connection definition and change it to use a JNDI JDBC datasource. Basically, this kind of JDBC connection requires that you type the name of a datasource that is made available on the server you are going to deploy. Using JNDI JDBC datasource has the major plus in having a single place were the connections information are kept: in the datasource definition on the application server or servlet engine configuration. So if something changes in our connection information the impact on our big set of reports is really low. That's a good thing.

This approach, at first, seems a good idea but what's wrong with this? For sure it's a bit elaborated but the worst thing is that you can forget to make the change suggested immediately before to deploy to your Pentaho BI Server. I forget for sure....  So what can we do to make the things the easiest the possible?

Use SimpleJNDI to emulate datasources on the client

Pentaho Reporting, and the same is for any other Pentaho tool, integrates a very cool library called SimpleJNDI. This library, as stated in the library website "is intended to solve two problems. The first is that of finding a container independent way of opening a database connection, the second is to find a good way of specifying application configurations". It is entirely library based, so no server instances are started, and it sits upon Java .properties files, XML files or Windows-style .ini files, so it is easy to use and simple to understand. The files may be either on the file system or in the classpath.

In case of PRD we have a file called default.properties typically located in the directory <user_home>/.pentaho/simple-jndi. That file contains the datasources definitions. Any datasource is defined by using a set of by 5 entries and any entry is made up by a key-value pair where the key is defined with the following syntax

<datasource name>/<attribute>

That said let's have a look at the example below:

SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/user=pentaho_user
SampleData/password=password
SampleData/url=jdbc:hsqldb:mem:SampleData

This example represents the definition of a datasource named SampleData; it has 5 attributes as detailed below:

  • type: it represents the type of JNDI resource we're going to define. Basically it's valued to the Java interface that represents a datasource.
  • driver: it is the name of the database JDBC driver and changes depending on the db we're going to connect to
  • user: the username used on the connection
  • password: the user password used for the connection
  • url: the jdbc connection url used to open the connection to the database. Again, this value depends on the database we're going to connect to.

As soon as the Reporting Engine is going to fill the report template with the data and so a database connection is require the library crease the datasource and the connection is shared with Pentaho Report designer that

The way to go to develop Pentaho reports using JNDI datasources 

That said to properly define a Pentaho JDBC Connection that uses JDBC JNDI datasources, definitely, the way to go is to do it as detailed below

  1. Open the default.properties file located under <user_home>/.pentaho/simple-jndi.
  2. Copy and paste the definition of an already defined datasource to simplify the new datasource definition. 
  3. Change the name of the datasource in the copied rows to the name of the new datasource we're going ìto define. Remember to call the datasource with the same name of the datasource defined on your Pentaho BI server that the report will use once deployed.
  4. Change the values of the attributes and update them with the connection parameters for the new datasource.
  5. Save and close the file
  6. Open the Pentaho Report Designer and start a new report. 
  7. Define a new Pentaho JDBC Datasource connection. Give it a proper name
  8. Add a new JDBC connection in the Pentaho JDBC datasource configuration.
  9. Give the connection a name, select the database connection type and JNDI as access type.
  10. As soon as the JNDI access type is selected set the value of the JNDI name field to the name of the datasource you just configured in the default.properties file (as detailed from point 1 to 5 above).
  11. Press the Test button to verify that the connection through the newly defined datasource works. 
  12. Press OK to close the database connection defininition dialog.


At this point your JNDI connection is defined and uses a fake datasource that has the same name as the datasource on the server. So you're not required to remember to make any change before deploying the report because everything is already done. Cool!

Print Pentaho reports from a PDI transformation

PDI (Pentaho Data Integration) has an output task called Pentaho Reporting Output that you can use to print a Pentaho Report from inside a transformation. Basically it prints your report by configuring the complete template filename, the path to the output directory, the report output format (PDF, Excel, HTML ecc.) and eventually the parameters.


So that said seems very simple for a PDI developer to print a Pentaho Report and obviously it is. Take data from your sources, manipulate them in your transformation and start printing your report using this cool output step. But there are some things that anyone must know to let the things works properly in any case. Because PDI doesn't inject connection information in Pentaho Reporting, a good way to have Pentaho Reporting JDBC connections details not written into the report metadata is to use JNDI JDBC datasources to have all that stuff externally defined and easily maintainable. Again, the simple-jndi library comes to our attention to help us in dealing with this.

PDI doesn't inject into our report a database connection

PDI doesn't inject a database connection into the report. So the report uses the connection information defined in Pentaho Reporting datasources connections. At this point, the better option we have to externalize the report database connection information  is to use the JNDI capability of our Pentaho Reporting by using a locally defined datasource connection (as detailed above). In this case the default.properties file containing the Pentaho Reporting datasources connection information works as a new external configuration file of our PDI process ad you can distribute it with your process configuration files.  As soon as your PDI transformation will start the print output, Pentaho Reporting uses simple-jndi library to create the JNDI JDBC datasource connection and will make that connection available to your report. When we talked about using simple-jndi to support the development process in Pentaho Report Designer of reports using JNDI datasources I said that PRD looks for the file in a specific location of your filesystem. My assumption at this point was that, as usual, even if we use this mechanism to print a report from inside PDI, the reporting subsystem knows where the default.properties file is located. But, unfortunately, I was wrong because Pentaho Reporting is unable to locate the correct location of our default.properties file so the JNDI datasources definitions made in our report are completely. Let analyze the standard use-case: Pentaho Report Designer (the tool) gets the location information for that file through a configuration read by the tool and when we deploy the report to the Pentaho BI server the JNDI name for the datasource is get from the server. But in our new use-case PDI that prints a Pentaho report that uses JNDI definitions, the things are totally different. So how can we deal with this to have our report printed from the PDI step without pain?

Report with JNDI JDBC datasource connections printed by PDI: what happens?

As soon as the print is started by PDI the the Pentaho reporting subsystem the path to default.properties is resolved locally to the script that starts the PDI process (either spoon, or kitchen or pan). Let me try to explain. It your ETL process files are located in the /tmp/foo/etlprocess directory and your start your ETL process with kitchen locally to that directory with a relative path to the job file as in this example

$ cd /tmp/foo/etlprocess
$ kitchen.sh -file:./my_print_job.kjb

PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo/etlprocess so it looks for it in /tmp/foo/etlprocess/simple-jndi. But what happens if you are starting kitchen from inside totally different directory, let me say for example /tmp/foo1 and you're going to start your job my_print_job.kjb that is locate in /tmp/foo/etlprocess
$ cd /tmp/foo1
$ kitchen.sh -file:/tmp/foo/etlprocess/my_print_job.kjb

In this case, PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo1 so it looks for it in /tmp/foo1/simple-jndi. Because you are unaware about how your final user will start your job this is completely a mess! But don't be afraid there's a solution. The best idea to solve this is:

  • Have a configuration directory local to your ETL process file, that already contains other configuration items for your process and that you distribute from within your package.
  • Put your default.properties file inside that directory
  • Have a way to specify to PDI where that simple-jndi configuration file is located.

This elegantly solves your problem.

How to specify simple-jndi information to have your report fully working

There is a way to specify to the simple-jndi library all the information needed to elegantly solve our issue by specifying a set of environment variables. The source of information for this is in the documentation you find in the binary download that you can get from here.

Basically to fix our issue we need redefine the standard value of the PENTAHO_DI_JAVA_OPTIONS to this value:

PENTAHO_DI_JAVA_OPTIONS="-Xmx512m -XX:MaxPermSize=256m -XX:-UseGCOverheadLimit -Djava.naming.factory.initial=org.osjava.sj.SimpleContextFactory -Dorg.osjava.sj.root=/tmp/foo/etlprocess/config -Dorg.osjava.sj.delimiter=/"

As you can see we have the standard memory settings plus three new parameters:

  • -Djava.naming.factory.initial, this first parameter sets the complete name of the initial Context Factory for SimpleContexts. This is an internal simple-jndi object
  • -Dorg.osjava.sj.root, this parameter sets the complete path to the directory that will contain the default.properties file containing our datasources definitions. Starting from our previous example we're saying to simple.jndi to locate that file in /tmp/foo/etlprocess/config where config would be nic that will be our process configuration directory.
  • -Dorg.osjava.sj.delimiter, this third parameter sets the the delimiter used to separate elements in a lookup value. This allows code to get closer to pretending to be another JNDI implementation, such as DNS or LDAP. In our case we need to use the / character.


Fill free to choose which mechanism to use to set these environment variables; you can either put the variable definition in the user profile file or you can make a script file and call the PDI scripts from there after having set these environment variables.

That's all!

So for now that's all! You have all the information required to have your JNDI based reports fully working even when launched through PDI. This is a good way to fully separate connections information from the report and have them saved together to your ETL configuration files. This will give you a clean distribution for your customer and of course, the more easy and clean the things are organized the more happy they will be!

Stay tuned for the next post that will come later on. I've interesting tips to share with you about using a java profiler to support you in investigating memory usage problems in your ETL process so that you get quickly to the solution right solution. Have fun and see you later.


Read More...

Monday, April 22, 2013

Pretty print XML and JSON files with PDI (2/2)

Leave a Comment
Today I'm back with the second part of my recipe about how two pretty print XML and JSON files with PDI. You can find the first part here.

To pretty print whatever JSON stream a good starting point is to use the GSON library. It is a nice component that lets you

  1. serialize JSON streams starting from a set of Java objects or 
  2. convert a JSON stream into an equivalent set of JAVA objects
The setup

To prepare PDI to run this example you must:

  1. Download the GSON library from the following link. In my case I've downloaded the version 2.2.3 but consider the same steps with other versions of the library.
  2. Unzip the file on a temporary directory
  3. Copy the gson-2.2.3.jar file to the <PDI_HOME>/libext directory
  4. Restart PDI 

The how-to

First of all I started by making an example to obtain an ugly JSON sample stream to format. To do this I built a new transformation by reusing the input files of the sample multilayer xml file transformation to obtain a simple JSON stream. The interesting part is at the very end of this transformation. Again you have a User Defined Java Class step that contains all the code that does the dirty job for you.

1:  import com.google.gson.Gson;  
2:  import com.google.gson.GsonBuilder;  
3:  import com.google.gson.JsonParser;  
4:  import com.google.gson.JsonElement;  
5:  String jsonOutputField;  
6:  String jsonPPField;  
7:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
8:  {  
9:    // First, get a row from the default input hop  
10:       //  
11:       Object[] r = getRow();  
12:    // If the row object is null, we are done processing.  
13:       //  
14:       if (r == null) {  
15:            setOutputDone();  
16:            return false;  
17:       }  
18:       // Let's look up parameters only once for performance reason.  
19:       //  
20:       if (first) {  
21:            jsonOutputField = getParameter("JSONOUTPUT_FIELD");  
22:            jsonPPField = getParameter("JSONPP_FIELD");  
23:         first=false;  
24:       }  
25:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
26:    // enough to handle any new fields you are creating in this step.  
27:       //  
28:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
29:       logBasic("Input row size: " + r.length);  
30:       logBasic("Output row size: " + data.outputRowMeta.size());  
31:       String jsonOutput = get(Fields.In, jsonOutputField).getString(r);  
32:       Gson gson = new GsonBuilder().setPrettyPrinting().create();  
33:       JsonParser jp = new JsonParser();  
34:       JsonElement je = jp.parse(jsonOutput);  
35:       String jsonpp = gson.toJson(je);  
36:       // Set the value in the output field  
37:       //    
38:       get(Fields.Out, jsonPPField).setValue(outputRow, jsonpp);  
39:    // putRow will send the row on to the default output hop.  
40:       //  
41:    putRow(data.outputRowMeta, outputRow);  
42:       return true;  
43:  }  

This time the interesting code is between lines 32 and 35:

  1. The Gson object is created enabling the pretty printing.
  2. The JSON stream that is coming in is read and parsed appropriately (lines 33-34)
  3. A new JSON stream pretty printed is built and used to fill the jsonpp rowset field (line 35)

Next the output pretty printed JSON stream is saved to a .js file and that's all.

You can download the sample transformation from this link. I hope you enjoyed this two part article and that it can be useful for you.

Read More...

Friday, April 19, 2013

Pretty print XML and JSON files with PDI (1/2)

Leave a Comment
These days I've got into a stupid trouble with PDI. A customer asked me to build a process that produces an XML file as a result; the result I get was horrible from a presentation standpoint: the file wasn't rightly pretty printed and that makes it very difficult to be read. You will encounter this problem anytime you are going to obtain XML files as output, but the same thing happens also anytime you want a JSON output. So this time I decided to take care of this by trying to get into a quick and simple solution to beautify my output; this is what I'm going to talk about in this post.

Pretty Printing XML files How-To

So what to do in case I need to pretty print XML files? Well that's very simple. PDI ships with dom4j, a very well known library that let you easily work with XML, XPATH and XSLT. That library has an object in its API model, OutputFormat, which represents the format configuration used to format the XML output. There you find a method called createPrettyPrint(); this is all the needed to reach our goal very effectively. But now how can we apply all of this to the XML that we're going to produce in our transformation? Very easy to do.

Let's start considering one of the samples shipped with PDI the multilayer XML files sample. You can find it in the samples/transformations directory in you PDI home directory. Take the original file and make a copy so that we can start working on it.

Now, in between the Text Output step and the 5th XML Join step (named "XML Join Step 5") I added two new steps a Select Value step and a User Defined Java Step as you can see in the detail shown in the figure below

While the Select values steps selects only the xmlOutput field that contains the raw xml that needs to be beautified, the User Defined Java class step is the place where the dirty job is made. To properly configure the User Defined Java Class step, that I renamed "Pretty Print Out XML", we need to


  1. Under the tab Fields below, define a new field named xpp whose type is String that will contain the produced pretty printed xml stream
  2. Select the tab Parameters 
  3. Define a new parameter XMLOUTPUT_FIELD whose value is set to xmlOutput 
  4. Define a new parameter XMLPP_FIELD whose value is set to xmlpp
  5. Copy and paste the code below in the related area in the task properties dialog

1:  import java.io.StringWriter;  
2:  import org.dom4j.io.OutputFormat;  
3:  import org.dom4j.Document;  
4:  import org.dom4j.DocumentHelper;  
5:  import org.dom4j.io.XMLWriter;  
6:  String xmlOutputField;  
7:  String xmlPPField;  
8:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
9:  {  
10:    // First, get a row from the default input hop  
11:       //  
12:       Object[] r = getRow();  
13:    // If the row object is null, we are done processing.  
14:       //  
15:       if (r == null) {  
16:            setOutputDone();  
17:            return false;  
18:       }  
19:       // Let's look up parameters only once for performance reason.  
20:       //  
21:       if (first) {  
22:            xmlOutputField = getParameter("XMLOUTPUT_FIELD");  
23:            xmlPPField = getParameter("XMLPP_FIELD");  
24:         first=false;  
25:       }  
26:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
27:    // enough to handle any new fields you are creating in this step.  
28:       //  
29:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
30:       logBasic("Input row size: " + r.length);  
31:       logBasic("Output row size: " + data.outputRowMeta.size());  
32:       String xmlOutput = get(Fields.In, xmlOutputField).getString(r);  
33:    StringWriter sw;  
34:    try {  
35:      OutputFormat format = OutputFormat.createPrettyPrint();  
36:      Document document = DocumentHelper.parseText(xmlOutput);  
37:      sw = new StringWriter();  
38:      XMLWriter writer = new XMLWriter(sw, format);  
39:      writer.write(document);  
40:    }  
41:    catch (Exception e) {  
42:      throw new RuntimeException("Error pretty printing xml:\n" + xmlOutputField, e);  
43:    }  
44:    String xmlpp = sw.toString();  
45:       // Set the value in the output field  
46:       //  
47:       get(Fields.Out, xmlPPField).setValue(outputRow, xmlpp);  
48:    // putRow will send the row on to the default output hop.  
49:       //  
50:    putRow(data.outputRowMeta, outputRow);  
51:       return true;  
52:  }  

The code we're interested in is between lines 32 and 44.


  1. The xml that is caming in is read from the xmlOutput row field and is put into the xmlOutput variable
  2. We create an OutputFormat object with the pretty printing enabled by using  OutputFormat.createPrettyPrint() fuction (see line 35).
  3. Parse the XML document and then write it to a StringWriter (see lines 36 - 39)
  4. Put the obtained pretty printed XML in the xmlpp variable and the into the new field created in the output row (see lines 44 - 47)
The resulting XML stream is then saved to disk by mean of the Text Output step.
We made tests by pretty printing quite significant XML files in terms of size with good performances so I strongly suggest you to use this method. We're also planning to build a custom task to do this in a smoother way that will be released to SeraSoft github public repository soon. 

You can download the following sample by clicking here. So for now we've done; next time I'll show you how to beautify a JSON file. Stay tuned!
Read More...

Saturday, June 23, 2012

Saiku UI Enhancements and new Excel export

Leave a Comment
Hi everybody. As promised we've some chestnuts that are ready to be eaten and here we are! We decided to start working on one of the major pieces of the Pentaho CE suite: the newest OLAP client Saiku. We have some enhancement in our pipeline to help the Trout, Paul and the other guys of the crew; this is the first of the list. We decided to start with two enhancements that makes all more beautiful to our customers eyes:

1) Improved columns header management
2) New and improved Excel export functionality

Improved columns header management

A thing that I totally don't like in Saiku was the way it managed the columns headers any time we had repeated values on it. You can see a sample of this old behavior in the figure below.



Now the things are totally different. With a little UI improvement column headers with repeated values are grouped together as displayed in the picture below to have better visualization of your data.



New Excel export

Here we have completely rewritten the export module using Apache POI instead of JExcelAPI. This gives us more power and more feature that could be useful in the future to enhance the functionality. The look and feel now is more elegant and exactly equal to what the user is seeing in the user interface. You can see a sample of the new output below




This is a first release beta release of these features so be careful if you decide to use it. Any feedback about how they work and what you would like to have next is fully appreciated.

Where you can find it

All the code is committed to the GitHub Serasoft repository. You can find it here. We're planning the installation of an Hudson server on our servers in the cloud so that you can pickup anything already compiled so be patient. At the moment you can only grab the sources and compile it. Remember that our code is always aligned with all the new functionalities of the standard Saiku distribution. So what else? Stay tuned for the next things that are going to be released...
Read More...