Monday, June 10, 2013

Print Pentaho Reports from PDI efficiently and without pain!

These days I'm involved in a project where we need to print reports directly from our PDI processes. So what a good option to use the PRD output step to do that? The idea was to find out a way for not writing the JDBC connection informations in the definition of our Pentaho reports connections. To do this, we decided to go for report connections defined by using JNDI datasources.

Pentaho Reporting and JNDI database connections

PRD, as any of the tools in the Pentaho family, supports the creation of a database connection using JNDI datasources. This is a cool feature! Creating Pentaho reports database connections using JNDI datasource is a good approach for the following reasons

  • We are able to not write connection information in the report making the it more easily adaptable to the system environment changes.
  • We are able to use a server side common datasource, greatly simplify the deployment of our reports to the Pentaho BI server.
  • Usually a datasource is directly related to the concept of database connection pooling that helps us to efficiently use and streamline server side resources.

Tipically a JNDI datasource is created by an application server or a servlet engine using an appropriate set of connection parameters that differs depending on the particular database we are going to connect to. As soon as the server side middleware creates the datasource, it is flagged with a name decided at design time by the developer and under that name is made available to any application that requires to use it. It is always a good rule of thumb to use a datasource, as soon as we have one, to let our application connect to our database. Pentaho BI server has a section in the Pentaho Administration Console (PAC) to let you create all the datasource connections you need. Pentaho Reporting and any other tool of the Pentaho suite makes extensive use of the JNDI datasources defined under PAC to connect to all the SQL databases. But now, as developers using PRD (Pentaho Report Designer) to develop our reports, the problem is really simple: how can we develop our reports using Pentaho Reporting JDBC Connections based on JNDI JDBC datasources if we don't have any datasource created on our client (remember that the datasource lives on the server).

First approach about using JNDI datasources in PRD

The first idea, not a good option unfortunately but the first we can find out, could be as follow:

  • Develop and test the report by defining Pentaho Reporting JDBC Connections based on Native JDBC connections. For this particular Pentaho Reporting JDBC Connection type, the connection information are saved into the Pentaho Report metadata definition making it very difficult to be maintained and eventually changed.
  • As soon as the report is ready to be deployed to the Pentaho BI Server, edit the Pentaho Reporting JDBC connection definition and change it to use a JNDI JDBC datasource. Basically, this kind of JDBC connection requires that you type the name of a datasource that is made available on the server you are going to deploy. Using JNDI JDBC datasource has the major plus in having a single place were the connections information are kept: in the datasource definition on the application server or servlet engine configuration. So if something changes in our connection information the impact on our big set of reports is really low. That's a good thing.

This approach, at first, seems a good idea but what's wrong with this? For sure it's a bit elaborated but the worst thing is that you can forget to make the change suggested immediately before to deploy to your Pentaho BI Server. I forget for sure....  So what can we do to make the things the easiest the possible?

Use SimpleJNDI to emulate datasources on the client

Pentaho Reporting, and the same is for any other Pentaho tool, integrates a very cool library called SimpleJNDI. This library, as stated in the library website "is intended to solve two problems. The first is that of finding a container independent way of opening a database connection, the second is to find a good way of specifying application configurations". It is entirely library based, so no server instances are started, and it sits upon Java .properties files, XML files or Windows-style .ini files, so it is easy to use and simple to understand. The files may be either on the file system or in the classpath.

In case of PRD we have a file called default.properties typically located in the directory <user_home>/.pentaho/simple-jndi. That file contains the datasources definitions. Any datasource is defined by using a set of by 5 entries and any entry is made up by a key-value pair where the key is defined with the following syntax

<datasource name>/<attribute>

That said let's have a look at the example below:

SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/user=pentaho_user
SampleData/password=password
SampleData/url=jdbc:hsqldb:mem:SampleData

This example represents the definition of a datasource named SampleData; it has 5 attributes as detailed below:

  • type: it represents the type of JNDI resource we're going to define. Basically it's valued to the Java interface that represents a datasource.
  • driver: it is the name of the database JDBC driver and changes depending on the db we're going to connect to
  • user: the username used on the connection
  • password: the user password used for the connection
  • url: the jdbc connection url used to open the connection to the database. Again, this value depends on the database we're going to connect to.

As soon as the Reporting Engine is going to fill the report template with the data and so a database connection is require the library crease the datasource and the connection is shared with Pentaho Report designer that

The way to go to develop Pentaho reports using JNDI datasources 

That said to properly define a Pentaho JDBC Connection that uses JDBC JNDI datasources, definitely, the way to go is to do it as detailed below

  1. Open the default.properties file located under <user_home>/.pentaho/simple-jndi.
  2. Copy and paste the definition of an already defined datasource to simplify the new datasource definition. 
  3. Change the name of the datasource in the copied rows to the name of the new datasource we're going ìto define. Remember to call the datasource with the same name of the datasource defined on your Pentaho BI server that the report will use once deployed.
  4. Change the values of the attributes and update them with the connection parameters for the new datasource.
  5. Save and close the file
  6. Open the Pentaho Report Designer and start a new report. 
  7. Define a new Pentaho JDBC Datasource connection. Give it a proper name
  8. Add a new JDBC connection in the Pentaho JDBC datasource configuration.
  9. Give the connection a name, select the database connection type and JNDI as access type.
  10. As soon as the JNDI access type is selected set the value of the JNDI name field to the name of the datasource you just configured in the default.properties file (as detailed from point 1 to 5 above).
  11. Press the Test button to verify that the connection through the newly defined datasource works. 
  12. Press OK to close the database connection defininition dialog.


At this point your JNDI connection is defined and uses a fake datasource that has the same name as the datasource on the server. So you're not required to remember to make any change before deploying the report because everything is already done. Cool!

Print Pentaho reports from a PDI transformation

PDI (Pentaho Data Integration) has an output task called Pentaho Reporting Output that you can use to print a Pentaho Report from inside a transformation. Basically it prints your report by configuring the complete template filename, the path to the output directory, the report output format (PDF, Excel, HTML ecc.) and eventually the parameters.


So that said seems very simple for a PDI developer to print a Pentaho Report and obviously it is. Take data from your sources, manipulate them in your transformation and start printing your report using this cool output step. But there are some things that anyone must know to let the things works properly in any case. Because PDI doesn't inject connection information in Pentaho Reporting, a good way to have Pentaho Reporting JDBC connections details not written into the report metadata is to use JNDI JDBC datasources to have all that stuff externally defined and easily maintainable. Again, the simple-jndi library comes to our attention to help us in dealing with this.

PDI doesn't inject into our report a database connection

PDI doesn't inject a database connection into the report. So the report uses the connection information defined in Pentaho Reporting datasources connections. At this point, the better option we have to externalize the report database connection information  is to use the JNDI capability of our Pentaho Reporting by using a locally defined datasource connection (as detailed above). In this case the default.properties file containing the Pentaho Reporting datasources connection information works as a new external configuration file of our PDI process ad you can distribute it with your process configuration files.  As soon as your PDI transformation will start the print output, Pentaho Reporting uses simple-jndi library to create the JNDI JDBC datasource connection and will make that connection available to your report. When we talked about using simple-jndi to support the development process in Pentaho Report Designer of reports using JNDI datasources I said that PRD looks for the file in a specific location of your filesystem. My assumption at this point was that, as usual, even if we use this mechanism to print a report from inside PDI, the reporting subsystem knows where the default.properties file is located. But, unfortunately, I was wrong because Pentaho Reporting is unable to locate the correct location of our default.properties file so the JNDI datasources definitions made in our report are completely. Let analyze the standard use-case: Pentaho Report Designer (the tool) gets the location information for that file through a configuration read by the tool and when we deploy the report to the Pentaho BI server the JNDI name for the datasource is get from the server. But in our new use-case PDI that prints a Pentaho report that uses JNDI definitions, the things are totally different. So how can we deal with this to have our report printed from the PDI step without pain?

Report with JNDI JDBC datasource connections printed by PDI: what happens?

As soon as the print is started by PDI the the Pentaho reporting subsystem the path to default.properties is resolved locally to the script that starts the PDI process (either spoon, or kitchen or pan). Let me try to explain. It your ETL process files are located in the /tmp/foo/etlprocess directory and your start your ETL process with kitchen locally to that directory with a relative path to the job file as in this example

$ cd /tmp/foo/etlprocess
$ kitchen.sh -file:./my_print_job.kjb

PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo/etlprocess so it looks for it in /tmp/foo/etlprocess/simple-jndi. But what happens if you are starting kitchen from inside totally different directory, let me say for example /tmp/foo1 and you're going to start your job my_print_job.kjb that is locate in /tmp/foo/etlprocess
$ cd /tmp/foo1
$ kitchen.sh -file:/tmp/foo/etlprocess/my_print_job.kjb

In this case, PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo1 so it looks for it in /tmp/foo1/simple-jndi. Because you are unaware about how your final user will start your job this is completely a mess! But don't be afraid there's a solution. The best idea to solve this is:

  • Have a configuration directory local to your ETL process file, that already contains other configuration items for your process and that you distribute from within your package.
  • Put your default.properties file inside that directory
  • Have a way to specify to PDI where that simple-jndi configuration file is located.

This elegantly solves your problem.

How to specify simple-jndi information to have your report fully working

There is a way to specify to the simple-jndi library all the information needed to elegantly solve our issue by specifying a set of environment variables. The source of information for this is in the documentation you find in the binary download that you can get from here.

Basically to fix our issue we need redefine the standard value of the PENTAHO_DI_JAVA_OPTIONS to this value:

PENTAHO_DI_JAVA_OPTIONS="-Xmx512m -XX:MaxPermSize=256m -XX:-UseGCOverheadLimit -Djava.naming.factory.initial=org.osjava.sj.SimpleContextFactory -Dorg.osjava.sj.root=/tmp/foo/etlprocess/config -Dorg.osjava.sj.delimiter=/"

As you can see we have the standard memory settings plus three new parameters:

  • -Djava.naming.factory.initial, this first parameter sets the complete name of the initial Context Factory for SimpleContexts. This is an internal simple-jndi object
  • -Dorg.osjava.sj.root, this parameter sets the complete path to the directory that will contain the default.properties file containing our datasources definitions. Starting from our previous example we're saying to simple.jndi to locate that file in /tmp/foo/etlprocess/config where config would be nic that will be our process configuration directory.
  • -Dorg.osjava.sj.delimiter, this third parameter sets the the delimiter used to separate elements in a lookup value. This allows code to get closer to pretending to be another JNDI implementation, such as DNS or LDAP. In our case we need to use the / character.


Fill free to choose which mechanism to use to set these environment variables; you can either put the variable definition in the user profile file or you can make a script file and call the PDI scripts from there after having set these environment variables.

That's all!

So for now that's all! You have all the information required to have your JNDI based reports fully working even when launched through PDI. This is a good way to fully separate connections information from the report and have them saved together to your ETL configuration files. This will give you a clean distribution for your customer and of course, the more easy and clean the things are organized the more happy they will be!

Stay tuned for the next post that will come later on. I've interesting tips to share with you about using a java profiler to support you in investigating memory usage problems in your ETL process so that you get quickly to the solution right solution. Have fun and see you later.