Rama's Free Thoughts

Saturday, June 7, 2014

Building a Data Mart with Pentaho Data Integration - Video Course Review

Some days ago I had the opportunity to look at this video course, Building a Data Mart with Pentaho Data Integration from Packt, same editor I'm working for in other Pentaho books.

I was really interested in evaluating it, firstly because the author, Diethard Steiner, is a valid community member and author of a great number of useful articles on his own blog. Secondly because it seemed me a good introduction about how to approach the design and development of a complete ETL process to load a Datamart with Pentaho Data Integration.

The course is structured through 9 chapters, clearly explained with a good preamble on top of every chapter to give a clear view of the topic the author is covering also to beginners users. The explanation is clearly undertandable because author speaks a good english at right speed (in my opinion) giving also people that are not perfectly familiar with english language the ability to clearly understand the concepts.

First chapter talks about how to setup PDI (Pentaho Data Integration) and illustrates source and target schema and basics of their design. The thing I liked a lot here was that author used a columnar database (MonetDB) as target platform for its datamart. This way, user can have the opportunity to know about another interesting database technology to use during his/her work. Because the vast majority of the users could not know anything about this database platform, I would have appreciated a few more details about it just to give some necessary background on the database platform and tools used.

Second chapter shows Table Input and Table Output tasks and their use to massively input and output data from a PDI transformation. Good to see also the illustration of the MonetDB bulk loader step, as a more optimized alternative to Table Output step.

Third chapter is all about Agile BI made with PDI and Agile BI Plugin. A good introduction to the benefits of early prototyping and how to do this with Pentaho technology.

Forth chapter is the place where author talks about Slowly Changing Dimensions and how to use PDI to read and write data in Slowly Changing Dimensions of Type 1 and 2. Here, I would have appreciated just one more slide to illustrate beginners users a few more theory about Slowly Changing Dimensions to clearly let user understand which are the problems this approach is going to address in the design of a Data Mart.

Fith chapter talks about time dimension, how to simply structure it and how to effectively generate data to put in it.

Sixth chapter is a guide through techniques used to populate the Fact Table. This same topic will then continue in eighth chapter where the author talks about Changing Data Capture and how to structure our jobs and transformation to load our fact table. Here, in chapter eight, in my opinion a thing is missing and the CDC topic is partially addressed. I would have appreciated to hear something about using the Merge step to get the table's delta from the previous data load and then synchronizing fact table through the Synchronize After Merge step. In my opinion this is a more generic and interesting way to address this problem.

Finally, chapter seven is all about how to better organize jobs and transformation efficently and chapter nine adresses the logging and scheduling of transformation.

Therefore, to give a summary about this video course, it is a good introduction to PDI and how to use it to load a Data Mart. I find it very useful for beginners and nice to see for intermediate users that want to go deeper on some concepts. Samples are well done and really help the user in understanding better the concepts explained. Just a side note: during the videos some of the samples are shown by using an early version of PDI 5 and some other by using the previous version PDI 4.4. Starting from the valid assumption that author was trying to show users the new version of PDI, I found this a bit confusing mostly for beginner users.

Thursday, September 26, 2013

A first look to the new Pentaho BA Server 5.0 CE

It's a lot of time I'm playing with the new Pentaho BA Server 5.0 to see the what's next and now, I think, it is the right time to start talking about it. The first thing that appears is that Pentaho made a big refactoring for this release. A lot of changes from a brand new GUI to a completely new and clean source tree with less and more organized projects than the previous release.

What's changed from a developer point of view

The first big set of changes, visible only to developers like me interested in looking into the code for getting knowledge or customize something, is summarized below:

Project moved to github at this url. This is a good choice to improve collaboration between people and collaboration is one of the key points of the OSS philosophy. Almost any project around Pentaho is housed on this platform (Kettle not yet but it seems the only one).
The project organization in the source code radically changed trying to simplify the structure of the codebase (and in my opinion they get it). You can find a description about the the new project structure and role of any of the sub-projects in this article in the Pentaho wiki.
Then, as a Java architect, another interesting change. It seems they moved to IntelliJ as development platform of choice because of the presence of the IntelliJ metadata files in the source tree. I'm a user of IntelliJ (after having use Netbeans and Eclipse) so this is a switch that I liked a lot. So here at this page a detailed description about to how pull the platform project directly into IntelliJ to start working on it

Get the platform and start playing with it

What's the best way to have the new release in your hands to start trying it out? Two ways: a) get it from the pentaho ci build system or b) build it by yourself. As a developer, the second choice is the funniest and the more interesting if you, like me, want to understand how the things changes from time to time and want to be sure that everything goes correctly. Moreover the ability to prepare your own system package gives you a high degree of personalization; to do this you can find useful using CBF from Webdetails.

In this case, if you want to build it by yourself firstly you've to pull the project locally on your computer from the pentaho-platform repository on GitHub. Then the quick and dirty way to build Pentaho BA Server is to use the development ant build buildfile (dev-build.xml) given in the source tree and make a new complete development build. If intersted, you can get all of the possible targets for this ant command file by giving the following command ant -f dev_build.xml help

but I decided to follow the easy way by using the easiest command available to reach my goal. To do this go to the platform sources' root directory (we indicate that directory as <pentaho-platform-root>) and give the following command

 ant -f dev_build.xml dev-rebuild -Dsolution.dir.server=bin/stage/biserver-ce/pentaho-solutions -Dproject.revision=CE-SNAPSHOT package

The system will remain busy building your new release; depending on your hardware it will take 10 to 15 minutes to complete and you will have it cooked rightly. The result is

a packaged version of the Pentaho BA server in the <pentaho-platform-root>/assembly/dist
an unpackaged version of the same in the <pentaho-platform-root>/assembly/bin/stage/biserver-ce directory that you can use immediately for development and testing purposes.

That said, you can use the unpackaged version to easily and quickly look at the new features of the BA server.

What's changed from a user point of view

As soon as the platform opens in the browser you can see the new look. The new login screen looks really clean and minimal. I really like it! Now, remember that our old and friendly user joe is dead; if you want to login as admin you need to log in as the user Admin with password.... ah yes of course password (too difficult to remember!). In any case you can always get help by clicking with the mouse on the Login as Evaluator link to get the demo users and passwords.

As soon as you get logged in you discover a totally new user environment that I really like for its. The new environment now is based on a concept called Perspective.

A Pentaho perspective is an area that refers to a specific user context and that collect all the objects and actions referred to that particular context. You can change your perspective by clicking on the cascade menu located on the upper left side of the screen immediately below the main menu

We have 4 perspective in our Pentaho environment:

Home: It's the Perspective you get into as soon as you log into Pentaho (see screenshot two pictures above). It contains portlets for Recent files and Favorites files to help you in getting easier to reach the more often used reports or visualizations. Then you have a big blue button called Manage Datasources that is the new way you need to use to define datasources. From this release, this is the only place you can go to deal with datasources: you will not have anymore this function in the administration part of the system as in the previous releases. By clicking on that button you open the Manage Datasources dialog box that let you define new datasources and either edit or delete existing ones.

Browse files: it gives you the ability to access the solution through the solution explorer. As you can see here another interesting new feature. Now we've a differentiation between a public and a private part of the solution. As soon as a user gets created a new Home directory is created. That directory and its content is visible only to that particular user. Only a user with the Admin role has access to all of the users' home directories. All of the homes directories are under a parent folder called Home. There's also a public part of the solution identified by a root folder called Public. That part is the part of the solution that is shared, depending on the share level decided by the administrator or by the content owner, by every user in the system. There's no more contextual menu referred to any item of the solution (either folders or files) as in the previous release but you can have the contextual actions using the menu on the right of the File explorer.

Open files: As soon as you start an analysis view, a report, a visualization or anything else time the content you get is opened in this perspective. You always have multiple contents opened at the same time through a tabbed interface and you can easily switch back to the Browse files perspective or any other using the drop down menu we illustrated a while ago.

Schedule: The old My Workspace view is definitely dead. With this new release we have this perspective where the user can check the status of any the scheduled content and open the content for any terminated execution.

Administration: Another important change is about the administration of the Pentaho BA Server. The old administration console is definitely dead (I like this sooooo much!) and we've a new perspective accessible only to the Admin role and called Administration that let you do your administrative tasks.

Conclusion

So what's next? I think they made a big work reorganizing almost anything in a new and clean way. I really like it thanks a lot guys. I really appreciated they choose to implement a responsive design for this new GUI so that it could be easily used without any pain on any device but it isn't. So this is a good point in my opinion where they could invest some time and there's space to make things much better. In any case we must admit a good change versus a more intuitive and clean web interface so thank you Pentaho for this good job.

Tuesday, July 16, 2013

My first book about Pentaho is out!

Today is a great day for me and I'm very proud about this. It is the realization of a dream I had from a long time: have the possibility to write a technical book.

In February 2013, a guy from Packt Publishing wrote me an email saying that he read at my blog, he liked it and so he decided to ask me about writing a book about Pentaho Data Integration. The book is about kitchen and how to use the PDI's command line tools efficiently to help people in their day to day operations with Pentaho Data Integration. It is a practical book and it seemed relatively easy to write. At first I was surprised and excited (a lot I must admit). It seemed me impossible that someone somewhere in the world noticed my blog and decided to trust me. I was really proud and honored about this so I decided to test me on this and I accepted.

Now, after 4 months of work, that book is finally out and you can order it!

I just wanted to thanks Packt Publishing for this opportunity and especially all the people from Packt that helped me in this adventure by reviewing my work and giving me useful suggestions about to improve my work of novice writer the best I can. I also want to thanks my technical reviewer Joel Latino for the precious help he gave in reviewing the book and suggesting things to improve it.

Monday, June 10, 2013

Print Pentaho Reports from PDI efficiently and without pain!

These days I'm involved in a project where we need to print reports directly from our PDI processes. So what a good option to use the PRD output step to do that? The idea was to find out a way for not writing the JDBC connection informations in the definition of our Pentaho reports connections. To do this, we decided to go for report connections defined by using JNDI datasources.

Pentaho Reporting and JNDI database connections

PRD, as any of the tools in the Pentaho family, supports the creation of a database connection using JNDI datasources. This is a cool feature! Creating Pentaho reports database connections using JNDI datasource is a good approach for the following reasons

We are able to not write connection information in the report making the it more easily adaptable to the system environment changes.
We are able to use a server side common datasource, greatly simplify the deployment of our reports to the Pentaho BI server.
Usually a datasource is directly related to the concept of database connection pooling that helps us to efficiently use and streamline server side resources.

Tipically a JNDI datasource is created by an application server or a servlet engine using an appropriate set of connection parameters that differs depending on the particular database we are going to connect to. As soon as the server side middleware creates the datasource, it is flagged with a name decided at design time by the developer and under that name is made available to any application that requires to use it. It is always a good rule of thumb to use a datasource, as soon as we have one, to let our application connect to our database. Pentaho BI server has a section in the Pentaho Administration Console (PAC) to let you create all the datasource connections you need. Pentaho Reporting and any other tool of the Pentaho suite makes extensive use of the JNDI datasources defined under PAC to connect to all the SQL databases. But now, as developers using PRD (Pentaho Report Designer) to develop our reports, the problem is really simple: how can we develop our reports using Pentaho Reporting JDBC Connections based on JNDI JDBC datasources if we don't have any datasource created on our client (remember that the datasource lives on the server).

First approach about using JNDI datasources in PRD

The first idea, not a good option unfortunately but the first we can find out, could be as follow:

Develop and test the report by defining Pentaho Reporting JDBC Connections based on Native JDBC connections. For this particular Pentaho Reporting JDBC Connection type, the connection information are saved into the Pentaho Report metadata definition making it very difficult to be maintained and eventually changed.
As soon as the report is ready to be deployed to the Pentaho BI Server, edit the Pentaho Reporting JDBC connection definition and change it to use a JNDI JDBC datasource. Basically, this kind of JDBC connection requires that you type the name of a datasource that is made available on the server you are going to deploy. Using JNDI JDBC datasource has the major plus in having a single place were the connections information are kept: in the datasource definition on the application server or servlet engine configuration. So if something changes in our connection information the impact on our big set of reports is really low. That's a good thing.

This approach, at first, seems a good idea but what's wrong with this? For sure it's a bit elaborated but the worst thing is that you can forget to make the change suggested immediately before to deploy to your Pentaho BI Server. I forget for sure.... So what can we do to make the things the easiest the possible?

Use SimpleJNDI to emulate datasources on the client

Pentaho Reporting, and the same is for any other Pentaho tool, integrates a very cool library called SimpleJNDI. This library, as stated in the library website "is intended to solve two problems. The first is that of finding a container independent way of opening a database connection, the second is to find a good way of specifying application configurations". It is entirely library based, so no server instances are started, and it sits upon Java .properties files, XML files or Windows-style .ini files, so it is easy to use and simple to understand. The files may be either on the file system or in the classpath.

In case of PRD we have a file called default.properties typically located in the directory <user_home>/.pentaho/simple-jndi. That file contains the datasources definitions. Any datasource is defined by using a set of by 5 entries and any entry is made up by a key-value pair where the key is defined with the following syntax

<datasource name>/<attribute>

That said let's have a look at the example below:

SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/user=pentaho_user
SampleData/password=password
SampleData/url=jdbc:hsqldb:mem:SampleData

This example represents the definition of a datasource named SampleData; it has 5 attributes as detailed below:

type: it represents the type of JNDI resource we're going to define. Basically it's valued to the Java interface that represents a datasource.
driver: it is the name of the database JDBC driver and changes depending on the db we're going to connect to
user: the username used on the connection
password: the user password used for the connection
url: the jdbc connection url used to open the connection to the database. Again, this value depends on the database we're going to connect to.

As soon as the Reporting Engine is going to fill the report template with the data and so a database connection is require the library crease the datasource and the connection is shared with Pentaho Report designer that

The way to go to develop Pentaho reports using JNDI datasources

That said to properly define a Pentaho JDBC Connection that uses JDBC JNDI datasources, definitely, the way to go is to do it as detailed below

Open the default.properties file located under <user_home>/.pentaho/simple-jndi.
Copy and paste the definition of an already defined datasource to simplify the new datasource definition.
Change the name of the datasource in the copied rows to the name of the new datasource we're going ìto define. Remember to call the datasource with the same name of the datasource defined on your Pentaho BI server that the report will use once deployed.
Change the values of the attributes and update them with the connection parameters for the new datasource.
Save and close the file
Open the Pentaho Report Designer and start a new report.
Define a new Pentaho JDBC Datasource connection. Give it a proper name
Add a new JDBC connection in the Pentaho JDBC datasource configuration.
Give the connection a name, select the database connection type and JNDI as access type.
As soon as the JNDI access type is selected set the value of the JNDI name field to the name of the datasource you just configured in the default.properties file (as detailed from point 1 to 5 above).
Press the Test button to verify that the connection through the newly defined datasource works.
Press OK to close the database connection defininition dialog.

At this point your JNDI connection is defined and uses a fake datasource that has the same name as the datasource on the server. So you're not required to remember to make any change before deploying the report because everything is already done. Cool!

Print Pentaho reports from a PDI transformation

PDI (Pentaho Data Integration) has an output task called Pentaho Reporting Output that you can use to print a Pentaho Report from inside a transformation. Basically it prints your report by configuring the complete template filename, the path to the output directory, the report output format (PDF, Excel, HTML ecc.) and eventually the parameters.

So that said seems very simple for a PDI developer to print a Pentaho Report and obviously it is. Take data from your sources, manipulate them in your transformation and start printing your report using this cool output step. But there are some things that anyone must know to let the things works properly in any case. Because PDI doesn't inject connection information in Pentaho Reporting, a good way to have Pentaho Reporting JDBC connections details not written into the report metadata is to use JNDI JDBC datasources to have all that stuff externally defined and easily maintainable. Again, the simple-jndi library comes to our attention to help us in dealing with this.

PDI doesn't inject into our report a database connection

PDI doesn't inject a database connection into the report. So the report uses the connection information defined in Pentaho Reporting datasources connections. At this point, the better option we have to externalize the report database connection information is to use the JNDI capability of our Pentaho Reporting by using a locally defined datasource connection (as detailed above). In this case the default.properties file containing the Pentaho Reporting datasources connection information works as a new external configuration file of our PDI process ad you can distribute it with your process configuration files. As soon as your PDI transformation will start the print output, Pentaho Reporting uses simple-jndi library to create the JNDI JDBC datasource connection and will make that connection available to your report. When we talked about using simple-jndi to support the development process in Pentaho Report Designer of reports using JNDI datasources I said that PRD looks for the file in a specific location of your filesystem. My assumption at this point was that, as usual, even if we use this mechanism to print a report from inside PDI, the reporting subsystem knows where the default.properties file is located. But, unfortunately, I was wrong because Pentaho Reporting is unable to locate the correct location of our default.properties file so the JNDI datasources definitions made in our report are completely. Let analyze the standard use-case: Pentaho Report Designer (the tool) gets the location information for that file through a configuration read by the tool and when we deploy the report to the Pentaho BI server the JNDI name for the datasource is get from the server. But in our new use-case PDI that prints a Pentaho report that uses JNDI definitions, the things are totally different. So how can we deal with this to have our report printed from the PDI step without pain?

Report with JNDI JDBC datasource connections printed by PDI: what happens?

As soon as the print is started by PDI the the Pentaho reporting subsystem the path to default.properties is resolved locally to the script that starts the PDI process (either spoon, or kitchen or pan). Let me try to explain. It your ETL process files are located in the /tmp/foo/etlprocess directory and your start your ETL process with kitchen locally to that directory with a relative path to the job file as in this example

$ cd /tmp/foo/etlprocess
$ kitchen.sh -file:./my_print_job.kjb

PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo/etlprocess so it looks for it in /tmp/foo/etlprocess/simple-jndi. But what happens if you are starting kitchen from inside totally different directory, let me say for example /tmp/foo1 and you're going to start your job my_print_job.kjb that is locate in /tmp/foo/etlprocess
$ cd /tmp/foo1
$ kitchen.sh -file:/tmp/foo/etlprocess/my_print_job.kjb

In this case, PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo1 so it looks for it in /tmp/foo1/simple-jndi. Because you are unaware about how your final user will start your job this is completely a mess! But don't be afraid there's a solution. The best idea to solve this is:

Have a configuration directory local to your ETL process file, that already contains other configuration items for your process and that you distribute from within your package.
Put your default.properties file inside that directory
Have a way to specify to PDI where that simple-jndi configuration file is located.

This elegantly solves your problem.

How to specify simple-jndi information to have your report fully working

There is a way to specify to the simple-jndi library all the information needed to elegantly solve our issue by specifying a set of environment variables. The source of information for this is in the documentation you find in the binary download that you can get from here.

Basically to fix our issue we need redefine the standard value of the PENTAHO_DI_JAVA_OPTIONS to this value:

PENTAHO_DI_JAVA_OPTIONS="-Xmx512m -XX:MaxPermSize=256m -XX:-UseGCOverheadLimit -Djava.naming.factory.initial=org.osjava.sj.SimpleContextFactory -Dorg.osjava.sj.root=/tmp/foo/etlprocess/config -Dorg.osjava.sj.delimiter=/"

As you can see we have the standard memory settings plus three new parameters:

-Djava.naming.factory.initial, this first parameter sets the complete name of the initial Context Factory for SimpleContexts. This is an internal simple-jndi object
-Dorg.osjava.sj.root, this parameter sets the complete path to the directory that will contain the default.properties file containing our datasources definitions. Starting from our previous example we're saying to simple.jndi to locate that file in /tmp/foo/etlprocess/config where config would be nic that will be our process configuration directory.
-Dorg.osjava.sj.delimiter, this third parameter sets the the delimiter used to separate elements in a lookup value. This allows code to get closer to pretending to be another JNDI implementation, such as DNS or LDAP. In our case we need to use the / character.

Fill free to choose which mechanism to use to set these environment variables; you can either put the variable definition in the user profile file or you can make a script file and call the PDI scripts from there after having set these environment variables.

That's all!

So for now that's all! You have all the information required to have your JNDI based reports fully working even when launched through PDI. This is a good way to fully separate connections information from the report and have them saved together to your ETL configuration files. This will give you a clean distribution for your customer and of course, the more easy and clean the things are organized the more happy they will be!

Stay tuned for the next post that will come later on. I've interesting tips to share with you about using a java profiler to support you in investigating memory usage problems in your ETL process so that you get quickly to the solution right solution. Have fun and see you later.