Rama's Free Thoughts

Building a Data Mart with Pentaho Data Integration - Video Course Review

2014-06-07T00:35:00.000+02:00

Some days ago I had the opportunity to look at this video course, Building a Data Mart with Pentaho Data Integration from Packt, same editor I'm working for in other Pentaho books.

I was really interested in evaluating it, firstly because the author, Diethard Steiner, is a valid community member and author of a great number of useful articles on his own blog. Secondly because it seemed me a good introduction about how to approach the design and development of a complete ETL process to load a Datamart with Pentaho Data Integration.

The course is structured through 9 chapters, clearly explained with a good preamble on top of every chapter to give a clear view of the topic the author is covering also to beginners users. The explanation is clearly undertandable because author speaks a good english at right speed (in my opinion) giving also people that are not perfectly familiar with english language the ability to clearly understand the concepts.

First chapter talks about how to setup PDI (Pentaho Data Integration) and illustrates source and target schema and basics of their design. The thing I liked a lot here was that author used a columnar database (MonetDB) as target platform for its datamart. This way, user can have the opportunity to know about another interesting database technology to use during his/her work. Because the vast majority of the users could not know anything about this database platform, I would have appreciated a few more details about it just to give some necessary background on the database platform and tools used.

Second chapter shows Table Input and Table Output tasks and their use to massively input and output data from a PDI transformation. Good to see also the illustration of the MonetDB bulk loader step, as a more optimized alternative to Table Output step.

Third chapter is all about Agile BI made with PDI and Agile BI Plugin. A good introduction to the benefits of early prototyping and how to do this with Pentaho technology.

Forth chapter is the place where author talks about Slowly Changing Dimensions and how to use PDI to read and write data in Slowly Changing Dimensions of Type 1 and 2. Here, I would have appreciated just one more slide to illustrate beginners users a few more theory about Slowly Changing Dimensions to clearly let user understand which are the problems this approach is going to address in the design of a Data Mart.

Fith chapter talks about time dimension, how to simply structure it and how to effectively generate data to put in it.

Sixth chapter is a guide through techniques used to populate the Fact Table. This same topic will then continue in eighth chapter where the author talks about Changing Data Capture and how to structure our jobs and transformation to load our fact table. Here, in chapter eight, in my opinion a thing is missing and the CDC topic is partially addressed. I would have appreciated to hear something about using the Merge step to get the table's delta from the previous data load and then synchronizing fact table through the Synchronize After Merge step. In my opinion this is a more generic and interesting way to address this problem.

Finally, chapter seven is all about how to better organize jobs and transformation efficently and chapter nine adresses the logging and scheduling of transformation.

Therefore, to give a summary about this video course, it is a good introduction to PDI and how to use it to load a Data Mart. I find it very useful for beginners and nice to see for intermediate users that want to go deeper on some concepts. Samples are well done and really help the user in understanding better the concepts explained. Just a side note: during the videos some of the samples are shown by using an early version of PDI 5 and some other by using the previous version PDI 4.4. Starting from the valid assumption that author was trying to show users the new version of PDI, I found this a bit confusing mostly for beginner users.

A first look to the new Pentaho BA Server 5.0 CE

2013-09-26T11:44:00.001+02:00

It's a lot of time I'm playing with the new Pentaho BA Server 5.0 to see the what's next and now, I think, it is the right time to start talking about it. The first thing that appears is that Pentaho made a big refactoring for this release. A lot of changes from a brand new GUI to a completely new and clean source tree with less and more organized projects than the previous release.

What's changed from a developer point of view

The first big set of changes, visible only to developers like me interested in looking into the code for getting knowledge or customize something, is summarized below:

Project moved to github at this url. This is a good choice to improve collaboration between people and collaboration is one of the key points of the OSS philosophy. Almost any project around Pentaho is housed on this platform (Kettle not yet but it seems the only one).
The project organization in the source code radically changed trying to simplify the structure of the codebase (and in my opinion they get it). You can find a description about the the new project structure and role of any of the sub-projects in this article in the Pentaho wiki.
Then, as a Java architect, another interesting change. It seems they moved to IntelliJ as development platform of choice because of the presence of the IntelliJ metadata files in the source tree. I'm a user of IntelliJ (after having use Netbeans and Eclipse) so this is a switch that I liked a lot. So here at this page a detailed description about to how pull the platform project directly into IntelliJ to start working on it

Get the platform and start playing with it

What's the best way to have the new release in your hands to start trying it out? Two ways: a) get it from the pentaho ci build system or b) build it by yourself. As a developer, the second choice is the funniest and the more interesting if you, like me, want to understand how the things changes from time to time and want to be sure that everything goes correctly. Moreover the ability to prepare your own system package gives you a high degree of personalization; to do this you can find useful using CBF from Webdetails.

In this case, if you want to build it by yourself firstly you've to pull the project locally on your computer from the pentaho-platform repository on GitHub. Then the quick and dirty way to build Pentaho BA Server is to use the development ant build buildfile (dev-build.xml) given in the source tree and make a new complete development build. If intersted, you can get all of the possible targets for this ant command file by giving the following command ant -f dev_build.xml help

but I decided to follow the easy way by using the easiest command available to reach my goal. To do this go to the platform sources' root directory (we indicate that directory as <pentaho-platform-root>) and give the following command

 ant -f dev_build.xml dev-rebuild -Dsolution.dir.server=bin/stage/biserver-ce/pentaho-solutions -Dproject.revision=CE-SNAPSHOT package

The system will remain busy building your new release; depending on your hardware it will take 10 to 15 minutes to complete and you will have it cooked rightly. The result is

a packaged version of the Pentaho BA server in the <pentaho-platform-root>/assembly/dist
an unpackaged version of the same in the <pentaho-platform-root>/assembly/bin/stage/biserver-ce directory that you can use immediately for development and testing purposes.

That said, you can use the unpackaged version to easily and quickly look at the new features of the BA server.

What's changed from a user point of view

As soon as the platform opens in the browser you can see the new look. The new login screen looks really clean and minimal. I really like it! Now, remember that our old and friendly user joe is dead; if you want to login as admin you need to log in as the user Admin with password.... ah yes of course password (too difficult to remember!). In any case you can always get help by clicking with the mouse on the Login as Evaluator link to get the demo users and passwords.

As soon as you get logged in you discover a totally new user environment that I really like for its. The new environment now is based on a concept called Perspective.

A Pentaho perspective is an area that refers to a specific user context and that collect all the objects and actions referred to that particular context. You can change your perspective by clicking on the cascade menu located on the upper left side of the screen immediately below the main menu

We have 4 perspective in our Pentaho environment:

Home: It's the Perspective you get into as soon as you log into Pentaho (see screenshot two pictures above). It contains portlets for Recent files and Favorites files to help you in getting easier to reach the more often used reports or visualizations. Then you have a big blue button called Manage Datasources that is the new way you need to use to define datasources. From this release, this is the only place you can go to deal with datasources: you will not have anymore this function in the administration part of the system as in the previous releases. By clicking on that button you open the Manage Datasources dialog box that let you define new datasources and either edit or delete existing ones.

Browse files: it gives you the ability to access the solution through the solution explorer. As you can see here another interesting new feature. Now we've a differentiation between a public and a private part of the solution. As soon as a user gets created a new Home directory is created. That directory and its content is visible only to that particular user. Only a user with the Admin role has access to all of the users' home directories. All of the homes directories are under a parent folder called Home. There's also a public part of the solution identified by a root folder called Public. That part is the part of the solution that is shared, depending on the share level decided by the administrator or by the content owner, by every user in the system. There's no more contextual menu referred to any item of the solution (either folders or files) as in the previous release but you can have the contextual actions using the menu on the right of the File explorer.

Open files: As soon as you start an analysis view, a report, a visualization or anything else time the content you get is opened in this perspective. You always have multiple contents opened at the same time through a tabbed interface and you can easily switch back to the Browse files perspective or any other using the drop down menu we illustrated a while ago.

Schedule: The old My Workspace view is definitely dead. With this new release we have this perspective where the user can check the status of any the scheduled content and open the content for any terminated execution.

Administration: Another important change is about the administration of the Pentaho BA Server. The old administration console is definitely dead (I like this sooooo much!) and we've a new perspective accessible only to the Admin role and called Administration that let you do your administrative tasks.

Conclusion

So what's next? I think they made a big work reorganizing almost anything in a new and clean way. I really like it thanks a lot guys. I really appreciated they choose to implement a responsive design for this new GUI so that it could be easily used without any pain on any device but it isn't. So this is a good point in my opinion where they could invest some time and there's space to make things much better. In any case we must admit a good change versus a more intuitive and clean web interface so thank you Pentaho for this good job.

My first book about Pentaho is out!

2013-07-16T21:42:00.000+02:00

Today is a great day for me and I'm very proud about this. It is the realization of a dream I had from a long time: have the possibility to write a technical book.

In February 2013, a guy from Packt Publishing wrote me an email saying that he read at my blog, he liked it and so he decided to ask me about writing a book about Pentaho Data Integration. The book is about kitchen and how to use the PDI's command line tools efficiently to help people in their day to day operations with Pentaho Data Integration. It is a practical book and it seemed relatively easy to write. At first I was surprised and excited (a lot I must admit). It seemed me impossible that someone somewhere in the world noticed my blog and decided to trust me. I was really proud and honored about this so I decided to test me on this and I accepted.

Now, after 4 months of work, that book is finally out and you can order it!

I just wanted to thanks Packt Publishing for this opportunity and especially all the people from Packt that helped me in this adventure by reviewing my work and giving me useful suggestions about to improve my work of novice writer the best I can. I also want to thanks my technical reviewer Joel Latino for the precious help he gave in reviewing the book and suggesting things to improve it.

Print Pentaho Reports from PDI efficiently and without pain!

2013-06-10T10:59:00.002+02:00

These days I'm involved in a project where we need to print reports directly from our PDI processes. So what a good option to use the PRD output step to do that? The idea was to find out a way for not writing the JDBC connection informations in the definition of our Pentaho reports connections. To do this, we decided to go for report connections defined by using JNDI datasources.

Pentaho Reporting and JNDI database connections

PRD, as any of the tools in the Pentaho family, supports the creation of a database connection using JNDI datasources. This is a cool feature! Creating Pentaho reports database connections using JNDI datasource is a good approach for the following reasons

We are able to not write connection information in the report making the it more easily adaptable to the system environment changes.
We are able to use a server side common datasource, greatly simplify the deployment of our reports to the Pentaho BI server.
Usually a datasource is directly related to the concept of database connection pooling that helps us to efficiently use and streamline server side resources.

Tipically a JNDI datasource is created by an application server or a servlet engine using an appropriate set of connection parameters that differs depending on the particular database we are going to connect to. As soon as the server side middleware creates the datasource, it is flagged with a name decided at design time by the developer and under that name is made available to any application that requires to use it. It is always a good rule of thumb to use a datasource, as soon as we have one, to let our application connect to our database. Pentaho BI server has a section in the Pentaho Administration Console (PAC) to let you create all the datasource connections you need. Pentaho Reporting and any other tool of the Pentaho suite makes extensive use of the JNDI datasources defined under PAC to connect to all the SQL databases. But now, as developers using PRD (Pentaho Report Designer) to develop our reports, the problem is really simple: how can we develop our reports using Pentaho Reporting JDBC Connections based on JNDI JDBC datasources if we don't have any datasource created on our client (remember that the datasource lives on the server).

First approach about using JNDI datasources in PRD

The first idea, not a good option unfortunately but the first we can find out, could be as follow:

Develop and test the report by defining Pentaho Reporting JDBC Connections based on Native JDBC connections. For this particular Pentaho Reporting JDBC Connection type, the connection information are saved into the Pentaho Report metadata definition making it very difficult to be maintained and eventually changed.
As soon as the report is ready to be deployed to the Pentaho BI Server, edit the Pentaho Reporting JDBC connection definition and change it to use a JNDI JDBC datasource. Basically, this kind of JDBC connection requires that you type the name of a datasource that is made available on the server you are going to deploy. Using JNDI JDBC datasource has the major plus in having a single place were the connections information are kept: in the datasource definition on the application server or servlet engine configuration. So if something changes in our connection information the impact on our big set of reports is really low. That's a good thing.

This approach, at first, seems a good idea but what's wrong with this? For sure it's a bit elaborated but the worst thing is that you can forget to make the change suggested immediately before to deploy to your Pentaho BI Server. I forget for sure.... So what can we do to make the things the easiest the possible?

Use SimpleJNDI to emulate datasources on the client

Pentaho Reporting, and the same is for any other Pentaho tool, integrates a very cool library called SimpleJNDI. This library, as stated in the library website "is intended to solve two problems. The first is that of finding a container independent way of opening a database connection, the second is to find a good way of specifying application configurations". It is entirely library based, so no server instances are started, and it sits upon Java .properties files, XML files or Windows-style .ini files, so it is easy to use and simple to understand. The files may be either on the file system or in the classpath.

In case of PRD we have a file called default.properties typically located in the directory <user_home>/.pentaho/simple-jndi. That file contains the datasources definitions. Any datasource is defined by using a set of by 5 entries and any entry is made up by a key-value pair where the key is defined with the following syntax

<datasource name>/<attribute>

That said let's have a look at the example below:

SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/user=pentaho_user
SampleData/password=password
SampleData/url=jdbc:hsqldb:mem:SampleData

This example represents the definition of a datasource named SampleData; it has 5 attributes as detailed below:

type: it represents the type of JNDI resource we're going to define. Basically it's valued to the Java interface that represents a datasource.
driver: it is the name of the database JDBC driver and changes depending on the db we're going to connect to
user: the username used on the connection
password: the user password used for the connection
url: the jdbc connection url used to open the connection to the database. Again, this value depends on the database we're going to connect to.

As soon as the Reporting Engine is going to fill the report template with the data and so a database connection is require the library crease the datasource and the connection is shared with Pentaho Report designer that

The way to go to develop Pentaho reports using JNDI datasources

That said to properly define a Pentaho JDBC Connection that uses JDBC JNDI datasources, definitely, the way to go is to do it as detailed below

Open the default.properties file located under <user_home>/.pentaho/simple-jndi.
Copy and paste the definition of an already defined datasource to simplify the new datasource definition.
Change the name of the datasource in the copied rows to the name of the new datasource we're going ìto define. Remember to call the datasource with the same name of the datasource defined on your Pentaho BI server that the report will use once deployed.
Change the values of the attributes and update them with the connection parameters for the new datasource.
Save and close the file
Open the Pentaho Report Designer and start a new report.
Define a new Pentaho JDBC Datasource connection. Give it a proper name
Add a new JDBC connection in the Pentaho JDBC datasource configuration.
Give the connection a name, select the database connection type and JNDI as access type.
As soon as the JNDI access type is selected set the value of the JNDI name field to the name of the datasource you just configured in the default.properties file (as detailed from point 1 to 5 above).
Press the Test button to verify that the connection through the newly defined datasource works.
Press OK to close the database connection defininition dialog.

At this point your JNDI connection is defined and uses a fake datasource that has the same name as the datasource on the server. So you're not required to remember to make any change before deploying the report because everything is already done. Cool!

Print Pentaho reports from a PDI transformation

PDI (Pentaho Data Integration) has an output task called Pentaho Reporting Output that you can use to print a Pentaho Report from inside a transformation. Basically it prints your report by configuring the complete template filename, the path to the output directory, the report output format (PDF, Excel, HTML ecc.) and eventually the parameters.

So that said seems very simple for a PDI developer to print a Pentaho Report and obviously it is. Take data from your sources, manipulate them in your transformation and start printing your report using this cool output step. But there are some things that anyone must know to let the things works properly in any case. Because PDI doesn't inject connection information in Pentaho Reporting, a good way to have Pentaho Reporting JDBC connections details not written into the report metadata is to use JNDI JDBC datasources to have all that stuff externally defined and easily maintainable. Again, the simple-jndi library comes to our attention to help us in dealing with this.

PDI doesn't inject into our report a database connection

PDI doesn't inject a database connection into the report. So the report uses the connection information defined in Pentaho Reporting datasources connections. At this point, the better option we have to externalize the report database connection information is to use the JNDI capability of our Pentaho Reporting by using a locally defined datasource connection (as detailed above). In this case the default.properties file containing the Pentaho Reporting datasources connection information works as a new external configuration file of our PDI process ad you can distribute it with your process configuration files. As soon as your PDI transformation will start the print output, Pentaho Reporting uses simple-jndi library to create the JNDI JDBC datasource connection and will make that connection available to your report. When we talked about using simple-jndi to support the development process in Pentaho Report Designer of reports using JNDI datasources I said that PRD looks for the file in a specific location of your filesystem. My assumption at this point was that, as usual, even if we use this mechanism to print a report from inside PDI, the reporting subsystem knows where the default.properties file is located. But, unfortunately, I was wrong because Pentaho Reporting is unable to locate the correct location of our default.properties file so the JNDI datasources definitions made in our report are completely. Let analyze the standard use-case: Pentaho Report Designer (the tool) gets the location information for that file through a configuration read by the tool and when we deploy the report to the Pentaho BI server the JNDI name for the datasource is get from the server. But in our new use-case PDI that prints a Pentaho report that uses JNDI definitions, the things are totally different. So how can we deal with this to have our report printed from the PDI step without pain?

Report with JNDI JDBC datasource connections printed by PDI: what happens?

As soon as the print is started by PDI the the Pentaho reporting subsystem the path to default.properties is resolved locally to the script that starts the PDI process (either spoon, or kitchen or pan). Let me try to explain. It your ETL process files are located in the /tmp/foo/etlprocess directory and your start your ETL process with kitchen locally to that directory with a relative path to the job file as in this example

$ cd /tmp/foo/etlprocess
$ kitchen.sh -file:./my_print_job.kjb

PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo/etlprocess so it looks for it in /tmp/foo/etlprocess/simple-jndi. But what happens if you are starting kitchen from inside totally different directory, let me say for example /tmp/foo1 and you're going to start your job my_print_job.kjb that is locate in /tmp/foo/etlprocess
$ cd /tmp/foo1
$ kitchen.sh -file:/tmp/foo/etlprocess/my_print_job.kjb

In this case, PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo1 so it looks for it in /tmp/foo1/simple-jndi. Because you are unaware about how your final user will start your job this is completely a mess! But don't be afraid there's a solution. The best idea to solve this is:

Have a configuration directory local to your ETL process file, that already contains other configuration items for your process and that you distribute from within your package.
Put your default.properties file inside that directory
Have a way to specify to PDI where that simple-jndi configuration file is located.

This elegantly solves your problem.

How to specify simple-jndi information to have your report fully working

There is a way to specify to the simple-jndi library all the information needed to elegantly solve our issue by specifying a set of environment variables. The source of information for this is in the documentation you find in the binary download that you can get from here.

Basically to fix our issue we need redefine the standard value of the PENTAHO_DI_JAVA_OPTIONS to this value:

PENTAHO_DI_JAVA_OPTIONS="-Xmx512m -XX:MaxPermSize=256m -XX:-UseGCOverheadLimit -Djava.naming.factory.initial=org.osjava.sj.SimpleContextFactory -Dorg.osjava.sj.root=/tmp/foo/etlprocess/config -Dorg.osjava.sj.delimiter=/"

As you can see we have the standard memory settings plus three new parameters:

-Djava.naming.factory.initial, this first parameter sets the complete name of the initial Context Factory for SimpleContexts. This is an internal simple-jndi object
-Dorg.osjava.sj.root, this parameter sets the complete path to the directory that will contain the default.properties file containing our datasources definitions. Starting from our previous example we're saying to simple.jndi to locate that file in /tmp/foo/etlprocess/config where config would be nic that will be our process configuration directory.
-Dorg.osjava.sj.delimiter, this third parameter sets the the delimiter used to separate elements in a lookup value. This allows code to get closer to pretending to be another JNDI implementation, such as DNS or LDAP. In our case we need to use the / character.

Fill free to choose which mechanism to use to set these environment variables; you can either put the variable definition in the user profile file or you can make a script file and call the PDI scripts from there after having set these environment variables.

That's all!

So for now that's all! You have all the information required to have your JNDI based reports fully working even when launched through PDI. This is a good way to fully separate connections information from the report and have them saved together to your ETL configuration files. This will give you a clean distribution for your customer and of course, the more easy and clean the things are organized the more happy they will be!

Stay tuned for the next post that will come later on. I've interesting tips to share with you about using a java profiler to support you in investigating memory usage problems in your ETL process so that you get quickly to the solution right solution. Have fun and see you later.

Pretty print XML and JSON files with PDI (2/2)

2013-04-22T07:24:00.002+02:00

Today I'm back with the second part of my recipe about how two pretty print XML and JSON files with PDI. You can find the first part here.

To pretty print whatever JSON stream a good starting point is to use the GSON library. It is a nice component that lets you

serialize JSON streams starting from a set of Java objects or
convert a JSON stream into an equivalent set of JAVA objects

The setup

To prepare PDI to run this example you must:

Download the GSON library from the following link. In my case I've downloaded the version 2.2.3 but consider the same steps with other versions of the library.
Unzip the file on a temporary directory
Copy the gson-2.2.3.jar file to the <PDI_HOME>/libext directory
Restart PDI

The how-to

First of all I started by making an example to obtain an ugly JSON sample stream to format. To do this I built a new transformation by reusing the input files of the sample multilayer xml file transformation to obtain a simple JSON stream. The interesting part is at the very end of this transformation. Again you have a User Defined Java Class step that contains all the code that does the dirty job for you.

1:  import com.google.gson.Gson;  
2:  import com.google.gson.GsonBuilder;  
3:  import com.google.gson.JsonParser;  
4:  import com.google.gson.JsonElement;  
5:  String jsonOutputField;  
6:  String jsonPPField;  
7:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
8:  {  
9:    // First, get a row from the default input hop  
10:       //  
11:       Object[] r = getRow();  
12:    // If the row object is null, we are done processing.  
13:       //  
14:       if (r == null) {  
15:            setOutputDone();  
16:            return false;  
17:       }  
18:       // Let's look up parameters only once for performance reason.  
19:       //  
20:       if (first) {  
21:            jsonOutputField = getParameter("JSONOUTPUT_FIELD");  
22:            jsonPPField = getParameter("JSONPP_FIELD");  
23:         first=false;  
24:       }  
25:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
26:    // enough to handle any new fields you are creating in this step.  
27:       //  
28:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
29:       logBasic("Input row size: " + r.length);  
30:       logBasic("Output row size: " + data.outputRowMeta.size());  
31:       String jsonOutput = get(Fields.In, jsonOutputField).getString(r);  
32:       Gson gson = new GsonBuilder().setPrettyPrinting().create();  
33:       JsonParser jp = new JsonParser();  
34:       JsonElement je = jp.parse(jsonOutput);  
35:       String jsonpp = gson.toJson(je);  
36:       // Set the value in the output field  
37:       //    
38:       get(Fields.Out, jsonPPField).setValue(outputRow, jsonpp);  
39:    // putRow will send the row on to the default output hop.  
40:       //  
41:    putRow(data.outputRowMeta, outputRow);  
42:       return true;  
43:  }

This time the interesting code is between lines 32 and 35:

The Gson object is created enabling the pretty printing.
The JSON stream that is coming in is read and parsed appropriately (lines 33-34)
A new JSON stream pretty printed is built and used to fill the jsonpp rowset field (line 35)

Next the output pretty printed JSON stream is saved to a .js file and that's all.

You can download the sample transformation from this link. I hope you enjoyed this two part article and that it can be useful for you.

Pretty print XML and JSON files with PDI (1/2)

2013-04-19T22:43:00.000+02:00

These days I've got into a stupid trouble with PDI. A customer asked me to build a process that produces an XML file as a result; the result I get was horrible from a presentation standpoint: the file wasn't rightly pretty printed and that makes it very difficult to be read. You will encounter this problem anytime you are going to obtain XML files as output, but the same thing happens also anytime you want a JSON output. So this time I decided to take care of this by trying to get into a quick and simple solution to beautify my output; this is what I'm going to talk about in this post.

Pretty Printing XML files How-To

So what to do in case I need to pretty print XML files? Well that's very simple. PDI ships with dom4j, a very well known library that let you easily work with XML, XPATH and XSLT. That library has an object in its API model, OutputFormat, which represents the format configuration used to format the XML output. There you find a method called createPrettyPrint(); this is all the needed to reach our goal very effectively. But now how can we apply all of this to the XML that we're going to produce in our transformation? Very easy to do.

Let's start considering one of the samples shipped with PDI the multilayer XML files sample. You can find it in the samples/transformations directory in you PDI home directory. Take the original file and make a copy so that we can start working on it.

Now, in between the Text Output step and the 5th XML Join step (named "XML Join Step 5") I added two new steps a Select Value step and a User Defined Java Step as you can see in the detail shown in the figure below

While the Select values steps selects only the xmlOutput field that contains the raw xml that needs to be beautified, the User Defined Java class step is the place where the dirty job is made. To properly configure the User Defined Java Class step, that I renamed "Pretty Print Out XML", we need to

Under the tab Fields below, define a new field named xpp whose type is String that will contain the produced pretty printed xml stream
Select the tab Parameters
Define a new parameter XMLOUTPUT_FIELD whose value is set to xmlOutput
Define a new parameter XMLPP_FIELD whose value is set to xmlpp
Copy and paste the code below in the related area in the task properties dialog

1:  import java.io.StringWriter;  
2:  import org.dom4j.io.OutputFormat;  
3:  import org.dom4j.Document;  
4:  import org.dom4j.DocumentHelper;  
5:  import org.dom4j.io.XMLWriter;  
6:  String xmlOutputField;  
7:  String xmlPPField;  
8:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
9:  {  
10:    // First, get a row from the default input hop  
11:       //  
12:       Object[] r = getRow();  
13:    // If the row object is null, we are done processing.  
14:       //  
15:       if (r == null) {  
16:            setOutputDone();  
17:            return false;  
18:       }  
19:       // Let's look up parameters only once for performance reason.  
20:       //  
21:       if (first) {  
22:            xmlOutputField = getParameter("XMLOUTPUT_FIELD");  
23:            xmlPPField = getParameter("XMLPP_FIELD");  
24:         first=false;  
25:       }  
26:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
27:    // enough to handle any new fields you are creating in this step.  
28:       //  
29:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
30:       logBasic("Input row size: " + r.length);  
31:       logBasic("Output row size: " + data.outputRowMeta.size());  
32:       String xmlOutput = get(Fields.In, xmlOutputField).getString(r);  
33:    StringWriter sw;  
34:    try {  
35:      OutputFormat format = OutputFormat.createPrettyPrint();  
36:      Document document = DocumentHelper.parseText(xmlOutput);  
37:      sw = new StringWriter();  
38:      XMLWriter writer = new XMLWriter(sw, format);  
39:      writer.write(document);  
40:    }  
41:    catch (Exception e) {  
42:      throw new RuntimeException("Error pretty printing xml:\n" + xmlOutputField, e);  
43:    }  
44:    String xmlpp = sw.toString();  
45:       // Set the value in the output field  
46:       //  
47:       get(Fields.Out, xmlPPField).setValue(outputRow, xmlpp);  
48:    // putRow will send the row on to the default output hop.  
49:       //  
50:    putRow(data.outputRowMeta, outputRow);  
51:       return true;  
52:  }

The code we're interested in is between lines 32 and 44.

The xml that is caming in is read from the xmlOutput row field and is put into the xmlOutput variable
We create an OutputFormat object with the pretty printing enabled by using OutputFormat.createPrettyPrint() fuction (see line 35).
Parse the XML document and then write it to a StringWriter (see lines 36 - 39)
Put the obtained pretty printed XML in the xmlpp variable and the into the new field created in the output row (see lines 44 - 47)

The resulting XML stream is then saved to disk by mean of the Text Output step.

We made tests by pretty printing quite significant XML files in terms of size with good performances so I strongly suggest you to use this method. We're also planning to build a custom task to do this in a smoother way that will be released to SeraSoft github public repository soon.

You can download the following sample by clicking here. So for now we've done; next time I'll show you how to beautify a JSON file. Stay tuned!

Saiku UI Enhancements and new Excel export

2012-06-23T00:25:00.003+02:00

Hi everybody. As promised we've some chestnuts that are ready to be eaten and here we are! We decided to start working on one of the major pieces of the Pentaho CE suite: the newest OLAP client Saiku. We have some enhancement in our pipeline to help the Trout, Paul and the other guys of the crew; this is the first of the list. We decided to start with two enhancements that makes all more beautiful to our customers eyes:

1) Improved columns header management
2) New and improved Excel export functionality

Improved columns header management

A thing that I totally don't like in Saiku was the way it managed the columns headers any time we had repeated values on it. You can see a sample of this old behavior in the figure below.

Now the things are totally different. With a little UI improvement column headers with repeated values are grouped together as displayed in the picture below to have better visualization of your data.

New Excel export

Here we have completely rewritten the export module using Apache POI instead of JExcelAPI. This gives us more power and more feature that could be useful in the future to enhance the functionality. The look and feel now is more elegant and exactly equal to what the user is seeing in the user interface. You can see a sample of the new output below

This is a first release beta release of these features so be careful if you decide to use it. Any feedback about how they work and what you would like to have next is fully appreciated.

Where you can find it

All the code is committed to the GitHub Serasoft repository. You can find it here. We're planning the installation of an Hudson server on our servers in the cloud so that you can pickup anything already compiled so be patient. At the moment you can only grab the sources and compile it. Remember that our code is always aligned with all the new functionalities of the standard Saiku distribution. So what else? Stay tuned for the next things that are going to be released...

I'm back after a long time missing....

2012-06-22T11:41:00.001+02:00

Hi everybody and sorry for not writing anything for so much time. Write a post is a nice thing and I love it but it takes time and during this last period I haven't had any time to write anything here or to cooperate tightly with my favorite community: the Pentaho community. During that time I was taken starting my own little company here in Italy (initially I spent 7 years as a freelance consultant) called SeraSoft. I decided to do that because I though it was the moment to change something in my professional life. This is a new job for me and it is really hard: manage a new business in a so competitive environment as for the Information Technology is much harder than developing and designing good software. We have a little office in Boffalora Sopra Ticino, a little town located 20 km far from Milan near to the river Ticino. It is a good place to stay away from the traffic of the city but at the same time we're near and it takes us really a few to be in the city.

Visualizzazione ingrandita della mappa

During these last months we made a lot of interesting things on real life projects with Pentaho and also other interesting Open Source technologies like Liferay, SugarCRM and Openbravo working with companies of any size in sectors like Banking, Telecommunication, Services, Industry and Automotive. Lastly I decided that it was the time to came back heavily working in the community and I re-started doing something I hope interesting for anyone giving my two cents to help the Open Source ecosystem to grow. I'm trying to take this idea of cooperating also to my colleagues so we hope to be able to quickly give our contribution as a company.

That said I thought it's time to came back and start sharing with you all the experiences of myself and my team about Pentaho and the other products we're using and I hope they could be of interest for you all. So stay tuned and for the next news from this channel and also follow us on GitHub to look directly at what we're doing. So see you soon! There are some chestnuts that are cooking on the fire and that are almost ready to be eaten.

Adjust detail rows height in WAQR templates dynamically

2010-11-21T16:43:00.023+01:00

Today I'm back writing something about WAQR. Yesterday one of my customers wrote me about a problem they were having while executing WAQR reports that has a long text in their columns. If you had columns with a very long text in it and you're exporting the WAQR report in PDF the columns text gets truncated. Stupid problem but apparently not so trivial to be solved within WAQR templates.

About WAQR

WAQR (Web AdHoc Query Reporting) is an interesting module in the Pentaho suite. It sits on top of the report engine and the metadata layer and lets the users easily build tabular reports to be used for their daily activities or to just to export some complex data in an easy way from the Pentaho system. The report definition is based on a wizard that takes the users along these easy steps:

select a template, from the set of available templates, and a business model.
decide were to put which fields in the report layout
adjust some visualization attributes or/and can apply filter conditions and define sort fields and orders
manage some page layout attributes
.... and here we go! We get the report.

WAQR templates are basically the old JFreereport report designer's template files so they are simple xml files we can view and modify. THEY ARE NOT compatible with the newer report designer template files. The wizard uses text manipulation routines to create a report-definition out of the template. We all know that sooner or later WAQR will be replaced by something more interactive and more attractive that will use the latest report engine's version. But for the moment we have this and with this we have to battle.

How to expand the height of the detail rows dynamically when export type is PDF

Try to build a new report with WAQR putting in the report definition a field with a very log text in it. If you try to export the report using the PDF format you will get the text truncated. To fix this it's only a matter of minutes and you need to modify the report template

adding a new configuration attribute for the report engine and
add a new attribute to the details band to set the row height dynamic.

Below I'm going to summarize all these steps modifying the Basic template given with the Pentaho demo solution.

Go to <biserver_home>/pentaho-solutions/system/waqr/templates/Basic and open jfreereport-template.xml

Locate the configuration xml element near the end of the file. Add the following line as a child of the configuration element

 <property name="org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.AssumeOverflowY">true</property>

Add the attribute dynamic = "true" to the items element
Save the template and if the BI Server is running refresh the cache using the Tools -> Refresh -> Repository Cache menu entry

Processing group of files executing a job once per file

2010-08-26T01:05:00.002+02:00

How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.

Suppose we've two groups of files that we call filegroup1 and filegroup2 and suppose we have 2 groups of transformations that we call transf_group1 and transf_group2. The requirement is: we want to execute the transformations in transf_group1 once for each file in filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of the transformations in transf_group2 once for each file in filegroup2. Let me analyse how we can do that.

A little about some main Kettle topics

Kettle processes informations that flows through a path made by steps. The user takes the needed steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of steps: we've input/output steps, transformation steps, script steps and so on. Any step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a transformation. So transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the job. The job orchestrates the execution of a set of transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.

In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.

How Kettle starts nested Jobs or Transactions

The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the Start transaction or Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the step configuration dialog , select the Advanced tab and check Execute for every single row. We see an example of that below in the Start transaction configuration dialog. You'll find the same setting in the Start job configuration dialog.

The job step Add filenames to result and why it isn't good for us

So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations, transf_group1 and transf_group2, it is clear that we will have two jobs one that chains all the transformations of transf_group1 and the second all the transformations for transf_group2. We call them respectively jobs1 and jobs2. So we will have:

a) A root job chains together 2 jobs job1 and job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.

Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the Add filenames to result. The picture below depict a possible flow for our root job.

We're starting to smile but unfortunately this solutions is not applicable because it doesn't work. To understand the why we need to understand the difference between row result and file result for Kettle. Typically a file result is a set of filenames that can be used only by steps that are able to manage attachments. The Mail step is the one step that can manage such a result. Row results instead are made by real data typically as output of a transaction. If you look at the Kettle internals you can notice that a job step manages these two datasets as two completely separated collections. The important thing to note here is that whenever you check Execute for every single row in our job/transformation configuration your're saying that you'll start your job/transformation for each row of your row result. So way our solution isn't good for us? Because our Add filenames to result steps fill a file result so our jobs will never starts.

So what to do??

The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.

You need to call that transformation through a Transformation step in our root job chaining it before job1 and job2 to get respectively filegroup1 and filegroup2. Here it is the complete layout of our definitive root job

You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the row result and the job can be executed once per file that is present in our directory. Remember check the magic flag Execute for every single row in the Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.

How to execute the provided sample

To execute my sample unzip the file in whatever directory. Edit the get file list transformations and change the Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.

Download from here the sample

Dealing with Kettle DBCache

2010-08-25T17:52:00.002+02:00

I came from 4 weeks of summer holidays... First time in my life but it was beautiful to spend a lot of time with my family. I need to take some time for them more often.

Anyway, today I was doing my usual work with Kettle but something strange happened that makes me crazy. I added a new field to my table and then I came to Kettle to update the fields layout in my Database lookup step. When I tried to find out my new field from the fields list I remained surprised... No new field appeared in the fields list for my table. I cried because I thought that it magically disappeared by my table but after a rapid check with my database client I saw my field was there in my table and I was happy. So what happened?

Cache database metadata with DBCache

DBCache is a cache where recurring query results (about database metadata informations for example) are saved the first time you access them. That informations are persisted in a file called db.cache located in the .kettle directory below your home directory. Informations are saved in row format so you have a look inside the file or edit its content.

Every time, for example, you go through Kettle database explorer looking for a table layout, the first time you access table metadata that results are cached through DBCache so that second time you you go through the cache saving an access to the db. But if you forget that and you update your table DDL in any way (like me after a veeeery long period of holidays) you could be surprised seeing that your updates seems not to be caught by Kettle.

How can we clear the cache and have our table metadata updated next time I need to access them? You can go through the Tool menu and choose Database > Clear cache and your fresh set of table metadata will be get from the database.

Blogging from my summer holidays

2010-08-14T00:25:00.005+02:00

Today I was reading through my twitter client the latest news and my attention was catched by all the tweets around the latest Doug post on the Pentaho forum. So I immediately connected through my phone browser to read that post and think about it.

Doug's question was asking how the Pentaho community is going. Following what my Pentaho friends said in their answers to this thread I also agree that comparing downloads isn't a good metric to judge how a community is going. The spirit of a community is made by all the people that every day gives a precious help in coding, helps people understanding the product or coming out from problems (that they think they are hard but they aren't) and talks about the product.

I personally come from another experience with another important opensource community (not a BI product). From the outside, I remember, everything seemed wonderful. The product stayed for a very long time on top as the best project in one of the biggest open source forges. But from the inside everything was totally different and the approach was really cold.

Here may experience is totally different, I'm breathing a completely different air. I found real people that cooperates everyday helping each other to solve their everyday problems with the product or other related technologies. They support people through irc, the forums and the wiki. I'm a software architect so I decided to help mainly with code contributions, but also in the forums and in writing articles about some pentaho topics in my blog .I studied and I'm continuously studying the code and I'm working with wonderful guys: Pedro, Tom, Paul and many other.

In our Pentaho community, the irc channel is the something I really enjoy but unfortunately I can participate only a few times because very often the various proxies broke my ability to connect. It's a sort of noisy room where anyone expose problems and quickly gets solutions. You always find someone available to support you. But it is also a place to talk with friends about everything. A sort of meeting place. That is wonderful.

I said friends not colleagues and this is a very important distinction. These are the things that makes me thinking that IT IS really a community that works and not the number of downloads or any other stupid indicator.

Setup Kettle debugging in 2 minutes

2010-07-21T17:56:00.006+02:00

Sometimes it could be very interesting to be able to debug something to understand better how it works from an internal standpoint. My experience says that this is the case with the majority the opensource projects: sooner or later this is will happen.

Today that happened with Kettle trying to better understand if my assumptions were correct.

So how can we easily debug Kettle? The answer is very easy is: use remote debugging. I'll explain everything in a minute. The development tool I'm using is IntelliJ IDEA but it is fairly simple to set up everything with Eclipse and Netbeans.

Setup Kettle to enable remote debugging

1) If you're on Windows, open Spoon.bat and add the following line

 set JAVA_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n

The same on Linux will be to open Spoon.sh and add the following line

 export JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n"

I've used the port 8000 as the socket port number at which the debugger attaches but feel free to use the one you prefer.

2) Go to the last line of Spoon .bat or Spoon.sh and update that last line as follow

Linux (Spoon.sh):

 $JAVA_BIN $JAVA_OPTS $OPT $STARTUP -lib $LIBPATH "${1+$@}"

Windows (Spoon.bat):

 start javaw %JAVA_OPTS% %OPT% -jar launcher\launcher.jar -lib %LIBSPATH% %_cmdline%

Configure IntelliJ IDE for remote debugging

1) Configure a project for Kettle source tree
2) Choose Build > Edit configurations from the menu
3) The Run/Debug Configurations dialog opens. Choose Add New Configuration (the + sign in the upper left corner) and select Remote from the Add New Configuration list

3) The Remote Configuration dialog opens

Configure Transport and Debugger mode choices as displayed in the image above. The set the Host address (localhost in this case or wherever you need) and the port (8000 in my case) and press Ok

And now let's dance...

So now start Kettle. As soon as Kettle started load you job or transformation and, before running it, launch the debugger from your IDE. As soon as the debugger started you're ready to launch your job/transformation and begin your investigation. The advantage of this approach is the ability to debug also remotely running kettle instances and that is useful sometimes in helping to investigate problems.

Have fun playing with your Kettle and see you the next article.

CDF internationalization early access release

2010-07-19T17:16:00.007+02:00

Today I finally completed the first implementation of the CDF Internationalization support. It is my first big effort to help in CDF and I'm proud I had the opportunity to work on a so important missing feature.
This is a first release, a sort of early access release, and it is available after building the framework sources from the repository trunk. I hope anyone can share ideas or suggestions on this feature helping making it better.

How does it works
CDF internationalization uses a jQuery i18n plugin to handle internationalized messages. The plugin support has been wrapped in the CDF framework so that the user never had to bother about jQuery specifics but will use the support CDF gives for that. Everything rounds about resource message files as usual in the java platform.

To create resources files we need to create a file for each locale our dashboard is going to support. These files are named <name>.properties, or <name>_<language>.properties or <name>_<language>_<country>.properties. For instance a resource bundle for UK will be MessagesBundle_en_GB.properties. In case the message file will not be present the framework support will display the default message string and you'll get a warning in the bi server message console (nothing to be worried about but it would be better to have the required missing resource file).
The <language> argument is a valid ISO Language Code. These codes are the lower-case, two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as: http://www.loc.gov/standards/iso639-2/englangn.html.
The <country> argument is a valid ISO Country Code. These codes are the upper-case, two-letter codes as defined by ISO-3166. You can find a full list of these codes at a number of sites, such as: http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html.

How resource bundles are managed in CDF
We have two level of messages files in CDF

Global message files: located in <biserver_home>/pentaho_solutions/system/pentaho_cdf/resources/languages and useful when used in dashboard global templates.
Dashboard specific message files: located in <dashboard_home> they can be distributed together with single dashboard in a sort of "package". To enable Dashboard specific message files support we've to add a <messages> element to the xcdf file configuration whose value is the base name of the dashboard specific message file.

The CDF.i18n tag: internationalization support in HTML dashboard templates
To use the internationalization support in dashboards HTML templates we implemented a tag to use to get the message string from the resource bundle and use it where required in the HTML template file. This is useful for descriptions, annotations, generic labels and similar.

We can use the CDF.i18n tag using the following syntax:

CDF.i18n(<message_key>)

where <message_key> is the message key located in the message resource bundle files.

Internationalization support in CDF components
Because of the early preview stage of this first release, for the moment, only the jFreechart component supports internationalization. In this case the support applies to the chart's title attribute.
To support an internationalized title in a JFreechart component we added a new titleKey metadata attribute whose value is the message key for the chart title we can found in the resource bundle. A complete internationalization support for all the CDF components (where needed) will come shortly so be patient.

And here we go...
For anyone who want to have a first look on this, in the bi-developer solution you have in the repository trunk has a fully working example you can have a look and experiment with. Below two screenshots of the sample dashboard I made internationalized in the Italian language. I hope you enjoy playing with this first release and stay tuned here for the next updates from the field.

Joining and appending output datasets with CompoundDataAccess

2010-07-12T12:26:00.004+02:00

An interesting topic about CDA is the ability apply join and union constructs to our incoming dataset using a CompoundDataAccess element. This definition is a sort of extension of the DataAccess element and through a type attribute gives us the ability to join or append two given DataAccess datasets outputs forming a new one.

How does CompoundDataAccess works internally

It is really interesting to have a look at the sources and see how the things really works:

when you want to join the two DataAccess outputs CDA builds a little Kettle transformations and uses the Merge Join step two do the required join operation
if you want to append the two DataAccess outputs CDA very easily appends by itself the two datasets following the order you specified in configuration file.

Remember that in either cases you can apply this operations considering only two datasets at the time for each CompoundDataAccess element definition.

CompoundDataAccess basics

To define a new CompoundDataAccess element you have to mandatory specify

an id attribute, to externally identify this DataAccess elemement
a type attribute that accepts two values union or join following the type of composition you're looking for.

Because CompoundDataAccess is only an extension of the DataAccess element, we can apply the Parameter, Columns, CalculatedColumns and Output elements following the same rules specified for DataAccess in my previous post here

Moreover, because the CompundDataAccess element works on resulting DataAccess datasets we have to preliminarily define the two needed DataAccess. It is not possible (but would be a nice to have and I can think about implementing it) use DataAccess elements defined in external CDA files.

Joining datasets output
To join two datasets we have to define a CompoundDataAccess element with the type attribute valued to the join value. The join type lets you execute a FULL OUTER JOIN upon the two input datasets. Required elements are:

Left element definition, to specify a dataset for the left part of the join,
Right element definition, to specify a dataset for the right part of the join.

Anyone of the two elements accepts the following attributes:

id, is valued to the id of the DataAccess element definition you are going to consider as the left/right dataset
key, is a comma separated list of elements to be considered as the join key. Remember that the columns keys are specified giving their position in the DataAccess query and not the name.

Below I give an example to fully understand this concept. Suppose we defined the following two DataAccess definitions in our CDA file

   <DataAccess id="1" connection="1" type="sql" access="private" cache="true" cacheDuration="300">  
                <Name>Sql Query on SampleData - Jdbc</Name>  
         <Query>  
                     select o.YEAR_ID, o.STATUS as status, sum(o.TOTALPRICE) as totalprice from orderfact o  
                     group by o.YEAR_ID, o.STATUS  
                </Query>  
                .  
                .  
                .  
                .  
   </DataAccess>  
   <DataAccess id="2" connection="1" type="sql" access="public" cache="true" cacheDuration="5">  
                <Name>Sql Query on SampleData</Name>  
        <Query>  
                     select o.YEAR_ID, o.status, sum(o.TOTALPRICE * 3) as tripleprice from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                     order by o.YEAR_ID DESC, o.STATUS  
                </Query>  
                .  
                .  
                .  
   </DataAccess>

and suppose in the same cda file we define this CompoundDataAccess element.

      <CompoundDataAccess id="3" type="join">  
           <Left id="1" keys="0,1"/>  
           <Right id="2" keys="0,1"/>  
           <Columns>  
                <CalculatedColumn>  
                     <Name>PriceDiff</Name>  
                     <Formula>=[TRIPLEPRICE]-[TOTALPRICE]</Formula>  
                </CalculatedColumn>  
           </Columns>  
               <Parameters>  
       <Parameter name="status" type="String" default="Shipped"/>  
       <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>  
     </Parameters>  
           <Output indexes="0,1,2,5,6"/>  
      </CompoundDataAccess>

As you can see the CompoundDataAccess element has the two keys, Left and Right, two indicate the two sides of the join. The left one uses the DataAccess id 1 (the first one above) and the second the DataAcccess id 2. Then you see the comma separated list that defines the positions of the keys columns in the related DataAccess query. In this case you can see, looking at the related DataAcces queries, that either the elements uses YEAR_ID and the STATUS columns

Appending datasets ouputs
To append two datasets we have to define a CompoundDataAccess element with the type attribute valued to the union value.The union type lets you execute append two DataAcces output datasets giving their order using two configuration elements:

Top, to specify the first dataset,
Bottom, to specify the second dataset.

Anyone of the two elements accepts the following attributes:

id, is valued to the id of the DataAccess element definition you are going to consider as the Top/Bottom dataset

Below a sample of configuration using a compound union. The DataAccess elements considered for this sample are the same defined in the previous sample for the join type.

   <CompoundDataAccess id="3" type="union">  
     <Top id="2"/>  
     <Bottom id="1"/>  
     <Parameters>  
       <Parameter name="year" type="Numeric" default="2004"/>  
     </Parameters>  
   </CompoundDataAccess>

CDA configuration files basics (part 2)

2010-07-09T23:29:00.001+02:00

Last time we started talking about the basics of CDA configuration. Today we'll continue our series of articles on CDA better at DataAccess element configuration.

A little ricap

CDA is a BI Platform plugin that that allows to fetch data in various formats. It works defining a configuration file where we will define connections to our available datasources and a number of data access objects. The data access objects defines the strategy used to access the data using a specified connection. In our previous post we talkes briefly about how to configure Connections a we give a brief overview about the DataAccess element. Today we will complete our introduction to the basics of CDA DataAccess configuration.

Adding parameters to our DataAccess element

The Parameters element, nested into the DataAccess element, let you define a set of parameters required to get the data from you connection. Every parameter is associated to a Parameter element which accepts at least the following attributes:

name, to specify the Parameter name.
type, lets you specify the Parameter type. Allowed types are String, Date, Integer, Numeric.
pattern, lets you specify a display pattern required to let the CDA better understand the parameter's value as inputted by the user. Samples of usage for this attribute are dates dates or numeric parameters.
default, lets you specify a default value for the parameter so that it will be automatically assigned every time we don't give that value

Below you've a sample about configuring a Parameters element

   <DataAccess id="1" connection="1" type="sql" access="public" cache="true" cacheDuration="300">  
                .  
                .  
                <Query>  
                     select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                </Query>  
                .  
                .  
                <Columns>  
                     <Column idx="0">  
                          <Name>Year</Name>  
                     </Column>  
                     <CalculatedColumn>  
                          <Name>PriceInK</Name>  
                          <Formula>=[PRICE]/1000000</Formula>  
                     </CalculatedColumn>  
                </Columns>  
                .  
                .  
    <!-- Output controls what is outputed and by what order. Defaults to everything -->  
    <Output indexes="1,0,2,3"/>  
    .  
    .  
   </DataAccess>

For your reference I've also shown you the query definition. Look at the Columns element declaration: you can see that we're going to rename the column with ordinal 0 from "YEAR_ID" to "Year". To do that we used a Column element where we set the new name.
Then, using the CalculatedColumn element, we added a new calculated column name PriceInK.
Consider that:

New calculated columns are always added at the end of the columns set
CDA uses Pentaho reporting libraries to apply formulas so that means you need to identify streams columns in your formula using square bracket notation and that you can use all the construct found in that libraries.

Last but not least we changed the positional layout of the output columns through with an Output element. In the indexes attributes we set the columns position using column indexes. You have two consider two things here

The total columns number is 4 because 3 came from the query and one was the calculated column lastly added to the stream
The index for the first leftmost column is 0 and NOT 1

Easily build BI Server from behind an NTLM proxy

2010-04-28T16:24:00.011+02:00

Pentaho projects bases on Apache IVY to manage project's dependencies. This is good because it gives you a consistent and uniform way to manage projects dependencies but gives you some troubles if you need to make a build of one of the many Pentaho projects from behind an NTLM firewall.

So here it is a brief and concise survival guide about how to do that quickly and efficiently.

Install a local proxy in you development computer

First of all download cntlm a local proxy that let you easily authenticate through your company or customer NTLM authentication proxy. You've versions for Windows and Linux. I learned about it two years ago when I was on Linux and I had to manage some Maven projects. Now I'm on Windows and the good for me is that it has a version also for this platform (I tried using also ntlmaps but I wasn't able to have it working properly with IVY)

Installation is really simple but, on WIndows, has a tweak we will talk about later. Download the .zip from sourceforge and unzip it wherever is fine for you. Execute the setup.bat and it will rapidly install the local proxy as a Windows Service.

Go to the installation directory (typically C:\Program Files\cntlm) and edit cntlm.ini. Replace the informations for username, password, domain and your proxies hosts and ports as detailed in the file excerpt given below

#
# Cntlm Authentication Proxy Configuration
#
# NOTE: all values are parsed literally, do NOT escape spaces,
# do not quote. Use 0600 perms if you use plaintext password.
#

Username        <your_username>
Domain          <your_domain>
Password        <your_password>   # Use hashes instead (-H)
#Workstation    netbios_hostname        # Should be auto-guessed

Proxy           <1st_proxy_host>:<1st_proxy_port>
Proxy           <2nd_proxy_host>:<2nd_proxy_port>

#
# This is the port number where Cntlm will listen
#
Listen                3128

Make the changes and save the file. Now, before starting the local proxy service, here is the tweak. In my case, the setup.bat script forget to set a service startup parameter that sets the cntlm configuration parameter file location. So to workaround that, open your registry and go to the following key HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\services\cntlm\Parameters. Look for the AppArgs attribute and modify it adding the parameter -i "C:\Program Files\Cntlm\cntlm.ini" as shown below

After that you're ready to start you service successfully otherwise you'll get an error.

Back to our Pentaho's project build

In my case I was making a build of the newest bi-server 3.6 get from the related branch of Pentaho repository (http://source.pentaho.org/svnroot/bi-platform-v2/branches/3.6). To enable IVY to use your locally configured authentication proxy server you've to set the following environment variable before starting compiling the project

ANT_OPTS=-Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128

It is up to you to decide where to set that variable either as a Windows environment variable or from within a command line session (if you're starting the build manually).

And what about for the Linux users....

If you're on Linux you can install cntlm using the package manager of your favorite distro. I used it on Ubuntu and Fedora withou any problem. The configuration file is the same as before but is located in /etc/cntlm.conf. You've to change the same parameters detailed above for cntlm.ini (the file is exactly the same). Start the daemon, set the ANT_OPTS environment variable and have fun with your build.

CDA configuration files basics (part 1)

2010-04-21T11:33:00.003+02:00

Last time I give a brief explanation about the basics of CDA compilation and deployment and I showed you how to start executing a simple query on our datamart using CDA getting back a resultset formatted as you expected. I decided to take you to a fully functional sample going through a set of explanatory steps summarized below:

Explain the structure of CDA files and learn how to built a new one
Show the basic verbs you can use to interact with CDA
Put everything together with a working example

Before start playing with CDA remember to get the latest CDA sources compile it and reinstall the plugin as explained in my first post on this blog.

CDA files has a very basic structure. They are very easy to be learned also by a novice user with a basic knowledge of XML. Basically a CDA file contains definitions for some important elements

datasources
data access
compound data access

The DataSource element

Datasources in CDA have the usual meaning of a named connection to a database server. The name (in our case the connection id) is commonly used when creating a query to the database.

Each DataSource is made up by a set of connections defined as follow:

has an id attribute to uniquely identify the connection in this CDA datasource definition. Of course we cannot have two connections with the same id.
has a type attribute whose value depends on the connection type. We've different connection types one for each datasource type: sql, mondrian, Pentaho Metadata (MQL), Olap4J
a set of other elements that differs depending on the connection type

We can have 1 or more connections in the DataSource element according to your needs.

Configuring connections to relational sources

If we look at the samples that comes together with the latest CDA distribution you have a sample for each of the possible connection types. Open PUC console and go to the folder bi-developer/cda/cdafile and you have the list of cda definitions shown below in the files window. As you can see every CDA file is identified by a specific icon.

First interesting point: to open a file we've a CDA files editor integrated in Pentaho console. Cool! So to open a CDA file select it, right click on the file name and select Edit

Let's have a look at how to connect to a relational datasource. We've two samples for that. One to connect through jdbc and another to connect through a JNDI datasource. Open the sql-jdbc.cda for example (the jdbc one). As you can see, in the right area of PUC console the editor opened and shows you the file content. CDA editor is the standard tool to modify CDA files in a quick and easy way. You can modify your file, save it and also test it using the preview functionality (we'll talk about it later). Looking at the file immediately after the CDADescriptor element you'll find the DataSource element and, as its child, the connection definitions.

As you can see you can find a connection with id="1" and type="sql.jdbc" and then a set of elements describing properties that are directly related with the selected connection type (in this case the definition of a jdbc connection so we've elements for driver, url, username and password). Take your time to go through all the other samples and see the differences in connection types.

The DataAccess element

The DataAccess contains the definitions for the query that needs to be executed, the parameters the query accepts and some other side elements that we will explore in the next steps.

A DataAccess element:

has an id,
has a descriptive name (given through a child element),
links a specific connection id defined in the previous DataSource section,
contains a query element,
contains a set of given parameters (not mandatory),
contains a set of other not mandatory definitions will go thorugh in the next steps.

We can have one or more DataAccess elements in our CDA file.
So go back to our example (sample-jdbc.cda) and continue our exploration.

As you can see in the picture above we've some things to be noted:

Remember to give an id and to link the connection defined in the DataSources section
Because we're writing text as value of an xml file element, remember the that comparison operators < > has to be written respectively as < and > otherwise you'll get an error.
Parameters has given a name. In the query you can indicate a parameters with the usual syntax ${<parameter_name>} where parameter_name is the exact name of the parameter we're going to apply in the where clause.
Every parameter has a type that, together with the name, needs to be mandatory given. Supperted types are: String, Date, Numeric and Integer. Any parameters can have a default value specified.
If the parameter's type is Date you can specify a pattern you followed for your date format.

Test the query

To test our DataAccess definition before moving foward with our implementaion click on the Preview button. When the Preview form appears you can select the DataAccess definition you want to test and see the results below on a nice looking table.

Some things to note:

the previewer sets the default values for the query (if any) but you can change them accordingly to your needs,
the date format for the orderDate parameter follows the pattern specified in the CDA file definition.

Now I'll give you the time to play with the samples and experiment a bit. Next time we'll talk about the second part of the DataAccess elements and CompoundDataAccess definitions.

A first look into Pentaho CDA

2010-03-19T16:02:00.005+01:00

These days I decided to have a look into Pentaho CDA the newest idea from Pedro Alves. The project is available here on Google code. He gave announcement about this two months ago in his blog and immediately it appeared interesting.

What is CDA?

CDA is a BI Platform plugin that that allows to fetch data in various formats. The good things are that it allows you to acces data using all of the possible data sources you can use in BI Server (SQL, MDX, Metadata, Kettle etc.) and can be used by any client (internal or external to BI Server) because you can call it through a simple url and it gives you back a data stream.

Let's go and compile the code

So after some very busy days of hard work were I only had the opportunity to keep my local copy of the code updated with the repository and see what's going on, yesterday I had some time available and I was able to compile it and give it try. The framework easily compiles with the ant script you find in the source tree. The only thing you've to consider is to create a file named override.properties containing a set of definitions related to your working environment

  plugin.local.install.solutions.dir = <substitute_with_your_biserver_home>/pentaho-solutions   
  plugin.local.install.user = joe   
  plugin.local.install.pass = password

After having compiled the code successfully you can install it under your BI Server rel. 3.5.2 with the command

 ant install-plugin

The basics

Immediately after our successfull compilation and installation we can start biserver and login into the system. Under the bi-developer solution you have a folder named CDA where you can find an extended set of examples to start playing with.

CDA configuration is based on xml files. Every xml file contains a set of data access definitions. Everyone is identified by an id and has a specified data access method. I don't copy and paste a sample here because you can have a look in the samples by yourself. They are very simple and the format is immediately understandable so that you can easily edit them by hand.

Basically you can have your output data stream in multiple formats. At the time of writing this article you can use JSON (the default), XML, Excel and CSV. This gives you the maximum flexibility on what you can do with that data.

If you want to try to execute a query using a data access method defined in your .cda file you can use the following this sintax

http://<biserver_host>:<port>/pentaho/content/cda/doQuery?solution=bi-developers&path=cda/cdafiles&file=sql-jndi.cda&outputType=xml&dataAccessId=1

This calls the method doQuery to execute the query using the dataAccess with id 1 defined in sql-jndi.cda and located under the solution bi-developers in folder cda/cdafiles giving you the output using XML. Try to execute it changing the output type and see what happens. Remember that if you don't specify an output type it will be JSON by default.

Suppose that we wnat an output typ like xml, the answer you'll see in your browser window is something like this

Shipped20044114929.96Shipped20051513074.46

almost insignificant but, if you get the page source you clearly can see the xml stream sent to you by our CDA plugin

 <CdaExport>  
      <MetaData>  
           <ColumnMetaData index="0" type="String" name="STATUS"/>  
           <ColumnMetaData index="1" type="Numeric" name="Year"/>  
           <ColumnMetaData index="2" type="Numeric" name="price"/>  
           <ColumnMetaData index="3" type="String" name="PriceInK"/>  
      </MetaData>  
      <ResultSet>  
           <Row>  
                <Col>Shipped</Col>  
                <Col>2004</Col>  
                <Col>4114929.96</Col>  
                <Col isNull="true"/>  
           </Row>  
           <Row>  
                <Col>Shipped</Col>  
                <Col>2005</Col>  
                <Col>1513074.46</Col>  
                <Col isNull="true"/>  
           </Row>  
      </ResultSet>  
 </CdaExport>

Really cool! If you don't want to make your hand dirty in changing jdbc definitions I suggests you to try all the files whose name contains the word jndi that means they use a datasource and not a direct jdbc connection defined in the file (these are the files that contains the word jdbc).

Read the next part of this serie and learn about CDA configuration files basics

Customize CDF jfreechart component

2010-02-25T17:43:00.003+01:00

I was dealing with CDF Dashboards during these days and I had to set line-width and line-style attributes in my jFreechartComponent's LineChart widget.

Looking at the jFreechartComponent documentation you clearly see that in the standard implementation some attributes, specifics only to some chart types, aren't implemented. This was also for the two attributes line-width and line-style related to the LineChart widget. But this is not a problem because we can add that missing attibutes very easily and I'll show you how. The interesting thing is that I can manage all the attributes that are available for a specific JFreeChart chart type.

If you need to add support for missed JFreeChart chart's attributes you can follow the steps summarized below. We suppose to add line-width and line-style attributes and to access them from CDF jFreechartComponent as two properties respectively named lineWidth and lineStyle.

1) Go to <biserver-home>/pentaho-solutions/cdf/components and open jfreechart.xaction and open it with your favourite editor.

2) Add two new elements to the inputs section of the .xaction file. Below an example of the file with the two new inputs added. Remember here that the value of the request element (see lines 5 and 11 below) has to be the same in name and in case as the name of the property we want to put in our CDF jFreechartComponent definition. Then be careful to correctly assign a the default value needed in case that the property isn't provided (see lines 7 and 13) by you CDF component call. This is particularly important when our attribute is a number. If you don't assign it properly you'll get a NumberFormatException.

1:  <inputs>   
2:    <!-- Omitted for brevity -->       
3:       <LINEWIDTH type="string">  
4:            <sources>  
5:                 <request>lineWidth</request>  
6:            </sources>  
7:            <default-value>1.5</default-value>  
8:       </LINEWIDTH>  
9:       <LINESTYLE type="string">  
10:            <sources>  
11:                 <request>lineStyle</request>  
12:            </sources>  
13:            <default-value/>  
14:       </LINESTYLE>  
15:    <!-- Omitted for brevity -->       
16:  </inputs>

3) Add the new inputs previously defined in the inputs section (see above) to the action-inputs section of our ChartComponent action-definition (as shown below at lines 7 and 8). Then, use these new fields to populate two new property elements added to the chart-attributes section as shown at line 25-26. Be very careful here because the name of the element you give to every new chart attribute has to be equal to the related attribute for the JFreeChart library. For example: if in JFreeChart library the line width attribute is named line-width that is the name to be considered as child element of the chart-attributes element. Then the value to be assigne to that new element is the name of the related action input element (for line-width we will assign as value the LINEWIDTH element. See line 25 below).

1:       <action-definition>   
2:            <component-name>ChartComponent</component-name>  
3:            <action-type>Chart</action-type>  
4:            <action-inputs>   
5:                 <chart-data type="result-set" mapping="newResults"/>   
6:                  <!-- Omitted for brevity -->       
7:                 <LINEWIDTH   type="string"/>  
8:                 <LINESTYLE   type="string"/>  
9:                 <!-- Omitted for brevity -->       
10:            </action-inputs>  
11:            <action-resources/>  
12:            <action-outputs>   
13:                 <chart-filename type="string"/>   
14:                 <base-url type="string"/>   
15:                 <chart-mapping type="string"/>   
16:                 <image-tag type="string"/>   
17:                 <chart-output type="content"/>   
18:            </action-outputs>  
19:            <component-definition>   
20:                 <width>{WIDTH}</width>   
21:                 <height>{HEIGHT}</height>   
22:                 <by-row>{BYROW}</by-row>   
23:                 <chart-attributes>   
24:                 <!-- Omitted for brevity -->       
25:                      <line-width>{LINEWIDTH}</line-width>  
26:                      <line-style>{LINESTYLE}</line-style>  
27:                 <!-- Omitted for brevity -->       
28:                 </chart-attributes>   
29:            </component-definition>   
30:       </action-definition>

BI Server & MS Active Directory in 10 minutes

2010-02-18T12:29:00.005+01:00

Recently I had the need to connect Pentaho to MS Active Directory for user authentication/authorization. Immediately I asked myself how to connect Pentaho BI Server to Microsoft Active directory and I answered "Oh don't worry... it will take no more than 10 minutes!". Then the "look for a how-to document..." discovery process started.

I found a lot of documentation about this issue (wiki articles, forums thread) but there isn't a well done survival guide on this problem (that's my opinion).

So I'll try to summarize in few lines the steps followed and problems encountered to try to build a sort of survival guide for anyone with the same issue to solve

Have a look at spring configuration files

BI Server security architecture is based on Spring Security so the first guide to be read is the Spring documentation where they talk about LDAP configuration. Better, in case you don't know anything about that, if you came a step backward and have a read at the general architecture of Spring security.

Spring security beans wires together through spring application context and in Pentaho all the needed spring application context files are located in <biserver_home>/pentaho-solutions/system. You'll find a lot of them there but the important things to know are:

pentaho-spring-beans.xml contains the list of imported spring bean files that will be loaded when BI Server will start.
We've two important file groups there named applicationContext-spring* and applicationContext-pentaho*. In each group, you have one file for every available authentication method defined in Pentaho. Usually the beans located in files belonging to applicationContext-spring group contains definitions for spring related beans needed to configure the specified authentication/authorization method. The beans located in files belonging to applicationContext-pentaho group contains definitions of Pentaho's beans involved in the authorization/authentication methods for the specific authentication method (LDAP, Hibernate, jdbc).

So how to configure Pentaho to work with MS Active Directory?

The setup to have Pentaho working with MS Active Directory is really simple if you know exactly what to do and to how. Il try to summarize you everything in the following paragraphs. As detailed above all the files we will mention are located in <biserver_home>/pentaho-solutions/system.

1. Setup an MS Active Directory user to let BI Server connect to. You need to define a user in MS Active Directory so that BI Server can connect to and check if the Pentaho's user is going authenticate exists and is valid. The user you're going to define in MS Active Directory doesn't need to have any special right so a normal user is the more appropriate. Remember to check that the "password never expire" flag is not set for this user.

2. Setup Spring Security files needed to enable LDAP Server authentication/authorization. This is a good point, I think the major one. First of all read the guidelines provided here about some rules to be followed when editing Spring configuration files particularly regarding white spaces and special characters. Then follow the points detailed here.

2.a) Open the applicationContext-security-ldap.properties and change the properties accordingly to your needs. The useful thing about this file is that it contains all the properties needed to configure spring beans so that we doesn't need to look for them in eac xml file. They're all in one single place. Following you'll find an example:

 contextSource.providerUrl=ldap\://ldaphost\:389  
 contextSource.userDn=cn=ldapuser,OU=my_org_unit,dc=my_dc  
 contextSource.password=password  
 userSearch.searchBase=OU=my_org_unit,dc=my_dc  
 userSearch.searchFilter=(sAMAccountName=\{0\})  
 populator.convertToUpperCase=false  
 populator.groupRoleAttribute=cn  
 populator.groupSearchBase=OU=my_org_unit,dc=my_dc  
 populator.groupSearchFilter=(member=\{0\})  
 populator.rolePrefix=  
 populator.searchSubtree=true  
 allAuthoritiesSearch.roleAttribute=cn  
 allAuthoritiesSearch.searchBase=OU=my_org_unit,dc=my_dc  
 allAuthoritiesSearch.searchFilter=(objectClass=group)

Important things to note here are

contextSource.providerUrl - LDAP server url
contextSource.userDn - LDAP username. This is the user we've talked about in 1) above
contextSource.password - LDAP user password
the populator properties are needed by Spring DefaultLdapAuthoritiesPopulator. That object is needed to load the set of authorities the user was granted.
the userSearch properties configures the attributes needed to fills up the "users" box when assigning permissions to reports etc.
the allAuthorities properties configures the attributes needed to fills up the "roles" portion of the permissions box when setting them for a report etc.

The excerpt I gave above is fully working so you can copy and paste it in your properties file changing only the definitions specific to your installation. At this point you completed almost the 70% of the required configuration to have everything working on your system. Be careful to declare the full DN (domain name) when you work with MS Active Directory because, if not, it's almost sure you'll have an error like this (The first time I tried I had such an error for this reason)

<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> Microsoft Active Directory Error:
   javax.naming.AuthenticationException:
   [LDAP: error code 49 - 80090308: LdapErr: DSID-0C09030B, comment:
                        AcceptSecurityContext error, data 525, v893 ]
     at com.sun.jndi.ldap.LdapCtx.mapErrorCode(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.processReturnCode(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.processReturnCode(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.connect(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(Unknown Source)
     at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
     at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
     at javax.naming.InitialContext.init(Unknown Source)
     at javax.naming.InitialContext.(Unknown Source)
     at javax.naming.directory.InitialDirContext.(Unknown Source)
</code></pre>

You can find short but useful informations about this issue here

2.b) Two words about the default role. Anytime a user logs into Pentaho, the system assigns the user a default role of "Authenticated". This is a really important point to have all the things working. If that role isn't assigned you're able to connect but mantle witll never show anything to you. My problem was that in my spring LDAP context file the definitions to have that role assigned by default were missed. So I had to add them manually. So be absolutely sure to check that in applicationContext-spring-security-ldap.xml you have the defaultRole property defined with a value of Authenticated in the popolator bean and if missed add it. Below an excerpt of that bean definition with the defaultRole property added so that you can copy and paste if missed in your file.

 <bean id="populator" class="org.springframework.security.ldap.populator.DefaultLdapAuthoritiesPopulator">  
  <!-- omitted -->  
  <property name="defaultRole" value="Authenticated" />  
  <!-- omitted -->  
 </bean>

3) Rework the imports in pentaho-spring-beans.xml to enable the load of LDAP security beans at startup. Below (at lines 8-9) we disabled the DAO/Hibernate security and (at lines 10-11) we imported the new definitions enabling LDAP security

1:  <beans>  
2:   <import resource="pentahoSystemConfig.xml" />  
3:   <import resource="adminPlugins.xml" />  
4:   <import resource="systemListeners.xml" />  
5:   <import resource="sessionStartupActions.xml" />  
6:   <import resource="applicationContext-spring-security.xml" />  
7:   <import resource="applicationContext-common-authorization.xml" />  
8:   <!-- import resource="applicationContext-spring-security-hibernate.xml" />  
9:   <import resource="applicationContext-pentaho-security-hibernate.xml" / -->  
10:   <import resource="applicationContext-spring-security-ldap.xml" />  
11:   <import resource="applicationContext-pentaho-security-ldap.xml" />  
12:   <import resource="pentahoObjects.spring.xml" />  
13:  </beans>

4) Now its the time to define a set of groups in your MS Active Directory for your pentaho users (if you don't already have such a groups) depending on your authorization needs. For sure you need to have at least one group to contain pentaho admins. In my system I called that group PentahoAdmin.

5) Declare the new admin group in Pentaho configuration to assign that group the admin grant. To do that rework the acl-voter element in pentaho.xml as shown below.

      <acl-voter>  
           <!-- What role must someone be in to be an ADMIN of Pentaho -->  
           <admin-role>PentahoAdmin</admin-role>  
      </acl-voter>

6) Rework acl-publisher definitions in pentaho.xml for all the Pentaho's groups defined in the LDAP server. In my system I defined two roles PentahoAdmin and PentahoUser so my configuration looks like this

  <acl-publisher>  
           <!--   
                These acls are used when publishing from the file system. Every folder  
                gets these ACLS. Authenticated is a "default" role that everyone  
                gets when they're authenticated (be sure to setup your bean xml properly  
                for this to work).  
           -->  
           <default-acls>  
                <acl-entry role="PentahoAdmin" acl="FULL_CONTROL" />                    <!-- Admin users get all authorities -->  
                <!-- acl-entry role="cto" acl="FULL_CONTROL" / -->                    <!-- CTO gets everything -->  
                <acl-entry role="PentahoUser" acl="EXECUTE_SUBSCRIBE" />          <!-- PentahoUser gets execute/subscribe -->  
                <acl-entry role="Authenticated" acl="EXECUTE" />          <!-- Authenticated users get execute only -->  
           </default-acls>  
           <!--  
                These acls are overrides to specific file/folders. The above default-acls will  
                be applied and then these overrides. This allows for specific access controls to  
                be loaded when the repository if first populated. Futher changes to acls can be  
                made in the platform GUI tool. Uncomment these and change add or delete to your hearts desire -->                           
           <overrides>  
                <file path="/pentaho-solutions/admin">  
                     <acl-entry role="PentahoAdmin" acl="FULL_CONTROL" />  
                </file>  
           </overrides>  
      </acl-publisher>

7) Stop and restart you Pentaho server and everything is ready for a try.

I hope I haven't missed anything and that everything is clear enough for everyone who reads these few instructions. Let me know if you have any problems so that I can keep updated this very brief guide.

Pentaho BI Server language files (1/2)

2010-02-02T00:14:00.001+01:00

It was a long time of hard work the last month and a half but now I'm able to see a little light at the end of the tunnel and I had the time to be back. This time I have an interesting thing for all the italian friends. I completed:

the italian language files for Mantle GWT Client. The files were already submitted to Pentaho JIRA hoping they will be incorporated in the next release of the product.
the italian language file for JPivot toolbar. This was a minor effort because it is a very little file so it takes really a few minutes to be completed. This language file has been already committed by myself on JPivot's cvs trunk.

I detailed below the steps to be followed to install the new language on existing BI Server installations. The files are compatible with Pentaho 3.x. A special thanks to my colleague and dear friend Andrea Pasotti who helped (and is helping) me in this work.

How to activate italian language support for Mantle GWT client

Download the files messages_it.properties and MantleLoginMessages_it.properties from the following link to JIRA
Stop BI Server
Copy the file messages_it.properties to <biserver_home>/webapps/pentaho/mantle/messages
Open the file <biserver_home>/webapps/pentaho/mantle/messages/supported_languages.properties
Add the following line

it=Italiano

Save and close the file
Copy the file MantleLoginMessages_it.properties to <biserver_home>/webapps/pentaho/mantleLogin/messages
Open the file <biserver_home>/webapps/pentaho/mantleLogin/messages/supported_languages.properties
Add the following line

it=Italiano

Save and close the file
Start BI Server

How to activate Italian language support for JPivot toolbar

Dowload the resources_it.properties language file from the following link
Go to <biserver_home>/webapps/pentaho/WEB-INF/classes and create the following directory path com/tonbeller/jpivot/toolbar
Copy resources_it.properties to <biserver_home>/webapps/pentaho/WEB-INF/classes/com/tonbeller/jpivot/toolbar
Start BI Server

I'm going to finalize to complete translation of the GUI so in the next few weeks will follow the italian language files PAC (Pentaho Administration Console) and Ad-Hoc Query Reporting. So what else.... stay tuned!

Patching JPivot to filter by multiple members belonging to the same dimension

2009-12-10T01:16:00.004+01:00

A very annoying problem with the release of JPivot shipped with the actual release of Pentaho is that you can only select one member from the same cube dimension at a time. But frequently we would like to select more than one member at a time. So what a good opportunity to build a patched release of JPivot and make that happens!

The first thing I made was try to test that by writing a sample MDX query on the SteelWheels cube that had, as a filter, a tuple made up by a set of members belonging to the same dimension. It worked. Infact, starting from Mondrian 3.1.2.13008, the OLAP engine supported compound slicers that makes this possible. After this very brief verification I was convinced to continue in my adventure and the next step was to find out where was the latest JPivot sources. As stated by Will Gorman in the post #7 of this thread the latest and good sources are in the obvious place... the JPivot CVS Repository of the JPivot project on SourceForge. So I get it and started my work.

The patch wasn't so difficult to be done and I reached easily my goal. It was a good opportunity to go into the internals of JPivot and learn about it. The new PAT is not round the corner so my opinion is that JPivot will be the production OLAP viewer of choice for Pentaho Community Edition again for a few months. So, I think, this exercise is not a waste of time.

If you want to try it, you can find the patched JPivot jar file here. The setup takes just one minute. Move your original JPivot library from <biserver_home>/tomcat/webapps/WEB-INF/lib in a different directory (just to be safe). Unzip the archive you just downloaded and copy the extracted jar file to the location specified before. Restart Pentaho BI Server and then... here we go!!

Accessing Mondrian cubes through Pentaho Report Designer

2009-11-27T14:36:00.000+01:00

Today I had the opportunity to design some complex reports using the cubes published in my customer's Pentaho BI Server as datasources for my report. Using a cube as a datasource to produce reports is good, in my opinion, because it gives you a perfect way to make easy reporting whenever you've, for example, to produce reports that compares data on different periods.

A good sample for what we're going to discuss here is the Top N Analysis report you can find in the Pentaho's Steel Wheels samples. For abbreviation I'll refer to this report simply as "the sample". If you open it and have a look at the defined datasources you can clearly see that it takes the data it needs from steelwheels.mondrian.xml cube schema. So that is good for us.

Publish the report to Pentaho BI Server

Before thinking about the publishing of your report to your bi-server running instance, you've to think about the way Pentaho will use to access the schema it needs for your report. The strategy used by the reporting plugin goes through two possible path:

Firstly it tries to access the Analysis Schema file using the path you've specified in the datasource definition of the Pentaho Report Designer. Whenever you're in the Pentaho BI Server execution environment every path is calculated respect to <BISERVER_HOME>/tomcat/bin. That means that if you set a reference to your Analysis Schema cube as a relative path in your report datasource definition (as is for the sample I mentioned in my opening) Pentaho will look for you schema file calculating the absolute file path respect to <BISERVER_HOME>/tomcat/bin. So you need to be sure that your file is in the right place before the system will try to access it. I think that this way is not as good because is dependent on your BI Server filesystem layout.
Secondly it tries to access the schema as an XMLA datasource. That is, in my opinion, the more elegant way to make the schema available to the reporting engine.

How to add a new XMLA datasource to Pentaho BI Server

To define a new XMLA datasource in our Pentaho BI Server environment we've to update the datasources.xml file in <BISERVER_HOME>/pentaho-solution/system/olap.

This file contains the definitions of all the XMLA datasources available in the system. We can add a new datasource definition using one of these two ways:

Manually add a new Catalog element to configure a new Mondrian catalog
While publishing the Analysis Schema from the Schema Workbench flag the Enable JNDI datasource and set the JNDI Data Source field appropriately. You can find the procedure to publish the schema clearly explained in Pentaho's wiki.

Mondrian cubes debugging: how to display SQL queries

2009-10-29T12:51:00.000+01:00

These days I've got the interesting need to look at the queries that Mondrian generates while the user is navigating the OLAP cube. This idea came to me when I decided to the check if the indexes applied to my tables gives me the best performances possible. To decide which indexes are eligible to be applied to my tables, my strategy is

collect some queries and then
get the query plan of each query and check if the indexes are properly applied.

It's really useful to look at Mondrian log files because they gives us a lot of useful informations about how our system is behaving. We can

look at sql statements and MDX queries,
have some profiling informations on queries that are executed,
get other useful debugging informations.

The following paragraphs illustrates how to enable Mondrian debugging logs, adding some properties to the Mondrian configuration file.
After that, we'll configure two new log4j appenders to have the desired log files properly written on our filesystem.

Enable Mondrian debug log
Mondrian has a big set of configuration settings that can be modified. In our case, to enable Mondrian debug informations follow the steps detailed below:

Open the mondrian.properties file located in <bi-server_home>/pentaho-solution/system/mondrian and add the following line.

mondrian.rolap.generate.formatted.sql=true

You can find the complete set of configuration settings here

Update log4j configuration
At this point we're going to modify the log4j configuration file adding the required appenders abd categories to have our logging informations displayed properly

Open the log4j.xml file located in <bi-server_home>/tomcat/webapps/pentaho/WEB-INF/classes

Based on what you want to log, add the one or each of the following lines to the file. They will create two new RollingFileAppenders. You're free to use
the kind of appender you prefer. In case you need further informations about log4j and its configuration parameters you can have a look at here
IMPORTANT: The location of the produced files is relative to the <bi-server_home>/tomcat/bin directory. You can put the generated log files wherever you
want in the filesystem but always remember this important consideration.

<appender name="SQLLOG" class="org.apache.log4j.RollingFileAppender">

     <param name="File" value="sql.log"/>

     <param name="Append" value="false"/>

     <param name="MaxFileSize" value="500KB"/>

     <param name="MaxBackupIndex" value="1"/>

     <layout class="org.apache.log4j.PatternLayout">

       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>

     </layout>

   </appender>

   

   <appender name="MONDRIAN" class="org.apache.log4j.RollingFileAppender">

     <param name="File" value="mondrian.log"/>

     <param name="Append" value="false"/>

     <param name="MaxFileSize" value="500KB"/>

     <param name="MaxBackupIndex" value="1"/>

     <layout class="org.apache.log4j.PatternLayout">

       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>

     </layout>

   </appender>

Add the following new categories to the log4j.xml file according to your logging needs.

<category name="mondrian.sql">

      <priority value="DEBUG"/>

      <appender-ref ref="SQLLOG"/>

   </category>

   

   <category name="mondrian">

      <priority value="DEBUG"/>

      <appender-ref ref="MONDRIAN"/>

   </category>

Enable the new log settings
To have the new log settings enabled restart the Pentaho bi-server instance. Remember, as soon as you satisfied your debugging needs, to disable the tracing logs because they have a severe impact on system performances.