Adjust detail rows height in WAQR templates dynamically

Today I'm back writing something about WAQR. Yesterday one of my customers wrote me about a problem they were having while executing WAQR reports that has a long text in their columns. If you had columns with a very long text in it and you're exporting the WAQR report in PDF  the columns text gets truncated. Stupid problem but apparently not so trivial to be solved within WAQR templates.

About WAQR

WAQR (Web AdHoc Query Reporting) is an interesting module in the Pentaho suite. It sits on top of the report engine and the metadata layer and lets the users easily build tabular reports to be used for their daily activities or to just to export some complex data in an easy way from the Pentaho system. The report definition is based on a wizard that takes the users along these easy steps:

  1. select a template, from the set of available templates, and a business model. 
  2. decide were to put which fields in the report layout
  3. adjust some visualization attributes or/and can apply filter conditions and define sort fields and orders
  4. manage some page layout attributes
  5. .... and here we go! We get the report.

WAQR templates are basically the old JFreereport report designer's template files so they are simple xml files we can view and modify. THEY ARE NOT compatible with the newer report designer template filesThe wizard uses text manipulation routines to create a report-definition out of the template. We all know that sooner or later WAQR will be replaced by something more interactive and more attractive  that will use the latest report engine's version. But for the moment we have this and with this we have to battle.

How to expand the height of the detail rows dynamically when export type is PDF

Try to build a new report with WAQR putting in the report definition a field with a very log text in it. If you try to export the report using the PDF format you will get the text truncated. To fix this it's only a matter of minutes and you need to modify the report template

  • adding a new configuration attribute for the report engine and 
  • add a new attribute to the details band to set the row height dynamic. 

Below I'm going to summarize all these steps modifying the Basic template given with the Pentaho demo solution.
  1. Go to <biserver_home>/pentaho-solutions/system/waqr/templates/Basic and open jfreereport-template.xml
  2. Locate the configuration xml element near the end of the file. Add the following line as a child of the configuration element
     <property name="org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.AssumeOverflowY">true</property>  
    
  3. Add the attribute dynamic = "true" to the items element
  4. Save the template and if the BI Server is running refresh the cache using the Tools -> Refresh -> Repository Cache menu entry

Processing group of files executing a job once per file

How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.

Suppose we've two groups of files that we call filegroup1 and filegroup2 and suppose we have 2 groups of transformations that we call transf_group1 and transf_group2. The requirement is: we want to execute the transformations in transf_group1 once for each file in  filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of  the transformations in transf_group2 once for each file in  filegroup2. Let me analyse how we can do that.


A little about some main Kettle topics

Kettle processes informations that flows through a path made by steps. The user takes the needed steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of steps: we've input/output steps, transformation steps, script steps and so on. Any step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a transformation. So transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the job. The job orchestrates the execution of a set of transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.

In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.

How Kettle starts nested Jobs or Transactions

The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the Start transaction or Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the step configuration dialog , select the Advanced tab and check Execute for every single row. We see an example of that below in the Start transaction configuration dialog. You'll find the same setting in the Start job configuration dialog.


The job step Add filenames to result and why it isn't good for us

So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations, transf_group1 and transf_group2, it is clear that we will have two jobs one that chains all the transformations of transf_group1 and the second all the transformations for transf_group2. We call them respectively jobs1 and jobs2. So we will have:

a) A root job chains together 2 jobs job1 and job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.

Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the Add filenames to result. The picture below depict a possible flow for our root job.

We're starting to smile but unfortunately this solutions is not applicable because it doesn't work. To understand the why we need to understand the difference between row result and file result for Kettle. Typically a file result is a set of filenames that can be used only by steps that are able to manage attachments. The Mail step is the one step that can manage such a result. Row results instead are made by real data typically as output of a transaction. If you  look at the Kettle internals you can notice that a job step manages these two datasets as two completely separated collections. The important thing to note here is that whenever you check Execute for every single row in our job/transformation configuration your're saying that you'll start your job/transformation for each row of your row result. So way our solution isn't good for us? Because our Add filenames to result steps fill a file result so our jobs will never starts.

So what to do??

The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.

You need to call that transformation through a Transformation step in our root job chaining it before job1 and job2 to get respectively filegroup1 and filegroup2. Here it is the complete layout of our definitive root job


You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the row result and the job can be executed once per file that is present in our directory. Remember check the magic flag Execute for every single row in the Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.

How to execute the provided sample

To execute my sample unzip the file in whatever directory. Edit the get file list transformations and change the Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.

Download from here the sample

Dealing with Kettle DBCache

I came from 4 weeks of summer holidays... First time in my life but it was beautiful to spend a lot of time with my family. I need to take some time for them more often.

Anyway, today I was doing my usual work with Kettle but something strange happened that makes me crazy. I added a new field to my table and then I came to Kettle to update the fields layout in my Database lookup step. When I tried to find out my new field from the fields list I remained surprised... No new field appeared in the fields list for my table. I cried because I thought that it magically disappeared by my table but after a rapid check with my database client I saw my field was there in my table and I was happy. So what happened?

Cache database metadata with DBCache

DBCache is a cache where recurring query results (about database metadata informations for example) are saved the first time you access them. That informations are persisted in a file called db.cache located in the .kettle directory below your home directory. Informations are saved in row format so you have a look inside the file or edit its content.

Every time, for example, you go through Kettle database explorer looking for a table layout, the first time you access table metadata that results are cached through DBCache so that second time you you go through the cache saving an access to the db. But if you forget that and you update your table DDL in any way  (like me after a veeeery long period of holidays) you could be surprised seeing that your updates seems not to be caught by Kettle.

How can we clear the cache and have our table metadata updated next time I need to access them? You can go through the Tool menu and choose Database > Clear cache and your fresh set of table metadata will be get from the database.

Blogging from my summer holidays

Today I was reading through my twitter client the latest news and my attention was catched by all the tweets around the latest Doug post on the Pentaho forum. So I immediately connected through my phone browser to read that post and think about it.


Doug's question was asking how the Pentaho community is going. Following what my Pentaho friends said in their answers to this thread I also agree that comparing downloads isn't a good metric to judge how a community is going. The spirit of a community is made by all the people that every day gives a precious help in coding, helps people understanding the product or coming out from problems (that they think they are hard but they aren't) and talks about the product. 

I personally come from another experience with another important opensource community (not a BI product). From the outside, I remember, everything seemed wonderful. The product stayed for a very long time on top as the best project in one of the biggest open source forges. But from the inside everything was totally different and the approach was really cold. 

Here may experience is totally different, I'm breathing a completely different air. I found real people that cooperates everyday helping each other to solve their everyday problems with the product or other related technologies. They support people through irc, the forums and the wiki. I'm a software architect so I decided to help mainly with code contributions, but also in the forums and in writing articles about some pentaho topics in my blog .I studied and I'm continuously studying the code and I'm working with wonderful guys: Pedro, Tom, Paul and many other.

In our Pentaho community, the irc channel is the something I really enjoy but unfortunately I can participate only a few times because very often the various proxies broke my ability to connect. It's a sort of noisy room where anyone expose problems and quickly gets solutions. You always find someone available to support you. But it is also a place to talk with friends about everything. A sort of meeting place. That is wonderful. 

I said friends not colleagues and this is a very important distinction. These are the things that makes me thinking that IT IS really a community that works and not the number of downloads or any other stupid indicator. 


Setup Kettle debugging in 2 minutes

Sometimes it could be very interesting to be able to debug something to understand better how it works from an internal standpoint. My experience says that this is the case with the majority the opensource projects: sooner or later this is will happen.

Today that happened with Kettle trying to better understand if my assumptions were correct.

So how can we easily debug Kettle? The answer is very easy is: use remote debugging. I'll explain everything in a minute. The development tool I'm using is IntelliJ IDEA but it is fairly simple to set up everything with Eclipse and Netbeans.

Setup Kettle to enable remote debugging

1) If you're on Windows, open Spoon.bat and add the following line

 set JAVA_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n  

The same on Linux will be to open Spoon.sh and add the following line

 export JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n"  

I've used the port 8000 as the socket port number at which the debugger attaches but feel free to use the one you prefer.

2) Go to the last line of Spoon .bat or Spoon.sh and update that last line as follow

Linux (Spoon.sh):

 $JAVA_BIN $JAVA_OPTS $OPT $STARTUP -lib $LIBPATH "${1+$@}"  

Windows (Spoon.bat):

 start javaw %JAVA_OPTS% %OPT% -jar launcher\launcher.jar -lib %LIBSPATH% %_cmdline%  

Configure IntelliJ IDE for remote debugging

1) Configure a project for Kettle source tree
2) Choose Build > Edit configurations from the menu
3) The Run/Debug Configurations dialog opens. Choose Add New Configuration (the + sign in the upper left corner) and select Remote from the Add New Configuration list


3) The Remote Configuration dialog opens


Configure Transport and Debugger mode choices as displayed in the image above. The set the Host address (localhost in this case or wherever you need) and the port (8000 in my case) and press Ok

And now let's dance...

So now start Kettle. As soon as Kettle started load you job or transformation and, before running it, launch the debugger from your IDE. As soon as the debugger started you're ready to launch your job/transformation and begin your investigation. The advantage of this approach is the ability to debug also remotely running kettle instances and that is useful sometimes in helping to investigate problems.

Have fun playing with your Kettle and see you the next article.

CDF internationalization early access release


Today I finally completed the first implementation of the CDF Internationalization support. It is my first big effort to help in CDF and I'm proud I had the opportunity to work on a so important missing feature.
This is a first release, a sort of early access release, and it is available after building the framework sources from the repository trunk. I hope anyone can share ideas or suggestions on this feature helping making it better.

How does it works
CDF internationalization uses a jQuery i18n plugin to handle internationalized messages. The plugin support has been wrapped in the CDF framework so that the user never had to bother about jQuery specifics but will use the support CDF gives for that. Everything rounds about resource message files as usual in the java platform.

To create resources files we need to create  a file for each locale our dashboard is going to support. These files are named <name>.properties, or <name>_<language>.properties or <name>_<language>_<country>.properties. For instance a resource bundle for UK will be MessagesBundle_en_GB.properties. In case the message file will not be present the framework support will display the default message string and you'll get a warning in the bi server message console (nothing to be worried about but it would be better to have the required missing resource file).
The <language> argument is a valid ISO Language Code. These codes are the lower-case, two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as: http://www.loc.gov/standards/iso639-2/englangn.html.
The <country> argument is a valid ISO Country Code. These codes are the upper-case, two-letter codes as defined by ISO-3166. You can find a full list of these codes at a number of sites, such as: http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html.

How resource bundles are managed in CDF
We have two level of messages files in CDF
  • Global message files: located in <biserver_home>/pentaho_solutions/system/pentaho_cdf/resources/languages and useful  when used in dashboard global templates.
  • Dashboard specific message files: located in <dashboard_home> they can be distributed together with single dashboard in a sort of "package".  To enable Dashboard specific message files support we've to add a <messages> element to the xcdf file configuration whose value is the base name of the dashboard specific message file.

The CDF.i18n tag: internationalization support in HTML dashboard templates
To use the internationalization support in dashboards HTML templates we implemented a tag to use to get the message string from the resource bundle and use it where required in the HTML template file. This is useful for descriptions, annotations, generic labels and similar.

We can use the CDF.i18n tag using the following syntax:

CDF.i18n(<message_key>)


where <message_key> is the message key located in the message resource bundle files.

Internationalization support in CDF components
Because of the early preview stage of this first release, for the moment, only the jFreechart component supports internationalization. In this case the support applies to the chart's title attribute.
To support an internationalized title in a JFreechart component we added a new titleKey metadata attribute whose value is the message key for the chart title we can found in the resource bundle. A complete internationalization support for all the CDF components (where needed) will come shortly so be patient.

And here we go...
For anyone who want to have a first look on this, in the bi-developer solution you have in the repository trunk has a fully working example you can have a look and experiment with. Below two screenshots of the sample dashboard I made internationalized in the Italian language. I hope you enjoy playing with this first release and stay tuned here for the next updates from the field.





Joining and appending output datasets with CompoundDataAccess

An interesting topic about CDA is the ability apply join and union constructs to our incoming dataset using a  CompoundDataAccess element. This definition is a sort of extension of the DataAccess element and through a type attribute gives us the ability to join or append two given DataAccess datasets outputs forming a new one.

How does CompoundDataAccess works internally

It is really interesting to have a look at the sources and see how the things really works:
  • when you want to join the two DataAccess outputs CDA builds a little Kettle transformations and uses the Merge Join step two do the required join operation
  • if you want to append the two DataAccess outputs CDA very easily appends by itself the two datasets following the order you specified in configuration file.

Remember that in either cases you can apply this operations considering only two datasets at the time for each CompoundDataAccess element definition.

CompoundDataAccess basics

To define a new CompoundDataAccess element you have to mandatory specify
  • an id attribute, to externally identify this DataAccess elemement
  • a type attribute that accepts two values union or join following the type of composition you're looking for.
Because CompoundDataAccess is only an extension of the DataAccess element, we can apply the Parameter, Columns, CalculatedColumns and Output elements following the same rules specified for DataAccess in my previous post here

Moreover, because the CompundDataAccess element works on resulting DataAccess datasets we have to preliminarily define the two needed DataAccess. It is not possible (but would be a nice to have and I can think about implementing it) use DataAccess elements defined in external CDA files.

Joining datasets output
To join two datasets we have to define a CompoundDataAccess element with the type attribute valued to the join value. The join type lets you execute a FULL OUTER JOIN upon the two input datasets. Required elements are:

  • Left element definition, to specify a dataset for the left part of the join,
  • Right element definition,  to specify a dataset for the right part of the join.
Anyone of the two elements accepts the following attributes:

  1. id, is valued to the id of the DataAccess element definition you are going to consider as the left/right dataset
  2. key, is a comma separated list of elements to be considered as the join key. Remember that the columns keys are specified giving their position in the DataAccess query and not the name.

Below I give an example to fully understand this concept. Suppose we defined the following two DataAccess definitions in our CDA file

   <DataAccess id="1" connection="1" type="sql" access="private" cache="true" cacheDuration="300">  
                <Name>Sql Query on SampleData - Jdbc</Name>  
         <Query>  
                     select o.YEAR_ID, o.STATUS as status, sum(o.TOTALPRICE) as totalprice from orderfact o  
                     group by o.YEAR_ID, o.STATUS  
                </Query>  
                .  
                .  
                .  
                .  
   </DataAccess>  
   <DataAccess id="2" connection="1" type="sql" access="public" cache="true" cacheDuration="5">  
                <Name>Sql Query on SampleData</Name>  
        <Query>  
                     select o.YEAR_ID, o.status, sum(o.TOTALPRICE * 3) as tripleprice from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                     order by o.YEAR_ID DESC, o.STATUS  
                </Query>  
                .  
                .  
                .  
   </DataAccess>  

and suppose in the same cda file we define this CompoundDataAccess element.

      <CompoundDataAccess id="3" type="join">  
           <Left id="1" keys="0,1"/>  
           <Right id="2" keys="0,1"/>  
           <Columns>  
                <CalculatedColumn>  
                     <Name>PriceDiff</Name>  
                     <Formula>=[TRIPLEPRICE]-[TOTALPRICE]</Formula>  
                </CalculatedColumn>  
           </Columns>  
               <Parameters>  
       <Parameter name="status" type="String" default="Shipped"/>  
       <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>  
     </Parameters>  
           <Output indexes="0,1,2,5,6"/>  
      </CompoundDataAccess>  

As you can see the CompoundDataAccess element has the two keys, Left and Right, two indicate the two sides of the join. The left one uses the DataAccess id 1 (the first one above) and the second the DataAcccess id 2. Then you see the comma separated list that defines the positions of the keys columns in the related DataAccess query. In this case you can see, looking at the related DataAcces queries, that either the elements uses YEAR_ID and the STATUS columns

Appending datasets ouputs
To append two datasets we have to define a CompoundDataAccess element with the type attribute valued to the union value.The union type lets you execute append two DataAcces output datasets giving their order using two configuration elements:
  • Top, to specify the first dataset,
  • Bottom,  to specify the second dataset.

Anyone of the two elements accepts the following attributes:
  1. id, is valued to the id of the DataAccess element definition you are going to consider as the Top/Bottom dataset
Below a sample of configuration using a compound union. The DataAccess elements considered for this sample are the same defined in the previous sample for the join type.

   <CompoundDataAccess id="3" type="union">  
     <Top id="2"/>  
     <Bottom id="1"/>  
     <Parameters>  
       <Parameter name="year" type="Numeric" default="2004"/>  
     </Parameters>  
   </CompoundDataAccess>