Rama's Free Thoughts: 2010

Sunday, November 21, 2010

Adjust detail rows height in WAQR templates dynamically

Today I'm back writing something about WAQR. Yesterday one of my customers wrote me about a problem they were having while executing WAQR reports that has a long text in their columns. If you had columns with a very long text in it and you're exporting the WAQR report in PDF the columns text gets truncated. Stupid problem but apparently not so trivial to be solved within WAQR templates.

About WAQR

WAQR (Web AdHoc Query Reporting) is an interesting module in the Pentaho suite. It sits on top of the report engine and the metadata layer and lets the users easily build tabular reports to be used for their daily activities or to just to export some complex data in an easy way from the Pentaho system. The report definition is based on a wizard that takes the users along these easy steps:

select a template, from the set of available templates, and a business model.
decide were to put which fields in the report layout
adjust some visualization attributes or/and can apply filter conditions and define sort fields and orders
manage some page layout attributes
.... and here we go! We get the report.

WAQR templates are basically the old JFreereport report designer's template files so they are simple xml files we can view and modify. THEY ARE NOT compatible with the newer report designer template files. The wizard uses text manipulation routines to create a report-definition out of the template. We all know that sooner or later WAQR will be replaced by something more interactive and more attractive that will use the latest report engine's version. But for the moment we have this and with this we have to battle.

How to expand the height of the detail rows dynamically when export type is PDF

Try to build a new report with WAQR putting in the report definition a field with a very log text in it. If you try to export the report using the PDF format you will get the text truncated. To fix this it's only a matter of minutes and you need to modify the report template

adding a new configuration attribute for the report engine and
add a new attribute to the details band to set the row height dynamic.

Below I'm going to summarize all these steps modifying the Basic template given with the Pentaho demo solution.

Go to <biserver_home>/pentaho-solutions/system/waqr/templates/Basic and open jfreereport-template.xml

Locate the configuration xml element near the end of the file. Add the following line as a child of the configuration element

 <property name="org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.AssumeOverflowY">true</property>

Add the attribute dynamic = "true" to the items element
Save the template and if the BI Server is running refresh the cache using the Tools -> Refresh -> Repository Cache menu entry

Thursday, August 26, 2010

Processing group of files executing a job once per file

How many of us ever had the need to process a set of files through Kettle executing jobs or transactions on a per file base. Let me make an example to illustrate a particular use case.

Suppose we've two groups of files that we call filegroup1 and filegroup2 and suppose we have 2 groups of transformations that we call transf_group1 and transf_group2. The requirement is: we want to execute the transformations in transf_group1 once for each file in filegroup1 and as soon as the processing of this group finishes as a whole we want to start the execution of the transformations in transf_group2 once for each file in filegroup2. Let me analyse how we can do that.

A little about some main Kettle topics

Kettle processes informations that flows through a path made by steps. The user takes the needed steps from a palette, drags them into the client area and builds a workflow. That flow is made up by different types of steps: we've input/output steps, transformation steps, script steps and so on. Any step has an input and an one or more outputs. It gets the information flow in input from the immediate preceding step, processes it and outputs, as a result, a new set of informations that will flow into the immediate next step. The output flow produced by the step can have a layout of fields, in terms of number and data types, that may differ from the flow in input. A set of steps chained together to built a specific task is called a transformation. So transformations = elementary tasks, sort of little reusable components that makes actions. A process is built coordinating a set of orchestrated tasks that can be executed in sequence or in parallel. This role of orchestrator in Kettle is filled by the job. The job orchestrates the execution of a set of transformations to build our complete ETL process. A job is made by a set of steps too but their intended scope is to help in orcherstrating the executions of the tasks (transformations) in our process. As you can see, we have a job steps palette but it contains only steps to check conditions or prepare the execution environment. The real work is made by steps contained in the transformations.

In an our ETL processes made with kettle we always have a main job, also called root job, that we start to orchestrate the execution of nested jobs or transactions. We can nest as many levels of josb and transformations we want below that main job.

How Kettle starts nested Jobs or Transactions

The first way of starting jobs or transformations in Kettle is the stupid way. Chain them together using the Start transaction or Start job steps and, when the nested transaction or job steps will be reached in the owner job flow they will be started in sequence or in parallel, it depends on how they are connected. But sometimes we would like to execute a transformation or a jobs once for each line in the input flow. To do that is really simple. Go to the step configuration dialog , select the Advanced tab and check Execute for every single row. We see an example of that below in the Start transaction configuration dialog. You'll find the same setting in the Start job configuration dialog.

The job step Add filenames to result and why it isn't good for us

So far so good. Well, go back to our requirements now. Because we said that we have 2 groups of transformations, transf_group1 and transf_group2, it is clear that we will have two jobs one that chains all the transformations of transf_group1 and the second all the transformations for transf_group2. We call them respectively jobs1 and jobs2. So we will have:

a) A root job chains together 2 jobs job1 and job2.
b) Each job chains all the transformations of the respective group.
c) Because the two job encloses the group of transformations we are sure that the second group of transformations will be executed after the first group, as a whole, will be executed.

Looking at what explained above regarding the way to start a transformation in a job, to start the two jobs once per file we need step that reads the list of files from a specified directory, fills the result with the set of complete filenames so that it can be used to start our job once for file in the result. Because we talked about two different filegroups we need two of steps like this chained before the respective job. We look into the job steps palette and we found a step that could be fine for us the Add filenames to result. The picture below depict a possible flow for our root job.

We're starting to smile but unfortunately this solutions is not applicable because it doesn't work. To understand the why we need to understand the difference between row result and file result for Kettle. Typically a file result is a set of filenames that can be used only by steps that are able to manage attachments. The Mail step is the one step that can manage such a result. Row results instead are made by real data typically as output of a transaction. If you look at the Kettle internals you can notice that a job step manages these two datasets as two completely separated collections. The important thing to note here is that whenever you check Execute for every single row in our job/transformation configuration your're saying that you'll start your job/transformation for each row of your row result. So way our solution isn't good for us? Because our Add filenames to result steps fill a file result so our jobs will never starts.

So what to do??

The solution is make a transformation whose only goal is to get the file list and use that list to populate a result list as shown in the picture below.

You need to call that transformation through a Transformation step in our root job chaining it before job1 and job2 to get respectively filegroup1 and filegroup2. Here it is the complete layout of our definitive root job

You can see the two transformations that gets the file lists before the two jobs. Using this approach the result file list that comes from the transformation fills the row result and the job can be executed once per file that is present in our directory. Remember check the magic flag Execute for every single row in the Start job step configuration as detailed above to correctly activate the jobs once per file as detailed in the paragraph above.

How to execute the provided sample

To execute my sample unzip the file in whatever directory. Edit the get file list transformations and change the Get file names step configuration according to a directory and files pattern that exists on your pc. Now, if you start the root job, you can go through the log and clearly see the messages that indicates the job is behaving as expected.

Download from here the sample

Wednesday, August 25, 2010

Dealing with Kettle DBCache

I came from 4 weeks of summer holidays... First time in my life but it was beautiful to spend a lot of time with my family. I need to take some time for them more often.

Anyway, today I was doing my usual work with Kettle but something strange happened that makes me crazy. I added a new field to my table and then I came to Kettle to update the fields layout in my Database lookup step. When I tried to find out my new field from the fields list I remained surprised... No new field appeared in the fields list for my table. I cried because I thought that it magically disappeared by my table but after a rapid check with my database client I saw my field was there in my table and I was happy. So what happened?

Cache database metadata with DBCache

DBCache is a cache where recurring query results (about database metadata informations for example) are saved the first time you access them. That informations are persisted in a file called db.cache located in the .kettle directory below your home directory. Informations are saved in row format so you have a look inside the file or edit its content.

Every time, for example, you go through Kettle database explorer looking for a table layout, the first time you access table metadata that results are cached through DBCache so that second time you you go through the cache saving an access to the db. But if you forget that and you update your table DDL in any way (like me after a veeeery long period of holidays) you could be surprised seeing that your updates seems not to be caught by Kettle.

How can we clear the cache and have our table metadata updated next time I need to access them? You can go through the Tool menu and choose Database > Clear cache and your fresh set of table metadata will be get from the database.

Saturday, August 14, 2010

Blogging from my summer holidays

Today I was reading through my twitter client the latest news and my attention was catched by all the tweets around the latest Doug post on the Pentaho forum. So I immediately connected through my phone browser to read that post and think about it.

Doug's question was asking how the Pentaho community is going. Following what my Pentaho friends said in their answers to this thread I also agree that comparing downloads isn't a good metric to judge how a community is going. The spirit of a community is made by all the people that every day gives a precious help in coding, helps people understanding the product or coming out from problems (that they think they are hard but they aren't) and talks about the product.

I personally come from another experience with another important opensource community (not a BI product). From the outside, I remember, everything seemed wonderful. The product stayed for a very long time on top as the best project in one of the biggest open source forges. But from the inside everything was totally different and the approach was really cold.

Here may experience is totally different, I'm breathing a completely different air. I found real people that cooperates everyday helping each other to solve their everyday problems with the product or other related technologies. They support people through irc, the forums and the wiki. I'm a software architect so I decided to help mainly with code contributions, but also in the forums and in writing articles about some pentaho topics in my blog .I studied and I'm continuously studying the code and I'm working with wonderful guys: Pedro, Tom, Paul and many other.

In our Pentaho community, the irc channel is the something I really enjoy but unfortunately I can participate only a few times because very often the various proxies broke my ability to connect. It's a sort of noisy room where anyone expose problems and quickly gets solutions. You always find someone available to support you. But it is also a place to talk with friends about everything. A sort of meeting place. That is wonderful.

I said friends not colleagues and this is a very important distinction. These are the things that makes me thinking that IT IS really a community that works and not the number of downloads or any other stupid indicator.

Wednesday, July 21, 2010

Setup Kettle debugging in 2 minutes

Sometimes it could be very interesting to be able to debug something to understand better how it works from an internal standpoint. My experience says that this is the case with the majority the opensource projects: sooner or later this is will happen.

Today that happened with Kettle trying to better understand if my assumptions were correct.

So how can we easily debug Kettle? The answer is very easy is: use remote debugging. I'll explain everything in a minute. The development tool I'm using is IntelliJ IDEA but it is fairly simple to set up everything with Eclipse and Netbeans.

Setup Kettle to enable remote debugging

1) If you're on Windows, open Spoon.bat and add the following line

 set JAVA_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n

The same on Linux will be to open Spoon.sh and add the following line

 export JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n"

I've used the port 8000 as the socket port number at which the debugger attaches but feel free to use the one you prefer.

2) Go to the last line of Spoon .bat or Spoon.sh and update that last line as follow

Linux (Spoon.sh):

 $JAVA_BIN $JAVA_OPTS $OPT $STARTUP -lib $LIBPATH "${1+$@}"

Windows (Spoon.bat):

 start javaw %JAVA_OPTS% %OPT% -jar launcher\launcher.jar -lib %LIBSPATH% %_cmdline%

Configure IntelliJ IDE for remote debugging

1) Configure a project for Kettle source tree
2) Choose Build > Edit configurations from the menu
3) The Run/Debug Configurations dialog opens. Choose Add New Configuration (the + sign in the upper left corner) and select Remote from the Add New Configuration list

3) The Remote Configuration dialog opens

Configure Transport and Debugger mode choices as displayed in the image above. The set the Host address (localhost in this case or wherever you need) and the port (8000 in my case) and press Ok

And now let's dance...

So now start Kettle. As soon as Kettle started load you job or transformation and, before running it, launch the debugger from your IDE. As soon as the debugger started you're ready to launch your job/transformation and begin your investigation. The advantage of this approach is the ability to debug also remotely running kettle instances and that is useful sometimes in helping to investigate problems.

Have fun playing with your Kettle and see you the next article.

Monday, July 19, 2010

CDF internationalization early access release

Today I finally completed the first implementation of the CDF Internationalization support. It is my first big effort to help in CDF and I'm proud I had the opportunity to work on a so important missing feature.
This is a first release, a sort of early access release, and it is available after building the framework sources from the repository trunk. I hope anyone can share ideas or suggestions on this feature helping making it better.

How does it works
CDF internationalization uses a jQuery i18n plugin to handle internationalized messages. The plugin support has been wrapped in the CDF framework so that the user never had to bother about jQuery specifics but will use the support CDF gives for that. Everything rounds about resource message files as usual in the java platform.

To create resources files we need to create a file for each locale our dashboard is going to support. These files are named <name>.properties, or <name>_<language>.properties or <name>_<language>_<country>.properties. For instance a resource bundle for UK will be MessagesBundle_en_GB.properties. In case the message file will not be present the framework support will display the default message string and you'll get a warning in the bi server message console (nothing to be worried about but it would be better to have the required missing resource file).
The <language> argument is a valid ISO Language Code. These codes are the lower-case, two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as: http://www.loc.gov/standards/iso639-2/englangn.html.
The <country> argument is a valid ISO Country Code. These codes are the upper-case, two-letter codes as defined by ISO-3166. You can find a full list of these codes at a number of sites, such as: http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html.

How resource bundles are managed in CDF
We have two level of messages files in CDF

Global message files: located in <biserver_home>/pentaho_solutions/system/pentaho_cdf/resources/languages and useful when used in dashboard global templates.
Dashboard specific message files: located in <dashboard_home> they can be distributed together with single dashboard in a sort of "package". To enable Dashboard specific message files support we've to add a <messages> element to the xcdf file configuration whose value is the base name of the dashboard specific message file.

The CDF.i18n tag: internationalization support in HTML dashboard templates
To use the internationalization support in dashboards HTML templates we implemented a tag to use to get the message string from the resource bundle and use it where required in the HTML template file. This is useful for descriptions, annotations, generic labels and similar.

We can use the CDF.i18n tag using the following syntax:

CDF.i18n(<message_key>)

where <message_key> is the message key located in the message resource bundle files.

Internationalization support in CDF components
Because of the early preview stage of this first release, for the moment, only the jFreechart component supports internationalization. In this case the support applies to the chart's title attribute.
To support an internationalized title in a JFreechart component we added a new titleKey metadata attribute whose value is the message key for the chart title we can found in the resource bundle. A complete internationalization support for all the CDF components (where needed) will come shortly so be patient.

And here we go...
For anyone who want to have a first look on this, in the bi-developer solution you have in the repository trunk has a fully working example you can have a look and experiment with. Below two screenshots of the sample dashboard I made internationalized in the Italian language. I hope you enjoy playing with this first release and stay tuned here for the next updates from the field.

Monday, July 12, 2010

Joining and appending output datasets with CompoundDataAccess

An interesting topic about CDA is the ability apply join and union constructs to our incoming dataset using a CompoundDataAccess element. This definition is a sort of extension of the DataAccess element and through a type attribute gives us the ability to join or append two given DataAccess datasets outputs forming a new one.

How does CompoundDataAccess works internally

It is really interesting to have a look at the sources and see how the things really works:

when you want to join the two DataAccess outputs CDA builds a little Kettle transformations and uses the Merge Join step two do the required join operation
if you want to append the two DataAccess outputs CDA very easily appends by itself the two datasets following the order you specified in configuration file.

Remember that in either cases you can apply this operations considering only two datasets at the time for each CompoundDataAccess element definition.

CompoundDataAccess basics

To define a new CompoundDataAccess element you have to mandatory specify

an id attribute, to externally identify this DataAccess elemement
a type attribute that accepts two values union or join following the type of composition you're looking for.

Because CompoundDataAccess is only an extension of the DataAccess element, we can apply the Parameter, Columns, CalculatedColumns and Output elements following the same rules specified for DataAccess in my previous post here

Moreover, because the CompundDataAccess element works on resulting DataAccess datasets we have to preliminarily define the two needed DataAccess. It is not possible (but would be a nice to have and I can think about implementing it) use DataAccess elements defined in external CDA files.

Joining datasets output
To join two datasets we have to define a CompoundDataAccess element with the type attribute valued to the join value. The join type lets you execute a FULL OUTER JOIN upon the two input datasets. Required elements are:

Left element definition, to specify a dataset for the left part of the join,
Right element definition, to specify a dataset for the right part of the join.

Anyone of the two elements accepts the following attributes:

id, is valued to the id of the DataAccess element definition you are going to consider as the left/right dataset
key, is a comma separated list of elements to be considered as the join key. Remember that the columns keys are specified giving their position in the DataAccess query and not the name.

Below I give an example to fully understand this concept. Suppose we defined the following two DataAccess definitions in our CDA file

   <DataAccess id="1" connection="1" type="sql" access="private" cache="true" cacheDuration="300">  
                <Name>Sql Query on SampleData - Jdbc</Name>  
         <Query>  
                     select o.YEAR_ID, o.STATUS as status, sum(o.TOTALPRICE) as totalprice from orderfact o  
                     group by o.YEAR_ID, o.STATUS  
                </Query>  
                .  
                .  
                .  
                .  
   </DataAccess>  
   <DataAccess id="2" connection="1" type="sql" access="public" cache="true" cacheDuration="5">  
                <Name>Sql Query on SampleData</Name>  
        <Query>  
                     select o.YEAR_ID, o.status, sum(o.TOTALPRICE * 3) as tripleprice from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                     order by o.YEAR_ID DESC, o.STATUS  
                </Query>  
                .  
                .  
                .  
   </DataAccess>

and suppose in the same cda file we define this CompoundDataAccess element.

      <CompoundDataAccess id="3" type="join">  
           <Left id="1" keys="0,1"/>  
           <Right id="2" keys="0,1"/>  
           <Columns>  
                <CalculatedColumn>  
                     <Name>PriceDiff</Name>  
                     <Formula>=[TRIPLEPRICE]-[TOTALPRICE]</Formula>  
                </CalculatedColumn>  
           </Columns>  
               <Parameters>  
       <Parameter name="status" type="String" default="Shipped"/>  
       <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>  
     </Parameters>  
           <Output indexes="0,1,2,5,6"/>  
      </CompoundDataAccess>

As you can see the CompoundDataAccess element has the two keys, Left and Right, two indicate the two sides of the join. The left one uses the DataAccess id 1 (the first one above) and the second the DataAcccess id 2. Then you see the comma separated list that defines the positions of the keys columns in the related DataAccess query. In this case you can see, looking at the related DataAcces queries, that either the elements uses YEAR_ID and the STATUS columns

Appending datasets ouputs
To append two datasets we have to define a CompoundDataAccess element with the type attribute valued to the union value.The union type lets you execute append two DataAcces output datasets giving their order using two configuration elements:

Top, to specify the first dataset,
Bottom, to specify the second dataset.

Anyone of the two elements accepts the following attributes:

id, is valued to the id of the DataAccess element definition you are going to consider as the Top/Bottom dataset

Below a sample of configuration using a compound union. The DataAccess elements considered for this sample are the same defined in the previous sample for the join type.

   <CompoundDataAccess id="3" type="union">  
     <Top id="2"/>  
     <Bottom id="1"/>  
     <Parameters>  
       <Parameter name="year" type="Numeric" default="2004"/>  
     </Parameters>  
   </CompoundDataAccess>

Friday, July 9, 2010

CDA configuration files basics (part 2)

Last time we started talking about the basics of CDA configuration. Today we'll continue our series of articles on CDA better at DataAccess element configuration.

A little ricap

CDA is a BI Platform plugin that that allows to fetch data in various formats. It works defining a configuration file where we will define connections to our available datasources and a number of data access objects. The data access objects defines the strategy used to access the data using a specified connection. In our previous post we talkes briefly about how to configure Connections a we give a brief overview about the DataAccess element. Today we will complete our introduction to the basics of CDA DataAccess configuration.

Adding parameters to our DataAccess element

The Parameters element, nested into the DataAccess element, let you define a set of parameters required to get the data from you connection. Every parameter is associated to a Parameter element which accepts at least the following attributes:

name, to specify the Parameter name.
type, lets you specify the Parameter type. Allowed types are String, Date, Integer, Numeric.
pattern, lets you specify a display pattern required to let the CDA better understand the parameter's value as inputted by the user. Samples of usage for this attribute are dates dates or numeric parameters.
default, lets you specify a default value for the parameter so that it will be automatically assigned every time we don't give that value

Below you've a sample about configuring a Parameters element

   <DataAccess id="1" connection="1" type="sql" access="public" cache="true" cacheDuration="300">  
                .  
                .  
                <Query>  
                     select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                </Query>  
                .  
                .  
                <Columns>  
                     <Column idx="0">  
                          <Name>Year</Name>  
                     </Column>  
                     <CalculatedColumn>  
                          <Name>PriceInK</Name>  
                          <Formula>=[PRICE]/1000000</Formula>  
                     </CalculatedColumn>  
                </Columns>  
                .  
                .  
    <!-- Output controls what is outputed and by what order. Defaults to everything -->  
    <Output indexes="1,0,2,3"/>  
    .  
    .  
   </DataAccess>

For your reference I've also shown you the query definition. Look at the Columns element declaration: you can see that we're going to rename the column with ordinal 0 from "YEAR_ID" to "Year". To do that we used a Column element where we set the new name.
Then, using the CalculatedColumn element, we added a new calculated column name PriceInK.
Consider that:

New calculated columns are always added at the end of the columns set
CDA uses Pentaho reporting libraries to apply formulas so that means you need to identify streams columns in your formula using square bracket notation and that you can use all the construct found in that libraries.

Last but not least we changed the positional layout of the output columns through with an Output element. In the indexes attributes we set the columns position using column indexes. You have two consider two things here

The total columns number is 4 because 3 came from the query and one was the calculated column lastly added to the stream
The index for the first leftmost column is 0 and NOT 1

Wednesday, April 28, 2010

Easily build BI Server from behind an NTLM proxy

Pentaho projects bases on Apache IVY to manage project's dependencies. This is good because it gives you a consistent and uniform way to manage projects dependencies but gives you some troubles if you need to make a build of one of the many Pentaho projects from behind an NTLM firewall.

So here it is a brief and concise survival guide about how to do that quickly and efficiently.

Install a local proxy in you development computer

First of all download cntlm a local proxy that let you easily authenticate through your company or customer NTLM authentication proxy. You've versions for Windows and Linux. I learned about it two years ago when I was on Linux and I had to manage some Maven projects. Now I'm on Windows and the good for me is that it has a version also for this platform (I tried using also ntlmaps but I wasn't able to have it working properly with IVY)

Installation is really simple but, on WIndows, has a tweak we will talk about later. Download the .zip from sourceforge and unzip it wherever is fine for you. Execute the setup.bat and it will rapidly install the local proxy as a Windows Service.

Go to the installation directory (typically C:\Program Files\cntlm) and edit cntlm.ini. Replace the informations for username, password, domain and your proxies hosts and ports as detailed in the file excerpt given below

#
# Cntlm Authentication Proxy Configuration
#
# NOTE: all values are parsed literally, do NOT escape spaces,
# do not quote. Use 0600 perms if you use plaintext password.
#

Username        <your_username>
Domain          <your_domain>
Password        <your_password>   # Use hashes instead (-H)
#Workstation    netbios_hostname        # Should be auto-guessed

Proxy           <1st_proxy_host>:<1st_proxy_port>
Proxy           <2nd_proxy_host>:<2nd_proxy_port>

#
# This is the port number where Cntlm will listen
#
Listen                3128

Make the changes and save the file. Now, before starting the local proxy service, here is the tweak. In my case, the setup.bat script forget to set a service startup parameter that sets the cntlm configuration parameter file location. So to workaround that, open your registry and go to the following key HKEY_LOCAL_MACHINE\SYSTEM\ControlSet001\services\cntlm\Parameters. Look for the AppArgs attribute and modify it adding the parameter -i "C:\Program Files\Cntlm\cntlm.ini" as shown below

After that you're ready to start you service successfully otherwise you'll get an error.

Back to our Pentaho's project build

In my case I was making a build of the newest bi-server 3.6 get from the related branch of Pentaho repository (http://source.pentaho.org/svnroot/bi-platform-v2/branches/3.6). To enable IVY to use your locally configured authentication proxy server you've to set the following environment variable before starting compiling the project

ANT_OPTS=-Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128

It is up to you to decide where to set that variable either as a Windows environment variable or from within a command line session (if you're starting the build manually).

And what about for the Linux users....

If you're on Linux you can install cntlm using the package manager of your favorite distro. I used it on Ubuntu and Fedora withou any problem. The configuration file is the same as before but is located in /etc/cntlm.conf. You've to change the same parameters detailed above for cntlm.ini (the file is exactly the same). Start the daemon, set the ANT_OPTS environment variable and have fun with your build.

Wednesday, April 21, 2010

CDA configuration files basics (part 1)

Last time I give a brief explanation about the basics of CDA compilation and deployment and I showed you how to start executing a simple query on our datamart using CDA getting back a resultset formatted as you expected. I decided to take you to a fully functional sample going through a set of explanatory steps summarized below:

Explain the structure of CDA files and learn how to built a new one
Show the basic verbs you can use to interact with CDA
Put everything together with a working example

Before start playing with CDA remember to get the latest CDA sources compile it and reinstall the plugin as explained in my first post on this blog.

CDA files has a very basic structure. They are very easy to be learned also by a novice user with a basic knowledge of XML. Basically a CDA file contains definitions for some important elements

datasources
data access
compound data access

The DataSource element

Datasources in CDA have the usual meaning of a named connection to a database server. The name (in our case the connection id) is commonly used when creating a query to the database.

Each DataSource is made up by a set of connections defined as follow:

has an id attribute to uniquely identify the connection in this CDA datasource definition. Of course we cannot have two connections with the same id.
has a type attribute whose value depends on the connection type. We've different connection types one for each datasource type: sql, mondrian, Pentaho Metadata (MQL), Olap4J
a set of other elements that differs depending on the connection type

We can have 1 or more connections in the DataSource element according to your needs.

Configuring connections to relational sources

If we look at the samples that comes together with the latest CDA distribution you have a sample for each of the possible connection types. Open PUC console and go to the folder bi-developer/cda/cdafile and you have the list of cda definitions shown below in the files window. As you can see every CDA file is identified by a specific icon.

First interesting point: to open a file we've a CDA files editor integrated in Pentaho console. Cool! So to open a CDA file select it, right click on the file name and select Edit

Let's have a look at how to connect to a relational datasource. We've two samples for that. One to connect through jdbc and another to connect through a JNDI datasource. Open the sql-jdbc.cda for example (the jdbc one). As you can see, in the right area of PUC console the editor opened and shows you the file content. CDA editor is the standard tool to modify CDA files in a quick and easy way. You can modify your file, save it and also test it using the preview functionality (we'll talk about it later). Looking at the file immediately after the CDADescriptor element you'll find the DataSource element and, as its child, the connection definitions.

As you can see you can find a connection with id="1" and type="sql.jdbc" and then a set of elements describing properties that are directly related with the selected connection type (in this case the definition of a jdbc connection so we've elements for driver, url, username and password). Take your time to go through all the other samples and see the differences in connection types.

The DataAccess element

The DataAccess contains the definitions for the query that needs to be executed, the parameters the query accepts and some other side elements that we will explore in the next steps.

A DataAccess element:

has an id,
has a descriptive name (given through a child element),
links a specific connection id defined in the previous DataSource section,
contains a query element,
contains a set of given parameters (not mandatory),
contains a set of other not mandatory definitions will go thorugh in the next steps.

We can have one or more DataAccess elements in our CDA file.
So go back to our example (sample-jdbc.cda) and continue our exploration.

As you can see in the picture above we've some things to be noted:

Remember to give an id and to link the connection defined in the DataSources section
Because we're writing text as value of an xml file element, remember the that comparison operators < > has to be written respectively as < and > otherwise you'll get an error.
Parameters has given a name. In the query you can indicate a parameters with the usual syntax ${<parameter_name>} where parameter_name is the exact name of the parameter we're going to apply in the where clause.
Every parameter has a type that, together with the name, needs to be mandatory given. Supperted types are: String, Date, Numeric and Integer. Any parameters can have a default value specified.
If the parameter's type is Date you can specify a pattern you followed for your date format.

Test the query

To test our DataAccess definition before moving foward with our implementaion click on the Preview button. When the Preview form appears you can select the DataAccess definition you want to test and see the results below on a nice looking table.

Some things to note:

the previewer sets the default values for the query (if any) but you can change them accordingly to your needs,
the date format for the orderDate parameter follows the pattern specified in the CDA file definition.

Now I'll give you the time to play with the samples and experiment a bit. Next time we'll talk about the second part of the DataAccess elements and CompoundDataAccess definitions.

Friday, March 19, 2010

A first look into Pentaho CDA

These days I decided to have a look into Pentaho CDA the newest idea from Pedro Alves. The project is available here on Google code. He gave announcement about this two months ago in his blog and immediately it appeared interesting.

What is CDA?

CDA is a BI Platform plugin that that allows to fetch data in various formats. The good things are that it allows you to acces data using all of the possible data sources you can use in BI Server (SQL, MDX, Metadata, Kettle etc.) and can be used by any client (internal or external to BI Server) because you can call it through a simple url and it gives you back a data stream.

Let's go and compile the code

So after some very busy days of hard work were I only had the opportunity to keep my local copy of the code updated with the repository and see what's going on, yesterday I had some time available and I was able to compile it and give it try. The framework easily compiles with the ant script you find in the source tree. The only thing you've to consider is to create a file named override.properties containing a set of definitions related to your working environment

  plugin.local.install.solutions.dir = <substitute_with_your_biserver_home>/pentaho-solutions   
  plugin.local.install.user = joe   
  plugin.local.install.pass = password

After having compiled the code successfully you can install it under your BI Server rel. 3.5.2 with the command

 ant install-plugin

The basics

Immediately after our successfull compilation and installation we can start biserver and login into the system. Under the bi-developer solution you have a folder named CDA where you can find an extended set of examples to start playing with.

CDA configuration is based on xml files. Every xml file contains a set of data access definitions. Everyone is identified by an id and has a specified data access method. I don't copy and paste a sample here because you can have a look in the samples by yourself. They are very simple and the format is immediately understandable so that you can easily edit them by hand.

Basically you can have your output data stream in multiple formats. At the time of writing this article you can use JSON (the default), XML, Excel and CSV. This gives you the maximum flexibility on what you can do with that data.

If you want to try to execute a query using a data access method defined in your .cda file you can use the following this sintax

http://<biserver_host>:<port>/pentaho/content/cda/doQuery?solution=bi-developers&path=cda/cdafiles&file=sql-jndi.cda&outputType=xml&dataAccessId=1

This calls the method doQuery to execute the query using the dataAccess with id 1 defined in sql-jndi.cda and located under the solution bi-developers in folder cda/cdafiles giving you the output using XML. Try to execute it changing the output type and see what happens. Remember that if you don't specify an output type it will be JSON by default.

Suppose that we wnat an output typ like xml, the answer you'll see in your browser window is something like this

Shipped20044114929.96Shipped20051513074.46

almost insignificant but, if you get the page source you clearly can see the xml stream sent to you by our CDA plugin

 <CdaExport>  
      <MetaData>  
           <ColumnMetaData index="0" type="String" name="STATUS"/>  
           <ColumnMetaData index="1" type="Numeric" name="Year"/>  
           <ColumnMetaData index="2" type="Numeric" name="price"/>  
           <ColumnMetaData index="3" type="String" name="PriceInK"/>  
      </MetaData>  
      <ResultSet>  
           <Row>  
                <Col>Shipped</Col>  
                <Col>2004</Col>  
                <Col>4114929.96</Col>  
                <Col isNull="true"/>  
           </Row>  
           <Row>  
                <Col>Shipped</Col>  
                <Col>2005</Col>  
                <Col>1513074.46</Col>  
                <Col isNull="true"/>  
           </Row>  
      </ResultSet>  
 </CdaExport>

Really cool! If you don't want to make your hand dirty in changing jdbc definitions I suggests you to try all the files whose name contains the word jndi that means they use a datasource and not a direct jdbc connection defined in the file (these are the files that contains the word jdbc).

Read the next part of this serie and learn about CDA configuration files basics

Thursday, February 25, 2010

Customize CDF jfreechart component

I was dealing with CDF Dashboards during these days and I had to set line-width and line-style attributes in my jFreechartComponent's LineChart widget.

Looking at the jFreechartComponent documentation you clearly see that in the standard implementation some attributes, specifics only to some chart types, aren't implemented. This was also for the two attributes line-width and line-style related to the LineChart widget. But this is not a problem because we can add that missing attibutes very easily and I'll show you how. The interesting thing is that I can manage all the attributes that are available for a specific JFreeChart chart type.

If you need to add support for missed JFreeChart chart's attributes you can follow the steps summarized below. We suppose to add line-width and line-style attributes and to access them from CDF jFreechartComponent as two properties respectively named lineWidth and lineStyle.

1) Go to <biserver-home>/pentaho-solutions/cdf/components and open jfreechart.xaction and open it with your favourite editor.

2) Add two new elements to the inputs section of the .xaction file. Below an example of the file with the two new inputs added. Remember here that the value of the request element (see lines 5 and 11 below) has to be the same in name and in case as the name of the property we want to put in our CDF jFreechartComponent definition. Then be careful to correctly assign a the default value needed in case that the property isn't provided (see lines 7 and 13) by you CDF component call. This is particularly important when our attribute is a number. If you don't assign it properly you'll get a NumberFormatException.

1:  <inputs>   
2:    <!-- Omitted for brevity -->       
3:       <LINEWIDTH type="string">  
4:            <sources>  
5:                 <request>lineWidth</request>  
6:            </sources>  
7:            <default-value>1.5</default-value>  
8:       </LINEWIDTH>  
9:       <LINESTYLE type="string">  
10:            <sources>  
11:                 <request>lineStyle</request>  
12:            </sources>  
13:            <default-value/>  
14:       </LINESTYLE>  
15:    <!-- Omitted for brevity -->       
16:  </inputs>

3) Add the new inputs previously defined in the inputs section (see above) to the action-inputs section of our ChartComponent action-definition (as shown below at lines 7 and 8). Then, use these new fields to populate two new property elements added to the chart-attributes section as shown at line 25-26. Be very careful here because the name of the element you give to every new chart attribute has to be equal to the related attribute for the JFreeChart library. For example: if in JFreeChart library the line width attribute is named line-width that is the name to be considered as child element of the chart-attributes element. Then the value to be assigne to that new element is the name of the related action input element (for line-width we will assign as value the LINEWIDTH element. See line 25 below).

1:       <action-definition>   
2:            <component-name>ChartComponent</component-name>  
3:            <action-type>Chart</action-type>  
4:            <action-inputs>   
5:                 <chart-data type="result-set" mapping="newResults"/>   
6:                  <!-- Omitted for brevity -->       
7:                 <LINEWIDTH   type="string"/>  
8:                 <LINESTYLE   type="string"/>  
9:                 <!-- Omitted for brevity -->       
10:            </action-inputs>  
11:            <action-resources/>  
12:            <action-outputs>   
13:                 <chart-filename type="string"/>   
14:                 <base-url type="string"/>   
15:                 <chart-mapping type="string"/>   
16:                 <image-tag type="string"/>   
17:                 <chart-output type="content"/>   
18:            </action-outputs>  
19:            <component-definition>   
20:                 <width>{WIDTH}</width>   
21:                 <height>{HEIGHT}</height>   
22:                 <by-row>{BYROW}</by-row>   
23:                 <chart-attributes>   
24:                 <!-- Omitted for brevity -->       
25:                      <line-width>{LINEWIDTH}</line-width>  
26:                      <line-style>{LINESTYLE}</line-style>  
27:                 <!-- Omitted for brevity -->       
28:                 </chart-attributes>   
29:            </component-definition>   
30:       </action-definition>

Thursday, February 18, 2010

BI Server & MS Active Directory in 10 minutes

Recently I had the need to connect Pentaho to MS Active Directory for user authentication/authorization. Immediately I asked myself how to connect Pentaho BI Server to Microsoft Active directory and I answered "Oh don't worry... it will take no more than 10 minutes!". Then the "look for a how-to document..." discovery process started.

I found a lot of documentation about this issue (wiki articles, forums thread) but there isn't a well done survival guide on this problem (that's my opinion).

So I'll try to summarize in few lines the steps followed and problems encountered to try to build a sort of survival guide for anyone with the same issue to solve

Have a look at spring configuration files

BI Server security architecture is based on Spring Security so the first guide to be read is the Spring documentation where they talk about LDAP configuration. Better, in case you don't know anything about that, if you came a step backward and have a read at the general architecture of Spring security.

Spring security beans wires together through spring application context and in Pentaho all the needed spring application context files are located in <biserver_home>/pentaho-solutions/system. You'll find a lot of them there but the important things to know are:

pentaho-spring-beans.xml contains the list of imported spring bean files that will be loaded when BI Server will start.
We've two important file groups there named applicationContext-spring* and applicationContext-pentaho*. In each group, you have one file for every available authentication method defined in Pentaho. Usually the beans located in files belonging to applicationContext-spring group contains definitions for spring related beans needed to configure the specified authentication/authorization method. The beans located in files belonging to applicationContext-pentaho group contains definitions of Pentaho's beans involved in the authorization/authentication methods for the specific authentication method (LDAP, Hibernate, jdbc).

So how to configure Pentaho to work with MS Active Directory?

The setup to have Pentaho working with MS Active Directory is really simple if you know exactly what to do and to how. Il try to summarize you everything in the following paragraphs. As detailed above all the files we will mention are located in <biserver_home>/pentaho-solutions/system.

1. Setup an MS Active Directory user to let BI Server connect to. You need to define a user in MS Active Directory so that BI Server can connect to and check if the Pentaho's user is going authenticate exists and is valid. The user you're going to define in MS Active Directory doesn't need to have any special right so a normal user is the more appropriate. Remember to check that the "password never expire" flag is not set for this user.

2. Setup Spring Security files needed to enable LDAP Server authentication/authorization. This is a good point, I think the major one. First of all read the guidelines provided here about some rules to be followed when editing Spring configuration files particularly regarding white spaces and special characters. Then follow the points detailed here.

2.a) Open the applicationContext-security-ldap.properties and change the properties accordingly to your needs. The useful thing about this file is that it contains all the properties needed to configure spring beans so that we doesn't need to look for them in eac xml file. They're all in one single place. Following you'll find an example:

 contextSource.providerUrl=ldap\://ldaphost\:389  
 contextSource.userDn=cn=ldapuser,OU=my_org_unit,dc=my_dc  
 contextSource.password=password  
 userSearch.searchBase=OU=my_org_unit,dc=my_dc  
 userSearch.searchFilter=(sAMAccountName=\{0\})  
 populator.convertToUpperCase=false  
 populator.groupRoleAttribute=cn  
 populator.groupSearchBase=OU=my_org_unit,dc=my_dc  
 populator.groupSearchFilter=(member=\{0\})  
 populator.rolePrefix=  
 populator.searchSubtree=true  
 allAuthoritiesSearch.roleAttribute=cn  
 allAuthoritiesSearch.searchBase=OU=my_org_unit,dc=my_dc  
 allAuthoritiesSearch.searchFilter=(objectClass=group)

Important things to note here are

contextSource.providerUrl - LDAP server url
contextSource.userDn - LDAP username. This is the user we've talked about in 1) above
contextSource.password - LDAP user password
the populator properties are needed by Spring DefaultLdapAuthoritiesPopulator. That object is needed to load the set of authorities the user was granted.
the userSearch properties configures the attributes needed to fills up the "users" box when assigning permissions to reports etc.
the allAuthorities properties configures the attributes needed to fills up the "roles" portion of the permissions box when setting them for a report etc.

The excerpt I gave above is fully working so you can copy and paste it in your properties file changing only the definitions specific to your installation. At this point you completed almost the 70% of the required configuration to have everything working on your system. Be careful to declare the full DN (domain name) when you work with MS Active Directory because, if not, it's almost sure you'll have an error like this (The first time I tried I had such an error for this reason)

<pre style="font-family:arial;font-size:12px;border:1px dashed #CCCCCC;width:99%;height:auto;overflow:auto;background:#f0f0f0;padding:0px;color:#000000;text-align:left;line-height:20px;"><code style="color:#000000;word-wrap:normal;"> Microsoft Active Directory Error:
   javax.naming.AuthenticationException:
   [LDAP: error code 49 - 80090308: LdapErr: DSID-0C09030B, comment:
                        AcceptSecurityContext error, data 525, v893 ]
     at com.sun.jndi.ldap.LdapCtx.mapErrorCode(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.processReturnCode(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.processReturnCode(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.connect(Unknown Source)
     at com.sun.jndi.ldap.LdapCtx.(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(Unknown Source)
     at com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(Unknown Source)
     at javax.naming.spi.NamingManager.getInitialContext(Unknown Source)
     at javax.naming.InitialContext.getDefaultInitCtx(Unknown Source)
     at javax.naming.InitialContext.init(Unknown Source)
     at javax.naming.InitialContext.(Unknown Source)
     at javax.naming.directory.InitialDirContext.(Unknown Source)
</code></pre>

You can find short but useful informations about this issue here

2.b) Two words about the default role. Anytime a user logs into Pentaho, the system assigns the user a default role of "Authenticated". This is a really important point to have all the things working. If that role isn't assigned you're able to connect but mantle witll never show anything to you. My problem was that in my spring LDAP context file the definitions to have that role assigned by default were missed. So I had to add them manually. So be absolutely sure to check that in applicationContext-spring-security-ldap.xml you have the defaultRole property defined with a value of Authenticated in the popolator bean and if missed add it. Below an excerpt of that bean definition with the defaultRole property added so that you can copy and paste if missed in your file.

 <bean id="populator" class="org.springframework.security.ldap.populator.DefaultLdapAuthoritiesPopulator">  
  <!-- omitted -->  
  <property name="defaultRole" value="Authenticated" />  
  <!-- omitted -->  
 </bean>

3) Rework the imports in pentaho-spring-beans.xml to enable the load of LDAP security beans at startup. Below (at lines 8-9) we disabled the DAO/Hibernate security and (at lines 10-11) we imported the new definitions enabling LDAP security

1:  <beans>  
2:   <import resource="pentahoSystemConfig.xml" />  
3:   <import resource="adminPlugins.xml" />  
4:   <import resource="systemListeners.xml" />  
5:   <import resource="sessionStartupActions.xml" />  
6:   <import resource="applicationContext-spring-security.xml" />  
7:   <import resource="applicationContext-common-authorization.xml" />  
8:   <!-- import resource="applicationContext-spring-security-hibernate.xml" />  
9:   <import resource="applicationContext-pentaho-security-hibernate.xml" / -->  
10:   <import resource="applicationContext-spring-security-ldap.xml" />  
11:   <import resource="applicationContext-pentaho-security-ldap.xml" />  
12:   <import resource="pentahoObjects.spring.xml" />  
13:  </beans>

4) Now its the time to define a set of groups in your MS Active Directory for your pentaho users (if you don't already have such a groups) depending on your authorization needs. For sure you need to have at least one group to contain pentaho admins. In my system I called that group PentahoAdmin.

5) Declare the new admin group in Pentaho configuration to assign that group the admin grant. To do that rework the acl-voter element in pentaho.xml as shown below.

      <acl-voter>  
           <!-- What role must someone be in to be an ADMIN of Pentaho -->  
           <admin-role>PentahoAdmin</admin-role>  
      </acl-voter>

6) Rework acl-publisher definitions in pentaho.xml for all the Pentaho's groups defined in the LDAP server. In my system I defined two roles PentahoAdmin and PentahoUser so my configuration looks like this

  <acl-publisher>  
           <!--   
                These acls are used when publishing from the file system. Every folder  
                gets these ACLS. Authenticated is a "default" role that everyone  
                gets when they're authenticated (be sure to setup your bean xml properly  
                for this to work).  
           -->  
           <default-acls>  
                <acl-entry role="PentahoAdmin" acl="FULL_CONTROL" />                    <!-- Admin users get all authorities -->  
                <!-- acl-entry role="cto" acl="FULL_CONTROL" / -->                    <!-- CTO gets everything -->  
                <acl-entry role="PentahoUser" acl="EXECUTE_SUBSCRIBE" />          <!-- PentahoUser gets execute/subscribe -->  
                <acl-entry role="Authenticated" acl="EXECUTE" />          <!-- Authenticated users get execute only -->  
           </default-acls>  
           <!--  
                These acls are overrides to specific file/folders. The above default-acls will  
                be applied and then these overrides. This allows for specific access controls to  
                be loaded when the repository if first populated. Futher changes to acls can be  
                made in the platform GUI tool. Uncomment these and change add or delete to your hearts desire -->                           
           <overrides>  
                <file path="/pentaho-solutions/admin">  
                     <acl-entry role="PentahoAdmin" acl="FULL_CONTROL" />  
                </file>  
           </overrides>  
      </acl-publisher>

7) Stop and restart you Pentaho server and everything is ready for a try.

I hope I haven't missed anything and that everything is clear enough for everyone who reads these few instructions. Let me know if you have any problems so that I can keep updated this very brief guide.

Tuesday, February 2, 2010

Pentaho BI Server language files (1/2)

It was a long time of hard work the last month and a half but now I'm able to see a little light at the end of the tunnel and I had the time to be back. This time I have an interesting thing for all the italian friends. I completed:

the italian language files for Mantle GWT Client. The files were already submitted to Pentaho JIRA hoping they will be incorporated in the next release of the product.
the italian language file for JPivot toolbar. This was a minor effort because it is a very little file so it takes really a few minutes to be completed. This language file has been already committed by myself on JPivot's cvs trunk.

I detailed below the steps to be followed to install the new language on existing BI Server installations. The files are compatible with Pentaho 3.x. A special thanks to my colleague and dear friend Andrea Pasotti who helped (and is helping) me in this work.

How to activate italian language support for Mantle GWT client

Download the files messages_it.properties and MantleLoginMessages_it.properties from the following link to JIRA
Stop BI Server
Copy the file messages_it.properties to <biserver_home>/webapps/pentaho/mantle/messages
Open the file <biserver_home>/webapps/pentaho/mantle/messages/supported_languages.properties
Add the following line

it=Italiano

Save and close the file
Copy the file MantleLoginMessages_it.properties to <biserver_home>/webapps/pentaho/mantleLogin/messages
Open the file <biserver_home>/webapps/pentaho/mantleLogin/messages/supported_languages.properties
Add the following line

it=Italiano

Save and close the file
Start BI Server

How to activate Italian language support for JPivot toolbar

Dowload the resources_it.properties language file from the following link
Go to <biserver_home>/webapps/pentaho/WEB-INF/classes and create the following directory path com/tonbeller/jpivot/toolbar
Copy resources_it.properties to <biserver_home>/webapps/pentaho/WEB-INF/classes/com/tonbeller/jpivot/toolbar
Start BI Server

I'm going to finalize to complete translation of the GUI so in the next few weeks will follow the italian language files PAC (Pentaho Administration Console) and Ad-Hoc Query Reporting. So what else.... stay tuned!

Rama's Free Thoughts