Friday, April 19, 2013

Pretty print XML and JSON files with PDI (1/2)

These days I've got into a stupid trouble with PDI. A customer asked me to build a process that produces an XML file as a result; the result I get was horrible from a presentation standpoint: the file wasn't rightly pretty printed and that makes it very difficult to be read. You will encounter this problem anytime you are going to obtain XML files as output, but the same thing happens also anytime you want a JSON output. So this time I decided to take care of this by trying to get into a quick and simple solution to beautify my output; this is what I'm going to talk about in this post.

Pretty Printing XML files How-To

So what to do in case I need to pretty print XML files? Well that's very simple. PDI ships with dom4j, a very well known library that let you easily work with XML, XPATH and XSLT. That library has an object in its API model, OutputFormat, which represents the format configuration used to format the XML output. There you find a method called createPrettyPrint(); this is all the needed to reach our goal very effectively. But now how can we apply all of this to the XML that we're going to produce in our transformation? Very easy to do.

Let's start considering one of the samples shipped with PDI the multilayer XML files sample. You can find it in the samples/transformations directory in you PDI home directory. Take the original file and make a copy so that we can start working on it.

Now, in between the Text Output step and the 5th XML Join step (named "XML Join Step 5") I added two new steps a Select Value step and a User Defined Java Step as you can see in the detail shown in the figure below

While the Select values steps selects only the xmlOutput field that contains the raw xml that needs to be beautified, the User Defined Java class step is the place where the dirty job is made. To properly configure the User Defined Java Class step, that I renamed "Pretty Print Out XML", we need to


  1. Under the tab Fields below, define a new field named xpp whose type is String that will contain the produced pretty printed xml stream
  2. Select the tab Parameters 
  3. Define a new parameter XMLOUTPUT_FIELD whose value is set to xmlOutput 
  4. Define a new parameter XMLPP_FIELD whose value is set to xmlpp
  5. Copy and paste the code below in the related area in the task properties dialog

1:  import java.io.StringWriter;  
2:  import org.dom4j.io.OutputFormat;  
3:  import org.dom4j.Document;  
4:  import org.dom4j.DocumentHelper;  
5:  import org.dom4j.io.XMLWriter;  
6:  String xmlOutputField;  
7:  String xmlPPField;  
8:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
9:  {  
10:    // First, get a row from the default input hop  
11:       //  
12:       Object[] r = getRow();  
13:    // If the row object is null, we are done processing.  
14:       //  
15:       if (r == null) {  
16:            setOutputDone();  
17:            return false;  
18:       }  
19:       // Let's look up parameters only once for performance reason.  
20:       //  
21:       if (first) {  
22:            xmlOutputField = getParameter("XMLOUTPUT_FIELD");  
23:            xmlPPField = getParameter("XMLPP_FIELD");  
24:         first=false;  
25:       }  
26:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
27:    // enough to handle any new fields you are creating in this step.  
28:       //  
29:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
30:       logBasic("Input row size: " + r.length);  
31:       logBasic("Output row size: " + data.outputRowMeta.size());  
32:       String xmlOutput = get(Fields.In, xmlOutputField).getString(r);  
33:    StringWriter sw;  
34:    try {  
35:      OutputFormat format = OutputFormat.createPrettyPrint();  
36:      Document document = DocumentHelper.parseText(xmlOutput);  
37:      sw = new StringWriter();  
38:      XMLWriter writer = new XMLWriter(sw, format);  
39:      writer.write(document);  
40:    }  
41:    catch (Exception e) {  
42:      throw new RuntimeException("Error pretty printing xml:\n" + xmlOutputField, e);  
43:    }  
44:    String xmlpp = sw.toString();  
45:       // Set the value in the output field  
46:       //  
47:       get(Fields.Out, xmlPPField).setValue(outputRow, xmlpp);  
48:    // putRow will send the row on to the default output hop.  
49:       //  
50:    putRow(data.outputRowMeta, outputRow);  
51:       return true;  
52:  }  

The code we're interested in is between lines 32 and 44.


  1. The xml that is caming in is read from the xmlOutput row field and is put into the xmlOutput variable
  2. We create an OutputFormat object with the pretty printing enabled by using  OutputFormat.createPrettyPrint() fuction (see line 35).
  3. Parse the XML document and then write it to a StringWriter (see lines 36 - 39)
  4. Put the obtained pretty printed XML in the xmlpp variable and the into the new field created in the output row (see lines 44 - 47)
The resulting XML stream is then saved to disk by mean of the Text Output step.
We made tests by pretty printing quite significant XML files in terms of size with good performances so I strongly suggest you to use this method. We're also planning to build a custom task to do this in a smoother way that will be released to SeraSoft github public repository soon. 

You can download the following sample by clicking here. So for now we've done; next time I'll show you how to beautify a JSON file. Stay tuned!

Saturday, June 23, 2012

Saiku UI Enhancements and new Excel export

Hi everybody. As promised we've some chestnuts that are ready to be eaten and here we are! We decided to start working on one of the major pieces of the Pentaho CE suite: the newest OLAP client Saiku. We have some enhancement in our pipeline to help the Trout, Paul and the other guys of the crew; this is the first of the list. We decided to start with two enhancements that makes all more beautiful to our customers eyes:

1) Improved columns header management
2) New and improved Excel export functionality

Improved columns header management

A thing that I totally don't like in Saiku was the way it managed the columns headers any time we had repeated values on it. You can see a sample of this old behavior in the figure below.



Now the things are totally different. With a little UI improvement column headers with repeated values are grouped together as displayed in the picture below to have better visualization of your data.



New Excel export

Here we have completely rewritten the export module using Apache POI instead of JExcelAPI. This gives us more power and more feature that could be useful in the future to enhance the functionality. The look and feel now is more elegant and exactly equal to what the user is seeing in the user interface. You can see a sample of the new output below




This is a first release beta release of these features so be careful if you decide to use it. Any feedback about how they work and what you would like to have next is fully appreciated.

Where you can find it

All the code is committed to the GitHub Serasoft repository. You can find it here. We're planning the installation of an Hudson server on our servers in the cloud so that you can pickup anything already compiled so be patient. At the moment you can only grab the sources and compile it. Remember that our code is always aligned with all the new functionalities of the standard Saiku distribution. So what else? Stay tuned for the next things that are going to be released...

Friday, June 22, 2012

I'm back after a long time missing....

Hi everybody and sorry for not writing anything for so much time. Write a post is a nice thing and I love it but it takes time and during this last period I haven't had any time to write anything here or to cooperate tightly with my favorite community: the Pentaho community. During that time I was taken starting my own little company here in Italy (initially I spent 7 years as a freelance consultant) called SeraSoft. I decided to do that because I though it was the moment to change something in my professional life. This is a new job for me and it is really hard: manage a new business in a  so competitive environment as for the Information Technology is much harder than developing and designing good software. We have a little office in Boffalora Sopra Ticino, a little town located 20 km far from Milan near to the river Ticino. It is a good place to stay away from the traffic of the city but at the same time we're near and it takes us really a few to be in the city.


Visualizzazione ingrandita della mappa

During these last months we made a lot of interesting things on real life projects with Pentaho and also other interesting Open Source technologies like Liferay, SugarCRM and Openbravo working with companies of any size in sectors like Banking, Telecommunication, Services, Industry and Automotive. Lastly I decided that it was the time to came back heavily working in the community and I re-started doing something I hope interesting for anyone giving my two cents to help the Open Source ecosystem to grow. I'm trying to take this idea of cooperating also to my colleagues so we hope to be able to quickly give our contribution as a company.

That said I thought it's time to came back and start sharing with you all the experiences of myself and my team about Pentaho and the other products we're using and I hope they could be of interest for you all. So stay tuned and for the next news from this channel and also follow us on GitHub to look directly at what we're doing. So see you soon! There are some chestnuts that are cooking on the fire and that are almost ready to be eaten.


Sunday, November 21, 2010

Adjust detail rows height in WAQR templates dynamically

Today I'm back writing something about WAQR. Yesterday one of my customers wrote me about a problem they were having while executing WAQR reports that has a long text in their columns. If you had columns with a very long text in it and you're exporting the WAQR report in PDF  the columns text gets truncated. Stupid problem but apparently not so trivial to be solved within WAQR templates.

About WAQR

WAQR (Web AdHoc Query Reporting) is an interesting module in the Pentaho suite. It sits on top of the report engine and the metadata layer and lets the users easily build tabular reports to be used for their daily activities or to just to export some complex data in an easy way from the Pentaho system. The report definition is based on a wizard that takes the users along these easy steps:

  1. select a template, from the set of available templates, and a business model. 
  2. decide were to put which fields in the report layout
  3. adjust some visualization attributes or/and can apply filter conditions and define sort fields and orders
  4. manage some page layout attributes
  5. .... and here we go! We get the report.

WAQR templates are basically the old JFreereport report designer's template files so they are simple xml files we can view and modify. THEY ARE NOT compatible with the newer report designer template filesThe wizard uses text manipulation routines to create a report-definition out of the template. We all know that sooner or later WAQR will be replaced by something more interactive and more attractive  that will use the latest report engine's version. But for the moment we have this and with this we have to battle.

How to expand the height of the detail rows dynamically when export type is PDF

Try to build a new report with WAQR putting in the report definition a field with a very log text in it. If you try to export the report using the PDF format you will get the text truncated. To fix this it's only a matter of minutes and you need to modify the report template

  • adding a new configuration attribute for the report engine and 
  • add a new attribute to the details band to set the row height dynamic. 

Below I'm going to summarize all these steps modifying the Basic template given with the Pentaho demo solution.
  1. Go to <biserver_home>/pentaho-solutions/system/waqr/templates/Basic and open jfreereport-template.xml
  2. Locate the configuration xml element near the end of the file. Add the following line as a child of the configuration element
     <property name="org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.AssumeOverflowY">true</property>  
    
  3. Add the attribute dynamic = "true" to the items element
  4. Save the template and if the BI Server is running refresh the cache using the Tools -> Refresh -> Repository Cache menu entry