Friday, April 19, 2013

Pretty print XML and JSON files with PDI (1/2)

These days I've got into a stupid trouble with PDI. A customer asked me to build a process that produces an XML file as a result; the result I get was horrible from a presentation standpoint: the file wasn't rightly pretty printed and that makes it very difficult to be read. You will encounter this problem anytime you are going to obtain XML files as output, but the same thing happens also anytime you want a JSON output. So this time I decided to take care of this by trying to get into a quick and simple solution to beautify my output; this is what I'm going to talk about in this post.

Pretty Printing XML files How-To

So what to do in case I need to pretty print XML files? Well that's very simple. PDI ships with dom4j, a very well known library that let you easily work with XML, XPATH and XSLT. That library has an object in its API model, OutputFormat, which represents the format configuration used to format the XML output. There you find a method called createPrettyPrint(); this is all the needed to reach our goal very effectively. But now how can we apply all of this to the XML that we're going to produce in our transformation? Very easy to do.

Let's start considering one of the samples shipped with PDI the multilayer XML files sample. You can find it in the samples/transformations directory in you PDI home directory. Take the original file and make a copy so that we can start working on it.

Now, in between the Text Output step and the 5th XML Join step (named "XML Join Step 5") I added two new steps a Select Value step and a User Defined Java Step as you can see in the detail shown in the figure below

While the Select values steps selects only the xmlOutput field that contains the raw xml that needs to be beautified, the User Defined Java class step is the place where the dirty job is made. To properly configure the User Defined Java Class step, that I renamed "Pretty Print Out XML", we need to

  1. Under the tab Fields below, define a new field named xpp whose type is String that will contain the produced pretty printed xml stream
  2. Select the tab Parameters 
  3. Define a new parameter XMLOUTPUT_FIELD whose value is set to xmlOutput 
  4. Define a new parameter XMLPP_FIELD whose value is set to xmlpp
  5. Copy and paste the code below in the related area in the task properties dialog

1:  import;  
2:  import;  
3:  import org.dom4j.Document;  
4:  import org.dom4j.DocumentHelper;  
5:  import;  
6:  String xmlOutputField;  
7:  String xmlPPField;  
8:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
9:  {  
10:    // First, get a row from the default input hop  
11:       //  
12:       Object[] r = getRow();  
13:    // If the row object is null, we are done processing.  
14:       //  
15:       if (r == null) {  
16:            setOutputDone();  
17:            return false;  
18:       }  
19:       // Let's look up parameters only once for performance reason.  
20:       //  
21:       if (first) {  
22:            xmlOutputField = getParameter("XMLOUTPUT_FIELD");  
23:            xmlPPField = getParameter("XMLPP_FIELD");  
24:         first=false;  
25:       }  
26:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
27:    // enough to handle any new fields you are creating in this step.  
28:       //  
29:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
30:       logBasic("Input row size: " + r.length);  
31:       logBasic("Output row size: " + data.outputRowMeta.size());  
32:       String xmlOutput = get(Fields.In, xmlOutputField).getString(r);  
33:    StringWriter sw;  
34:    try {  
35:      OutputFormat format = OutputFormat.createPrettyPrint();  
36:      Document document = DocumentHelper.parseText(xmlOutput);  
37:      sw = new StringWriter();  
38:      XMLWriter writer = new XMLWriter(sw, format);  
39:      writer.write(document);  
40:    }  
41:    catch (Exception e) {  
42:      throw new RuntimeException("Error pretty printing xml:\n" + xmlOutputField, e);  
43:    }  
44:    String xmlpp = sw.toString();  
45:       // Set the value in the output field  
46:       //  
47:       get(Fields.Out, xmlPPField).setValue(outputRow, xmlpp);  
48:    // putRow will send the row on to the default output hop.  
49:       //  
50:    putRow(data.outputRowMeta, outputRow);  
51:       return true;  
52:  }  

The code we're interested in is between lines 32 and 44.

  1. The xml that is caming in is read from the xmlOutput row field and is put into the xmlOutput variable
  2. We create an OutputFormat object with the pretty printing enabled by using  OutputFormat.createPrettyPrint() fuction (see line 35).
  3. Parse the XML document and then write it to a StringWriter (see lines 36 - 39)
  4. Put the obtained pretty printed XML in the xmlpp variable and the into the new field created in the output row (see lines 44 - 47)
The resulting XML stream is then saved to disk by mean of the Text Output step.
We made tests by pretty printing quite significant XML files in terms of size with good performances so I strongly suggest you to use this method. We're also planning to build a custom task to do this in a smoother way that will be released to SeraSoft github public repository soon. 

You can download the following sample by clicking here. So for now we've done; next time I'll show you how to beautify a JSON file. Stay tuned!

No comments:

Post a Comment