Saturday, June 7, 2014

Building a Data Mart with Pentaho Data Integration - Video Course Review

Some days ago I had the opportunity to look at this video course, Building a Data Mart with Pentaho Data Integration from Packt, same editor I'm working for in other Pentaho books.

I was really interested in evaluating it, firstly because the author, Diethard Steiner, is a valid community member and author of a great number of useful articles on his own blog. Secondly because it seemed me a good introduction about how to approach the design and development of a complete ETL process to load a Datamart with Pentaho Data Integration.

The course is structured through 9 chapters, clearly explained with a good preamble on top of every chapter to give a clear view of the topic the author is covering also to beginners users. The explanation is clearly undertandable because author speaks a good english at right speed (in my opinion) giving also people that are not perfectly familiar with english language the ability to clearly understand the concepts.

First chapter talks about how to setup PDI (Pentaho Data Integration) and illustrates source and target schema and basics of their design. The thing I liked a lot here was that author used a columnar database (MonetDB) as target platform for its datamart. This way, user can have the opportunity to know about another interesting database technology to use during his/her work. Because the vast majority of the users could not know anything about this database platform, I would have appreciated a few more details about it just to give some necessary background on the database platform and tools used.

Second chapter shows Table Input and Table Output tasks and their use to massively input and output data from a PDI transformation. Good to see also the illustration of the MonetDB bulk loader step, as a more optimized alternative to Table Output step.

Third chapter is all about Agile BI made with PDI and Agile BI Plugin. A good introduction to the benefits of early prototyping and how to do this with Pentaho technology.

Forth chapter is the place where author talks about Slowly Changing Dimensions and how to use PDI to read and write data in Slowly Changing Dimensions of Type 1 and 2. Here, I would have appreciated just one more slide to illustrate beginners users a few more theory about Slowly Changing Dimensions to clearly let user understand which are the problems this approach is going to address in the design of a Data Mart.

Fith chapter talks about time dimension, how to simply structure it and how to effectively generate data to put in it.

Sixth chapter is a guide through techniques used to populate the Fact Table. This same topic will then continue in eighth chapter  where the author talks about Changing Data Capture and how to structure our jobs and transformation to load our fact table. Here, in chapter eight, in my opinion a thing is missing and the CDC topic is partially addressed. I would have appreciated to hear something about using the Merge step to get the table's delta from the previous data load and then synchronizing fact table through the Synchronize After Merge step. In my opinion this is a more generic and interesting way to address this problem.

Finally, chapter seven is all about how to better organize jobs and transformation efficently and chapter nine adresses the logging and scheduling of transformation.

Therefore, to give a summary about this video course,  it is a good introduction to PDI and how to use it to load a Data Mart. I find it very useful for beginners and nice to see for intermediate users that want to go deeper on some concepts. Samples are well done and really help the user in understanding better the concepts explained. Just a side note: during the videos some of the samples are shown by using an early version of PDI 5 and some other by using the previous version PDI 4.4. Starting from the valid assumption that author was trying to show users the new version of PDI, I found this a bit confusing mostly for beginner users.