Monday, July 12, 2010

Joining and appending output datasets with CompoundDataAccess

An interesting topic about CDA is the ability apply join and union constructs to our incoming dataset using a  CompoundDataAccess element. This definition is a sort of extension of the DataAccess element and through a type attribute gives us the ability to join or append two given DataAccess datasets outputs forming a new one.

How does CompoundDataAccess works internally

It is really interesting to have a look at the sources and see how the things really works:
  • when you want to join the two DataAccess outputs CDA builds a little Kettle transformations and uses the Merge Join step two do the required join operation
  • if you want to append the two DataAccess outputs CDA very easily appends by itself the two datasets following the order you specified in configuration file.

Remember that in either cases you can apply this operations considering only two datasets at the time for each CompoundDataAccess element definition.

CompoundDataAccess basics

To define a new CompoundDataAccess element you have to mandatory specify
  • an id attribute, to externally identify this DataAccess elemement
  • a type attribute that accepts two values union or join following the type of composition you're looking for.
Because CompoundDataAccess is only an extension of the DataAccess element, we can apply the Parameter, Columns, CalculatedColumns and Output elements following the same rules specified for DataAccess in my previous post here

Moreover, because the CompundDataAccess element works on resulting DataAccess datasets we have to preliminarily define the two needed DataAccess. It is not possible (but would be a nice to have and I can think about implementing it) use DataAccess elements defined in external CDA files.

Joining datasets output
To join two datasets we have to define a CompoundDataAccess element with the type attribute valued to the join value. The join type lets you execute a FULL OUTER JOIN upon the two input datasets. Required elements are:

  • Left element definition, to specify a dataset for the left part of the join,
  • Right element definition,  to specify a dataset for the right part of the join.
Anyone of the two elements accepts the following attributes:

  1. id, is valued to the id of the DataAccess element definition you are going to consider as the left/right dataset
  2. key, is a comma separated list of elements to be considered as the join key. Remember that the columns keys are specified giving their position in the DataAccess query and not the name.

Below I give an example to fully understand this concept. Suppose we defined the following two DataAccess definitions in our CDA file

   <DataAccess id="1" connection="1" type="sql" access="private" cache="true" cacheDuration="300">  
                <Name>Sql Query on SampleData - Jdbc</Name>  
         <Query>  
                     select o.YEAR_ID, o.STATUS as status, sum(o.TOTALPRICE) as totalprice from orderfact o  
                     group by o.YEAR_ID, o.STATUS  
                </Query>  
                .  
                .  
                .  
                .  
   </DataAccess>  
   <DataAccess id="2" connection="1" type="sql" access="public" cache="true" cacheDuration="5">  
                <Name>Sql Query on SampleData</Name>  
        <Query>  
                     select o.YEAR_ID, o.status, sum(o.TOTALPRICE * 3) as tripleprice from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                     order by o.YEAR_ID DESC, o.STATUS  
                </Query>  
                .  
                .  
                .  
   </DataAccess>  

and suppose in the same cda file we define this CompoundDataAccess element.

      <CompoundDataAccess id="3" type="join">  
           <Left id="1" keys="0,1"/>  
           <Right id="2" keys="0,1"/>  
           <Columns>  
                <CalculatedColumn>  
                     <Name>PriceDiff</Name>  
                     <Formula>=[TRIPLEPRICE]-[TOTALPRICE]</Formula>  
                </CalculatedColumn>  
           </Columns>  
               <Parameters>  
       <Parameter name="status" type="String" default="Shipped"/>  
       <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>  
     </Parameters>  
           <Output indexes="0,1,2,5,6"/>  
      </CompoundDataAccess>  

As you can see the CompoundDataAccess element has the two keys, Left and Right, two indicate the two sides of the join. The left one uses the DataAccess id 1 (the first one above) and the second the DataAcccess id 2. Then you see the comma separated list that defines the positions of the keys columns in the related DataAccess query. In this case you can see, looking at the related DataAcces queries, that either the elements uses YEAR_ID and the STATUS columns

Appending datasets ouputs
To append two datasets we have to define a CompoundDataAccess element with the type attribute valued to the union value.The union type lets you execute append two DataAcces output datasets giving their order using two configuration elements:
  • Top, to specify the first dataset,
  • Bottom,  to specify the second dataset.

Anyone of the two elements accepts the following attributes:
  1. id, is valued to the id of the DataAccess element definition you are going to consider as the Top/Bottom dataset
Below a sample of configuration using a compound union. The DataAccess elements considered for this sample are the same defined in the previous sample for the join type.

   <CompoundDataAccess id="3" type="union">  
     <Top id="2"/>  
     <Bottom id="1"/>  
     <Parameters>  
       <Parameter name="year" type="Numeric" default="2004"/>  
     </Parameters>  
   </CompoundDataAccess>  

No comments:

Post a Comment