Tuesday, 16 September 2014

The Trials of Smooks

The fact that I'm a hard to please guy explains why I rarely show appreciation for a tool. I easily get frustrated when a tool fails to meet the challenges it's meant to solve. Smooks is one of the few tools I appreciate. It's an invaluable transformation framework in the integrator's arsenal. On a project I was on, I threw at Smooks [1] all manner of challenges, and one after another, Smooks overcame them without giving up a key requirement: maintaining a low memory overhead during transformation. A shoutout to Tom Fennelly and his team for bringing to us such a fantastic tool.

Trial I

The initial challenge I brought to Smooks was about taking a tilde delimited CSV file and map its records to POJOs:

You can see the file has an unorthodox header in addition to a footer. Using Smooks's built-in CSV reader, I wrote concisely the Smooks config doing the mapping to POJOs:

What's happening under the covers, and in general, is that the reader pulls data from a source (e.g., java.io.InputStream) to go on to produce a stream of SAX events. The reader I'm using above is expecting the source data to be structured as CSV and to consist of 4 columns. Let's make things more concrete. Reading from the products.csv file, the reader produces the following XML stream [2]:

Listening to the stream of SAX events is the visitor. A visitor listens to specific events from the stream to fire some kind of behaviour, typically transformation. With the singleBinding element in the csv-to-pojos.xml config, the CSV reader pre-configures a JavaBean visitor to listen for csv-record elements. On intercepting this element, the JavaBean visitor instantiates a org.ossandme.Product object and binds its properties to csv-record's children element content. You'll notice that I left Product's target properties unspecified in the config. The CSV reader assumes Product follows JavaBean conventions and its properties are named the same as the defined CSV columns. Records disobeying the column definition are ignored. Consequently, I do not need to worry about the file's header and footer.

With the transformation configuration out of the way, I turned my attention to running the transformation on the CSV file from my Java code and process the Product objects as they are instantiated and bound by Smooks:

Trial II

A more complex transformation task I gave to Smooks was to load file records, holding a variable number of columns, into a database. As in the previous task, this file had a header as well as a footer:

You'll observe in the sample CSV file that records could be one of three types as denoted by the first column: TH, TB or TF. The CSV reader, as it transforms and pushes records to the XML stream, can be customised such that it renames the csv-record holder to the record's primary column:

As we'll see later, the above config permits Smooks to distinguish between the different record types. Given the sample file transactions.csv, the reader I've configured produces the following stream:

UNMATCHED elements represent the file's header and footer. A CSV record having TH in the first field will trigger the reader to create a TH element holding the other record fields. The same logic goes for TB and TF.

Database visitors load the records. However, since these visitors are limited to binding data from POJOs, I first must turn the XML mapped records from the stream into said POJOs. The CSV reader doesn't know how to bind variable field records to POJOs so I configure the mapping myself:

Given what we've learnt about Smooks, we can deduce what's happening here. The JavaBean visitor for lines 10 till 17 has a selector (i.e, createOnElement) for the element TH. A selector is a quasi XPath expression applied on XML elements as they come through the stream. On viewing TH, the visitor will:
  1. Instantiate a HashMap.

  2. Iterate through the TH fragment. If an element inside the fragment matches the selector set in a data attribute, then (a) a map entry is created, (b) bound to the element content, and (c) put in the map.

  3. Add the map to the Smooks bean context which is identified by the name set in beanID. The map overwrites any previous map in the context with the same ID. This makes sense since we want to prevent objects from accumulating in memory.
The database visitors reference the maps in the bean context:

The insert statements are bound to the map entry values and are executed after the element, the executeOnElement selector points to, is processed. The next step is to configure a datasource for the database visitors (lines 47-49):

Last but not least, the Java code to kick off the data load:

Trial III

The next challenge for Smooks makes the previous ones look like child's play. The goal: transform an XML stream to a CSV file that is eventually uploaded to an FTP server. The input:

The desired output:

Considering the CSV could be large in size, my requirement was for Smooks to write the transformed content to a PipedOutputStream. An FTP library would read from the PipedOutputStream's connected PipedInputStream, and write the streamed content to a file. To this end, I wrote the class running the transformation as follows:

My focus then turned to the XML-to-CSV mapping configuration. After deliberation, I reluctantly settled to use the FreeMarker visitor for writing the CSV. I considered as an alternative to develop a visitor specialised for this type of transformation but time constraints made this unfeasible. The FreeMarker visitor, like the database one, cannot read directly off the XML stream. Instead, it can read from DOM and POJOs. So I decide to use the DOM visitor such that it creates DOMs from record elements found within the input stream:

I then configured the FreeMarker visitor to apply the CSV template on seeing the element record in the stream:

Below is a simplified version of what I had in real life in account.ftl (note the last line of the template must be a newline):

An additional complexity I had to consider were the CSV's header and footer. Apart from being structured differently than the rest of the records, the header had to contain the current date whereas, for the footer, the total record count. What I did for the header was to bind the current date from my Java code to Smooks's bean context (lines 27-30 and 38):

The date is then referenced from the Smooks config (lines 9-12):

With respect to the above config, at the start of the XML stream, FreeMarker writes the header to the output stream (i.e., PipedOutputStream):

000000Card Extract   [current date]

<?TEMPLATE-SPLIT-PI?> is an embedded Smooks instruction that applies account.ftl to record elements after the header.

Adding the record count to the footer is just a matter of configuring the Calculator visitor to maintain a counter in the bean context and referencing that counter from the template:

Trial IV

The final challenge Smooks had to go against was to read from a java.util.Iterator of maps and, like the previous task, write the transformed output to a stream in CSV format. Unlike the InputStream that Smooks read from the other tasks, Smooks doesn't have a reader that is capable of writing a properly structured XML doc from an iterator of maps. So I'm left with writing my own reader:

The custom reader is hooked into Smooks as follows (line 5):

Finally, passing the iterator to Smooks for transformation consists of setting a JavaSource parameter, holding the iterator, on filterSource(...)  (line 27):

1: The Smooks version I used was 1.5.2.
2: You might be wondering how I know for certain the XML document shown is the one actually produced by Smooks. I know because of Smooks's HtmlReportGenerator class.

1 comment:

  1. Hi - Thanks for the great post I learnt some new Smooks tricks reading through the article- Do you know if / how its possible to get Smooks to execute validation on a CSV-POJO transformation such that only the valid records leads to beans being created. I know that you can fail on error. However, I just want Smooks to skip bean creation for any csv record that is invalid. Currently Smooks creates incomplete POJOs in the final list for any failing record.