Cloudera Kite Morphlines Getting Started Example

Kite Morphlines development was initiated  Cloudera Kite Morphlines
as a part of Cloudera Search project and
later it was moved to Kite SDK to make it
more available to a wide range of users
and to invite contributions from the CDK
active community.Idea behind the Kite
Morphlines development is to streamline
the ETL processing , so that the time and
effort involved in Extraction , Transformation and Load of the huge data into Apache Solr, HBase, HDFS, Enterprise Data Warehouses can be reduced.

Most of the time you will observe that data engineers prefer conventional ETL tools like Kettle , Ab Initio , Talend etc to perform these ETL operations.Now if a Java developer has been asked to work on these tools ,it might be possible that it will take sometime for him to get familiar with these tools.

Here comes the Kite Morphlines at rescue which has been developed using Java as a core to create new commands.

A morphline is sequence of commands that are defined in a configuration file to perform any ETL related task and all these commands are stored in an in-memory container. It can also be considered as an abstraction over the Kite Morphlines Java API , where instead of directly interacting with the Java API , you are doing it with the help of a configuration file. This makes the development of complex ETL processes on HDFS much faster and less cumbersome.

Best thing that I liked about the Kite Morphlines is their re-usability , once you have written a command it can be used any number of times and in any number of applications. Now you might be wondering I have used this word “Command” so many times but what actually it is.

Commands are the building block of kite morphlines and a morphline can have any number of commands clubbed together. A command is basically a simple Java Class that actually contains the business logic for activity that you want to perform in that step. e.g : this logic can be reading a data line by line from a text file , reading JSON data , validate the record on the basis of some input rules ,load data into Hive table etc , possibilities are countless.

As Kite Morphlines API is evolving you will find the commands for most of the use cases of ETL processing in Hadoop Ecosystem , moreover if you feel that some of the commands that are required for your morphline are missing , you can always develop a one.

Kite Morphlines also provides a robust framework to make the development of commands faster and hassle free. Once you get familiar with the API you can start writing any complex command and can use in your morphline , moreover if you feel that the command you have developed can help others , you can always contribute it back to the Cloudera.

Below is the simple example of one of the Morphline where we are reading data from a text file , validating the records(using Drools) and all the accepted records we are loading into the Hive table (using Kite Data Module).morphlineAnd here is the corresponding configuration file.

morphlines : [{
    id : morphline1
    importCommands : ["org.kitesdk.**","com.techidiocy.custom.commands.**"]
    commands : [
    {
       readLine 
       {
            commentPrefix : "#" 
            charset : ASCII 
       } 
     }
     {
       dataValidator 
       {
         inputSchemaLocation : "path/to/the/input/schema/file"
         rulesSheetLocation : "/path/to/the/business/rules/sheet"
       } 
     }
     
     {
       hiveDataLoader
       {
            tableName : "Person" 
            format : "Parquet" 
       } 
     } 
 
    ]
}]

Above presented example is one of the simple possible use case where 3 commands are invoked in the morphline. These morphline can contain any number of commands and can solve any complex ETL processing with utmost simplicity.

Note : Above mentioned commands don’t exist in Kite Morphlines API , they are custom commands that I have written and in the next post we will see how to write a custom command. Here , for validation of records I have integrated Drools with the Kite Morphline and used Kite Data Module for loading the records into hive in Parquet format.

Suggestions ,corrections,questions are most welcomed.

Disclaimer : All the logos and images used above belong to their respective owners.

Let'sConnect

Saurabh Jain

A Developer working on Enterprise applications ,Distributed Systems, Hadoop and BigData.This blog is about my experience working mostly on Java technologies ,NoSQL ,git , maven and Hadoop ecosystem.
Let'sConnect

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

2 thoughts on “Cloudera Kite Morphlines Getting Started Example

Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>