Anatomy of a Configuration File with Example- Cloudera Kite Morphlines

At the heart of the Cloudera Kite Morphlines Cloudera Kite Morphlines
is the configuration file that contains all of
your commands that you want to execute
as a part of your ETL process. In the last post we have seen the structure of a configuration file and how the commands are specified in the configuration file.

In Cloudera Kite Morphlines every configuration file ends with an extension of .conf , it is a little bit new and more specific to Morphlines. In this post we are going to dissect the configuration file that we had seen in the last post, we will see the flow of execution between commands , how the control is transferred from one command to other command , how the input is passed from one command to other command and in the last how the output is generated by the final command in the morphline.

Cloudera Kite Morphlines

Our configuration file or morphline contains 3 commands and below is the brief description for each of these command.

  1. readLine : This command comes in the Cloudera Kite Morphlines bundle and read line by line of a text file and pass that data to the next command as an input for it.
  2. validateRecord : This is the custom command that I have written to validate the record passed by the previous command. Here , in this command we are doing some data validation checks and some small transformations. For the validation purpose I have used JBoss Drools , that is a template based rule engine.Output of this command is a Java Object that contains a map which stores all the fields as a key value pair present in the record.
  3. loadDataIntoHive : This is the custom command that I have written to load the data into the Hive table in Parquet Format. Here to load the data into the Hive table , i have used the Kite Data Hive Module. As this is the last command in the chain , it will load the data into the Hive and will return its result (true / false) to the previous command and so on.
morphlines : [{
    id : morphline1
    importCommands : ["org.kitesdk.**","com.techidiocy.custom.commands.**"]
    commands : [
    {
       readDataFromTextFile 
       {
            commentPrefix : "#" 
            charset : "ASCII" 
       } 
     }
     {
       validateRecords 
       {
         inputSchemaLocation : "path/to/the/input/schema/file"
         rulesSheetLocation : "/path/to/the/business/rules/sheet"
       } 
     }
     
     {
       loadDataInHive 
       {
            tableName : "SomeTableName" 
            format : "Parquet" 
       } 
     } 
 
    ]
}]

Below is the main class that will trigger this configuration file. This class needs an input text file and your configuration file (in our case loadHiveData.conf) as an input. You will also notice that we are setting the input stream as an input for our first command i.e readLine.

//import statements
public class HiveDataLoaderTest {	
               public static void main(String... args) throws IOException {
		
		//args[0] contains location of your conf file. 
		File configFile = new File(args[0]); 
		MorphlineContext context = new MorphlineContext.Builder().build();
		// compile loadHiveData.conf file on the fly
		Command morphline = new Compiler().compile(configFile, null, context, null);
		
		InputStream in = new FileInputStream(new File(args[1]));
	    Record record = new Record();
	    //Setting the stream as an input for the readLine command
	    record.put(Fields.ATTACHMENT_BODY, in);
	    
	    //It will invoke the Pipe command that will pass the control to the readLine Command
	    morphline.process(record);
	    
	    //close the stream
	    in.close();
	}
}

I have put a debugger at the line where we are invoking the process on the command and taken the snapshot of that particular moment. If you see the below snapshot you will observe that realChild field represents the actual implementation class of next command that is going to be execute and in the name property (green rectangle) you will see the command name which is mentioned in the configuration file.
Cloudera Kite MorphlinesNow when we go inside the execution of process() call , you will see that it invokes the process() method of the ReadLineBuilder.java which in actual is the implementation of readLine command. Now , this process() method will read the first line of the text file set it as an input for validateRecord and pass the control to it by invoking the process() call on its realChild in that case it is ValidateRecord.java (implementation of validateRecord). Important thing to pay attention here is the Connector command that acts as a bridge between two commands (readLine —->Connector—–>validateRecord——>Connector——->loadData).

In the below snapshot you can see the complete flow control for our morphline (i.e loadHiveData.conf) .Red rectangles represents all the 3 commands that are specified in the configuration file and green rectangle specifies the Connector part.You can also notice that everything starts with the Pipe command as specified by the orange rectangle.
Cloudera Kite Morphlines
Two important methods that are common to all the commands are

boolean process(Record record);

which comes from the

org.kitesdk.morphline.api.Command interface

and

boolean doProcess(Record record);

which comes from

org.kitesdk.morphline.base.AbstractCommand.java

class.

Important thing to note here is that whenever you write your custom command you will always override this doProcess(Record record) method in your implementaion. This is the place where all of your actual business logic goes and also you have to make sure that after writing your business logic , you need to pass the control to the next command in this method.In the next post we are going to see how to write a custom command and the important things to keep in mind while writing a command.

Suggestions ,corrections,questions are most welcomed.

Disclaimer : All the logos and images used above belong to their respective owners.

Let'sConnect

Saurabh Jain

A Developer working on Enterprise applications ,Distributed Systems, Hadoop and BigData.This blog is about my experience working mostly on Java technologies ,NoSQL ,git , maven and Hadoop ecosystem.
Let'sConnect

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS

2 thoughts on “Anatomy of a Configuration File with Example- Cloudera Kite Morphlines

Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>