While evaluating the Cloudera Kite Morphlines ,
I came across this exception while reading the
data from the table.
java.io.IOException: can not read class
parquet.format.PageHeader: null.
Before going ahead let me give you the background what I am trying to do here.
I am building an application where external client will upload input XML files and there corresponding XSDs ,once these files are uploaded a job will run that will unmarshall these XML files into Java objects , later on these these java objects will be passed to Drools Framework where validation and minor transformations will be performed on this data. During this process it will flatten out the records in 2 files , 1 for all the accepted records and other for rejected records. Now all the corresponding java objects to accepted records will be send back to the next command(generateAvroSchemaFile) of morphline. This command will generate the avro schema file on the fly by reading the Java object and place this schema file on the HDFS(Why HDFS ?? there is a reason for that , I will explain that in next post.). Now control will be transferred to last command (loadDataInHive)to create a Hive table using the avro schema , generated by last command and will load the data using Cloudera Kite Data Module.
Everything went fine here. Table created successfully and even data is loaded without any issue.
Real problem starts when I opened my Hive CLI , and executed a simple hive query. e.g
select * from open_trans;
As soon as I hit enter , I got the below mentioned exception –
java.io.IOException: can not read class parquet.format.PageHeader: null
After doing a little bit research over internet , I came to know that this was a bug in the parquet-avro api and was fixed in the parquet-avro version 1.2.9. Now when I checked the version of parquet-avro in my pom file it was 1.2.5, as it was coming as a dependency from kite-morphlines-all for which I was using the 0.13.0 version.
Before Fix :
<dependency> <groupId>org.kitesdk</groupId> <artifactId>kite-morphlines-all</artifactId> <version>0.13.0</version> <type>pom</type> </dependency>
After Fix :
<dependency> <groupId>org.kitesdk</groupId> <artifactId>kite-morphlines-all</artifactId> <version>0.14.0</version> <type>pom</type> </dependency>
It was bad on my side as I was not using the latest stable release for Cloudera Kite Morphlines i.e 0.14.0. As soon as I updated the version to 0.14.0, it downloaded the version 1.4.1 for parquet-avro , in which fix for the above mentioned exception is available.
Again I repeat please make sure that you are using the 1.2.9 or later version for parquet-avro in your application.
Suggestions ,corrections,questions are most welcomed.
Disclaimer : All the logos and images used above belong to their respective owners.
Related articles across the web
Related Posts
Saurabh Jain
Latest posts by Saurabh Jain (see all)
- java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskInputOutputContext, but class was expected - August 8, 2014
- org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO “TABLE_PARAMS” – Hive with Kite Morphlines - July 17, 2014
- java.io.IOException: can not read class parquet.format.PageHeader: null – Hive - July 12, 2014