java.io.IOException: can not read class parquet.format.PageHeader: null – Hive

While evaluating the Cloudera Kite Morphlines , java.io.IOException: can not read class parquet.format.PageHeader:
I came across this exception while reading the
data from the table.

java.io.IOException: can not read class
parquet.format.PageHeader: null
.

Before going ahead let me give you the background what I am trying to do here.
I am building an application where external client will upload input XML files and there corresponding XSDs ,once these files are uploaded a job will run that will unmarshall these XML files into Java objects , later on these these java objects will be passed to Drools Framework where validation and minor transformations will be performed on this data. During this process it will flatten out the records in 2 files , 1 for all the accepted records and other for rejected records. Now all the corresponding java objects to accepted records will be send back to the next command(generateAvroSchemaFile) of morphline. This command will generate the avro schema file on the fly by reading the Java object and place this schema file on the HDFS(Why HDFS ?? there is a reason for that , I will explain that in next post.). Now control will be transferred to last command (loadDataInHive)to create a Hive table using the avro schema , generated by last command and will load the data using Cloudera Kite Data Module.

Everything went fine here. Table created successfully and even data is loaded without any issue.
Real problem starts when I opened my Hive CLI , and executed a simple hive query. e.g

select * from open_trans;

As soon as I hit enter , I got the below mentioned exception –
java.io.IOException: can not read class parquet.format.PageHeader: null

After doing a little bit research over internet , I came to know that this was a bug in the parquet-avro api and was fixed in the parquet-avro version 1.2.9. Now when I checked the version of parquet-avro in my pom file it was 1.2.5, as it was coming as a dependency from kite-morphlines-all for which I was using the 0.13.0 version.

Before Fix :

   
<dependency>
    <groupId>org.kitesdk</groupId>
    <artifactId>kite-morphlines-all</artifactId>
    <version>0.13.0</version> 
    <type>pom</type>
</dependency>

After Fix :

<dependency>
    <groupId>org.kitesdk</groupId>
    <artifactId>kite-morphlines-all</artifactId>
    <version>0.14.0</version> 
    <type>pom</type>
</dependency>

It was bad on my side as I was not using the latest stable release for Cloudera Kite Morphlines i.e 0.14.0. As soon as I updated the version to 0.14.0, it downloaded the version 1.4.1 for parquet-avro , in which fix for the above mentioned exception is available.

Again I repeat please make sure that you are using the 1.2.9 or later version for parquet-avro in your application.

Suggestions ,corrections,questions are most welcomed.

Disclaimer : All the logos and images used above belong to their respective owners.

Let'sConnect

Saurabh Jain

A Developer working on Enterprise applications ,Distributed Systems, Hadoop and BigData.This blog is about my experience working mostly on Java technologies ,NoSQL ,git , maven and Hadoop ecosystem.
Let'sConnect

Share and Enjoy

  • Facebook
  • Twitter
  • Delicious
  • LinkedIn
  • StumbleUpon
  • Add to favorites
  • Email
  • RSS
Add Comment Register



Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>