Often during development and in production data need to be loaded into HBase tables. This can be for testing application code or migrating data from existing database among many other scenarios. One obvious option is to read data from a source and use HBase put client API to write data into tables. This works fine for small amount of data for unit testing or PoC. In order to load data of large size running into GBs or TBs, using put to write data to HBase tables will be time consuming if the source data is already available. In order to mitigate this, HBase provides an option to create hfiles which are HBase specific file formats used to store table data in the underlying filesystem and load them into HBase tables. For HDFS, these files can be created using a map reduce job and the following are the high level steps.
- Copy the source data in HDFS using tools like distcp
- Define the target table in HBase using HBase shell or programatically using HBase client admin APIs
- Create and run a map-reduce job to create HFiles for the source data on HDFS
- Load the HFiles into HBase using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles program shipped with HBase The following code example shows how to go about with the creation of the map reduce job to generate the HFiles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
The driver program takes in three parameters table name, HDFS directory where the source data is stored, the HDFS output directory where HFiles need to be created for loading into HBase It sets the out format to HBase org.apache.hadoop.hbase.client.Put which represents a single row in a HBase table The input format is set Text to read source data from a text file In the configuration object, the only parameter which need to be set iis the ZooKeeper (ZK) quorum and the value should be set to the ZK quorum corresponding to the HBase cluster on which the target table is defined No reducers are required to be set to create HFiles using map reduce The following is the code snippet for the HFileMapper class used by the Driver program
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Note that this is a dummy mapper in which the key and values are generated dynamically in the code. Based on what the source data source file stores and what need to be stored in the HBase table, the mapper need to be modified. The key aspect to note is how the Put object is created. Once the driver and mapper code is compiled, packaged in a Java jar file (e.g. happy-hbase-sample.jar) and made available on all the nodes in the HBase/HDFS cluster, the HFiles can be generated by running the map-reduce job on the cluster. run mapred job
When the map reduce job completes, it creates number of files in the output directory on HDFS and it can be used to load data into target HBase table and in this case healthyTable