HBase : Data Load

Often during development and in production data need to be loaded into HBase tables. This can be for testing application code or migrating data from existing database among many other scenarios. One obvious option is to read data from a source and use HBase put client API to write data into tables. This works fine for small amount of data for unit testing or PoC. In order to load data of large size running into GBs or TBs, using put to write data to HBase tables will be time consuming if the source data is already available. In order to mitigate this, HBase provides an option to create hfiles which are HBase specific file formats used to store table data in the underlying filesystem and load them into HBase tables. For HDFS, these files can be created using a map reduce job and the following are the high level steps.

Copy the source data in HDFS using tools like distcp
Define the target table in HBase using HBase shell or programatically using HBase client admin APIs
Create and run a map-reduce job to create HFiles for the source data on HDFS
Load the HFiles into HBase using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles program shipped with HBase The following code example shows how to go about with the creation of the map reduce job to generate the HFiles.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class Driver {
  public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      conf.clear();
      conf.set("hbase.zookeeper.quorum","zk1:2181,zk2:2181,zk3:2181");
      Job job = new Job(conf, "HBase Bulk Import Example");
      job.setJarByClass(HFileMapper.class);
      job.setMapperClass(HFileMapper.class);
      job.setMapOutputKeyClass(ImmutableBytesWritable.class);
      job.setMapOutputValueClass(Put.class);
      job.setInputFormatClass(TextInputFormat.class);
      HTable hTable = new HTable(conf, args[2]);
      HFileOutputFormat.configureIncrementalLoad(job, hTable);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.waitForCompletion(true);
  }
}

The driver program takes in three parameters table name, HDFS directory where the source data is stored, the HDFS output directory where HFiles need to be created for loading into HBase It sets the out format to HBase org.apache.hadoop.hbase.client.Put which represents a single row in a HBase table The input format is set Text to read source data from a text file In the configuration object, the only parameter which need to be set iis the ZooKeeper (ZK) quorum and the value should be set to the ZK quorum corresponding to the HBase cluster on which the target table is defined No reducers are required to be set to create HFiles using map reduce The following is the code snippet for the HFileMapper class used by the Driver program

import java.util.Random;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
 public class HFileMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
     @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException  {
      String rowkey = new Random().nextInt() + "";
      Put put = new Put(rowkey.getBytes());
      put.add(Constants.COL_FAMILY, "col".getBytes(), "v".getBytes());
      ImmutableBytesWritable hkey = new ImmutableBytesWritable(rowkey.getBytes());
      context.write(hkey, put);
  }
}

Note that this is a dummy mapper in which the key and values are generated dynamically in the code. Based on what the source data source file stores and what need to be stored in the HBase table, the mapper need to be modified. The key aspect to note is how the Put object is created. Once the driver and mapper code is compiled, packaged in a Java jar file (e.g. happy-hbase-sample.jar) and made available on all the nodes in the HBase/HDFS cluster, the HFiles can be generated by running the map-reduce job on the cluster. run mapred job

hadoop jar happy-hbase-sample.jar com.happy.hbase.sample.Driver healthyTable /user/happy/data/input /user/hbase/healthytable/output

When the map reduce job completes, it creates number of files in the output directory on HDFS and it can be used to load data into target HBase table and in this case healthyTable

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/healthytable/output healthyTable

Quick Notes

Things that came on the way

Comments