Quick Notes

Things that came on the way

Oozie Job to Schedule HBase Major Compaction

HBase performs time based major compaction and in-order to prevent this resource intensive process interfere performance sensitive applications, this can be disabled. Once disabled, in order to keep the HBase store files in optimal condition, application team need to schedule regular compaction. The following details the steps which need to be followed to schedule daily compaction through Oozie.

HBase Replication

HBase supports inter-cluster data replication which can be used to propagate data to a secondary cluster/data center that can be accessed when primary cluster/data center is not available. The following are the high level steps to enable HBase inter-cluster replication. Note that HBase also supports region replication with in a cluster for read HA which is different from inter-cluster data replication.

HBase Back-up

Often during development and sometimes in production backup of a HBase table need to be made, for e.g., to run a test code against a table during development and being able to restore the original data if some thing went wrong or create a clone of the existing table or move table data to a new development cluster. HBase provides the option of taking snapshot of tables which can be used in such scenarios. The following are the various hbase shell commands to accomplish some of the common requirements.

HBase : Best Practices

Application Development

Connection Object Reuse

Creating connections to a server component from an application is a heavy weight operation and it is much pronounced when connecting to a database server. That being the reason database connection pooling is used to reuse connection objects and HBase is no exception. In HBase, data from meta table that stores details about region servers that can serve data for specific key ranges gets cached at the individual connection level that makes HBase connections much heavier. So if there are region movements for balancing or if a region server fails, the meta data need to be refreshed for each connection object which is a performance overhead. For these reasons, applications need to try to reuse connection objects created.

HBase : Data Load

Often during development and in production data need to be loaded into HBase tables. This can be for testing application code or migrating data from existing database among many other scenarios. One obvious option is to read data from a source and use HBase put client API to write data into tables. This works fine for small amount of data for unit testing or PoC. In order to load data of large size running into GBs or TBs, using put to write data to HBase tables will be time consuming if the source data is already available. In order to mitigate this, HBase provides an option to create hfiles which are HBase specific file formats used to store table data in the underlying filesystem and load them into HBase tables. For HDFS, these files can be created using a map reduce job and the following are the high level steps.