Quick Notes

Things that came on the way

Garbage First (G1) GC and HBase

G1 the lastest Java garbage collector is a generational collector like CMS. The key differences between G1 and CMS collectors are

  • In CMS the young and old generation heap space are physically differentiated. But in the case of G1 they are logical. This is achieved by dividing the total heap space into number of regions (2000) and the regions are identified whether they belong to young or old space.
  • Compaction of the heap space is done as part of the collection process which was not the case in CMS.
  • G1 collector tries to meet user specified target for application pause time during the collection process.

Copy Distributed Log Files Into HDFS

In a distributed cluster like HBase, logs are created and written on individual nodes participating in the cluster. If users want to look at the log entries say for debugging purposes they would need to logon to individual nodes. Apart from the complexity of going through the logs on the individual nodes, users need to be provided access to all the nodes. One of the ways to mitigate these concerns is to copy the log files into a centralized location.

JVM GC Settings and HBase Performance

HBase being Java based and uses JVM heap differently than a typical Java process, Java garbage collection impacts the performance of HBase. The performance impact will be noticeable if JVM GC parameters are not set properly. Inorder to come up with optimal GC settings, it is good to understand GC process which will provide the required context. The following is a quick summary of GC process and the relevant parameters which influences its behavior. For anyone who likes to venture into the details of GC there is enough literature available on the net.

Leverage Large Physical Memory to Improve HBase Read Performance

HBase uses block cache to store data read from disks in memory so that data referenced repeatedly are serviced with out disk reads. Block cache uses the HBase JVM heap to store cache data and that means any factor which adversely impacts JVM GC process will impact cache and hence query performance.It is commonly known that large heap sizes of say more than 16 GB will make “stop the world” instances of GC run very slow and the time taken to complete can even make the HBase region server to be marked as dead.

Leverage HBase Cache and Improve Read Performance

As with any database management system, proper utilization of caches will improve the query performance in HBase. If someone is looking to optimize caching, it is good to understand the HBase data structures since optimization will vary with each use case. The following is a simplistic view of HBase read write path of HBase and the participating components.

HBase has two in-memory data structures (memstore, blockcache) and two on disk data structures (WAL, HFile). During data write, HBase writes data into WAL (Write Ahead Log) on disk and also to memstore in memory. When a memstore utilization threshold is reached data is flushed into HFiles on disk.