Quick Notes

Things that came on the way

Build Your Hadoop Cluster Unattended (Almost!!)

If you are a developer who would like to have a Hadoop cluster or a dev lead who would like everyone in your team a cluster of their own without going through the hassle of creating machines, networking them, install the required software components, chef-bach is an option you want to try.

Garbage First (G1) GC and HBase

G1 the lastest Java garbage collector is a generational collector like CMS. The key differences between G1 and CMS collectors are

  • In CMS the young and old generation heap space are physically differentiated. But in the case of G1 they are logical. This is achieved by dividing the total heap space into number of regions (2000) and the regions are identified whether they belong to young or old space.
  • Compaction of the heap space is done as part of the collection process which was not the case in CMS.
  • G1 collector tries to meet user specified target for application pause time during the collection process.

Copy Distributed Log Files Into HDFS

In a distributed cluster like HBase, logs are created and written on individual nodes participating in the cluster. If users want to look at the log entries say for debugging purposes they would need to logon to individual nodes. Apart from the complexity of going through the logs on the individual nodes, users need to be provided access to all the nodes. One of the ways to mitigate these concerns is to copy the log files into a centralized location.

JVM GC Settings and HBase Performance

HBase being Java based and uses JVM heap differently than a typical Java process, Java garbage collection impacts the performance of HBase. The performance impact will be noticeable if JVM GC parameters are not set properly. Inorder to come up with optimal GC settings, it is good to understand GC process which will provide the required context. The following is a quick summary of GC process and the relevant parameters which influences its behavior. For anyone who likes to venture into the details of GC there is enough literature available on the net.

Leverage Large Physical Memory to Improve HBase Read Performance

HBase uses block cache to store data read from disks in memory so that data referenced repeatedly are serviced with out disk reads. Block cache uses the HBase JVM heap to store cache data and that means any factor which adversely impacts JVM GC process will impact cache and hence query performance.It is commonly known that large heap sizes of say more than 16 GB will make “stop the world” instances of GC run very slow and the time taken to complete can even make the HBase region server to be marked as dead.