HBase performs time based major compaction and in-order to prevent this resource intensive process interfere performance sensitive applications, this can be disabled. Once disabled, in order to keep the HBase store files in optimal condition, application team need to schedule regular compaction. The following details the steps which need to be followed to schedule daily compaction through Oozie.
Non Kerberized Cluster
Create the following files with the content shown in a local directory. In this example the files are created in hbasecompact under the users local home directory. Please note that the files are under different directories.
Shell script to start HBase compaction on a table and copy the output of the command to a HDFS directory. The command output can be checked for any issues.
Oozie workflow definition to perform HBase compaction. The first step major_compact runs the script hbaseCompact.sh. The next step logOozieId runs the script logOozieId.sh to copy the Oozie workflow id onto HDFS.
Note: Do not use 1440 minutes as frequency in workflow.xml if the expectation is to run compaction everyday at a certain time since this will cause change in job run time when system time gets changed for day light savings. The starttime and endtime should be specified in UTC/GMT. The timezone is required for Oozie to invoke the logic to handle the time changes due to day light savings.
Properties which need to be substituted in the place of the parameters defined in the workflow and coordinator xml files. If this example is used, this is the only file which need to be changed. Inline comments will help in making the changes.
~$ cat hbasecompact/coordinator/coordinator.properties
//HDFS Namenode URL: You can find it in hdfs-site.xml
//If HDFS HA is enabled use the value of dfs.nameservices
//URL of jobTracker for MR1
//If MR2/YARN is used, use the YARN RM URL : YARN-RM:8032
//If YARN HA is enabled use the YARN cluster-id which is specified in
//yarn.resourcemanager.cluster-id property of yarn-site.xml
// YARN queue to which the workflow MR jobs need to be submitted
// HDFS directory where the oozie application is stored
// HDFS directory where workflow.xml is stored
// HDFS directory where scripts in the workflow are located
// HDFS directory where the script output need to be stored
// Table name which need to be compacted
// Date time to start and stop the workflow
// HDFS directory where the coordinator.xml is stored
If you list the local directory where these file are stored it will look like this
~$ ls -ls -R hbasecompact/
4 drwxr-xr-x 2 userid users 4096 Oct 22 14:46 coordinator
4 drwxr-xr-x 2 userid users 4096 Oct 22 14:15 scripts
4 drwxr-xr-x 2 userid users 4096 Oct 22 14:05 workflow
4 -rw-r--r-- 1 userid users 1197 Oct 22 14:46 coordinator.properties
4 -rw-r--r-- 1 userid users 640 Oct 22 12:35 coordinator.xml
4 -rwxr-xr-x 1 userid users 243 Oct 22 12:02 hbaseCompact.sh
4 -rwxr-xr-x 1 userid users 65 Oct 22 12:23 logOozieId.sh
4 -rw-r--r-- 1 userid users 1853 Oct 22 14:05 workflow.xml
Schedule the Oozie job using the following command. Note that the properties file in this example /home/userid/hbasecompact/coordinator/coordinator.properties should be on the local disk from where this command is executed.
The status of the job submitted can be verified through the Oozie UI which should be normally accessible through *http://oozie-host:11000/oozie/*. Also status of MapRed jobs can be viewed through the YARN RM/NM URLS.
Also the id of the last Oozie workflow which was executed and the output from the HBase major_compact command which was executed in the last run can be found from the HDFS files which got created by the job.
~$ hdfs dfs -cat /user/userid/compact/majorcompact.log
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.98.4.2.2.8.0-2928-hadoop2, r87e9f77a121be2dae41c9ef8964d254fdc4c23a3, Fri Aug 21 13:29:40 PDT 2015
0 row(s) in 5.0410 seconds
~$ hdfs dfs -cat /user/userid/compact/oozieId.log