Quick Notes

Amazon Aurora Storage

2021-01-10T12:05:19-08:00

Two fundamental concepts enables Amazon Aurora help meet requirements that need to be satisfied by any cloud based database like seamless scalability, high availability, fault tolerance, quick recovery without compromising on performance or increase in maintenance effort.

Monotonically increasing Log Sequence Number (LSN) attached to each log record which is written for changes
A multi tenant distributed storage system built for databases to which multiple database instances can be attached. The storage system performs the persistence functions of a traditional database like writing logs to disk, creating and persisting data pages i.e. the custom storage system understands log records and data pages. Also the storage system makes it possible for Aurora to segregate the compute components of databases namely the SQL layer, transaction management and caching from the storage layer

Storage System

The storage system exposes API to write redo logs which a database instance can use to persist a redo log entries and need to pass in the LSN for the change while doing so. Since this is a network IO, the database instance can make the request to persist log entries as soon as changes are made. This is in contrast to traditional database where log entries are buffered to group them before persisting since it involves disk IO. The storage system persists the log entry to a “hot log” and acknowledges the database instance. This is the only write request made from the database instance to the storage system which reduces the number of IOs by a factor of ~7 per transaction. Data is stored in 10 GB chunks called segments and replicated 6 ways. For availability and recovery, 2 out of 6 copies are made available in an availability zone with a total of 3 AZs where data is stored. The logical group of 6 segments is called the protection group (PG). Storage segments are distributed across storage nodes which are EC2 instances in AWS. Chain of PGs constitutes a database volume which has a one to one relationship to a database instance and the volume can grow as data grows by adding new PGs. The database instance maintains the metadata about the segments, the protection group to which it is part of, the storage nodes which are responsible for the segments along with the data pages and log offsets which is stored in each segment in AWS key value store DynamoDB.

Database instance sends write requests to all the segments for log write and the write is considered successful if 4 out of the 6 storage node acknowledges that the write is successful. The storage system can identify issues with any storage node or segment and when one is found it will add a new protection group by replacing the segment with issues with a new segment to the members of the existing protection group. The database instance can decide which PG to drop based on how quickly the issue with the old segment is resolved. Since the write is coordinated by the database instance, there is no need to consensus protocols in the storage layer which reduces complexity.

In the background the storage system coalesces log entries, create/update data pages, garbage collect unwanted data pages, perform consistency checks of pages and backup data pages to external storage like S3 for recovery relieving the database instance from these operations. This also allows multiple database instances to be attached to the same storage volume. When there are multiple read instances attached to the same storage volume, the write instance along with sending log write requests to the storage system, it will also send the log entries to the read database instances so that they can keep their buffer caches current. Multiple write instances are enabled by pre-allocating range of LSNs for each instance so that there are no conflicts between the updates made through the various instances. If transactions from two write instances updates the same rows in a table, the transaction for which the 4 out of 6 quorum writes completes first gets committed and the other transaction will fail.

Aurora database instances maintains various consistency points enabled by the monotonically increasing LSN which makes distributed commits and recovery simpler. Segment complete LSN (SCL) is the low watermark below which all the log records has been received. When there are holes in the log records, storage nodes gossip with other nodes in the PG to which is the segment is part of to fill the holes. Protection group complete LSN (PGCL) keeps track of when 4/6 segment SCLs advance in a PG. Volume complete LSN (VCL) tracks the LSN when PGCLs advance at all the PGs. When a recovery happens, the storage system can truncate all the log records with an LSN larger than the VCL can be truncated.

Database can also set a recovery point based transactions. Each database level transactions is broken into multiple mini-transactions (MTR) that need to be performed in order and atomically. The final record in a mini transaction is marked as Consistency Point LSN (CPL). Volume durable LSN (VDL) tracks the CPL which is smaller than or equal to VCL. During recovery, the database establishes the durable point to be the VDL and the storage system truncates all data above that consistency point.

Database Cloning

Cloning database will result in the new database pointing to the same logs and data pages of the original database i.e. the clone will much quicker and will be able to perform reads on the data as of that point in time. Creation of new log segments and data pages are done as writes a made through the new database instance. This satisfies the need for reducing cost of creating database clone and the speed in a cloud offering.

Database Backtracking

Aurora can be configured to track changes so that state of the database can be moved to a previous point in time quickly for any reasons like data corruption due to incorrect code. Once backtracking is configured, data pages are not garbage collected by the storage system and tracked so that users can go back to a desired point time.

References

Tales of the Tail

2020-03-07T14:39:56-08:00

Spikes in tail latency is a common challenge faced especially in large scale, parallel and interactive applications and this paper looks at the sources of this spike at the hardware, os and application layers. The study is done by performing tests and collecting fine grained measurements on three servers a custom null RPC service, Memchached and Ngnix on Linux. The measurements are compared against best achievable latency distribution by modeling these services as a queueing system. This comparison identifies the major sources of tail latency beyond that caused by workload bursts namely

Interference from other processes including background processes on a seemingly dedicated machine
Request re-ordering caused by scheduling policies that are not designed with tail latency in mind
Application design choices involving how transport connections are bound to processes or threads
Multi core issues such as how NIC interrupts and server processes are mapped to cores
CPU power saving mechanisms

and quantify them.

Summary of the findings

The following table provides the summary of findings on the cause of the spike in tail latency due to various factors and how they can be mitigated

Reference

Hardware, OS, and Application-level Sources of Tail Latency

Active DBMS

2020-02-15T11:55:07-08:00

Passive database management systems (DBMS) are program driven i.e. users query the current state of database and retrieve the information currently available in the database. An active database is one which automatically executes user specified actions when specified condition arise. The first paper details an architecture for an active database using Event-Condition-Action (ECA) rules as a formalism for active database capabilities. The second paper details an architecture of transforming a passive DBMS to an active DBMS.

Fundamentally in an active database, users can create a rule which defines the conditions that need to be met and what action need to be taken by the database. When a DML statement is executed on data in the DBMS, it is checked to see whether the DML caused any of the conditions to be satisfied and if so the action is taken by the DBMS. Anyone who is familiar to triggers in DBMS which was just getting supported in commercial DBMS when these papers where published can relate to this ECA paradigm.

At a high level, the following functional components are required for an active DBMS in addition to code data management and transaction management functionality.

Condition evaluator which evaluates rule conditions against events to check whether they satisfy the conditions and informs the rule manager
Rule manager manages the definition of rules, takes required action when conditions of rules are met by any event and also interacts with the transaction manager to couple the actions with the transaction initiated by the event
Event detectors identifies any DML operation and informs the rule manager

Altert

Alert follows evolutionary approach of extending a passive DBMS into an active DBMS and the components implemented for the extension provides insights into the basic components used in the current data data streaming and processing technologies like Apache Kafka.

It introduces the notion of active tables and active queries . Active tables are append-only tables in which the tuples are never updated in-place and new tuples are added at the end and active queries are queries that range over the tables. When a cursor is opened for an active query involving one more active tables, tuples added to an active table after the cursor was opened also contribute to answer tuples. This the active queries are defined over past, present and future data where as the domain of the passive queries is limited to past and present data. In order to support active queries a new SQL primitive fetch-wait to iterate over active queries was introduced in Alert which blocks when the current answer set is exhausted and resumes returning data when new tuple is available. The active tables are defined by users similar to any passive tables and they are data is stored in active tables akin to journals created by many applications such as banking transactions.

Users can create rules using standard SQL which includes the condition which need to satisfied and the action the database need to take when the condition is satisfied by a database event.

Create rule cost_watch as
    SELECT sbmail(‘Irv’,ename)
    FROM journal
    WHERE method_name == ‘expense_claim’
        AND expense_amount > 1000

Creates a rule to send an email when the conditions are met. Like database views, rules can be referred in any other query. After creating the rules it can be activated and deactivated explicitly. When activating users can specify the transaction and time coupling along with the assertion mode of the rule. Transaction coupling specifies whether the triggered action from the rule need to be executed in the transaction which the triggered event is part of or seperately. Time coupling specifies whether the triggered action need to be executed synchronously with the triggering event or in parallel with the triggering event asynchrounously. The assertion mode provides the users the option to specify whether the rule is triggered as soon as the rule condition is satisfied or deferred till the end of the transaction. The following diagram shows the message flow when an event happens which affects an active table which in turn satisfies conditions in multiple rules.

The alert rule system which is a new component added to extend a passive DBMS to make it active, does any conflict resolutions between multiple rules and identifies the order of execution. Then the rules are executed in the order to fetch the tuples and passed to the associated action in the rules. To reduce the amount of locks taken on data pages when data is read and rules are evaluated, latches are taken on the pages and locks are deferred to the end.

Reference

The RUM Conjecture

2016-05-10T07:19:50-08:00

Data access methods need to modified or newly invented to adapt with ever changing workload requirements and hardware changes. This paper looks at the challenges in designing new access methods which increasingly needs to be application and hardware aware. The fundamental challenges faced are to minimize a) Read time - R b) Update cost - U c) memory over head - M and the conjecture made is that when optimizing the read-update-memory (RUM) overheads, optimizing in any two negatively impacts the third. Deciding which overheads to optimize for and to what extend has always been and remains the prominent part of designing access methods.

RUM Overheads

Access methods enable us to read or write base data potentially using auxiliary data such as indexes to provide performance improvement.

Read Overhead

RO is the read amplification which is the ratio between the total amount of data read including auxiliary data and base data, divided by the amount of data read.

Update Overhead

UO is the write amplification which is the ratio between the size of physical update performed for one logical update divided by the size of the logical update. A logical update can involve multiple physical updates for e.g. update to base data and auxiliary data like indexes

Memory Overhead

MO is the space amplification which is the ratio between the space utilized for auxiliary and base data, divided by the space utilized for base data.

The theoretical minimum for overhead is to have the ratio equal to 1.0 which implies that the base data is always read and updated directly and no extra bit of memory is wasted. Achieving these bounds for all the three overheads simultaneously is not possible as there is always price to pay for every optimization. When designing an access method all the three overheads should be minimal but depending on the application workload and available technology they need to prioritized.

RUM Conjecture

designing access methods that set an upper bound for two of the RUM overheads, leads to a hard lower bound for the third overhead which cannot be further reduced

In other words, we can choose which two overheads to prioritize and optimize for and pay the price by having the third overhead greater than the hard lower bound. The following figure shows some popular access methods mapped to three dimensional RUM space projected on a two dimensional plane.

Using the RUM space we can understand the current access methods interms of which overhead they prioritize and can choose the appropriate one that suits the need. It also helps in designing new access methods or a combination of access methods which can tune based on available technology and dynamically adapt to new workloads and hence covering the RUM space in aggregate.

Reference

Designing Access Methods: The RUM Conjecture

Amazon Dynamo

2016-04-18T14:45:20-08:00

Requirements Dynamo tries to satisfy

Data read and written are identified uniquely by a key
Data size is small and stored as raw bytes that doesn’t require a relational schema
Queries doesn’t span multiple data items i.e. user queries deal with only one row at a time
Use cases that can tolerate weaker consistency for high availability and require no isolation guarantees
Can be deployed on commodity hardware in a trusted environment that doesn’t require authentication or authorization

Users will be able to control durability and consistency to make tradeoff between functionality, performance and cost effectiveness. This is in contrast to traditional database where consistency is favored over availability.

Dynamo design considerations

Systems are prone to server and network failures
Availability can be increased by using optimistic replication but need to be able to do conflict resolution and when I.e. whether during reads or writes
Should be an always available datastore
Incremental scalability to handle increase in user needs
Symmetry of node responsibility i.e. no node is special
Decentralization is favored over centralized control
Heterogeneity of the environment should be expoited

Dynamo uses a synthesis of well known techniques to realize its features

Partitioning using consistent hashing (Distributed Hash Table) and replication for scalability and availability
Quorum based and decentralized replica synchronization protocol
Merkle trees used to identify divergence in data stored in different replicas and recovery
Eventual consistency to make it always write available
Hinted handoff i.e. storing writes temporarily when nodes recover from temporary failures
Object versioning for conflict resolution which can be dealt with at the storage or client layer. In Dynamo conflict resolution is done during reads using vector clocks
Gossip based distributed failure detection and membership protocol which eliminates manual intervention to add or remove nodes
Simple key/value user interface to interact with data using the key

Partitioning Algorithm

Dynamo uses consistent hashing where the output range of a hash function is treated as a fixed circular space or ring. Each node in Dynamo is assigned a random value within this space which represents its position in the ring. Each data item identified by its key is assigned to a node by hashing the key to identify its position in the ring and assigning to the first node in the ring with a position larger than the item’s position. This node is the coordinator node for the key for writes. Also by assigning items to nodes, adding or removing nodes to Dynamo impacts immediate neighbors and other nodes remain unaffected. To reduce non uniform distribution of data and load distribution due to random position assignment of nodes in the has ring , Dynamo uses virtual nodes. Each virtual node is assigned a position in the hash ring and each physical node is assigned multiple virtual nodes. Varying the number of virtual nodes based on physical node capacity, heterogeneity of the environment can be taken into account. If a node becomes unavailable, virtual nodes help in equal dispersement of load across other nodes.

The following three partition schemes used and efficiency of load distribution compared

Strategy 1: T random tokens per node and partition by token value
Strategy 2: T random tokens per node and equal sized partition
Strategy 3: Q/S tokens per node and equal sized partition where Q is the number of equally sized partition and S the number of nodes

Replication

The node to which a data item is assigned to is called the coordinator node which not only stores the data locally and also coordinates the replication of data to “N-1” number of nodes where N is configurable. Successive N-1 nodes in the ring after the coordinator node is selected for replication and are called preference list. Since multiple virtual nodes can be assigned to a single physical node, to avoid copies of replicated data is not stores in the same physical node, nodes are skipped to come-up with the preference list. If any of the nodes in the preference list is not available, another node will be selected to store the replica of the data with the meta data to hint to which the data belongs. When the target node becomes available, the data is delivered to the preferred node. Nodes storing the hinted handoff data can fail before it is replicated to the target node and it will end up replicas being not consistent. To prevent inconsistencies and to be able to recover quickly, Merkle trees on the data stored in each node is maintained and compared regularly. When a difference is found data is replicated in the background to bring back the replications to be insync.

Data Versioning

Dynamo uses vector clocks which is a list of (node, counter) pair in order to capture the causality between versions of same object. When multiple versions of data are retrieved during a read operation, if the counter and nodes in the versions are in order i.e. the nodes and counter of the last version contains all the nodes and the largest counter, then all the older versions can be forgotten. But if there are versions where is there is no causal dependency i.e. node in the (node, counter) pair is not the same in two versions then it need to be reconciled. The reconciliation can be done at the client using business logic, at the storage level with last write wins using physical timestamp or by setting the read replica to 1 and write replica to N which will make sure that all the replicas have the same version.

Reference

Amazon Dynamo

JDBC Connection to Apache Phoenix

2016-03-31T12:27:36-04:00

Phoenix provides a JDBC driver for Java client and hence can be connected to Phoenix by following the steps required to get a JDBC connection. As with JDBC drivers for other DBMS, there are are some Phoenix specific requirements to get a JDBC connection. For a non secure HBase cluster the Phoenix JDBC connection string should be of the form jdbc:phoenix:::. The following is the code snippet to get a Phoenix JDBC connection object for a non secure HBase cluster.

Connection con = DriverManager.getConnection("jdbc:phoenix:nodea,nodeb,nodec:2181:/hbase");

To connect to a secure HBase cluster using a Kerberos user principal and keytab, the Phoenix JDBC connection string should be of the form jdbc:phoenix::::principal_name@REALM:/path/to/keytab. The following is the code snippet to get a Phoenix JDBC connection object for a secure HBase cluster.

con = DriverManager.getConnection("jdbc:phoenix:node1,node2,node3:2181:/hbase:peace@REALM.COM:/home/peace/peace.keytab);

If Kerberos principal and keytab is not used to connect to a secured HBase cluster, then the user running the code to make the connection should be defined in the Kerberos KDC and should have a valid TGT. The user running the code can verify whether they are in the correct KDC and have a valid TGT by running klist command . One key item to note is that to access a secure HBase cluster, the hbase-site.xml and core-site.xml of the target HBase cluster should be available in the classpath of the application.

HBase 1.0 : Offheap Cache Configuration

2016-03-12T10:55:14-05:00

If you have been using HBase off-heap bucketcache, you may agree that configuration it is a bit cumbersome to say the least. In version 1.0, the HBase development team simplified the offheap cache configuration process. With the changes, the following are the steps to configure bucketCache for e.g. of size n GB.

Set the total off-heap memory size to be used by HBase to the environment variable HBASE_OFFHEAPSIZE. One way to do this is to set it in hbase-env.sh

export HBASE_OFFHEAPSIZE=(n+x)G

Note The “G” is for gigabytes. The value for x should be 1 to 2 GB and the value should be at the higher end for clusters handling high volume of transactions. - The other option to configure the total off-heap memory size is tp set the -XX:MaxDirectMemorySize JVM property. Again this value can be set in hbase-env.sh

export HBASE_OPTS="$HBASE_OPTS -XX:MaxDirectMemorySize=(n+x)G"

If you are wondering what x GB is for, it is for Java direct memory used by hdfs client used by hbase to interact with the underlying hdfs filesystem. - Set the HBase property hbase.bucketcache.combinedcache.enabled to true so that on-heap cache will be used for index and bloomfilter blocks along with bucketcache for data blocks.

  
    hbase.bucketcache.combinedcache.enabled
    true

Set the HBase property hbase.bucketcache.ioengine to offheap.

  
    hbase.bucketcache.ioengine
    offheap

Set the HBase property hbase.bucketcache.size to the size of memory which is allocated for bucket cache in mb.

  
    hbase.bucketcache.size
    n*1024

Restart HBase server processess so that these changes can take in effect. ```

Note on Distributed Computing

2016-03-06T16:18:22-05:00

A distributed system is a collection of independent computers that appears to its users as a single coherent system. This paper argues that the objects in a distributed object oriented system form a single ontological class where all entities can be described by the specification of the set of interfaces of the objects and the semantics of operation is mistaken. This vision of unified objects for distributed systems is centered around the principles that

There is a single natural object oriented design for a given application, irrespective of context in which the application will be deployed
Failure and performance issues are tied to the implementation of the components of the application and can be left out of initial design
The interface of an object is independent of the context in which the object is used

While this view is a natural extension of object oriented design, it doesn’t take into account the important issues in distributed systems namely

Latency
- Ignoring the difference between the performance of local and remote invocations can lead to designs whose implementations will have performance problems because of large amount of communication between components that are in different address spaces and on different machines.
Memory access
- Disparate address spaces both local and remote makes accessing memory locations/objects a bigger challenge
- This would require a common component like DSM which would translate memory pointer/object locations
- The other option is that the programmer is aware of the local and remote access which breaks transparency
Partial failure and concurrency due to lack of common agent
- Interfaces of components can be designed as if all are local resulting in unhandled catastrophic failures
- Or design the interfaces as if all objects are remote resulting in the undesirable overhead on components which are local
- Partial failure also brings up the challenge of bringing back to an acceptable state which is not there in non distributed systems

Historically there is a desire to merge programming and computational models of local and remote computing. For e.g. communications protocol development has followed

A path where integration with the current programming language is emphasized
The other path where solving the inherent problems in distributed computing is emphasized

These two approaches can be used in distributed object oriented systems by accepting the fact that there are irreconcilable differences in local and distributed computing

Using IDLs which can be used to define local and remote objects and the IDL compilers which can vary its output based on object location
Objects which are local but on a different address space can be considered as a separate category of object by the IDL compilers
Developing tools like the ones which can identify the interaction between objects so that they can be designed to be local or remote
Programmers being aware of the location of the objects so that they can think about the problems differently

End-to-end Arguments in System Design

2016-02-07T18:41:26-05:00

This paper presents the design principle regarding placement of functions in computer system design called “end-to-end argument”. The argument of the principle is that any application functions implemented at lower levels of a system may be redundant or of little value when compared with the cost of providing them at lower level. The paper articulates the argument through requirements and examples in distributed systems like reliable data transmission, encryption, duplicate message detection, message sequencing, detecting host crashes, delivery receipts etc.

Reliable Data Transmission

Even if the communication network provides aid in coping with issues in data transmission like data buffering issues, processor or memory issues, loss of packet, host crashes through duplicate copies, timeout and retry, redundancy for error detection, crash recovery etc, at the end the data transfer application need to perform the check of the data transferred to claim it to be successful
This makes the functions in the network layer for reliable data transfer redundant. More over these functions will impact other applications which will be using the network layer but doesn’t require all the functions

Lower layers can implement function which can improve the performance of the application using it. For e.g. in the case of reliable data transmission

The application need to make sure that the data transferred matches the source for e.g. by using checksums. If the checksum at the source and target don’t match the data transfer need to be redone
If functions can be included which will cost less but enhances the end to end reliability this will reduce the retries required to have the data data transferred reliably and hence improves the performance

Decisions to include functions in lower layers

Need to take into account the cost of implementing it at the lower layer and the impact on other applications which may be using it
Need to be made with the understanding that the higher level layers will have much more information than the lower level layers to guarantee a feature/functionality

In order to make these decisions i.e. whether to include functions in lower layers or let the application handle it at the end, application requirements of what need to be accomplished need to be well understood.

Scale in Distributed Systems

2016-01-10T18:51:35-05:00

This paper looks at scale and how it affects distributed systems including highlights of how how scale is addressed in existing systems. A system is said to be scalable if it can handle addition of resources and users without suffering noticeable loss in performance or increase in administrative complexity. Scale also affects the way users perceive the systems. For e.g. as the number of objects accessible grows it becomes increasingly difficult to locate the objects of interest.

Definitions
- A distributed system is a collection of computers, connected by a computer network, working together to collectively implement some minimal set of services.
- A service or resource is replicated when it has multiple logically identical instances appearing on different nodes in a system
- A service is distributed when it is provided by multiple nodes each capable of handling a subset of the requests for service. A distribution function maps requests to the subset of the nodes that can handle it
- Caching is a temporary form of replication used to save and reuse of query results on nodes. Caching need to use validation techniques to make sure that the data saved are current
Effects of Scale
- Reliability
  - Systems should not cease to operate just because nodes are unavailable
  - Reliability can be improved by increasing the autonomy of the nodes and replication
- System Load
  - System query load increases with increase in amount of data, nodes, services
  - Replication, distribution and caching can be used to reduce the number of requests that need to be handled by each node
- Administration
  - With increase in number of nodes, administration of users, services and systems becomes complex
  - Complexity in administration can be reduced by maintaining common information centrally
- Heterogeneity
  - With scale nodes part of the system can not only of different hardware but can also run different OS and different versions of OS
  - Coherence an approach which expects nodes in a system support a common interface is one which is used to deal with heterogeneity
    - Common instruction set
    - Common execution abstraction
    - Support a common set of protocols
Distributed system components affected by scale
- Naming and directory services
- Authentication
- Authorization
- Accounting
- Communication
- Remote Resources

Scaling of all the components can be improved by replication, distribution and caching

Key points to remember while building scalable systems

Replication
- Replication important resources
- Distribute the replicas
- Use loose consistency
Distribution
- Distribute across multiple servers
- Distribute evenly
- Exploit locality
- Avoid upper level of hierarchies
Caching
- Cache frequently accessed data
- Consider access patterns when caching
  - amount of data accessed together, read to write ratio, likelihood of conflicts, number of simultaneous users
- Cache timeouts
- Caching at multiple levels
- Look first locally
- Data cached extensively must be changed less frequently
Avoid global broadcast
Shed load but not too much: perform computation where it suits better
Support multiple access mechanisms
Keep users in mind

Evaluating distributed systems

Use of the system
- Growth of queries as the system grows
- Central servers in the system and issues with replication
Data
- Increase in data and how it increases data maintained in each node in the system
- Increase in query time with increase in data size
- Data update process and how it scales
- Cache data invalidation and query performance
Administration
- Does the system require a centralized admin system?
- Is it practical in the environment in which the system is used?

Oozie Job to Schedule HBase Major Compaction

2015-12-20T12:10:23-04:00

HBase performs time based major compaction and in-order to prevent this resource intensive process interfere performance sensitive applications, this can be disabled. Once disabled, in order to keep the HBase store files in optimal condition, application team need to schedule regular compaction. The following details the steps which need to be followed to schedule daily compaction through Oozie.

Non Kerberized Cluster Create the following files with the content shown in a local directory. In this example the files are created in hbasecompact under the users local home directory. Please note that the files are under different directories. Shell script to start HBase compaction on a table and copy the output of the command to a HDFS directory. The command output can be checked for any issues.

~$ cat hbasecompact/scripts/hbaseCompact.sh
#!/bin/ksh
echo "Starting HBase major compaction on table $1 - $3" > majorcompact.log
echo "major_compact \"$1\"" | /usr/bin/hbase shell 2>&1 >> majorcompact.log
echo "HBase major compaction request complete on table $1" >> majorcompact.log
hdfs dfs -moveFromLocal -f majorcompact.log $2

Shell script to copy the Oozie job id into a HDFS directory. The job id can be used to check any issues.

~$ cat hbasecompact/scripts/logOozieId.sh
echo $1 > oozieId.log
hdfs dfs -moveFromLocal -f oozieId.log $2

Oozie workflow definition to perform HBase compaction. The first step major_compact runs the script hbaseCompact.sh. The next step logOozieId runs the script logOozieId.sh to copy the Oozie workflow id onto HDFS.

~$ cat hbasecompact/workflow/workflow.xml

      ${jobTracker}
      ${nameNode}
      
            mapred.job.queue.name
            ${queueName}
         
              oozie.launcher.mapreduce.map.memory.mb
              ${mapMemoryMB}
            
         ${majorCompactScriptPath}/${majorCompactScriptName}
         ${tableName}
         ${hdfsLogDir}
         ${timestamp()}
         ${majorCompactScriptPath}/${majorCompactScriptName}#${majorCompactScriptName}
         
         ${majorCompactScriptPath}/${logOozieIdScriptName}
         ${wf:id()}
         ${hdfsLogDir}
         ${majorCompactScriptPath}/${logOozieIdScriptName}#${logOozieIdScriptName}
      
      Major compaction failed [${wf:errorMessage(wf:lastErrorNode())}]

Oozie coordinator.xml definition to run the major_compact_wf workflow defined in the previous step.

~$ cat hbasecompact/coordinator/coordinator.xml
   frequency="${coord:days(1)}"
   start="${starttime}" end="${endtime}" timezone="${timezone}"
   xmlns="uri:oozie:coordinator:0.1">
    
      1
      FIFO
    
         ${nameNode}${rootDir}/workflow
         
               HBaseMajorCompact
               '${coord:user()}: Kicking off HBase Major Compact WF'

Note: Do not use 1440 minutes as frequency in workflow.xml if the expectation is to run compaction everyday at a certain time since this will cause change in job run time when system time gets changed for day light savings. The starttime and endtime should be specified in UTC/GMT. The timezone is required for Oozie to invoke the logic to handle the time changes due to day light savings. Properties which need to be substituted in the place of the parameters defined in the workflow and coordinator xml files. If this example is used, this is the only file which need to be changed. Inline comments will help in making the changes.

~$ cat hbasecompact/coordinator/coordinator.properties
//
//HDFS Namenode URL: You can find it in hdfs-site.xml
//If HDFS HA is enabled use the value of dfs.nameservices
//
nameNode=hdfs://NN-HA
//
//URL of jobTracker for MR1
//If MR2/YARN is used, use the YARN RM URL : YARN-RM:8032
//If YARN HA is enabled use the YARN cluster-id which is specified in 
//yarn.resourcemanager.cluster-id property of yarn-site.xml
//
jobTracker=YARN-RM-CLUSTER-ID
//
// YARN queue to which the workflow MR jobs need to be submitted
//
queueName=default
//
// HDFS directory where the oozie application is stored
//
rootDir=/user/userid/hbasecompact
//
// HDFS directory where workflow.xml is stored
//
wf.application.path=${nameNode}${rootDir}/workflow
oozie.wf.rerun.failnodes=true
//
// HDFS directory where scripts in the workflow are located
//
majorCompactScriptPath=${nameNode}${rootDir}/scripts
majorCompactScriptName=hbaseCompact.sh
logOozieIdScriptName=logOozieId.sh
//
// HDFS directory where the script output need to be stored
//
hdfsLogDir=/user/userid/compact
//
// Table name which need to be compacted
//
tableName=t
mapMemoryMB=8192
//
// Date time to start and stop the workflow 
//
starttime=2015-10-22T10:24Z
endtime=2020-10-22T10:26Z

timezone=America/New_York
//
// HDFS directory where the coordinator.xml is stored
//
oozie.coord.application.path=${nameNode}${rootDir}/coordinator
oozie.use.system.libpath=true
oozie.libpath=${nameNode}/user/oozie/share/lib

If you list the local directory where these file are stored it will look like this

~$ ls -ls -R hbasecompact/
hbasecompact/:
total 12
4 drwxr-xr-x 2 userid users 4096 Oct 22 14:46 coordinator
4 drwxr-xr-x 2 userid users 4096 Oct 22 14:15 scripts
4 drwxr-xr-x 2 userid users 4096 Oct 22 14:05 workflow

hbasecompact/coordinator:
total 8
4 -rw-r--r-- 1 userid users 1197 Oct 22 14:46 coordinator.properties
4 -rw-r--r-- 1 userid users  640 Oct 22 12:35 coordinator.xml

hbasecompact/scripts:
total 8
4 -rwxr-xr-x 1 userid users 243 Oct 22 12:02 hbaseCompact.sh
4 -rwxr-xr-x 1 userid users  65 Oct 22 12:23 logOozieId.sh

hbasecompact/workflow:
total 4
4 -rw-r--r-- 1 userid users 1853 Oct 22 14:05 workflow.xml

copy them into to a HDFS directory

hdfs dfs -copyFromLocal ./hbasecompact /user/userid

If you list the HDFS directory which was used as the target location in the previous step, the output should be similar to the following.

~$ hdfs dfs -ls -R /user/userid/hbasecompact
drwxr-xr-x   - userid supergroup          0 2015-10-22 12:58 /user/userid/hbasecompact/coordinator
-rw-r--r--   3 userid supergroup        513 2015-10-22 12:58 /user/userid/hbasecompact/coordinator/coordinator.properties
-rw-r--r--   3 userid supergroup        640 2015-10-22 12:58 /user/userid/hbasecompact/coordinator/coordinator.xml
drwxr-xr-x   - userid supergroup          0 2015-10-22 12:58 /user/userid/hbasecompact/scripts
-rw-r--r--   3 userid supergroup        243 2015-10-22 12:58 /user/userid/hbasecompact/scripts/hbaseCompact.sh
-rw-r--r--   3 userid supergroup         65 2015-10-22 12:58 /user/userid/hbasecompact/scripts/logOozieId.sh
drwxr-xr-x   - userid supergroup          0 2015-10-22 12:58 /user/userid/hbasecompact/workflow
-rw-r--r--   3 userid supergroup       1751 2015-10-22 12:58 /user/userid/hbasecompact/workflow/workflow.xml

If the execute permissions in the two shell scripts are not set, do a chmod to set the executable permission.

hdfs dfs -chmod 755 /user/userid/hbasecompact/scripts/hbaseCompact.sh
hdfs dfs -chmod 755 /user/userid/hbasecompact/scripts/logOozieId.sh

Schedule the Oozie job using the following command. Note that the properties file in this example /home/userid/hbasecompact/coordinator/coordinator.properties should be on the local disk from where this command is executed.

~$ oozie job -oozie http://oozie-host:11000/oozie -config /home/userid/hbasecompact/coordinator/coordinator.properties -run
job: 0000000-150821155141136-oozie-oozi-C

The status of the job submitted can be verified through the Oozie UI which should be normally accessible through *http://oozie-host:11000/oozie/*. Also status of MapRed jobs can be viewed through the YARN RM/NM URLS. Also the id of the last Oozie workflow which was executed and the output from the HBase major_compact command which was executed in the last run can be found from the HDFS files which got created by the job.

~$ hdfs dfs -cat /user/userid/compact/majorcompact.log
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.98.4.2.2.8.0-2928-hadoop2, r87e9f77a121be2dae41c9ef8964d254fdc4c23a3, Fri Aug 21 13:29:40 PDT 2015

major_compact "t"
0 row(s) in 5.0410 seconds


~$ hdfs dfs -cat /user/userid/compact/oozieId.log
0000009-150821155141136-oozie-oozi-W

HBase Replication

2015-11-21T11:53:57-04:00

HBase supports inter-cluster data replication which can be used to propagate data to a secondary cluster/data center that can be accessed when primary cluster/data center is not available. The following are the high level steps to enable HBase inter-cluster replication. Note that HBase also supports region replication with in a cluster for read HA which is different from inter-cluster data replication.

Set the hbase.replication property to true in hbase-site.xml of the HBase cluster from which data need to be replicated from. This cluster is referred as the master going forward. By default the value of this property is “true”.

  hbase.replication
  true

Create a HBase replication peer in the master HBase cluster using the information about the ZooKeeper quorum of the cluster to which data need to be replicated to. The cluster to which data will be replicated to will be referred as slave going forward.

$ hbase shell
15/09/23 10:35:52 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
HBase Shell; enter 'help' for list of supported commands.
... 
hbase(main):001:0>add_peer 'HBASE_REPL_PEER','zk1,zk2,zk3:2181:/hbase'
0 row(s) in 0.3470 seconds
 
hbase(main):003:0> list_peers
PEER_ID CLUSTER_KEY STATE TABLE_CFS
HBASE_REPL_PEER zk1,zk2,zk3:2181:/hbase ENABLED
1 row(s) in 0.1520 seconds

Once the replication peer is created and enabled, replication need to be enabled on HBase tables whose data need to be replicated from the master cluster by setting the “REPLICATION_SCOPE” attribute of the table to a non zero value. By default this value is set to “0”. If the table is an existing table, altering the table to set the “REPLICATION_SCOPE” to a non zero value requires disabling and enabling the table and the following is an example of the steps where the existing table’s name is “healthy”. Note that a table with the same definition as the table being replicated (in this case “healthy”) should be created in the slave cluster before the replication is enabled on master.

hbase(main):006:0> describe 'healthy'
DESCRIPTION ENABLED
'healthy', {NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REP true
LICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0',
TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY =
> 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.0430 seconds
 
hbase(main):007:0> disable 'healthy'
0 row(s) in 1.2940 seconds
 
hbase(main):015:0> alter 'healthy',{NAME => 'cf1', REPLICATION_SCOPE => '1'}
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.1660 seconds
hbase(main):017:0> enable 'healthy'
0 row(s) in 0.2310 seconds
 
hbase(main):016:0> describe 'healthy'
DESCRIPTION ENABLED
'healthy', {NAME => 'cf1', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REP false
LICATION_SCOPE => '1', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0',
TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY =
> 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.0450 seconds

Once replication is enabled replication related JMX stats are made available in all the region servers in the master cluster hosting the regions for which data replication is enabled.

"source.shippedKBs" : 3388621,
"source.logEditsFiltered" : 428856845,
"source.sizeOfLogQueue" : 0,
"source.HBASE_REPL_PEER.logEditsFiltered" : 378778983,
"source.HBASE_REPL_PEER-rs1,60200,1441910792839.shippedBatches" : 24,
"source.logReadInBytes" : 108666586903,
"source.HBASE_REPL_PEER.shippedOps" : 2097152,
"source.HBASE_REPL_PEER.logReadInBytes" : 96312291957,
"source.shippedOps" : 3098060,
"source.HBASE_REPL_PEER.shippedBatches" : 51,
"source.HBASE_REPL_PEER-rs1,60200,1441910792839.shippedKBs" : 1094799,
"source.logEditsRead" : 428859535,
"source.shippedBatches" : 75,
"source.HBASE_REPL_PEER-rs1,60200,1441910792839.logReadInBytes" : 12354294946,
"source.HBASE_REPL_PEER.shippedKBs" : 2293822,
"source.HBASE_REPL_PEER-rs1,60200,1441910792839.shippedOps" : 1000908,
"source.ageOfLastShippedOp" : 51072,
"source.HBASE_REPL_PEER.sizeOfLogQueue" : 0,
"source.HBASE_REPL_PEER.ageOfLastShippedOp" : 51072,
"source.HBASE_REPL_PEER.logEditsRead" : 378780481

Note that these are statistics available in HBase version 0.98. Later versions may have additional statistics which can help with replication monitoring.

HBase Back-up

2015-10-24T11:02:17-04:00

Often during development and sometimes in production backup of a HBase table need to be made, for e.g., to run a test code against a table during development and being able to restore the original data if some thing went wrong or create a clone of the existing table or move table data to a new development cluster. HBase provides the option of taking snapshot of tables which can be used in such scenarios. The following are the various hbase shell commands to accomplish some of the common requirements.

Create a snapshot of HBase table

$ hbase shell
... 
hbase(main):001:0> snapshot 'tableName', 'table-snapshotname'
0 row(s) in 1.1060 seconds

For easier identification it is a good practice to create snapshot with table name, creation date, creation time in the snapshot name.

Restore data from snapshot

$ hbase shell
 
hbase(main):001:0> disable 'tableName'
0 row(s) in 1.1000 seconds
hbase(main):001:0> restore_snapshot 'table-snapshotname'
0 row(s) in 2.1060 seconds
hbase(main):001:0> enable 'tableName'
0 row(s) in 1.0000 seconds

Note that the table need to be disabled to restore the table data from snapshot. Also note that any updates to the table data after the snapshot will be lost once the restoration is complete.

Clone table from snapshot

$ hbase shell
... 
hbase(main):001:0> clone_snapshot 'table-snapshotname', 'newTableName'
0 row(s) in 2.1060 seconds

A new table will be created with the attributes of the original table from which the snap shot was made and the data from the point in time of the snapshot will be restored.

List all the available snapshots for a table If multiple snapshots were made on tables and would like to see the list of available snapshots

$ hbase shell
 
hbase(main):001:0> list_snapshots
SNAPSHOT                                     TABLE + CREATION TIME
  emp-snapshot-073115                        emp (Fri Jul 31 16:07:14 -0400 2015)
 
1 row(s) in 2.1060 seconds

HBase : Best Practices

2015-09-26T19:06:10-05:00

Application Development

Connection Object Reuse

Creating connections to a server component from an application is a heavy weight operation and it is much pronounced when connecting to a database server. That being the reason database connection pooling is used to reuse connection objects and HBase is no exception. In HBase, data from meta table that stores details about region servers that can serve data for specific key ranges gets cached at the individual connection level that makes HBase connections much heavier. So if there are region movements for balancing or if a region server fails, the meta data need to be refreshed for each connection object which is a performance overhead. For these reasons, applications need to try to reuse connection objects created.

The following code snippet shows how to create a HBase connection object in a Java application using HBase.

Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "localhost");
HConnection conn = HConnectionManager.createConnection(conf);

If the application is multi-threaded, then it need to reuse the connection object to perform any data manipulation operations on tables. This can be achieved by individual threads creating the HTable object using the getTable(TableName) method of the HConnection object. Once the data manipulation operations are complete each thread should close corresponding HTable but not the HConnection object so that it can be reused by other threads.

Pre-split Table

In order to prevent skews in processing of queries and to distribute query processing work load across all the nodes in the cluster, it is a good practice to create tables which is pre-split. The key is to identify the split point so that the data will be distributed across all the nodes in the cluster. Once the split point is identified the table can be created pre-split using HBase shell and the following is an example of a table with 3 split points.

hbase(main):001:0> create 'split_table', 'cf1', {SPLITS => ['a','g','o']}
0 row(s) in 2.5540 seconds
=> Hbase::Table - split_table

During start of development, when the split points in the data are not clear but if some one still want to pre-split the table, HBase provides a utility program which can split the table and uniformly distribute the data. The following is an example which creates a table with 10 splits and columnfamily ‘cf1’.

$ hbase org.apache.hadoop.hbase.util.RegionSplitter 'split_table2' UniformSplit -c 10 -f 'cf1'

If you are creating tables programmatically using Java APIs, the following code snippet shows how to pre-split the table during creation

// Instantiating configuration class
Configuration conf = HBaseConfiguration.create();
System.out.println(conf.get("hbase.zookeeper.quorum"));
conf.set("hbase.zookeeper.quorum", "localhost");
conf.setInt("hbase.zookeeper.property.clientPort",2181);
// Creating a connection
Connection conn = ConnectionFactory.createConnection(conf);
// Instantiating admin class
Admin admin = conn.getAdmin();
// Instantiating table descriptor class
HTableDescriptor tableDescriptor = new
HTableDescriptor(TableName.valueOf("emp5"));
byte[][] splitKeys = ...;
// Adding column families to table descriptor
tableDescriptor.addFamily(new HColumnDescriptor("market"));
tableDescriptor.addFamily(new HColumnDescriptor("corp"));
// Create and pre-split the table through admin
admin.createTable(tableDescriptor, splitKeys);
System.out.println(" Table created ");
admin.close();
conn.close();

For further reading and understanding the details about HBase table splitting and merging refer this blog post.

HBase : Data Load

2015-08-16T20:56:33-05:00

Often during development and in production data need to be loaded into HBase tables. This can be for testing application code or migrating data from existing database among many other scenarios. One obvious option is to read data from a source and use HBase put client API to write data into tables. This works fine for small amount of data for unit testing or PoC. In order to load data of large size running into GBs or TBs, using put to write data to HBase tables will be time consuming if the source data is already available. In order to mitigate this, HBase provides an option to create hfiles which are HBase specific file formats used to store table data in the underlying filesystem and load them into HBase tables. For HDFS, these files can be created using a map reduce job and the following are the high level steps.

Copy the source data in HDFS using tools like distcp
Define the target table in HBase using HBase shell or programatically using HBase client admin APIs
Create and run a map-reduce job to create HFiles for the source data on HDFS
Load the HFiles into HBase using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles program shipped with HBase The following code example shows how to go about with the creation of the map reduce job to generate the HFiles.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 
public class Driver {
  public static void main(String[] args) throws Exception {
      Configuration conf = new Configuration();
      conf.clear();
      conf.set("hbase.zookeeper.quorum","zk1:2181,zk2:2181,zk3:2181");
      Job job = new Job(conf, "HBase Bulk Import Example");
      job.setJarByClass(HFileMapper.class);
      job.setMapperClass(HFileMapper.class);
      job.setMapOutputKeyClass(ImmutableBytesWritable.class);
      job.setMapOutputValueClass(Put.class);
      job.setInputFormatClass(TextInputFormat.class);
      HTable hTable = new HTable(conf, args[2]);
      HFileOutputFormat.configureIncrementalLoad(job, hTable);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
      job.waitForCompletion(true);
  }
}

The driver program takes in three parameters table name, HDFS directory where the source data is stored, the HDFS output directory where HFiles need to be created for loading into HBase It sets the out format to HBase org.apache.hadoop.hbase.client.Put which represents a single row in a HBase table The input format is set Text to read source data from a text file In the configuration object, the only parameter which need to be set iis the ZooKeeper (ZK) quorum and the value should be set to the ZK quorum corresponding to the HBase cluster on which the target table is defined No reducers are required to be set to create HFiles using map reduce The following is the code snippet for the HFileMapper class used by the Driver program

import java.util.Random;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
 public class HFileMapper extends Mapper {
     @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException  {
      String rowkey = new Random().nextInt() + "";
      Put put = new Put(rowkey.getBytes());
      put.add(Constants.COL_FAMILY, "col".getBytes(), "v".getBytes());
      ImmutableBytesWritable hkey = new ImmutableBytesWritable(rowkey.getBytes());
      context.write(hkey, put);
  }
}

Note that this is a dummy mapper in which the key and values are generated dynamically in the code. Based on what the source data source file stores and what need to be stored in the HBase table, the mapper need to be modified. The key aspect to note is how the Put object is created. Once the driver and mapper code is compiled, packaged in a Java jar file (e.g. happy-hbase-sample.jar) and made available on all the nodes in the HBase/HDFS cluster, the HFiles can be generated by running the map-reduce job on the cluster. run mapred job

hadoop jar happy-hbase-sample.jar com.happy.hbase.sample.Driver healthyTable /user/happy/data/input /user/hbase/healthytable/output

When the map reduce job completes, it creates number of files in the output directory on HDFS and it can be used to load data into target HBase table and in this case healthyTable

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/healthytable/output healthyTable

HBase Shell

2015-07-05T18:05:25-05:00

HBase shell is a Ruby based shell which can be used to interact with HBase cluster and perform data definition and data manipulation tasks. The shell is made available as part of the HBase code base and by default gets installed on all the nodes of the HBase cluster. In order to access the shell, HBase software need to be installed on the machine from where the shell is invoked and the PATH environment variable updated with the directory where the hbase shell program is stored. Commonly users and administrators access the shell through one of the nodes in the HBase cluster. Following is a quick introduction to frequently used hbase shell commands. HBase shell is invoked using the hbase shell command which in turn provides the user with a command prompt to enter the commands supported by the particular version of HBase.

$ hbase shell
...
hbase(main):001:0> help
...

COMMAND GROUPS:
   Group name: general
   Commands: status, table_help, version, whoami
   Group name: ddl
   Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, show_filters

Further details on a particular command can be found using help ‘command-of-interest’. The following is the partial output from help on create command help create

hbase(main):002:0> help 'create'
Creates a table. Pass a table name, and a set of column family
specifications (at least one), and, optionally, table configuration.
Column specification can be a simple string (name), or a dictionary
(dictionaries are described below in main help output), necessarily
including NAME attribute.
Examples:
Create a table with namespace=ns1 and table qualifier=t1
   hbase> create 'ns1:t1', {NAME => 'f1', VERSIONS => 5}
Create a table with namespace=default and table qualifier=t1
   hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
   hbase> # The above in shorthand would be the following:
As one would expect creating a table is using create command. The following is an example of creating a table TestTable with column family cf. The column family is of BLOCKSIZE 16K, uses SNAPPY COMPRESSION and the BLOOMFILTER attribute set to ROWCOL. 
create Table
hbase(main):014:0> create 'TestTable', {NAME => 'cf', BLOCKSIZE => 16384, BLOOMFILTER => 'ROWCOL', COMPRESSION => 'SNAPPY'}
0 row(s) in 0.5020 seconds

Note that by default a table is created in the default namespace if the namespace is not specified during table creation. If a namespace is specified during table creation, the namespace should already exist and it can be created using create namespace command. To view the properties of a table describe command can be used.

hbase(main):017:0> describe 'TestTable'
DESCRIPTION                                                                                                   ENABLED
  'TestTable1', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROWCOL', REPLICATION_SCOPE => '0 true
  ', VERSIONS => '1', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'f
  alse', BLOCKSIZE => '16384', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.0730 seconds

Note that there are two columns in the output DESCRIPTION and ENABLED. The value true displayed on the console is for the ENABLED column which informs the user whether the table is enabled or not. To see whether a table is defined in the cluster, use list command which lists all the tables in the HBase cluster and the following is a sample output.

hbase(main):018:0> list
TABLE
TestTable
TestTable1
TestTableRpl3
3 row(s) in 0.0600 seconds
=> ["TestTable", "TestTable1", "TestTableRpl3"]

To drop a table, the table need to be disabled first using the disable command before dropping using the drop command. If drop is attempted before the disable the shell will prompt with the message that the table is enabled.

hbase(main):019:0> drop 'TestTable'
ERROR: Table TestTable1 is enabled. Disable it first.'
 
hbase(main):020:0> disable 'TestTable'
0 row(s) in 1.3510 seconds
 hbase(main):021:0> drop 'TestTable'
0 row(s) in 0.2470 seconds
To alter a table, use alter command. The alter command can be used to make changes at table level and at the columnfamily level and it can also operate on multiple columnfamilies at the same time. Use the help 'alter' command to get more details. The following is an example of altering the number of versions that need to be stored for a particular columnfamily in a table.
alter table
hbase(main):025:0> alter 'TestTable',{NAME => 'cf', VERSIONS => 3}
Updating all regions with the new schema...
0/1 regions updated.
1/1 regions updated.
Done.
0 row(s) in 2.2800 seconds

Note that changes to some of the attributes require other actions. For e.g. changing the BLOCKSIZE will require a major compaction of the table if the change need to take effect immediately. Other changes like modifying the REGION_REPLICATION property requires the table to be disabled before altering the table and then enabling it For basic data manipulations, the put, get, scan, delete commands can be used. The following puts two rows to a table, does a get a scan and a delete.

put get scan delete
hbase(main):036:0> put 'TestTable','r1','cf:c1','v1'
0 row(s) in 0.0300 seconds
hbase(main):037:0> put 'TestTable','r2','cf:c1','v2'
0 row(s) in 0.0060 seconds
hbase(main):038:0> get 'TestTable','r1'
COLUMN                                    CELL
  cf:c1                                   timestamp=1437492679670, value=v1
1 row(s) in 0.0340 seconds
hbase(main):039:0> scan 'TestTable'
ROW                                       COLUMN+CELL
  r1                                      column=cf:c1, timestamp=1437492679670, value=v1
  r2                                      column=cf:c1, timestamp=1437492691520, value=v2
2 row(s) in 0.0620 seconds
hbase(main):041:0> delete 'TestTable','r1','cf:c1'
0 row(s) in 0.0630 seconds

During development if minor compaction need to be performed explicitly on table regions, the compact command can be used. Note: since compaction process will have performance impact use caution on when you invoke compaction in a production environment.

hbase(main):043:0> compact 'TestTable', 'cf'
0 row(s) in 3.5630 seconds

Also to improve data locality and performance major compaction of a table may have to be run and the following is an example. Again, major compaction is a resource intensive (CPU, IO, Memory and Network) process and should not be run in production without proper scheduling to minimize business impact.

hbase(main):044:0> major_compact 'TestTable'
0 row(s) in 1.1430 seconds

Balancing the region distribution of a table may help improving performance and it is a best practice to balance the regions before running major compaction explicitly to improve its performance. Be cautious on when you run balancer in production environment since it will impact performance.

hbase(main):045:0> balancer
true
0 row(s) in 59.1120 seconds

During development there may be a need to disable running of HBase balancer so that regions can be moved manually. Enabling and disabling of balancer can be accomplished using balance_switch command which takes in the value true|false as input.

hbase(main):048:0> balance_switch false
true
0 row(s) in 0.0300 seconds

Note that balance_switch command will return the previous value of whether the balancer is enabled or not. In the previous example the balancer was enabled and hence the return value of true.

To check the status of the HBase cluster, the status command can be used. The command can generate detailed, simple or summary status based on whether detailed|simple|summary parameter is passed.

hbase(main):052:0> status 'summary'
13 servers, 0 dead, 478.1538 average load

Java Direct ByteBuffer Performance Advantages and Considerations

2015-06-05T21:38:40-04:00

During execution, objects/variables created by Java programs gets their space allocated in the JVM heap memory. The total amount of heap memory available for a JVM is determined by the value set to -Xmx parameter when starting the Java process. When object allocated is released by the Java program, the corresponding memory is made available for later use by the JVM garbage collection (GC) process.

The GC process gets invoked typically when the amount of free memory in the JVM falls below a certain threshold. At a very high level, the GC process involves identification of objects which are not used any more i.e. not referenced anymore, releasing the memory and compacting the memory to reduce memory fragmentation. Readers who are interested in understanding the details of GC process can find it here. As one can imagine, the time it takes to complete the GC process will increase with the increase in size of the Java heap memory since it takes more time to identify the objects which can be released and also to perform compaction.

If a Java application requires large memory (in GBs), the time it takes to complete the GC process will be detrimental to its performance. If the application is performance sensitive, then large heap memory size can adversely impact its performance. In order to mitigate this, one can try to use memory outside Java heap and hence reduce the Java heap memory use and its size. This can be done using the Java ByteBuffer class which provides the option to allocate ByteBuffers outside JVM heap using allocateDirect) method.

The allocateDirect method allocates memory of requested size (in bytes) on memory outside the JVM heap (off-heap) and provides the object reference to the application with the starting offset of 0. The application can then use the reference to store and retrieve data into the off-heap memory. When the garbage collection runs, it doesn’t have to take into account the memory allocated off-heap to identify memory not being used or perform memory compaction which in turn reduce the time to complete GC.

While the time to complete GC can be reduced when large memory is used in a Java process by using off-heap memory, there are other overheads which need to be taken into consideration before using it. Allocation of off-heap memory will take more time than the on-heap memory since the JVM need to make native calls to get the memory allocated. Also when the off-heap memory is not used anymore by the application, during GC process, the JVM need to make native calls to free the off-heap memory in addition to releasing the memory used by the object reference in on-heap memory. Also as per the API documentation, the JVM will make the best effort to not to use any on-heap memory as a intermediate step to store and retrieve data to/from the off-heap memory. In order to compensate these additional overheads and at the same time take advantage of using large memory without the penalty of increased GC time, it is best to use off-heap memory for large objects which doesn’t get released often.

When a JVM is brought up to run a Java process, the total memory which can be used for off-heap memory can be specified using the JVM parameter -XX:MaxDirectMemorySize parameter. If the parameter is not set explicitly, the value is set to the free memory available in the system at the start of the process using VM.maxDirectMemory() method call. When off-heap memory allocation is made, the JVM keeps track of the total memory used so far. When a new off-heap memory allocation request is made the JVM checks whether the sum of the requested memory size and the total memory allocated so far is greater than the available direct memory size set at the start of the Java process. If the sum exceeds the available memory, an explicit GC system call is made by the memory allocator and then the process thread sleeps for 100ms for the GC call to complete. After 100ms the allocator checks again to see whether there is enough space to satisfy the new memory allocation request before raising an out of memory exception.

Few things to note about this allocation process.

For performance sensitive applications the explicit GC and the non-tunable sleep time in the allocation logic when there is not enough memory can be a large overhead.
The second item to note is that a GC call means best effort will be made by the JVM to schedule one and doesn’t guarantee that one will be run immediately. So there can be situations when the Java process will fail with OOM error even when there is enough memory to be freed to accommodate new memory allocation request since a GC is not run immediately.
Third, the thread sleep time of 100ms may not be sufficient in certain situations for the GC to complete and release unused memory to satisfy new memory allocation request. If any one is surprised that the 100ms is more than sufficient for a GC to complete, we came across the situation where trying to allocate 1 GB chunks of off heap space using a simple for loop failing on Ubuntu 12.04 LTS with OOM while the same runs fine on Redhat Linux machine which had a relatively less powerful hardware. With the current API this sleep time can’t be adjusted and hence the application may have to perform additional sleep to make sure that there is no memory to use.
The last item of interest is the total memory available for use to allocate off-heap memory. This value is set at the start of the process either manually or by the VM. When set manually, the JVM doesn’t verify whether there is enough free memory available on the system. Even if the value is set automatically by the JVM, the available memory on the system can be lower during the process execution since memory usage of other processes in the system can change as time goes by which can result in the Java process failing due to unexpected exception in memory allocation. So it is important to make sure that the memory of size set in -XX:MaxDirectMemorySize is available for the Java process to use so that the failures doesn’t happen.

There are few options the JVM can do to prevent allocation related exceptions which would require changes to the JVM code.

Verify the system free memory to make sure that it is greater than or equal to what is set by the users when the JVM is brought up and also during the process execution. This will require native system calls and may have a pronounced impact on the performance of the Java process. One way to mitigate is to provide an JVM option for the users to set if they need this strict condition checking.
Instead of invoking a GC call when all the memory is used, it would be better to have a configurable parameter to set direct memory used threshold to make the GC call. This should be a fairly simple change to the JVM code.
Calculate the sleep time after the GC process taking into consideration all the factors which impacts GC time. This will be complex and will not be of less importance if the previous suggestion is implemented in the JDK code.

Secure All Applications Please

2015-04-30T22:46:43-04:00

When you work with enterprises often you get to see batch applications storing credentials to login to systems like databases or messaging infrastructure or other enterprise applications in config files as plain text. Also these batch applications don’t get the same attention as customer facing applications when it comes to security. If you have similar application configurations and the thought is that these batch applications are behind the firewall in a DMZ and hence pose less risk, think again. As anyone who work in computer forensics/security can attest, most often data breach is perpetrated by an insider and these instances never get reported or get media attention. If you are looking for numbers here is a summary of 2012 security incident report from Forrester.

To de-risk scenarios like these, the solution doesn’t have to be too complex. It can be a matter of following a simple process similar to the following across the enterprise,

Generate an application specific key for the particular machine on which an application will be scheduled to run.
Store the key in a location which is accessible only to the service id (headless user id) which will run the application.
Encrypt the password of credentials which will be used by the application to authenticate with critical systems using the key generated.
Store the encrypted password and the location of the key in the application configuration file.
In the application code, when required decrypt the password using the key. Once the decrypted password is used for authentication, destroy the decrypted password value so that application process memory dump doesn’t expose the decrypted value.
And have the application owner or sys admin own the responsibility for key generation, storage and encryption of passwords for the application.

Some may say that this is too simplistic and they may be correct. But here is something to think about. Have you wondered the use of a lock when road side assistance comes and opens your car when you loose your key? Locks are not for the small percentage of the population who will always find a way to break it, it is there to act as a barrier to the temptations of the vast majority of us. So even if this approach is simplistic it still acts as a barrier in a situation where there is nothing in place. If there are other solutions which can be put in place that you are comfortable with, it is even better, but please secure all applications. It can save someones life savings, identity, medical records or other critical data and the someone can be your friend, family, neighbor or even yourself!! With so much at stake on data, any vulnearabilty to breach is not an option anymore.

Note: A simple utility to create a barrier is available here for Java based applications.

Chef HWRP Using an Example

2015-04-23T23:34:17-04:00

Heavy Weight Resource Provider (HWRP) is one of the options Chef offers to create custom resources and the other being LWRP. It would be good to read the notes on LWRP to understand the context and the difference between LWRP and HWRP.

Similar to LWRP, HWRP requires a resource definition and the corresponding provider. The key difference is that there are no DSL in the HWRP as in LWRP and everything is coded in Ruby code. So taking the same example of HDFS directory resource used in the notes on LWRP, the following is the skeleton of the resource definition.

class Chef
  class Resource
    class HdfsDir < Chef::Resource

      #
      # What provider this resource provides
      #
      provides :hdfsdir

      def initialize(name, run_context=nil)
        super

        #
        # Set the resource name
        #
        @resource_name = :hdfsdir

        #
        # Allowed actions in this resource
        #
        @allowed_actions = [:create, :delete, :chown, :chmod, :rename, :chgrp, :nothing]

        #
        # Default action if none specified when using the resource
        #
        @action = :create

        #
        # Set default values for resource attributes
        #
        @path = name
        @namenode = nil
        ...
      end
      
      #
      # Methods to get/set attributes and define additional characteristics
      #
      def path(arg=nil)
        set_or_return(:path, arg, :kind_of => String, :required => true)
      end

      def namenode(arg=nil)
        set_or_return(:namenode, arg, :kind_of => String, :required => true)
      end

      ...

    end
  end
end

The HWRP is a Ruby class in this case HdfsDir which is a subclass of Chef::Resource class. The provides method specifies the resource provider for this resource and in this case it is hdfsdir.

As in any Ruby class, the initialize method is used perform initializations like setting initial values of variables. In this case the pre-defined instance variable resource_name is set to a name which can be used to create a resource block in recipes using this HWRP. An array of symbols specifying the supported actions supported by this HWRP is assigned to the instance variable allowed_actions. A default action which will be taken if an action is not set for while creating a resource using this HWRP (in this case create) is set to the instance variable action.

The remaining section in the skeleton is to define the characteristics of all the attributes of this resource which is similar to the attribute definition in LWRP. The key difference is that they are all defined as Ruby methods and the set_or_return is similar to Ruby attr_accessor method which creates the getters and setters for the attributes.

Unlike LWRP, the HWRP resource and provider code is stored in files under the libraries directory of the cookbook. Also there is no strict rules about the file naming conventions since these are Ruby classes and they get loaded first during the Chef client run.

Now lets turn to the corresponding provider definition and the following is the skeleton. It is more or less similar to the LWRP provider code we had seen earlier with some differences.

require 'chef/log'

class Chef
  class Provider
    class HdfsDir < Chef::Provider

      #
      # To enable -W/--why-run option of chef-client
      #
      def whyrun_supported?
         true
      end

      #
      # Method automatically called by Chef during the client execution phase
      # Can be used to initialize variables and also verify the current state
      #
      def load_current_resource
        require 'webhdfs'

        new_resource.user == nil ? @user = ENV['USER'] : @user = new_resource.user
        nnaddress = new_resource.namenode
        nnport = new_resource.nnport
        @client = WebHDFS::Client.new(nnaddress,nnport,@user)
        if (!validnn?())
          raise RuntimeError, "Invalid namenode provided or HDFS not available"
        end
        @omode = new_resource.mode
        new_resource.mode == nil ? @mode = "0750" : @mode = new_resource.mode
        @path = new_resource.path
        @tpath = new_resource.tpath
        @tgroup = new_resource.tgroup
        @tuser = new_resource.tuser
      end

      ...
      #
      # Action to create a directory in HDFS
      #
      def action_create
        if (dir_exists?(@path))
          Chef::Log::info("Directory #{ @path } exits; create action not taken")
        else
          converge_by("Create #{ @new_resource }") do
            @client.mkdir(@path,'permission' => @mode)
          end
          new_resource.updated_by_last_action(true)
        end
      end
    …
      #
      # Method to check whether the namenode provided is valid
      #
      def validnn?()
        return dir_exists?("/") ? true : false
      end
      ...
    end
  end
end

As with the resource definition, the provider is also a Ruby class which is a subclass of Chef::Provider class. The method whyrun_supported is to specify whether the resource supports the chef client run with why-run option. If this method is set to return true, then the strings provided in the converge_by statement of the action requested in the recipe will be logged instead of performing the actual convergence of the resource.

load_current_resource method need to be overwritten in an HWRP which is optional in as LWRP. As discussed in the LWRP note, this method can be used to check the current state of the resource.

The methods for the actions supported are defined using the naming convention action_name. For e.g. for the create action the method name is action_create. Supporting methods can be defined as in any Ruby class for e.g. in this case `validnn??`` method.

The method new_resource.updated_by_last_action is called with a value of true so that Chef is notified that the resource got updated by that particular action.

More notes in this category can be found here.

Chef LWRP Using HDFS Directory as an Example

2015-04-19T22:48:58-04:00

Chef provides a large set of resources to work with. But there are situations where resources provided by Chef may not be sufficient. For e.g, distributed file systems can’t be handled by the file system related resources (file, directory etc) which comes out of the box with Chef. Being flexible and customizable, Chef provides two options (LWRP, HWRP) for users to create their own resources.

Light Weight Resource Providers (LWRP) use DSL to simplify the creation of resources and are used when existing chef resources can be leveraged with minimal Ruby code. In contrast, Heavy Weight Resource Providers (HWRP) are used when existing resources can’t be leveraged and Ruby code need to be used to implement the resource provider.

Lets quickly look at a LWRP using HDFS (which is a distributed file system) directory resource as an example. A LWRP is created and stored in a cookbook and there are two parts to it. First the resource definition which defines the actions supported and attributes accepted by the LWRP. The resource definition resides in the resources directory of the cookbook. The second part is the provider or simply the code which implements the actions supported by the LWRP. The provider is stored in the provider directory of the cookbook in which the LWRP is being created.

For chef to be able to identify the new resource, the file name of the resource definition and the provider file need to have the same name. For e.g. lets assume that the HDFS directory resource is created in hdfs cookbook and the resource is named hdfsdir, the following will be the cookbook directory structure (showing only the required directories)

hdfs
  |_____ providers
  |             |________ hdfsdir.rb        
  |
  |_____ Resources
                |________ hdfsdir.rb

When the resource need to be used in a recipe, the resource need to be prefixed with the cookbook name separated by an “_” (underscore). For e.g.

hdfs_hdfsdir “/tmp/pass” do
  action :delete
end

Lets look at the LWRP resource definition for HDFS directory resource

actions :create, :delete, :chown, :chmod, :rename, :chgrp
default_action :create
#
# fqdn or ip address of the name node server
#
attribute :namenode, :kind_of => String, :required => true
#
# port number of the namenode
#
attribute :nnport, :kind_of => String, :required => true
#
# Directory path on which actions need to be taken
#
attribute :path, :kind_of => String, :name_attribute => true, :required => true
...

actions defines the actions supported by the new resource.

default_action defines the action when the resource is used in an recipe and no action clause is specified in the recipe.

attribute defines each attribute which can be set when using the resource. It also defines whether the attribute is required and the attribute type.

One attribute can take the name value of the resource if it is not set explicitly in the recipe and in this case the attribute path takes the name value of the resource and it is specified using :name_attribute => true. The complete definition can be found here.

With the resource definition out of the way lets look at the provider for the hdfs directory resource. The following is the code skeleton for the provider and the full code can be found here.

require 'chef/log'
require 'webhdfs'
#

#
use_inline_resources
#
# To enable -W/--why-run option of chef-client
#
def whyrun_supported?
   true
end
#
# Method automatically called by Chef during the client execution phase
# Can be used to initialize variables and also verify the current state
#
def load_current_resource
   @current_resource = Chef::Resource::HdfsHdfsdir.new(new_resource.name)
   new_resource.user == nil ? @current_resource.user = ENV['USER'] : @current_resource.user = new_resource.user

...
end

action :create do
  if (dir_exists?(@path))
    Chef::Log::info("Directory #{ @path } exits; create action not taken")
  else
    converge_by("Create #{ @new_resource }") do
      @client.mkdir(@path,'permission' => @mode)
    end
    new_resource.updated_by_last_action(true)
  end
end

action :delete do
...
end

...

The require method is used to include external files as in any Ruby code. Note that webhdfs gem need to installed for this code to work.

use_inline_resources is a must for LWRP. The reason is to make sure that any notifications raised from any of the resources in the LWRP (remember LWRP can leverage other Chef resources to implement its functionality) will be treated as being raised by the LWRP resource collection as a whole and not the individual resource within the LWRP resource which is raising the notification.

new_resource instance object is automatically created when the resource is used and the attribute values in the new object is set to the values passed from the recipe that is using the resource.

When creating resources one of the key requirement is to make sure that it is idempotent. This requires the current state of the resource to be known. For this chef provides an empty method load_current_resource which can be overwritten by the resource provider. Since this method will be the first to be called when the resource is used in a recipe, the method call can be used to check the current state of the resource. For e.g. if the directory resource is already existing and the resource action is to create the directory, the resource provider can skip the requested action since the directory is up to date. For anyone interested in more details look into the code for LWRPBase class which is the parent for the LWRP provider.

The remaining sections in the code skeleton are to implement the actions supported by the resource. They can use the existing chef resources and/or use Ruby code. If you had a chance to look at the complete code for hdfsdir provider contrary to how LWRP is meant to be implemented, the code doesn’t use any existing chef resource since there is none for a distributed file system like HDFS and all the actions had to be implemented using Ruby. But to understand the various aspects of writing an LWRP it is still helpful.

whyrun_supported? method is used to enable/disable support for the --why-run option of chef-client by setting the return value to true or false.

When whyrun_supported? is set to true and if chef-client run uses --why-run option the string passed to converge_by clause will be logged instead of actual convergence.

When an action is taken on a resource new_resource.updated_by_last_action(true) is used to notify chef that the resource was updated by the requested action.

Finally, note that you can use the hdfs LWRP code used in this example if you are dealing with HDFS by renaming the files and copying into the resources and provider directories of your cookbook.

More notes on this category can be found here