The storage system exposes API to write redo logs which a database instance can use to persist a redo log entries and need to pass in the LSN for the change while doing so. Since this is a network IO, the database instance can make the request to persist log entries as soon as changes are made. This is in contrast to traditional database where log entries are buffered to group them before persisting since it involves disk IO. The storage system persists the log entry to a “hot log” and acknowledges the database instance. This is the only write request made from the database instance to the storage system which reduces the number of IOs by a factor of ~7 per transaction. Data is stored in 10 GB chunks called segments and replicated 6 ways. For availability and recovery, 2 out of 6 copies are made available in an availability zone with a total of 3 AZs where data is stored. The logical group of 6 segments is called the protection group (PG). Storage segments are distributed across storage nodes which are EC2 instances in AWS. Chain of PGs constitutes a database volume which has a one to one relationship to a database instance and the volume can grow as data grows by adding new PGs. The database instance maintains the metadata about the segments, the protection group to which it is part of, the storage nodes which are responsible for the segments along with the data pages and log offsets which is stored in each segment in AWS key value store DynamoDB.
Database instance sends write requests to all the segments for log write and the write is considered successful if 4 out of the 6 storage node acknowledges that the write is successful. The storage system can identify issues with any storage node or segment and when one is found it will add a new protection group by replacing the segment with issues with a new segment to the members of the existing protection group. The database instance can decide which PG to drop based on how quickly the issue with the old segment is resolved. Since the write is coordinated by the database instance, there is no need to consensus protocols in the storage layer which reduces complexity.
In the background the storage system coalesces log entries, create/update data pages, garbage collect unwanted data pages, perform consistency checks of pages and backup data pages to external storage like S3 for recovery relieving the database instance from these operations. This also allows multiple database instances to be attached to the same storage volume. When there are multiple read instances attached to the same storage volume, the write instance along with sending log write requests to the storage system, it will also send the log entries to the read database instances so that they can keep their buffer caches current. Multiple write instances are enabled by pre-allocating range of LSNs for each instance so that there are no conflicts between the updates made through the various instances. If transactions from two write instances updates the same rows in a table, the transaction for which the 4 out of 6 quorum writes completes first gets committed and the other transaction will fail.
Aurora database instances maintains various consistency points enabled by the monotonically increasing LSN which makes distributed commits and recovery simpler. Segment complete LSN (SCL) is the low watermark below which all the log records has been received. When there are holes in the log records, storage nodes gossip with other nodes in the PG to which is the segment is part of to fill the holes. Protection group complete LSN (PGCL) keeps track of when 4/6 segment SCLs advance in a PG. Volume complete LSN (VCL) tracks the LSN when PGCLs advance at all the PGs. When a recovery happens, the storage system can truncate all the log records with an LSN larger than the VCL can be truncated.
Database can also set a recovery point based transactions. Each database level transactions is broken into multiple mini-transactions (MTR) that need to be performed in order and atomically. The final record in a mini transaction is marked as Consistency Point LSN (CPL). Volume durable LSN (VDL) tracks the CPL which is smaller than or equal to VCL. During recovery, the database establishes the durable point to be the VDL and the storage system truncates all data above that consistency point.
Cloning database will result in the new database pointing to the same logs and data pages of the original database i.e. the clone will much quicker and will be able to perform reads on the data as of that point in time. Creation of new log segments and data pages are done as writes a made through the new database instance. This satisfies the need for reducing cost of creating database clone and the speed in a cloud offering.
Aurora can be configured to track changes so that state of the database can be moved to a previous point in time quickly for any reasons like data corruption due to incorrect code. Once backtracking is configured, data pages are not garbage collected by the storage system and tracked so that users can go back to a desired point time.
and quantify them.
The following table provides the summary of findings on the cause of the spike in tail latency due to various factors and how they can be mitigated
Fundamentally in an active database, users can create a rule which defines the conditions that need to be met and what action need to be taken by the database. When a DML statement is executed on data in the DBMS, it is checked to see whether the DML caused any of the conditions to be satisfied and if so the action is taken by the DBMS. Anyone who is familiar to triggers in DBMS which was just getting supported in commercial DBMS when these papers where published can relate to this ECA paradigm.
At a high level, the following functional components are required for an active DBMS in addition to code data management and transaction management functionality.
Alert follows evolutionary approach of extending a passive DBMS into an active DBMS and the components implemented for the extension provides insights into the basic components used in the current data data streaming and processing technologies like Apache Kafka.
It introduces the notion of active tables and active queries . Active tables are append-only tables in which the tuples are never updated in-place and new tuples are added at the end and active queries are queries that range over the tables. When a cursor is opened for an active query involving one more active tables, tuples added to an active table after the cursor was opened also contribute to answer tuples. This the active queries are defined over past, present and future data where as the domain of the passive queries is limited to past and present data. In order to support active queries a new SQL primitive fetch-wait to iterate over active queries was introduced in Alert which blocks when the current answer set is exhausted and resumes returning data when new tuple is available. The active tables are defined by users similar to any passive tables and they are data is stored in active tables akin to journals created by many applications such as banking transactions.
Users can create rules using standard SQL which includes the condition which need to satisfied and the action the database need to take when the condition is satisfied by a database event.
1 2 3 4 5 |
|
Creates a rule to send an email when the conditions are met. Like database views, rules can be referred in any other query. After creating the rules it can be activated and deactivated explicitly. When activating users can specify the transaction and time coupling along with the assertion mode of the rule. Transaction coupling specifies whether the triggered action from the rule need to be executed in the transaction which the triggered event is part of or seperately. Time coupling specifies whether the triggered action need to be executed synchronously with the triggering event or in parallel with the triggering event asynchrounously. The assertion mode provides the users the option to specify whether the rule is triggered as soon as the rule condition is satisfied or deferred till the end of the transaction. The following diagram shows the message flow when an event happens which affects an active table which in turn satisfies conditions in multiple rules.
The alert rule system which is a new component added to extend a passive DBMS to make it active, does any conflict resolutions between multiple rules and identifies the order of execution. Then the rules are executed in the order to fetch the tuples and passed to the associated action in the rules. To reduce the amount of locks taken on data pages when data is read and rules are evaluated, latches are taken on the pages and locks are deferred to the end.
Access methods enable us to read or write base data potentially using auxiliary data such as indexes to provide performance improvement.
RO is the read amplification which is the ratio between the total amount of data read including auxiliary data and base data, divided by the amount of data read.
UO is the write amplification which is the ratio between the size of physical update performed for one logical update divided by the size of the logical update. A logical update can involve multiple physical updates for e.g. update to base data and auxiliary data like indexes
MO is the space amplification which is the ratio between the space utilized for auxiliary and base data, divided by the space utilized for base data.
The theoretical minimum for overhead is to have the ratio equal to 1.0 which implies that the base data is always read and updated directly and no extra bit of memory is wasted. Achieving these bounds for all the three overheads simultaneously is not possible as there is always price to pay for every optimization. When designing an access method all the three overheads should be minimal but depending on the application workload and available technology they need to prioritized.
designing access methods that set an upper bound for two of the RUM overheads, leads to a hard lower bound for the third overhead which cannot be further reduced
In other words, we can choose which two overheads to prioritize and optimize for and pay the price by having the third overhead greater than the hard lower bound. The following figure shows some popular access methods mapped to three dimensional RUM space projected on a two dimensional plane.
Using the RUM space we can understand the current access methods interms of which overhead they prioritize and can choose the appropriate one that suits the need. It also helps in designing new access methods or a combination of access methods which can tune based on available technology and dynamically adapt to new workloads and hence covering the RUM space in aggregate.
Dynamo uses consistent hashing where the output range of a hash function is treated as a fixed circular space or ring. Each node in Dynamo is assigned a random value within this space which represents its position in the ring. Each data item identified by its key is assigned to a node by hashing the key to identify its position in the ring and assigning to the first node in the ring with a position larger than the item’s position. This node is the coordinator node for the key for writes. Also by assigning items to nodes, adding or removing nodes to Dynamo impacts immediate neighbors and other nodes remain unaffected. To reduce non uniform distribution of data and load distribution due to random position assignment of nodes in the has ring , Dynamo uses virtual nodes. Each virtual node is assigned a position in the hash ring and each physical node is assigned multiple virtual nodes. Varying the number of virtual nodes based on physical node capacity, heterogeneity of the environment can be taken into account. If a node becomes unavailable, virtual nodes help in equal dispersement of load across other nodes.
The following three partition schemes used and efficiency of load distribution compared
The node to which a data item is assigned to is called the coordinator node which not only stores the data locally and also coordinates the replication of data to “N-1” number of nodes where N is configurable. Successive N-1 nodes in the ring after the coordinator node is selected for replication and are called preference list. Since multiple virtual nodes can be assigned to a single physical node, to avoid copies of replicated data is not stores in the same physical node, nodes are skipped to come-up with the preference list. If any of the nodes in the preference list is not available, another node will be selected to store the replica of the data with the meta data to hint to which the data belongs. When the target node becomes available, the data is delivered to the preferred node. Nodes storing the hinted handoff data can fail before it is replicated to the target node and it will end up replicas being not consistent. To prevent inconsistencies and to be able to recover quickly, Merkle trees on the data stored in each node is maintained and compared regularly. When a difference is found data is replicated in the background to bring back the replications to be insync.
Dynamo uses vector clocks which is a list of (node, counter) pair in order to capture the causality between versions of same object. When multiple versions of data are retrieved during a read operation, if the counter and nodes in the versions are in order i.e. the nodes and counter of the last version contains all the nodes and the largest counter, then all the older versions can be forgotten. But if there are versions where is there is no causal dependency i.e. node in the (node, counter) pair is not the same in two versions then it need to be reconciled. The reconciliation can be done at the client using business logic, at the storage level with last write wins using physical timestamp or by setting the read replica to 1 and write replica to N which will make sure that all the replicas have the same version.
jdbc:phoenix:<ZK-QUORUM>:<ZK-PORT>:<ZK-HBASE-NODE>
. The following is the code snippet to get a Phoenix JDBC connection object for a non secure HBase cluster.
1
|
|
To connect to a secure HBase cluster using a Kerberos user principal and keytab, the Phoenix JDBC connection string should be of the form jdbc:phoenix:<ZK-QUORUM>:<ZK-PORT>:<ZK-HBASE-NODE>:principal_name@REALM:/path/to/keytab
. The following is the code snippet to get a Phoenix JDBC connection object for a secure HBase cluster.
1
|
|
If Kerberos principal and keytab is not used to connect to a secured HBase cluster, then the user running the code to make the connection should be defined in the Kerberos KDC and should have a valid TGT. The user running the code can verify whether they are in the correct KDC and have a valid TGT by running klist
command . One key item to note is that to access a secure HBase cluster, the hbase-site.xml and core-site.xml of the target HBase cluster should be available in the classpath of the application.
HBASE_OFFHEAPSIZE
. One way to do this is to set it in hbase-env.sh
1
|
|
Note The “G” is for gigabytes. The value for x should be 1 to 2 GB and the value should be at the higher end for clusters handling high volume of transactions.
- The other option to configure the total off-heap memory size is tp set the -XX:MaxDirectMemorySize
JVM property. Again this value can be set in hbase-env.sh
1
|
|
If you are wondering what x GB is for, it is for Java direct memory used by hdfs client used by hbase to interact with the underlying hdfs filesystem.
- Set the HBase property hbase.bucketcache.combinedcache.enabled
to true
so that on-heap cache will be used for index and bloomfilter blocks along with bucketcache
for data blocks.
1 2 3 4 |
|
hbase.bucketcache.ioengine
to offheap
.1 2 3 4 |
|
hbase.bucketcache.size
to the size of memory which is allocated for bucket cache in mb.1 2 3 4 |
|
While this view is a natural extension of object oriented design, it doesn’t take into account the important issues in distributed systems namely
Historically there is a desire to merge programming and computational models of local and remote computing. For e.g. communications protocol development has followed
These two approaches can be used in distributed object oriented systems by accepting the fact that there are irreconcilable differences in local and distributed computing
Reliable Data Transmission
Lower layers can implement function which can improve the performance of the application using it. For e.g. in the case of reliable data transmission
Decisions to include functions in lower layers
In order to make these decisions i.e. whether to include functions in lower layers or let the application handle it at the end, application requirements of what need to be accomplished need to be well understood.
]]>Scaling of all the components can be improved by replication, distribution and caching
Key points to remember while building scalable systems
Evaluating distributed systems
Non Kerberized Cluster Create the following files with the content shown in a local directory. In this example the files are created in hbasecompact under the users local home directory. Please note that the files are under different directories. Shell script to start HBase compaction on a table and copy the output of the command to a HDFS directory. The command output can be checked for any issues.
1 2 3 4 5 6 |
|
Shell script to copy the Oozie job id into a HDFS directory. The job id can be used to check any issues.
1 2 3 |
|
Oozie workflow definition to perform HBase compaction. The first step major_compact runs the script hbaseCompact.sh. The next step logOozieId runs the script logOozieId.sh to copy the Oozie workflow id onto HDFS.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Oozie coordinator.xml definition to run the major_compact_wf workflow defined in the previous step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Note: Do not use 1440 minutes as frequency in workflow.xml if the expectation is to run compaction everyday at a certain time since this will cause change in job run time when system time gets changed for day light savings. The starttime and endtime should be specified in UTC/GMT. The timezone is required for Oozie to invoke the logic to handle the time changes due to day light savings. Properties which need to be substituted in the place of the parameters defined in the workflow and coordinator xml files. If this example is used, this is the only file which need to be changed. Inline comments will help in making the changes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
If you list the local directory where these file are stored it will look like this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
copy them into to a HDFS directory
1
|
|
If you list the HDFS directory which was used as the target location in the previous step, the output should be similar to the following.
1 2 3 4 5 6 7 8 9 |
|
If the execute permissions in the two shell scripts are not set, do a chmod to set the executable permission.
1 2 |
|
Schedule the Oozie job using the following command. Note that the properties file in this example /home/userid/hbasecompact/coordinator/coordinator.properties should be on the local disk from where this command is executed.
1 2 |
|
The status of the job submitted can be verified through the Oozie UI which should be normally accessible through *http://oozie-host:11000/oozie/*. Also status of MapRed jobs can be viewed through the YARN RM/NM URLS. Also the id of the last Oozie workflow which was executed and the output from the HBase major_compact command which was executed in the last run can be found from the HDFS files which got created by the job.
1 2 3 4 5 6 7 8 9 10 11 |
|
Set the hbase.replication
property to true
in hbase-site.xml of the HBase cluster from which data need to be replicated from. This cluster is referred as the master
going forward. By default the value of this property is “true”.
1 2 3 4 |
|
Create a HBase replication peer in the master HBase cluster using the information about the ZooKeeper quorum of the cluster to which data need to be replicated to. The cluster to which data will be replicated to will be referred as slave
going forward.
1 2 3 4 5 6 7 8 9 10 11 |
|
Once the replication peer is created and enabled, replication need to be enabled on HBase tables whose data need to be replicated from the master cluster by setting the “REPLICATION_SCOPE” attribute of the table to a non zero value. By default this value is set to “0”. If the table is an existing table, altering the table to set the “REPLICATION_SCOPE” to a non zero value requires disabling and enabling the table and the following is an example of the steps where the existing table’s name is “healthy”. Note that a table with the same definition as the table being replicated (in this case “healthy”) should be created in the slave cluster before the replication is enabled on master.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Once replication is enabled replication related JMX stats are made available in all the region servers in the master cluster hosting the regions for which data replication is enabled.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Note that these are statistics available in HBase version 0.98. Later versions may have additional statistics which can help with replication monitoring.
]]>Create a snapshot of HBase table
1 2 3 4 |
|
For easier identification it is a good practice to create snapshot with table name, creation date, creation time in the snapshot name.
Restore data from snapshot
1 2 3 4 5 6 7 8 |
|
Note that the table need to be disabled to restore the table data from snapshot. Also note that any updates to the table data after the snapshot will be lost once the restoration is complete.
Clone table from snapshot
1 2 3 4 |
|
A new table will be created with the attributes of the original table from which the snap shot was made and the data from the point in time of the snapshot will be restored.
List all the available snapshots for a table If multiple snapshots were made on tables and would like to see the list of available snapshots
1 2 3 4 5 6 7 |
|
Creating connections to a server component from an application is a heavy weight operation and it is much pronounced when connecting to a database server. That being the reason database connection pooling is used to reuse connection objects and HBase is no exception. In HBase, data from meta table that stores details about region servers that can serve data for specific key ranges gets cached at the individual connection level that makes HBase connections much heavier. So if there are region movements for balancing or if a region server fails, the meta data need to be refreshed for each connection object which is a performance overhead. For these reasons, applications need to try to reuse connection objects created.
The following code snippet shows how to create a HBase connection object in a Java application using HBase.
1 2 3 |
|
If the application is multi-threaded, then it need to reuse the connection object to perform any data manipulation operations on tables. This can be achieved by individual threads creating the HTable object using the getTable(TableName) method of the HConnection object. Once the data manipulation operations are complete each thread should close corresponding HTable but not the HConnection object so that it can be reused by other threads.
In order to prevent skews in processing of queries and to distribute query processing work load across all the nodes in the cluster, it is a good practice to create tables which is pre-split. The key is to identify the split point so that the data will be distributed across all the nodes in the cluster. Once the split point is identified the table can be created pre-split using HBase shell and the following is an example of a table with 3 split points.
1 2 3 |
|
During start of development, when the split points in the data are not clear but if some one still want to pre-split the table, HBase provides a utility program which can split the table and uniformly distribute the data. The following is an example which creates a table with 10 splits and columnfamily ‘cf1’.
1
|
|
If you are creating tables programmatically using Java APIs, the following code snippet shows how to pre-split the table during creation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
For further reading and understanding the details about HBase table splitting and merging refer this blog post.
]]>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
The driver program takes in three parameters table name, HDFS directory where the source data is stored, the HDFS output directory where HFiles need to be created for loading into HBase It sets the out format to HBase org.apache.hadoop.hbase.client.Put which represents a single row in a HBase table The input format is set Text to read source data from a text file In the configuration object, the only parameter which need to be set iis the ZooKeeper (ZK) quorum and the value should be set to the ZK quorum corresponding to the HBase cluster on which the target table is defined No reducers are required to be set to create HFiles using map reduce The following is the code snippet for the HFileMapper class used by the Driver program
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Note that this is a dummy mapper in which the key and values are generated dynamically in the code. Based on what the source data source file stores and what need to be stored in the HBase table, the mapper need to be modified. The key aspect to note is how the Put object is created. Once the driver and mapper code is compiled, packaged in a Java jar file (e.g. happy-hbase-sample.jar) and made available on all the nodes in the HBase/HDFS cluster, the HFiles can be generated by running the map-reduce job on the cluster. run mapred job
1
|
|
When the map reduce job completes, it creates number of files in the output directory on HDFS and it can be used to load data into target HBase table and in this case healthyTable
1
|
|
1 2 3 4 5 6 7 8 9 10 11 |
|
Further details on a particular command can be found using help ‘command-of-interest’. The following is the partial output from help on create command help create
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Note that by default a table is created in the default namespace if the namespace is not specified during table creation. If a namespace is specified during table creation, the namespace should already exist and it can be created using create namespace command. To view the properties of a table describe command can be used.
1 2 3 4 5 6 |
|
Note that there are two columns in the output DESCRIPTION and ENABLED. The value true displayed on the console is for the ENABLED column which informs the user whether the table is enabled or not. To see whether a table is defined in the cluster, use list command which lists all the tables in the HBase cluster and the following is a sample output.
1 2 3 4 5 6 7 |
|
To drop a table, the table need to be disabled first using the disable command before dropping using the drop command. If drop is attempted before the disable the shell will prompt with the message that the table is enabled.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Note that changes to some of the attributes require other actions. For e.g. changing the BLOCKSIZE will require a major compaction of the table if the change need to take effect immediately. Other changes like modifying the REGION_REPLICATION property requires the table to be disabled before altering the table and then enabling it For basic data manipulations, the put, get, scan, delete commands can be used. The following puts two rows to a table, does a get a scan and a delete.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
During development if minor compaction need to be performed explicitly on table regions, the compact command can be used. Note: since compaction process will have performance impact use caution on when you invoke compaction in a production environment.
1 2 |
|
Also to improve data locality and performance major compaction of a table may have to be run and the following is an example. Again, major compaction is a resource intensive (CPU, IO, Memory and Network) process and should not be run in production without proper scheduling to minimize business impact.
1 2 |
|
Balancing the region distribution of a table may help improving performance and it is a best practice to balance the regions before running major compaction explicitly to improve its performance. Be cautious on when you run balancer in production environment since it will impact performance.
1 2 3 |
|
During development there may be a need to disable running of HBase balancer so that regions can be moved manually. Enabling and disabling of balancer can be accomplished using balance_switch command which takes in the value true|false as input.
1 2 3 |
|
Note that balance_switch command will return the previous value of whether the balancer is enabled or not. In the previous example the balancer was enabled and hence the return value of true.
To check the status of the HBase cluster, the status command can be used. The command can generate detailed, simple or summary status based on whether detailed|simple|summary parameter is passed.
1 2 |
|
The GC process gets invoked typically when the amount of free memory in the JVM falls below a certain threshold. At a very high level, the GC process involves identification of objects which are not used any more i.e. not referenced anymore, releasing the memory and compacting the memory to reduce memory fragmentation. Readers who are interested in understanding the details of GC process can find it here. As one can imagine, the time it takes to complete the GC process will increase with the increase in size of the Java heap memory since it takes more time to identify the objects which can be released and also to perform compaction.
If a Java application requires large memory (in GBs), the time it takes to complete the GC process will be detrimental to its performance. If the application is performance sensitive, then large heap memory size can adversely impact its performance. In order to mitigate this, one can try to use memory outside Java heap and hence reduce the Java heap memory use and its size. This can be done using the Java ByteBuffer class which provides the option to allocate ByteBuffers outside JVM heap using allocateDirect) method.
The allocateDirect method allocates memory of requested size (in bytes) on memory outside the JVM heap (off-heap) and provides the object reference to the application with the starting offset of 0. The application can then use the reference to store and retrieve data into the off-heap memory. When the garbage collection runs, it doesn’t have to take into account the memory allocated off-heap to identify memory not being used or perform memory compaction which in turn reduce the time to complete GC.
While the time to complete GC can be reduced when large memory is used in a Java process by using off-heap memory, there are other overheads which need to be taken into consideration before using it. Allocation of off-heap memory will take more time than the on-heap memory since the JVM need to make native calls to get the memory allocated. Also when the off-heap memory is not used anymore by the application, during GC process, the JVM need to make native calls to free the off-heap memory in addition to releasing the memory used by the object reference in on-heap memory. Also as per the API documentation, the JVM will make the best effort to not to use any on-heap memory as a intermediate step to store and retrieve data to/from the off-heap memory. In order to compensate these additional overheads and at the same time take advantage of using large memory without the penalty of increased GC time, it is best to use off-heap memory for large objects which doesn’t get released often.
When a JVM is brought up to run a Java process, the total memory which can be used for off-heap memory can be specified using the JVM parameter -XX:MaxDirectMemorySize
parameter. If the parameter is not set explicitly, the value is set to the free memory available in the system at the start of the process using VM.maxDirectMemory() method call. When off-heap memory allocation is made, the JVM keeps track of the total memory used so far. When a new off-heap memory allocation request is made the JVM checks whether the sum of the requested memory size and the total memory allocated so far is greater than the available direct memory size set at the start of the Java process. If the sum exceeds the available memory, an explicit GC system call is made by the memory allocator and then the process thread sleeps for 100ms for the GC call to complete. After 100ms the allocator checks again to see whether there is enough space to satisfy the new memory allocation request before raising an out of memory exception.
Few things to note about this allocation process.
For performance sensitive applications the explicit GC and the non-tunable sleep time in the allocation logic when there is not enough memory can be a large overhead.
The second item to note is that a GC call means best effort will be made by the JVM to schedule one and doesn’t guarantee that one will be run immediately. So there can be situations when the Java process will fail with OOM error even when there is enough memory to be freed to accommodate new memory allocation request since a GC is not run immediately.
Third, the thread sleep time of 100ms may not be sufficient in certain situations for the GC to complete and release unused memory to satisfy new memory allocation request. If any one is surprised that the 100ms is more than sufficient for a GC to complete, we came across the situation where trying to allocate 1 GB chunks of off heap space using a simple for loop failing on Ubuntu 12.04 LTS with OOM while the same runs fine on Redhat Linux machine which had a relatively less powerful hardware. With the current API this sleep time can’t be adjusted and hence the application may have to perform additional sleep to make sure that there is no memory to use.
The last item of interest is the total memory available for use to allocate off-heap memory. This value is set at the start of the process either manually or by the VM. When set manually, the JVM doesn’t verify whether there is enough free memory available on the system. Even if the value is set automatically by the JVM, the available memory on the system can be lower during the process execution since memory usage of other processes in the system can change as time goes by which can result in the Java process failing due to unexpected exception in memory allocation. So it is important to make sure that the memory of size set in -XX:MaxDirectMemorySize
is available for the Java process to use so that the failures doesn’t happen.
There are few options the JVM can do to prevent allocation related exceptions which would require changes to the JVM code.
Verify the system free memory to make sure that it is greater than or equal to what is set by the users when the JVM is brought up and also during the process execution. This will require native system calls and may have a pronounced impact on the performance of the Java process. One way to mitigate is to provide an JVM option for the users to set if they need this strict condition checking.
Instead of invoking a GC call when all the memory is used, it would be better to have a configurable parameter to set direct memory used threshold to make the GC call. This should be a fairly simple change to the JVM code.
Calculate the sleep time after the GC process taking into consideration all the factors which impacts GC time. This will be complex and will not be of less importance if the previous suggestion is implemented in the JDK code.
To de-risk scenarios like these, the solution doesn’t have to be too complex. It can be a matter of following a simple process similar to the following across the enterprise,
Some may say that this is too simplistic and they may be correct. But here is something to think about. Have you wondered the use of a lock when road side assistance comes and opens your car when you loose your key? Locks are not for the small percentage of the population who will always find a way to break it, it is there to act as a barrier to the temptations of the vast majority of us. So even if this approach is simplistic it still acts as a barrier in a situation where there is nothing in place. If there are other solutions which can be put in place that you are comfortable with, it is even better, but please secure all applications. It can save someones life savings, identity, medical records or other critical data and the someone can be your friend, family, neighbor or even yourself!! With so much at stake on data, any vulnearabilty to breach is not an option anymore.
Note: A simple utility to create a barrier is available here for Java based applications.
]]>LWRP
and HWRP
.
Similar to LWRP, HWRP requires a resource definition and the corresponding provider. The key difference is that there are no DSL in the HWRP as in LWRP and everything is coded in Ruby
code. So taking the same example of HDFS directory resource used in the notes on LWRP, the following is the skeleton of the resource definition.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|
The HWRP is a Ruby
class in this case HdfsDir
which is a subclass of Chef::Resource
class. The provides
method specifies the resource provider for this resource and in this case it is hdfsdir
.
As in any Ruby
class, the initialize
method is used perform initializations like setting initial values of variables. In this case the pre-defined instance variable resource_name
is set to a name which can be used to create a resource block in recipes using this HWRP. An array of symbols specifying the supported actions
supported by this HWRP is assigned to the instance variable allowed_actions
. A default action which will be taken if an action
is not set for while creating a resource using this HWRP (in this case create
) is set to the instance variable action
.
The remaining section in the skeleton is to define the characteristics of all the attributes of this resource which is similar to the attribute definition in LWRP. The key difference is that they are all defined as Ruby
methods and the set_or_return
is similar to Ruby
attr_accessor
method which creates the getters and setters for the attributes.
Unlike LWRP, the HWRP resource and provider code is stored in files under the libraries
directory of the cookbook. Also there is no strict rules about the file naming conventions since these are Ruby
classes and they get loaded first during the Chef client
run.
Now lets turn to the corresponding provider definition and the following is the skeleton. It is more or less similar to the LWRP provider code we had seen earlier with some differences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
|
As with the resource definition, the provider is also a Ruby
class which is a subclass of Chef::Provider
class. The method whyrun_supported
is to specify whether the resource supports the chef client
run with why-run
option. If this method is set to return true
, then the strings provided in the converge_by
statement of the action
requested in the recipe will be logged instead of performing the actual convergence of the resource.
load_current_resource
method need to be overwritten in an HWRP which is optional in as LWRP. As discussed in the LWRP note, this method can be used to check the current state of the resource.
The methods for the actions
supported are defined using the naming convention action_name
. For e.g. for the create
action the method name is action_create
. Supporting methods can be defined as in any Ruby
class for e.g. in this case `validnn??`` method.
The method new_resource.updated_by_last_action
is called with a value of true
so that Chef
is notified that the resource got updated by that particular action
.
More notes in this category can be found here.
]]>Chef
provides a large set of resources
to work with. But there are situations where resources provided by Chef
may not be sufficient. For e.g, distributed file systems can’t be handled by the file system related resources (file
, directory
etc) which comes out of the box with Chef
. Being flexible and customizable, Chef
provides two options (LWRP, HWRP) for users to create their own resources.
Light Weight Resource Providers (LWRP) use DSL to simplify the creation of resources and are used when existing chef resources
can be leveraged with minimal Ruby
code. In contrast, Heavy Weight Resource Providers (HWRP) are used when existing resources can’t be leveraged and Ruby
code need to be used to implement the resource provider.
Lets quickly look at a LWRP using HDFS (which is a distributed file system) directory resource as an example. A LWRP is created and stored in a cookbook and there are two parts to it. First the resource definition which defines the actions supported and attributes accepted by the LWRP. The resource definition resides in the resources
directory of the cookbook. The second part is the provider or simply the code which implements the actions supported by the LWRP. The provider is stored in the provider
directory of the cookbook in which the LWRP is being created.
For chef
to be able to identify the new resource, the file name of the resource definition and the provider file need to have the same name. For e.g. lets assume that the HDFS directory resource is created in hdfs cookbook and the resource is named hdfsdir, the following will be the cookbook directory structure (showing only the required directories)
1 2 3 4 5 6 |
|
When the resource need to be used in a recipe
, the resource need to be prefixed with the cookbook name separated by an “_” (underscore). For e.g.
1 2 3 |
|
Lets look at the LWRP resource
definition for HDFS directory resource
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
actions
defines the actions supported by the new resource.
default_action
defines the action when the resource is used in an recipe and no action clause is specified in the recipe.
attribute
defines each attribute which can be set when using the resource. It also defines whether the attribute is required and the attribute type.
One attribute
can take the name
value of the resource if it is not set explicitly in the recipe and in this case the attribute
path takes the name value of the resource and it is specified using :name_attribute => true
. The complete definition can be found here.
With the resource
definition out of the way lets look at the provider
for the hdfs directory resource. The following is the code skeleton for the provider and the full code can be found here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
The require
method is used to include external files as in any Ruby code. Note that webhdfs
gem need to installed for this code to work.
use_inline_resources
is a must for LWRP. The reason is to make sure that any notifications
raised from any of the resources in the LWRP (remember LWRP can leverage other Chef resources to implement its functionality) will be treated as being raised by the LWRP resource collection as a whole and not the individual resource within the LWRP resource which is raising the notification.
new_resource
instance object is automatically created when the resource is used and the attribute values in the new object is set to the values passed from the recipe that is using the resource.
When creating resources one of the key requirement is to make sure that it is idempotent. This requires the current state of the resource to be known. For this chef
provides an empty method load_current_resource
which can be overwritten by the resource provider. Since this method will be the first to be called when the resource is used in a recipe, the method call can be used to check the current state of the resource. For e.g. if the directory resource is already existing and the resource action is to create
the directory, the resource provider can skip the requested action since the directory is up to date. For anyone interested in more details look into the code for LWRPBase class which is the parent for the LWRP provider.
The remaining sections in the code skeleton are to implement the actions supported by the resource. They can use the existing chef
resources and/or use Ruby code. If you had a chance to look at the complete code for hdfsdir provider contrary to how LWRP is meant to be implemented, the code doesn’t use any existing chef
resource since there is none for a distributed file system like HDFS and all the actions had to be implemented using Ruby
. But to understand the various aspects of writing an LWRP it is still helpful.
whyrun_supported?
method is used to enable/disable support for the --why-run
option of chef-client
by setting the return value to true
or false
.
When whyrun_supported?
is set to true
and if chef-client
run uses --why-run
option the string passed to converge_by
clause will be logged instead of actual convergence.
When an action is taken on a resource new_resource.updated_by_last_action(true)
is used to notify chef
that the resource was updated by the requested action.
Finally, note that you can use the hdfs LWRP code used in this example if you are dealing with HDFS by renaming the files and copying into the resources and provider directories of your cookbook.
More notes on this category can be found here
]]>