Gentle Introduction to HBase Part I – Data Structure

13 Nov

In this post (hopefully the first of more), I hope to provide a gentle introduction to HBase (since I never had one myself!) This specific post is more about the specifics of HBase’s data structure, but I hope to do more posts introducing HBase programming using a combination of HBase on Amazon’s Elastic Map Reduce (I like to call it Amazon’s HAAS or HBase-as-a-Service) and Python using the HappyBase library (this offers a really easy interface to the HBase Thrift interface.)

HBASE = TWO-DIMENSIONAL, HORIZONTALLY SCALABLE HASH MAP

HBase is defined many different ways, I think most often its called a BigTable clone or a column-oriented datastore. I like to think of HBase as simply a two-dimensional associative array, or a two-dimensional hash map if you will.

If that’s all it is, what’s the use for HBase? As you might have realized, HBase is one of the most visible NoSQL databases out there today, and for good reason – its massively scalable. Facebook and other large companies use HBase for petabyte-scale data storage which can be queried and accessed in milliseconds (well, in some cases, maybe seconds.)

HBase database tables have “rows” with a unique row key, each of which can have multiple column keys and values corresponding to those column keys:

row-key => {
 column-key1: value1,
 column-key2: value2,
}

COLUMN FAMILIES

But is that it? For most HBase users, that might suffice, but HBase has a feature that you need to know about – column families. Column families are one layer above the column keys. The reason even novice developers need to know about column families in HBase is because you MUST define column families for each table that we create in HBase (column families are the entities that need to be pre-defined while creating a HBase table).

row-key => {
 cf1:column-key1: value1,
 cf1:column-key2: value2,
 cf2:column-key3: value3,
 cf2:column-key1: value4,
}

So are HBase rows actually three-dimensional hash maps? Not really, column families are always “fused” together with the column key. You can think of column keys in HBase as a combination of the column family and your user-defined column key itself. Thus, as in the above example, you can use the same user-defined column key across multiple column-families.

HBase does NOT recommend using more than two to three column families, so define them with caution. The only thing to remember about column families in HBase is that column families are tied together when read from or written to disk (this is related to “compactions”, which hopefully I may get to in this series.) So if you have a I/O-intensive operation in HBase and multiple data structures, this may be useful. But given that HBase performance degrades when you have more than 2-3 column families, I’m personally not sure how useful they may be (I have yet to design an application myself that used more than one column family, though I am sure there are use cases.)

TIMESTAMPS AND VERSIONING

OK, now we have that out of the way. Is that it? Actually, there is one more small thing to HBase column keys and values – HBase actually versions key-values written into columns! You can specify the number of versions to store for HBase (by default, I believe it is three); it internally keep that many versions in its store, and remove them eventually (through marking them with “tombstones” for eventual deletion, another advanced topic).

row-key => {
 cf1:column-key1: 0 (at timestamp t1),
 cf1:column-key1: 1 (at timestamp t2),
 cf1:column-key1: 2 (at timestamp t3),
 ...
}

In the above example, the value for cf1:column-key1 for row-key has been updated at three times (timestamp t1, t2 and t3). When a default query for row-key and cf1:column-key1 is issued, HBase only returns the latest value (“2” in this case). But it is possible to query all versions of a certain row/column key query … and in fact, specify timestamp filters as well!

BYTES, STRINGS, INTEGERS

The last thing about the HBase data structure is that it stores everything as bytes internally. So internally, HBase makes no differentiation, and this is something to keep in mind while storing (and reading) data to (and from) HBase.

FUTURE POSTS

That’s it for this post, hopefully in future posts, I will be able to cover:
– Standing up a HBase cluster on Amazon EC2, and starting Thrift
– Writing Python code to write data to, and reading from, a HBase cluster
– How HBase distributes its data

Advertisements

2 Responses to “Gentle Introduction to HBase Part I – Data Structure”

  1. Joel June 30, 2015 at 4:12 pm #

    I appreciate the gentle nature of the intro. I believe it would be helpful to show a simple concrete example of the column families using a specific entity like “Animal” or “Automobile”

  2. Paul Barbadew March 27, 2016 at 8:31 am #

    Nice post, so useful
    I’m starting to develop an application with happybase and I dont understand how to define the “primary key” (is it the same of the rowkey?)
    I’ll have two families and I know whats must the key columns, but I’m still havent found the way to define it
    Best regards

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: