Sense Codons: January 2012

One of the things that I love about scripting languages (such as PHP and bash) is the built-in concept of a "map." A map is a data structure that allows a mapping of a key to some arbitrary value. Other words for this structure are "dictionary," "associative array," and "hash." The key must be unique (hence the term "dictionary:" a given key can have only one entry in the dictionary). It is also usually implemented via hashing (hence the term "hash"). In many languages, the means of manipulating a map are the same as those for manipulating an array (hence the term "associative array").

Maps come in two varieties: unidirectional and bidirectional. A unidirectional map (arguably the most common kind) is a typical "key → value" mapping; the keys must be unique, but the values may overlap. A bidirectional map allows you to go the other way as well; you can get the key for a given value (a "key ↔ value").

For this article, I'm assuming that you have a basic knowledge of maps. To brush up, or for more information, you can visit one of these sites:

For a sweet comparison of mapping support in various languages, you should visit:

http://en.wikipedia.org/wiki/Comparison_of_programming_languages_(mapping)

Maps in C: GLib's GHashTable
Today, we'll be concerned with "C," since that's the language that my company uses for most of its work.

In C, there is no native map support. In order to use maps, you will need to either use a library that provides them or make your own. It is now the year 2012, and if you are planning on making your own mapping library, then you are either (1) crazy, (2) interested in learning about how maps work, or (3) really smart and have either a very unique take on maps or a very unique situation to optimize for. While there are a few map implementations available in C, I'll be focusing on GLib's "GHashTable," since GLib provides numerous data structures (among other things).

A GHashTable is a unidirectional map, implemented via a hash table. It does just about everything that you could ever want from a map in C; you can find the documentation for it here:

http://developer.gnome.org/glib/2.30/glib-Hash-Tables.html

Since this is C, so you have no template magic to handle types. Without creating multiple different implementations, a generic mapping implementation must use "void*" pointers to arbitrary values (such as integers, doubles, structures, etc.), and this is what a GHashTable does. Remember, you must police yourself in terms of data types for the map; GLib cannot do it for you.

To see GLib's GHashTable implementation, you can go here:

http://git.gnome.org/browse/glib/tree/glib/ghash.c

The key parts of the structure can be summed up by:

gpointer* keys;
guint* hashes;
gpointer* values;

A "gpointer" is essentially "void*," so both "keys" and "values" are arrays of "void*" pointers, exactly what you'd expect for a general mapping implementation in C. "hashes" is an array of unsigned integers ("guint"). These three arrays have the same number of elements, and an index in one is an index in the others.

(Note that the actual implementation is not as simple as I'm going to explain.) If you want to map the text "some key" to the text "some value," then "some key" would (naturally) be the key, and "some value" would likewise be the value. A GHashTable is a hash, which means that the key is first run through a hashing function that returns an integer (hopefully unique) for the key. That hash value is stored in "hashes" array at an index that has had some more math magic applied to it (conceptually, the hashes array would be a 1:1 mapping with itself (index 3 would be equal to 3), but GHashTable does some additional work to keep the table from clustering and such); the pointer to the string "some key" is stored in that index in "keys;" and the string "some value" is stored at that index in "values."

GHashTable does not copy the content of the data that you give it; it just sticks the pointer in its arrays. This means that you have to allocate and free the data for the GHashTable. So, before you add a value to a GHashTable, you have to make sure that it is allocated in such a way that it will not go out of scope and cause all kinds of dereferencing trouble later. The logical way to do this is to "malloc" everything that goes into the GHashTable and "free" it when you're done. There are some conveniences that GHashTable provides to make this easier, but we'll get into that later.

In order for GHashTable to know how to hash the key and compare it to other keys, it needs to know two things: (1) the hash function, and (2) the equality function. Since two keys can hash to the same value, a further (or "final") step in retrieving or updating values is to do a comparison on the key at the hashed location to determine if it is identical. If not, GHashTable has some built-in logic around looking in other places for that same key.

Trivial cases: g_int_hash, g_int64_hash, g_double_hash, and g_str_hash

GLib provides four "common" hash and equality functions:

g_int_hash (equality function: "g_int_equal"): This allows you to use a 32-bit integer as a key (remember, the value is 100% up to you).
g_int64_hash (equality function: "g_int64_equal"): This allows you to use a 64-bit integer as a key.
g_double_hash (equality function: "g_double_equal"): This allows you to use a double-precision number as a key.
g_str_hash (equality function: "g_str_equal"): This allows you to use a string as a key.

All four of these functions will dereference the key given to them, which means that you must pass a pointer to the data, and that pointer must remain valid for the entirety of its time in the hash.

For example, this is the implementation of "g_int_equal:"

gboolean g_int_equal( gconstpointer v1, gconstpointer v2 ) {
return *((const gint*) v1) == *((const gint*) v2);
}

The first thing that this function does is cast the "void*" pointer to an integer pointer, and then it dereferences it. For strings, g_str_equal calls "strcmp" (which will dereference the pointers).

When using a GHashTable in real life, it is usually a really good idea to use the "g_hash_table_new_full" function, which lets you specify a function to call (i.e., a "destructor") when a key should be removed (and also one for the value). This saves you the trouble of manually looping through the GHashTable to free up memory.

Here is a complete example of how to create a mapping of an integer to its textual, English name.

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <glib.h>

/** This prints out a key/value combination, for use in "g_hash_foreach".

**/

void printKeyValue( gpointer key, gpointer value, gpointer userData ) {

int realKey = *( (int*)key );

char* realValue = (char*)value;

printf( "%d => %s\n", realKey, realValue );

return;

}

/** The main function.

**/

int main( int argc, char** argv ) {

//! This is our map.

GHashTable* map = NULL;

//! This is the pointer to our key data.

int* key = NULL;

//! This is the pointer to our value data.

char* value = NULL;

// Create the GHashTable.

map = g_hash_table_new_full(

g_int_hash, g_int_equal, //< This is an integer hash.

free, //< Call "free" on the key (made with "malloc").

free //< Call "free" on the value (made with "strdup").

);

// Insert some values.

// 7 => seven

key = malloc( sizeof(*key) );

*key = 7;

value = strdup( "seven" );

g_hash_table_insert( map, key, value );

// 10 => ten

key = malloc( sizeof(*key) );

*key = 10;

value = strdup( "ten" );

g_hash_table_insert( map, key, value );

// 9001 => over nine-thousand!

key = malloc( sizeof(*key) );

*key = 9001;

value = strdup( "over nine-thousand!" );

g_hash_table_insert( map, key, value );

// Print out the contents of the map.

g_hash_table_foreach( map, printKeyValue, NULL );

// Destroy the map.

g_hash_table_destroy( map );

map = NULL;

return 0;

}

The output is naturally:

7 => seven

10 => ten

9001 => over nine-thousand!

The important thing to note is that no memory is leaked. A GHashTable created with "g_hash_table_new_full" will do a good job of cleaning up after itself when destroyed (or when removing or duplicating keys).

What would happen if we didn't pass "key" the pointer to "g_hash_table_insert," but instead just passed the integer that we wanted to use? We would immediately segfault because "g_int_hash" would dereference "7" (our first integer) and find that 0x0...07 is not anywhere in our process space and crash.

As you can see, GHashTable is pretty cool, but it gets ugly when it comes to allocating memory for the keys and values. It makes sense to allocate memory for the value (since I used a string in this case), but we're not used to allocating memory for integers. However, all four of the "trivial" hash types provided by GLib will work this way.

Down and dirty case: g_direct_hash

So, now that we have an example under our belts and we understand how GHashTable is using our data, we can look at an "advanced" mechanism for hashing, "g_direct_hash."

Going back to the implementation of GHashTable, we can see that "keys" and "values" are arrays of "gpointer"s. What this really means is that we can stick anything we want in "keys" and "values" as long as it can fit within "sizeof( gpointer )". On a 32-bit system, that is 4 bytes; on a 64-bit system, that is 8 bytes. So, in reality, we could stick an actual integer into that "gpointer" array as long as we can tell GHashTable not to dereference it.

GHashTable provides exactly this mechanism via "g_direct_hash" (and "g_direct_equal," its equality counterpart). By using the "g_direct_hash" function, the "pointer" given to GHashTable will be used as the input to the hash function, not the value that it "points" to (since it is not a pointer and doesn't point to anything).

Using a "g_direct_hash" has the following advantages over the other hash types:

No memory is allocated for the key. Memory allocation is expensive, whether or not you may realize it. In a high-performance application, this can be a huge saving.
Hashing and comparison are faster. Since we don't need to dereference the pointer to the key, we save an operation (potentially expensive, since the value may well be outside of the CPU cache).
There is no cleanup for the key. Since we didn't allocate anything going in, we don't have to free anything going out.

However, a "g_direct_hash" has some disadvantages:

It is architecture-dependent. On a 64-bit system, it can store 8 bytes of data, whereas on a 32-bit system, it can only store 4 bytes of data. It is up to you to make sure that you know what you are doing.
It can only store structures that are no larger than a "gpointer." Your key should essentially be a 64-bit value on a 64-bit system (or use 32-bit values if you want compatibility with 32-bit systems).

Since the "values" array is implemented exactly as the "keys" array, you can apply the same logic to storing the values. This makes for very efficient mappings between integers and doubles, but not so much for strings over 7 characters (7 characters + "\0" = 8 bytes = 64-bits).

To make a GHashTable accept a non-pointer value, you have to coerce it into doing so using one of GLib's type-conversion macros: GINT_TO_POINTER. To get your key (or value) back, you need to convert the other way: GPOINTER_TO_INT. Both of these will be used below in our example.

For example, to insert our ( 7, "seven" ), we would do:

value = strdup( "seven" );

g_hash_table_insert( map, GINT_TO_POINTER(7), value );

Here is the example from above using "g_direct_hash".

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <glib.h>

/** This prints out a key/value combination, for use in "g_hash_foreach".

**/

void printKeyValue( gpointer key, gpointer value, gpointer userData ) {

int realKey = GPOINTER_TO_INT( key );

char* realValue = (char*)value;

printf( "%d => %s\n", realKey, realValue );

return;

}

/** The main function.

**/

int main( int argc, char** argv ) {

//! This is our map.

GHashTable* map = NULL;

//! This is the pointer to our value data.

char* value = NULL;

// Create the GHashTable.

map = g_hash_table_new_full(

g_direct_hash, g_direct_equal, //< This is an integer hash.

NULL, //< There is nothing to free for our key.

free //< Call "free" on the value (made with "strdup").

);

// Insert some values.

// 7 => seven

value = strdup( "seven" );

g_hash_table_insert( map, GINT_TO_POINTER( 7 ), value );

// 10 => ten

value = strdup( "ten" );

g_hash_table_insert( map, GINT_TO_POINTER( 10 ), value );

// 9001 => over nine-thousand!

value = strdup( "over nine-thousand!" );

g_hash_table_insert( map, GINT_TO_POINTER( 9001 ), value );

// Print out the contents of the map.

g_hash_table_foreach( map, printKeyValue, NULL );

// Destroy the map.

g_hash_table_destroy( map );

map = NULL;

return 0;

}

This concludes our discussion of GHashTable. Remember, you have to allocate and free everything that goes into a GHashTable on your own; you may use "g_hash_table_new_full" to have GHashTable do the freeing for you, however. You may take advantage of the knowledge of your architecture to use non-pointer data for your keys, using "g_direct_hash" (and your values, using your own code).

Monday, January 2, 2012

GLib, GHashTable, and g_direct_hash