Tutorial : C - Data Structures

Data Structures

One of the key elements in software design is to determine which data-structures are most appropriate for the problem at hand. Data-structures determine how information is stored and exchanged, and have a significant effect on the overall cohesiveness, clarity, and efficiency of the program.

The C language provides pointers, arrays and structs as the basic facilities upon which powerful high-level data-structures may be created. (However, these data-structures are not specific to C, and may be implemented similarly in other programming languages.) In this chapter we examine several of the more common constructions, but we only scratch the surface of the great ensemble of existing data-structures. For more information, the interested reader is directed to a good algorithms and data-structures textbook.1 For simplicity, the implementations presented below are written for specific data-types; converting them to generic code is left as an exercise.

 

 

Efficiency and Time Complexity

Efficiency, in terms of execution speed, can be approached at two levels. One is to optimise the execution of a set of loops and expressions in a process of code tuning. These are small-scale optimisations, which attempt to replace expensive operations with faster ones. The following are common examples.

• Replace array indexing with pointer operations.

• Replace power-of-two integer multiplication and division with bitwise shift operations.

• Manually unroll loops to avoid the overhead of branching and conditional expression evaluation for each iteration.

• Move non-changing expressions outside of loops to avoid unnecessary re-evaluation.

• Replace functions with macros to avoid function-call overhead.

The gains for such adjustments are typically small, and often come at the price of considerable loss of code clarity. Also, many of these optimisations are performed automatically by modern compilers. In fact, manual code tuning may actually result in slower execution than the original (simpler) code as the more obscure operations might be less amenable to compiler optimisation. In general, small-scale code optimisation should be avoided until proven necessary. That is, the program should be written with a focus on correctness and clarity, and subsequently tested for bottlenecks using a profiler. A profiler is a software tool that measures where a program spends most of its time, and enables code tuning to be focused where it is actually needed. Finally, to ensure that tuning actually has the desired effect, it is essential to measure the code segment execution time both before and after applying optimisations.

Some programmers pay too much attention to efficiency; by worrying too soon about little “optimizations” they create ruthlessly clever programs that are insidiously difficult to maintain. Others pay too little attention; they end up with beautifully structured programs that are utterly inefficient and therefore useless. Good programmers keep efficiency in context: it is just one of many problems in software, but it is sometimes very important

The second approach to efficiency concerns code organisation on a larger scale; it concerns the choice of algorithms and data-structures. These decisions tend to have a much greater effect on program speed, often by orders-of-magnitude. The central concept in this approach is the property of time complexity, which is a measure of how long an algorithm will take to complete given a certain size input. Time complexity is usually expressed in terms of O-notation which, in essence, states an upper-bound on computation to within a constant factor. For example, to search for a particular item in an unordered array of n elements is a linear operation; it takes on average O(n) iterations to find the item (or to show that the item is missing). Informally, O(n) may be understood as meaning “to the order n” or “linear with respect to n”. Common time complexities encountered for collections of n items include constant time O(1), logarithmic time O(log n), linear time O(n), and quadratic time O(n2).

Aside: A third important factor with regard to efficiency is the effect of the memory hierarchy on computation speed. Registers, caches, RAM, and virtual memory all incur different access penalties, with each lower level in the hierarchy often costing several orders-of-magnitude greater runtime than its predecessor. Different algorithms and data-structures have different memory interaction behaviours, and an algorithm that involves less instructions on paper may actually run significantly slower than another. Algorithms that exploit space optimisation, locality of reference, and other memory-friendly techniques are usually much faster than those that don’t. A thorough discussion of memory hierarchy issues is well beyond the scope of this text.

 

 

Arrays

The simplest of all data-structures is the array. It is directly supported by the C language and defines a contiguous sequence of elements in memory. It groups together a set of variables of the same type, and permits iteration over the set. If an array is of fixed size, known at compile-time, it can be allocated on the stack, otherwise, its size may be determined at run-time and allocated on the heap via malloc(). The expandable vector is a good example of an array whose size may be grown dynamically at run-time.

For an array of size n, to access or change the value of an array element is an O(1) operation. However, to insert or remove an element from the middle of an array is O(n), as the elements following that point must be shuffled along by one. A linear search of an array is O(n); a binary search of a sorted array is O(log n).

 

 

Linked Lists

A linked-list is a set of nodes, where each node contains an item of data and a pointer to the next node in the list. Linked-lists come in two basic varieties: singly linked and doubly linked. The implementation of singly linked-lists is covered, and so here we will present code for a doubly linked-list.

For simplicity, let the contained item be an integer. A list node is defined as

typedef struct List List;
struct List {
	int item; /* Contained item. */
	List *next; /* Pointer to next node in list. */
	List *prev; /* Pointer to previous node in list. */
};

The following functions perform node insertion and deletion operations, which may be applied at any point along the list.

1 struct List *insert after(List *node, int item)
2 /* Add new item to the next position in the list after ’node’. Return pointer to new node. */
3 {
4 /* Allocate memory for new node. */
5 List *newnode = (List *) malloc(sizeof (List));
6 if (newnode == NULL)
7 return NULL; /* allocation failed */
8
9 /* If list is not empty, splice new node into list. */
10 if (node) {
11 newnode−>next = node−>next;
12 newnode−>prev = node;
13 node−>next = newnode;
14 newnode−>next−>prev = newnode;
15 }
16 else { /* empty list, make new head node. */
17 newnode−>next = NULL;
18 newnode−>prev = NULL;
19 }
20
21 newnode−>item = item;
22 return newnode;
23 }
24
25 void delete node(List *node)
26 /* Remove node and reconnect links on either side. */
27 {
28 if (node−>prev) node−>prev−>next = node−>next;
29 if (node−>next) node−>next−>prev = node−>prev;
30 free(node);
31 }

10–15 Given a pointer to a node already in the list, the new node is inserted in the position immediately following. Existing links are connected to the new node to effectively splice it into the list.

16–19 However, if the node pointer passed to this function is NULL, it simply creates a new node with links in both directions pointing to NULL. That is, it creates a new list of length one.

28–30 To remove a node from a list, the nodes on either side are first connected together, and then the selected node is deleted. The if-statements check that the adjacent links were not NULL.

Insertion and removal of nodes from anywhere along a list is O(1), provided a pointer to the specified location is already available. Thus, linked-lists are a good choice if insertion/removal operations are frequent into the middle of a sequence. However, search along a list of size n is O(n), and so other data-structures (such as a sorted array) are preferred if the sequence is very long and searching is a common operation.

 

 

Circular Buffers

A circular buffer, like an array or linked-list, contains a sequence of elements. The difference is that this data-structure has fixed size and its end loops back to its beginning. Circular buffers are usually implemented with either an array or a linked-list as the underlying representation. (In the latter case, the end-node points back to the head-node.) In this section, we describe an array-based implementation.

The basic form of a circular buffer is shown in Figure 15.1. The buffer is allocated a fixed amount of space, and may contain no more elements than this maximum. There are two indexes into the buffer marking the front and back, respectively. The front index points to the empty slot where the next element will be added. The back index points to the next element to be extracted. Thus, items are pushed onto the front of the buffer and removed from the back. After each addition or removal, the associated index is moved forward to the next position and, when an index reaches the end of the buffer, it is cycled around to the beginning, so that the buffer appears as a continuous loop. (This continuity is represented naturally by a linked-list, with the connection of the end-node back to the head-node being no different to any other link. However, for an array-based implementation, special wrap-around operations must be performed for the end of the array.)

Circular buffers are very common in real-time and embedded systems where there are various processes that communicate with each other. One process is a producer and puts data into the buffer, and another is a consumer and removes data from the buffer. The buffer acts as temporary storage to allow both processes to operate asynchronously, with neither process having to wait for the other to complete a transaction.

For our array-based implementation, we define the circular buffer type as follows.

typedef struct CircBuf_t {
	ValueType *array; /* Pointer to array of items. */
	int size; /* Maximum number of items in buffer. */
	int nelems; /* Current number of items in buffer. */
	int front; /* Index to front of buffer. */
	int back; /* Index to back of buffer. */
} CircBuf;

The public interface is exported in a header file, and provides operations to create and destroy a buffer, to add and extract items, and to get the current size and maximum size of the buffer. Notice that, once again, the structure CircBuf is an opaque type with its definition hidden within the private interface.

1 typedef double ValueType;
2 typedef struct CircBuf t CircBuf;
3
4 /* create-destroy buffer */
5 CircBuf *create buffer(int size);
6 void destroy buffer(CircBuf *cb);
7
8 /* add-remove elements */
9 int add item(CircBuf *cb, const ValueType *item);
10 int get item(CircBuf *cb, ValueType *item);
11
12 /* query state */
13 int get nitems(const CircBuf *cb);
14 int get size(const CircBuf *cb);

The two functions below form the crux of the buffer data-structure—insertion and extraction. Their implementation is straightforward, with the wrap-around code being the only subtlety.

1 int add item(CircBuf *cb, const ValueType *item)
2 /* Add a new element to front of buffer.
3 * Returns 0 for success, and -1 if buffer is full. */
4 {
5 if (cb−>nelems == cb−>size)
6 return −1;
7
8 cb−>array[cb−>front]=*item;
9 if (++cb−>front == cb−>size) /* wrap around */
10 cb−>front = 0;
11 ++cb−>nelems;
12 return 0;
13 }
14
15 int get item(CircBuf *cb, ValueType *item)
16 /* Remove element from back of buffer, and assign it to *item.
17 * Returns 0 for success, and -1 if buffer is empty. */
18 {
19 if (cb−>nelems == 0)
20 return −1;
21
22 −−cb−>nelems;
23 *item = cb−>array[cb−>back];
24 if (++cb−>back == cb−>size) /* wrap around */
25 cb−>back = 0;
26 return 0;
27 }

5–6 Check whether buffer is full. Circular buffers generally deal with being full in one of two ways. First, as in this example, they might return a flag to indicate full status.2 A second option is to over-write existing items so that, for a buffer of size n, only the most recent n elements are retained. That is, when the buffer is full, the front index meets the back index. If another element is inserted, both indexes are incremented and the back item is lost.

9–10 When the front index reaches the end of the array, it must be reset to zero to point to the beginning of the array again. A common alternative to this comparison code is the following idiom, which uses the modulus operator to bound the index.

cb->front = (cb->front + 1) % cb->size; /* wrap around */

Aside: A common difficulty with all the data-structure implementations presented in this chapter, and with algorithm design in general, is ensuring correct behaviour for boundary conditions. Boundary conditions occur at the extremities of normal algorithm operation. For example, the boundaries for a circular buffer occur when the buffer is full or empty, or when the indices wrap around. Care must be taken to avoid “off-by-one” indexing errors, and to make the appropriate choices of pre- or post-increment. Checking boundary conditions is an important technique for finding subtle bugs.

The idea is that most bugs occur at boundaries. If a piece of code is going to fail, it will likely fail at a boundary. Conversely, if it works at its boundaries, its likely to work elsewhere too.

 

 

Stacks

A stack, also known as a last-in-first-out (LIFO) buffer, is used to store items and then extract them in reverse order. The stack grows upwards as items are pushed onto the top, and shrinks back down as items are popped off the top. Stacks with fixed maximum size are usually implemented with a simple array. If a stack does not have a fixed maximum, it might be implemented using a dynamic representation such as our expandable vector.

As an example implementation, consider the following data type for a stack with maximum size MAXSIZE.

typedef struct Stack {
	double buffer[MAXSIZE]; /* Stack buffer. */
	int count; /* Number of elements in stack. */
} Stack;

The code for pushing items onto the stack and popping them off is trivial. (We neglect bounds checking, for brevity.)

void push(Stack *s, double item) { s->buffer[s->count++] = item; }
double pop(Stack *s) { return s->buffer[--s->count]; }

 

 

Queues

A queue, otherwise known as a first-in-first-out (FIFO) buffer, represents the other side of the coin to a stack. Where a stack pushes and pops items from the top (or front) of the buffer, a queue pushes items on at the front and pops them off at the back. Thus, a queue removes items in the same order as they were placed onto it.

The basic form of a queue. If a queue has fixed maximum size, it becomes
equivalent to a circular buffer. Variable length queues are a little more complicated. That is, they are difficult to implement efficiently with an expandable vector, but are well suited to implementation with a singly linked-list. Another alternative is to use a deque, which is a linked-list of short arrays and provides certain efficiency advantages of both arrays and lists.

The following implementation demonstrates a fixed size queue, which is a simple wrapper of the circular buffer described above. Once again, we neglect error checking for the sake of brevity, but the basic operations are clear.

typedef CircBuf Queue;
void push(Queue *q, ValueType item) { add_item(q, &item); }
ValueType pop(Queue *q) { ValueType v; get_item(q, &v); return v; }

 

 

Binary Trees

Binary trees are self-referential structures like linked-lists; they consist of nodes that point to other nodes. However, where list nodes point to adjacent nodes in a linear manner, tree nodes point to child nodes in a branching descent structure. Each node has two pointers, which point to a left and a right child node, respectively. If a particular child does not exist, the pointer is NULL, and if both pointers are NULL, the node is called a leaf node. The top node in the tree is called the root node.

Binary trees are useful for storing data in sorted order. If the data were to be stored first and sorted second, then an array or list would be the right data-structure for the job. But neither is particularly efficient for storing data in sorted order as it arrives; they both take O(n) time to insert a new item into a set that already contains n elements. For a balanced tree containing n elements, insertion of a new node in sorted order takes O(log n) time. (A balanced tree is a tree with the minimum possible depth; it has no nodes at a level lower than the highest NULL branch. ) Furthermore, a balanced tree permits O(log n) search for items, and the entire tree may be examined in sorted order in O(n) operations.

The following implementation presents a binary tree that contains two items of ancillary data: a string and an integer. This structure, and the associated functions, are used to implement a wordcounting algorithm similar to that presented. From this example, more generic use of binary trees should be apparent. The Tree structure contains the two data items and pointers to left and right Tree nodes as follows.

struct Tree {
	char *item; /* Contained string. */
	int count; /* Count of string appearances. */
	Tree *left; /* Pointer to left-child. */
	Tree *right; /* Pointer to right-child. */
};

The module’s public interface exports the name of the Tree data-type and four functions to manipulate the tree. These perform operations to add new strings, count the occurrences of a specified string, print all strings in lexicographic order, and delete the entire tree structure, respectively

1 typedef struct Tree Tree;
2
3 Tree *add item(Tree *root, const char *item);
4 int count item(Tree *root, const char *item);
5 void print inorder(Tree *root);
6 void destroy tree(Tree *root);

The function below is the implementation for adding a new string to the tree. Its basic operation is to first search for whether the word already exists in the tree. If so, it increments the count for that word. Otherwise, it adds the new word to the tree with a count of one. Trees (and linked-lists also), being self-referential structures, are well suited to recursive algorithms, and this is the case here: add_item() calls itself recursively until either the word is found, or an empty space is located in which to store a new word.

1 Tree *add item(Tree *node, const char *item)
2 /* Search for whether item already exists in tree. If not, add it to first empty
3 * node location (in lexicographical order), if found, increment word count. Perform
4 * recursive descent for search, return pointer to current node. */
5 {
6 int cmp;
7
8 if (node == NULL) /* found empty tree location, add item */
9 return make node(item);
10
11 /* Recursive comparison to put item in correct location. */
12 cmp = strcmp(item, node−>item);
13 if (cmp < 0)
14 node−>left = add item(node−>left, item);
15 else if (cmp > 0)
16 node−>right = add item(node−>right, item);
17 else
18 ++node−>count; /* item already in tree, increment count */
19
20 return node;
21 }

8–9 If the passed pointer is NULL, a new node is created by calling make_node() and a pointer to this node is returned. The make_node() function is part of the module’s private interface; it allocates memory for the new node, and initialises it with a copy of the passed string and a count of one.

12 The binary tree stores words in lexicographic order. This ordering is accomplished using strcmp() to determine whether a word is less than, greater than, or equal to the word contained by the current node.

13–14 If the word is less than the node word, we recurse using the node’s left-hand child. Notice how the return value of add_item() is used to connect lower-level nodes with their parent nodes.

15–16 If the word is greater than the node word, we recurse using the node’s right-hand child.

17–18 If the words are equal (i.e., a match has been found), the count for that node is incremented and the recursive search terminates.

There are three points to note from this function. The first is that the recursive search terminates when either a word match is found (lines 17 and 18) or we reach a NULL node (lines 8 and 9) indicating that we have a new word. Second, when a new child node is created, it is attached to its parent node via the return value (lines 9, 14 and 16). And third, the recursion, as it splits to the left and right, orders insertion and search, giving the tree its O(log n) properties.

The next two functions perform binary search and in-order visitation, respectively. The first, count_item() searches the tree for a word match, and returns the word count if a match is found, and zero if it is not. (The word count is the number of times a particular word was sent to add_item().) This function demonstrates an iterative (i.e., non-recursive) method for traversing the tree. The second function, print_inorder() visits every node in the tree and prints the stored word and word count. (The function print_node() is part of the module’s private interface.) The recursive implementation of print_inorder() causes the nodes to be printed in sorted order.

1 int count item(Tree *root, const char *item)
2 /* Search for item in tree and return its count. */
3 {
4 	while (root) {
5 		int cmp = strcmp(item, root−>item);
6 		if (cmp < 0)
7 			root = root−>left;
8 		else if (cmp > 0)
9 			root = root−>right;
10 		else
11 			return root−>count;
12 		}
13 		return 0;
14 	}
15
16 void print inorder(Tree *node)
17 /* Print tree in lexicographical order */
18 {
19 	if (node == NULL) return;
20 		print inorder(node−>left);
21 		print node(node);
22 		print inorder(node−>right);
23 }

The basic binary tree, as presented here, is sufficient for a great many situations. It is well suited to problems where the data arrives in random order, such as the words from a book. However, it behaves very inefficiently if the data does not arrive in random order and the tree becomes unbalanced. In the worst case, if the data is added to the tree in sorted order, the tree obtains the appearance and properties of a linked-list, with insert and search times being O(n).

Various solutions exist that resolve this problem. Advanced binary tree implementations, such as red-black trees, remain balanced for any input. Also, a data-structure called a skip list, while entirely different to a binary tree, possesses the same insertion and search properties as for a balanced binary tree.

Finally, even when the data is suitable for a simple binary tree implementation, it might not be the best data-structure for the job. Trees are best suited to tasks where the data is to be in sorted order during the insertion phase. However, if the data is to be stored in any order, and fast search is required subsequently, it is usually more efficient to store the data in an (expandable) array and then sort the array. With a good sorting algorithm, the time required to sort an array is less than the time to insert data into a tree. Similarly, the time to perform a binary search of a sorted array is generally less than the time to search a tree. Also, arrays consume less space than trees. The key advice here is to be aware of the tradeoffs between data-structures, and know when one is likely to be more suitable than another.

 

 

Hash Tables

A hash table is a data-structure that uses a hash function to compute an index into an array. The most common form of hash table, and the easiest to explain, is one that combines an array of pointers with a set of linked-lists. The basic form of this type of hash table. Each pointer in the array of pointers points to the head of a singly linked-list. A list may be empty, in which case the pointer is NULL. Each element in the array is called a “bucket”, and the list pointed to by a bucket is called a “chain”.

The operation of a hash table is as follows. Given an item of data to be stored in the table, the hash function computes an index based on the value of this item. The index is such that it falls within the bounds of the pointer array, and so specifies one of the buckets. The selected bucket points to a linked-list, which is searched to check whether the item is already stored, otherwise the item is added to the front of the list.

The key to an efficient hash table is a good hash function. Essentially it should distribute items evenly between the different buckets and not favour any particular bucket over another. The derivation of hash functions involves fairly advanced mathematics including aspects of probability theory and prime number theory. Also, the length of the array of pointers should not be arbitrary but itself a prime number. We will not discuss these issues further as they are beyond the scope of this text.

A hash table is useful for implementing very fast lookup tables. The hash function computes a bucket index in O(1) time and, assuming the chain is short, the linear link-list search is very quick. Provided the hash function distributes items evenly, the chains will be short enough so that the entire operation may be considered O(1). Thus, on average, hash tables permit O(1) lookup, although in the worst case, where the hash function places all items in a single bucket, lookup can be O(n).

In the following example we use a hash table to implement a dictionary. The dictionary is built up of words and their associated definitions. These word-definition pairs are stored in the hash table using the word as a search key. The hash table permits fast insertion, search and deletion, and the public interface for this module is shown below.

1 typedef struct Dictionary t Dictionary;
2
3 Dictionary *create table(void);
4 void destroy table(Dictionary *);
5
6 int add word(Dictionary *, const char *key, const char *defn);
7 char *find word(const Dictionary *, const char *key);
8 void delete word(Dictionary *, const char *key);

The next section of code shows part of the private interface. The #define (line 1) specifies the size of the array of pointers (i.e., the number of buckets). Notice this value is prime. Lines 3 to 7 define the link-list node type. Each node in a chain contains a word, its definition, and a pointer to the next node. The Dictionary type (lines 9 to 11) contains an array of pointers to the chain nodes; this is the array of buckets. The hash function (lines 13 to 22) is a complicated device, and we will not elaborate on its workings here. it “is not the best possible hash function, but it is short and effective.” Notice that it takes a string argument and converts it to a bucket index.

1 #define HASHSIZE 101
2
3 struct Nlist {
4 char *word; /* search word */
5 char *defn; /* word definition */
6 struct Nlist *next; /* pointer to next entry in chain */
7 };
8
9 struct Dictionary t {
10 struct Nlist *table[HASHSIZE]; /* table is an array of pointers to entries */
11 };
12
13 static unsigned hash function(const char *str)
14 /* Hashing function converts a string to an index within hash table. */
15 {
16 const int HashValue = 31;
17 unsigned h;
18
19 for (h = 0; *str != ’\0’; ++str)
20 h = *str + HashValue * h;
21 return h % HASHSIZE;
22 }

The two functions that follow demonstrate the main workings of the hash table algorithm. The first, and most instructive, is add_word(), which takes a word-definition pair and adds it to the table. If the word is already stored, the old definition is replaced with the new one, otherwise, if the word is not found, a new table entry is added. The second function, find_word(), uses the same search mechanism as add_word() to determine if a word is stored in the table and, if so, returns the associated definition.

1 int add word(Dictionary *dict, const char *key, const char *defn)
2 /* Add new word to table. Replaces old definition if word already exists.
3 * Return 0 if successful, and -1 is fails. */
4 {
5 unsigned i = hash function(key); /* get table index */
6 struct Nlist *pnode = dict−>table[i];
7
8 while (pnode && strcmp(pnode−>word, key) != 0) /* search chain */
9 pnode = pnode−>next;
10
11 if (pnode) { /* match found, replace definition */
12 char *str = allocate string(defn);
13 if (str == NULL) /* allocation fails, return fail and keep old defn */
14 return −1;
15
16 free(pnode−>defn);
17 pnode−>defn = str;
18 }
19 else { /* no match, add new entry to head of chain */
20 pnode = makenode(key, defn);
21 if (pnode == NULL)
22 return −1;
23
24 pnode−>next = dict−>table[i];
25 dict−>table[i] = pnode;
26 }
27 return 0;
28 }
29
30 char *find word(const Dictionary *dict, const char *key)
31 /* Find definition for keyword. Return NULL if key not found. */
32 {
33 unsigned i = hash function(key); /* get table index */
34 struct Nlist *pnode = dict−>table[i];
35
36 while (pnode && strcmp(pnode−>word, key) != 0) /* search index chain */
37 pnode = pnode−>next;
38
39 if (pnode) /* match found */
40 return pnode−>defn;
41 return NULL;
42 }

5–6 The word is passed to the hash function, which computes an array index. This bucket, in turn, points to the head of a node chain.

8–9 We search the length of the chain looking for a string match between the keyword and a node word.

11–18 If the node pointer is not NULL then a match was found before the end of the chain. Thus, we replace the old definition with the new one. (Note, the function allocate_string() is part of the module’s private interface and makes a duplicate of the passed string.)

19–26 If the end of the chain was reached, then the keyword is new and is added to the head of the chain. (Note, the function makenode() is part of the module’s private interface; it creates and initialises a Nlist node.)

33–37 This code is identical to lines 5 to 9 in add_word().

39–41 If the keyword is found, return a pointer to its definition, otherwise return NULL.

The above example demonstrates a specific hash table implementation for storing a specific type of data (i.e., strings). Writing a generic version of a hash table is not trivial because different data types typically require different hash functions. However, a generic implementation is possible by using a function pointer to permit user-supplied hash functions. (A default hash function might be called if the user passes NULL.)

As always, deciding whether a hash table is the right data-structure for a particular problem is a matter of considering tradeoffs. Hash tables provide very fast O(1) add, delete, and find operations on average, if supplied with an effective hash function. However, they can have bad O(n) worst-case behaviour, and in some circumstances the O(log n) worst-case complexity of a balanced tree (e.g., a red-black tree) might be preferred.