Sorting Algorithms in C - The Fastest Sort Ever

Started by dhilipkumar, Mar 13, 2009, 10:22 PM

Previous topic - Next topic

dhilipkumar

Sorting Algorithms  in C - The Fastest Sort Ever

In the first issue of this little series about sort algorithms we only scratched the surface of the problem: we analysed a couple of ways to do the job the dirty way. Sure it was quick, but it was way too complex for any project involving a database. Since in real world it's not unusual to deal with "lists" of some thousands elements, neither the bubble sort nor the insertion sort can ensure acceptable performance because their O(n^2) complexity. Luckily, there are way better solutions.

The most widely adopted solution is to use one of the recursive sort algorithms. These guys are based on the use of recursive functions to divide the problem into a number of less complex problems; this allows them to achieve a complexity in the order of nlogn. Recursive algorithms will be the argument of the next article. Today instead we'll talk about yet another iterative algorithm, the counting sort.

Now, maybe the title was a bit too much. I'm not at all sure that there is no faster algorithm than the counting sort, and I'll say that probably there is at least one. But the complexity of the counting sort is as low as O(n), which is the lowest among the most commonly used sort algorithms. Let's see how it works and what are its requirements.

The counting sort algorithm is designed to sort a list of integers v(i), belonging to an interval ranging from 1 to k. The integers can actually have any value; if we call m = min(v) and M = max(v), then k = M - m. The procedure can sort either in ascending or descending order; from now on, I'll take into consideration the case where the integers are to be sorted in ascending order (lower values on lower indexes). As before, the term "list" will refer equivalently to an array or a linked list, or any other kind of list.

The basic idea is to count how many elements of the list are lower than the element currently being processed. Then the list will be rebuilt from scratch, inserting each element in the right position. This is different than the insertion sort, in that the element is not shifted through the list until it reaches the right position, but it's positioned in the required place in just one instruction.

Let's see an example. Let's assume the vector v to be sorted is as follows:

i => v(i)
1 => 3
2 => 2
3 => 5
4 => 7
5 => 5
6 => 4

The algorithm computes a list, count, such that count(i) is the number of elements whose value is less than i (so i ranges from m to M). This is done with a two-step process: first the elements are counted, then the count list is "integrated". So, we count:

i => count(i)
2 => 1
3 => 1
4 => 1
5 => 2
6 => 0
7 => 1

and then we integrate, i.e. each element of count gets substituted with the sum of all the values in lower indexes:

i => count(i)
2 => 1
3 => 2
4 => 3
5 => 5
6 => 5
7 => 6

Now mind one thing: the last element of count is equal to the number of elements to sort. This happens because all the elements are less than the maximum (7 in this case). Now we only need to construct the array r to be returned, following this procedure:

for each element of v, we look at count(v(i));
this will tell us where v(i) should be placed;
then we decrement count(v(i)), so that next element with that value will not be inserted in the same place.


So we consider the first element, v(1), i.e. 3; then we use 3 as the index and look for count(3), which is 2; so we will give r(2) the value 3. A little circuitous, I know. Basically it's r(count(v(i))) = v(i). So the procedure goes like this:

i => v(i) => count(v(i))
1 => 3 => 2 (after this, count(v(i)) is decremented; c: 1 1 3 5 5 6) (r: - 3 - - - -)
2 => 2 => 1 (c: 0 1 3 5 5 6) (r: 2 3 - - - -)
3 => 5 => 5 (c: 0 1 3 4 5 6) (r: 2 3 - - 5 -)
4 => 7 => 6 (c: 0 1 3 4 5 5) (r: 2 3 - - 5 7)
5 => 5 => 4 (c: 0 1 3 3 5 5) (r: 2 3 - 5 5 7)
6 => 4 => 6 (c: 0 1 2 3 5 5) (r: 2 3 4 5 5 7)

As you see, things work! The algorithm is a bit difficult to understand; I hope that the example was detailed enough. The advantage of this procedure is its very low complexity: it includes scanning twice a list of n elements and scanning twice a list of k elements, so the total complexity is O(n+k).

The algorithm is particularly useful when k is comparable to n. If the values are too spread out, k will be dominant; but even if the maximum and the minimum are very close, the number of elements to be sorted will still dominate the complexity. Mind that the value of k also determines the amount of memory that the algorithm will need (since count has k elements).




dhilipkumar


A C implementation of the algorithm is given below.

1:  int *countingSort(int *v, int n)
2:  {
3:      int *ris, *count;
4:      int i, k, min;
5:   
6:      // Translation of v to an interval [0, k]
7:      // Search k = max(v) and min = min(v)
8:      for (i=1, k=v[0], min=v[0]; i<n; i++)
9:      {
10:          if (v > k)
11:              k = v;
12:          if (v < min)
13:              min = v;
14:      }
15:      // The values are actually translated
16:      if (min != 0)
17:      {
18:          k -= min;
19:          for (i=0; i<n; i++)
20:              v -= min;
21:      }
22:   
23:      if ((ris = (int *)malloc(n*sizeof(int))) == (int *)NULL)
24:          return NULL;
25:      for (i=0; i<n; ris=min-1, i++);
26:      if ((count = (int *)malloc((k+1)*sizeof(int))) == (int *)NULL)
27:          return NULL;
28:      for (i=0; i<=k; count=0, i++);
29:   
30:      // Count and integrate
31:      for (i=0; i<n; (count[v])++, i++);
32:      for (i=1; i<=k; count+=count[i-1], i++);
33:   
34:      // Construction of the sorted array
35:      for (i=0; i<n; i++)
36:      {
37:          ris[count[v]-1] = v + min;
38:          count[v]--;
39:      }
40:   
41:      free(count);
42:   
43:      return ris;
44:  }
45:   
46: 


Note that the algorithm can only be applied to integer values. This also includes characters, using their ASCII table representation. It also includes any data structure that has an integer as its key.

dhilipkumar

Sorting Algorithms - basic level


Hi everyone! This is not my first article/tutorial in general, but it's the first one on OSIX so I thought I'd say hello.

The problem I'm going to address here is simple and ubiquitous: given a list of elements, an algorithm should sort the list based on the value of a key field. The list is usually an array; it could be a linked list in case you don't know how many elements it includes (or in case it's subject to alteration from the user), but few changes are needed for list sort.

The elements can be any kind of data structure. What's important is that at least one field of the structure can be used as a key, i.e. a (not necessarily unique) index associated to each element that you can compare with other keys. The best example for a key would be an integer: you can easily understand that 3 is less than 5 and greater than 2. It's also good to use floating-point numbers and also characters or strings of characters, if you consider the ASCII values of every character, and any other type you can define a comparison function about.

When dealing with exposition and analysis of a sorting algorithm, it's common to consider arrays of integers, but the principle can be extended to any kind of data structure. To keep things general, a sorting algorithm would require two user-defined functions:

a compare(a, b) function, which returns 1 if a is "greater than" b (in the meaning which is relevant to the data type), -1 if it's the opposite and 0 if they're equal;
a swap(a, b) function, which swaps the contents of the two data structures (or at least copies the content of one structure into another one).


In the following, I'll make the assumption that we're sorting an array of integers in ascending order, in order to keep the code simple. You can easily adapt the code to suit your needs.



Bubble sort

The first algorithm that is usually presented in programming classes is the bubble sort algorithm. It's based on this idea: if a number x greater than y is before y (i.e. it has a lower index in the array), then we should swap them. So the algorithm checks every pair of elements and tests their values, and if necessary swaps them.

Let's see an example. For every element (or rather for every position in the array) we check the pair formed by that element and any other "later" element:
3 6 2 4
3 6 2 4
3 6 2 4 => 2 6 3 4
2 6 3 4
2 6 3 4 => 2 3 6 4
2 3 6 4
2 3 6 4 => 2 3 4 6


As you can see, things work! Basically, each iteration considers a position in the array, and makes sure that the element is the minimum available (assuming that previous elements have a lower key, which is granted by the previous iterations).

dhilipkumar

A C function which implements bubble sorting for an array of integers follows. Only for this time, I'll show the proper use of the compare and swap functions I talked about before; in this case, they're really just redundant.

1:  int compare(int a, int b)
2:  {
3:      if (a > b)
4:          return 1;
5:      else if (a < b)
6:          return -1;
7:      else
8:          return 0;
9:  }
10:   
11:  void swap(int *a, int *b)
12:  {
13:      int c = *a;
14:      *a = *b;
15:      *b = *c;
16:      return;
17:  }
18:   
19:  void bubbleSort(int *v, int n)
20:  {
21:      int i, j;
22:   
23:      for (i=0; i<n; i++)
24:          for (j=i+1; j<n; j++)
25:              if (compare(v, v[j]) == -1)
26:                  swap(&(v), &(v[j]));
27:   
28:      return;
29:  }
30: 


I know that the two auxiliary function might not be the best example of code efficiency, but this is not the purpose of this article (yet). This is a different problem, there are thousands of better solutions (like the slightly faster XOR swap). I also know that the auxiliary functions here provide only worthless complication; in this case, lines 25-26 are better substituted with
            if (v > v[j])
            {
                int tmp = v;
                v = v[j];
                v[j] = tmp;
            }


One could argue whether or not to include the "equal to" case. I'd say it's better not to include it, since this might reduce the number of swaps, but it doesn't affect the correctness of the algorithm.

The function bubbleSort needs the array of integers and the number of elements in the array, then scans the array "twice" to check for unsorted pairs. At the end of the procedure, the v array is sorted in ascending order.


dhilipkumar



Insertion sort

Another simple iterative algorithm is the insertion sort. It works like this: we consider each single element in the array, and we shift it to the correct position, i.e. the highest position where it is greater than the element at its left (if the array index grows from left to right). This is done by swapping the element with the one at its left as long as the element on the left is greater.

Let's see an example to clear things out:
2 7 5 9 3 4 6
2 7 5 9 3 4 6
2 7 5 9 3 4 6
2 7 5 9 3 4 6 => 2 5 7 9 3 4 6
2 5 7 9 3 4 6
2 5 7 9 3 4 6 => 2 5 7 3 9 4 6 => 2 5 3 7 9 4 6 => 2 3 5 7 9 4 6
2 3 5 7 9 4 6 => 2 3 5 7 4 9 6 => 2 3 5 4 7 9 6 => 2 3 4 5 7 9 6
2 3 4 5 7 9 6 => 2 3 4 5 7 6 9 => 2 3 4 5 6 7 9


Got it? The name derives from the fact that we virtually divide the array into two sections, the leftmost one that contains already sorted numbers, and the rightmost one with the numbers yet to sort, then we consider the first non-sorted element and we insert it in the right position. The algorithm can be implemented in C as follows:

int *insertionSort(int *v, int n)
{
    int i, j, key;

    for (j=1; j<n; j++)
    {
        key = v[j];
        i = j - 1 ;
        while (i >= 0 && v>key)
        {
            v[i+1] = v;
            i--;
        }
        v[i+1] = key;
    }

    return v;
}


I copied this function from an exercise for my programming course, so I'm sure it works. Reading it now I'm not sure there's any need to make the function return the int pointer, since changes should be done directly on the array to be sorted. What's certain, that doesn't hurt :)

The code does just what we just told. For each number, it keeps swaping it with the element at its left until the one on the left is greater (or until there's no element on the left, i.e. the number that's being considered is the minimum and has been shifted all the way through the array).

dhilipkumar

Why these guys are "The trivial ones"

These pieces of code have the wonderful property of being short and therefore easy to remember on top of your head. Let's say you're writing a program like an address book for example, and you are working on the hard things to do, like cross-referencing your entries or whatnot - but before that you need to sort your entries alphabetically. If you don't want to waste times recreating (or looking up) better algorithms, you can write a bubble sort from scratch in a minute and have a working sort technique. Unless your application has to order just a few elements, though, none of these two algorithms can provide the performance you might expect from your application.

The reason for this is that both the bubble sort and the insertion sort consider every pair of elements in the list - maybe not really every of them, but a lot nonetheless. Regardless of its input, bubble sort always performs the checks (and often the swaps), so its complexity is very close to O(n2); this means that if you need to sort 1000 elements, it'll take you about one million operations!

Insertion sort is slightly better in that it can "recognise" a vector which is already sorted. Every time an element is considered, the algorithm just checks the element at its left, sees that it's smaller and moves on to the next iteration. In such a condition, its complexity is O(n). Yet, with random or unpredictable input, its complexity becomes O(n2); the worst case is when the array is inversely sorted, in which case the number of swaps is clearly quadratic.

With randomly sorted arrays, the insertion sort completes the task in about half the time required by the bubble sort, which is quite an improvement. Yet there are much faster algorithms, whose complexity is O(nlogn) (recursive algorithms), or even O(n) for some very specialized cases. In the next article, I'll show you one of these special situations, where the keys are integers. After that, we'll talk about recursive algorithms, which are the fastest things available for general purpose sorting. Comments, critiques and corrections are always appreciated.

dhilipkumar



Here's a look at which sorting algo is the fastest, looks like the Shell method takes the gold by a pretty big amount.

void shellSort(int numbers[], int array_size)
{
int i, j, increment, temp;

increment = 3;
while (increment > 0)
{
   for (i=0; i < array_size; i++)
   {
     j = i;
     temp = numbers;
     while ((j >= increment) && (numbers[j-increment] > temp))
     {
       numbers[j] = numbers[j - increment];
       j = j - increment;
     }
     numbers[j] = temp;
   }
   if (increment/2 != 0)
     increment = increment/2;
   else if (increment == 1)
     increment = 0;
   else
     increment = 1;
}
}