Wednesday 19 February 2014

How does sort actually work?

I talked about sort in my last post, but I started to read more about it and got wondering, how does it actually work?

You can use sort to order lists numerically or lexically. To sort lexically you use the cmp operator within the curly braces:
1.   sort{$a cmp $b}("please", "sort", "this", "list");
and for sorting numerically, you use the <=> operator:
1.   sort{$a <=> $b}(13,2,8,1,3,1,5,21);
What I think this means is that $a and $b refer to the two strings or numbers that are being compared when the algorithm is run. The letter "a" comes before the letter "b" in the alphabet and $a is put in front of $b in the code which means that words beginning with a letter that comes before other words in the alphabet should be sorted in front of those other words. Hopefully this all makes sense and is actually accurate.

Perl 5.6 and earlier used the quicksort algorithm and perl 5.7 and onwards, uses the merge sort algorithm. On wikipedia (http://en.wikipedia.org/wiki/Merge_sort) there's a pretty good animation at the top of the page that shows how merge sort works.

Quicksort - perl 5.6 and earlier

I remember learning this at school and uni and thinking it's pretty cool but also wondering what it was used for. We had to write a program in Java that carried out a quicksort. I should have just submitted one line of perl code...

Quicksort is a divide and conquer algorithm and there are different variations of it based on which position you choose for the pivot (explanation very soon) but I couldn't find out which version was used for sort in perl. So for simplicity, and because it's how I was taught, I'm going to choose the pivot and the value in the middle of the list.

So we have a list of numbers (assuming we're using numbers and sorting them in ascending numerical order) and the value in the centre of the list is the pivot. All numbers lower than the pivot are put on the left of this pivot and all numbers higher are put on the right. This creates three different lists - numbers lower than the pivot, numbers higher than the pivot and the pivot itself. The pivot list is one element long and is considered sorted so now we need to sort the other two lists in the same way. For each list, a pivot is chosen and again, lower numbers go on the left and higher numbers go on the right. This now creates seven lists. This continues until you have lists containing only one element, putting these lists all together will give one sorted list. Here's an example (please feel free to skip the example if you already know how it works or if it doesn't interest you!):

(4,2,9,6,5,1,8,7,3)

The number in the centre of the list is the pivot.

Starting from the left, numbers lower than 5 are put into one list to the left and numbers higher than 5 are put into another list on the right. 4 is taken out first giving:

(4)  (2,9,6,5,1,8,7,3)

Then 2 is taken out and put in the list on the left:

(4,2) (9,6,5,1,8,7,3)

Then 9:

(4,2) (6,5,1,8,7,3) (9)

This is repeated, excluding the pivot until we have three list with the pivot on its own:

(4,2,1,3)  (5)  (9,6,8,7)

Pivots are now chosen from any lists with more than one element. I think this time, as there is no real middle element, I will use the (n+1)/2th element.

(4,2,1,3)  (5)  (9,6,8,7)

The same is applied starting with the first list - all element lower than one go on the left and all numbers higher than one go on the right:

(1) (4,2,3) (5) (9,6,8,7)

The same is applied to the other list:

(1) (4,2,3) (5) (6,7) (8) (9)

Again, pivots are picked for the lists with more than one element:

(1) (4,2,3) (5) (6,7) (8) (9)

And the same sorting is applied giving:

(1) (2) (3,4) (5) (6) (7) (8) (9)

Then the final list that contains more than one element is sorted:

(1) (2) (3) (4) (5) (6) (7) (8) (9)

Put this all together and you have a sorted list:

(1,2,3,4,5,6,7,8,9)

Merge Sort - perl 5.7 onwards

So all of that wasn't entirely relevant for those of you that have upgraded your perl version since 2002 because from perl 5.6, the sort function has used the merge sort algorithm.

Merge sort is also a divide and conquer algorithm. It works by splitting up the list into individual elements. Each element is compared to its neighbour and (again assuming we're using numbers and sorting them in ascending numerical order) the smaller number is put on the left and the larger on the right. Then the next set of pairs are compared and so on until the end of the list. Then these sorted pair are compared to each other and ordered to make groups of four. Then these sorted groups are compared to the other groups and this is repeated until the whole list is in one big, sorted group. Hopefully this example will make all this clearer (again, feel free to skip this bit if you don't really care!):

Say we have a list of:

(4,2,6,5,1,8,7,3)

We then split it into sublists which contain one individual element. This means we have eight sublists of single elements:

(4)  (2)  (6)  (5)  (1)  (8)  (7)  (3)

Then we compare the elements to their next door neighbour. First will be (4) and (2). 2 is smaller so goes in front of 4 and we have our first sorted pair of (2,4)

(2,4)  (6)  (5)  (1)  (8)  (7)  (3)

The smallest sublists are always compared first so we compare the next two single element sublists - (6) and (5) giving:

(2,4)  (5,6)  (1)  (8)  (7)  (3)

This is continued until all the sublists contain two sorted elements:

(2,4)  (5,6),  (1,8)  (7)  (3)

(2,4)  (5,6)  (1,8)  (3,7)

Then the pairs are compared to each other starting with (2,4) and (5,6) and always comparing the first element of each list. First 2 and 5 are compared, 2 is smaller so is taken out to make a new list:

(2,4)  (5,6)  (1,8)  (3,7)

(2

Then the first elements of the target sublists are compared so 4 and 5, 4 is smaller so is taken out making:

(4)  (5,6)  (1,8)  (3,7)

(2,4

Then there's only one target sublist left so elements are taken out one by one from the front:

(5,6)  (1,8)  (3,7)

(2,4,5

(6)  (1,8)  (3,7)

(2,4,5,6)

And finally:

(2,4,5,6) (1,8) (3,7)

Then the pairs (1,8) and (7,3) are compared in the same way with intermediate steps:

(2,4,5,6) (1,8) (3,7)

(1

(2,4,5,6) (8) (3,7)

(1,3

(2,4,5,6) (8) (7)

(1,3,7

(2,4,5,6) (8)

(1,3,7,8)

Giving us two sorted lists of:

(2,4,5,6) (1,3,7,8)

Finally these two sublists are compared in the same way to give one sorted list of

(1,2,3,4,5,6,7,8)

If that doesn't make sense, again, I highly recommend looking at the animation on the wikipedia page.


Why was quicksort ditched in favour of merge sort?

According to the perl docs, the quicksort algorithm used was unstable because it can become quadratic. What????

A stable sort preserves the order of list elements that compare equal so maybe the quicksort algorithm kept sorting list items that were exactly the same as each other, which made the run time and complexity higher.

The run time of both of these algorithms comes out as O(n log n) when averaged over all arrays of length n. However, because quicksort does not always preserve the order of equal elements, its run time can become O(n^2) - which is a quadratic because the n is squared. This behaviour doesn't happen with merge sort so quicksort was replaced.

The perl docs do say that for "some inputs" on "some platforms" the original quicksort was faster, but no other information is given - I'm not sure which inputs or which platforms, but I'm sure that these are a minority. Also as an extra note, I think that 99.99% of the time, any time or performance differences will not be large enough to be noticeable anyway.



Thursday 13 February 2014

Lists, Lists, Lists

List or Array?

I've written a post about arrays and kind of glossed over the array/list distinction. Some people got back to me about the fact that I hadn't really talked about lists at all so here I'm going to try to explore what a list is and how it differs from an array.

A list is just what it sounds like - a number of items separated by a comma.
An array is assigned a list, it contains the list but it isn't the list itself. I don't think that you can access elements in a list via index numbers unless you explicitly do something to make it be treated like an array.

So this is a list:

("apples", "bananas", "cherries", 45, 360)

We can assign it to an array:
1.   my @array = ("apples", "bananas", "cherries", 45, 360);
And this is an array:

@array

In my post about arrays, I showed a way to get the number of elements in an array using scalar(@arr) and I also talked about using scalar on a list but I didn't explain it very well.

So if we have:
1.   my @arr = ("this", "is", "a", "list);
The difference between
1.   print scalar(@arr);
and
1.   print scalar("this", "is", "a", "list");

is that the first one is giving the scalar value of an array, which is the number of elements in the array (4) and the second one is giving the scalar value of a list, which is the value of the last element ("list"). Two very similar statements, but giving you very different results - I think this is very strange behaviour.


Note To prevent you from having to type loads of quotes in your list, which I know for me definitely slows down my typing and results in a lot of pressing the wrong keys, you can do this:

my @arr = ("this", "is", "a", "list");

my @arr = qw(this is a list);

Much quicker and qw stands for "quote word". The only problem is that the space between the words means that the words are separate list items so if you want to include a space in one of your list items, you're going to have to use quotes.

You can do lots of things with lists and here are some of them.

The following only work on an array so you have to assign your list to an array first.

Pop

This takes the last element from the array and gives it to what you are assigning it to.

my @arr = (1,2,3,4,5);
my $val = pop(@arr);

$val has the value 5 and @arr is now (1,2,3,4)

Push

This one is kind of the opposite of pop. Instead of taking away from the end of the array, you add an element on to the end of it.

my @arr = (1,2,3,4,5);
push (@arr, "x");

@arr is now (1,2,3,4,5,"x")

Shift

This is a different opposite of pop. Instead of taking the last element of the list and assigning it to a value, you take the first element.

my @arr = (1,2,3,4,5);
my $val = shift(@arr);

$val has the value 1 and @arr is now (2,3,4,5)

Unshift

This is the opposite of shift and push. An element is added to the front of the array.

my @arr = (1,2,3,4,5);
unshift (@arr, "x");

@arr is now ("x",1,2,3,4,5)


I think this little table sums up everything above:

Beginning or end of array Add or remove element
Pop End Remove
Push End Add
Shift Beginning Remove
Unshift Beginning Add


Sort

You can also sort your lists using the sort function:

Lexically: 

If sort is used by itself with no parameters, it will sort the list in standard string comparison order, which basically means alphabetical order. You can either use sort directly on a list as I have below or you can assign the list to an array and use sort on the array. Just type "sort" before the list or before the array.
1.   my @arr = sort("hello", "my", "name", "is", "emma");
2.   print "@arr\n";
When this the above code is run, it will come out with:
emma hello is my name
This can also be written as
sort {$a cmp $b} ("hello", "my", "name", "is" "emma"); 
Note
If you have words beginning with capital letters, these will always be sorted in front of lower cased words, even if they come after the lower cased word alphabetically.

To sort the words into a backwards order just reverse the a and b:
sort {$b cmp $a} ("hello", "my", "name", "is" "emma");

Numerically:
1.   my @arr = sort{$a<=>$b}(7,9,4,2,8);
2.   print "@arr\n";
To sort numerically, you use the <=> operator rather than cmp and you still use $a and $b to represent the two numbers being sorted in the alogorithm. When the above code is run, you will get:
(2,4,7,8,9)
Again, to sort backwards, you reverse the a and b:
1.   my @arr = sort{$b<=>$a}(7,9,4,2,8);
2.   print "@arr\n";
 This will print:
(9,8,7,4,2) 
List mapping
 
The map function is used to transform lists element-wise. You can go through each element of a list and perform a function on it and a new list of the new values will be created.

To do the map function, you say that you want to put the result into a new array (@new_numbers in the code below), then you type an equals sing, then the word "map" and then what you want to do to each element in curly brackets. $_ refers to each individual element, kind of like x in a mathematical equation. In the code below what I've said is to take each element and times that element by two. Then you type the list that you want to perform the operation on - you can either write the list out or give an array like I've done.
1.   my @numbers = (1,2,3,4,5);
2.   my @new_numbers = map{$_*2}@numbers;
3.   print "@new_numbers\n";
Which will print:
2 4 6 8 10
You can also apply map to text:

1.   my @text = ("this", "is", "a", "list");
2.   my @new_text = map{$_.":"}@text;
3.   print "@new_text\n";
Which will print:
this: is: a: list:

Grep

Grep is similar to sort, although instead of applying a change to each element, it evaluates the result of the operation and if the result is true, the original value will be put into a new list, if the result is false, the original value will be filtered out.

Again you start with the array you want your new list to be put into (@multiples_of_two), then an equals sign and then the evaluation you want is put in curly braces. The evaluation must create an answer that is either true or false. Then you give the list, either written out as a list or as an array. The code below goes through the list of 1-10 and puts the multiples of two into a new list.

1.   my @numbers = (1,2,3,4,5,6,7,8,9,10);
2.   my @multiples_of_two = grep{$_%2==0}@numbers;
3.   print "@multiples_of_two\n";
This will print:
(2,4,6,8,10)

Thursday 6 February 2014

@arr = ("my", "first", "array");

Now, I know that arrays are a huge topic so I can't possibly fit everything into one post or you'll all get really bored. So here are some of what I think are the key points to understanding arrays, enjoy...

Arrays are just data structures where we can store a list of singular things (scalars) in a specific order. They are like lists, but each element is associated with an incremental number and you can access each of these elements using its associated number. In fact you can assign a list to an array variable name and it will turn into an array.

Arrays are really important, in my clearly expert opinion, because sometimes we need to store things in a specific order and sometimes it makes sense to store a lot of scalars together in one place. You can also do a lot of things with them like iterate through each entry or do the same thing to each entry and put it into a new array. That kind of thing. You can store numbers and strings and arrays and other scalars and mixtures of all of those things into the elements of the array. Here's how:

You can tell you are looking at an array if the variable name is prefixed by an @ sign so for example:
@myarray
I've recently found out that punctuation marks in front of variables in perl are actually called sigils so I may or may not start calling them that from now on.


Declaring and Assigning

As with many things (nearly everything) in perl, arrays can be declared and assigned in different ways:

These two ways show the whole array being created in one go - a list is created and then this list is assigned to the array name:

1.   my @arr;
2. @arr = ("this", "is", "array", 1);

Note that strings are in quotes and numbers are not

or

1.   my @arr = ("this", "is", "array", 2);
Arrays can also be created one element at a time:

1.   my @arr;
2. $arr[0] = "this";
3. $arr[1] = "is";
4. $arr[2] = "array";
5. $arr[3] = 3;

Note that the array index starts at 0 and also that when you assign something to the individual array positions, you use a $ sign (sigil!). This is because each array entry is a singular piece of data, and from my post about scalars, we know that singular pieces of data are stored in scalars and scalars are identified by a $ sign.

If you skip out element positions when you are creating an array like this:

1.   my @arr;
2. $arr[0] = "this array";
3. $arr[2] = "has";
4. $arr[5] = "some";
5. $arr[6] = "gaps";

When you print it out, you will get:

this array, undef, has, undef, undef, some, gaps


Accessing the elements

Once you've declared and assigned your array, you probably want to do things with it. You can access each element by referring to it's index number, remembering that it starts from 0. So for the array
1.   my @myarray = ("this", "array", "is", 6, "elements", "long");

To print the word "is", you need to use the $ sigil as it's a single piece of data, then the name of the array, then a 2 in square brackets because "is" is in the array with element index 2:

1.   print $myarray[2];

Tip

If you want to print out your whole array with a space in between each element, you need to write something like:

print "@array\n";

This interpolates all of the array elements into a string and by default, each element is separated by a space.

Length of Arrays

You can either use

scalar(@arr)

or

$#arr+1

The first one makes more sense and I think it's easier to understand. It just gives you the array as a scalar value which turns out to be the number of elements it contains. The second one gives you the index value of the last element in the array and because the index starts from 0, you have to +1 to get the number of elements.

There's something you need to be careful about within this topic and I think it's a bit of a red herring. If you write:
1.   print scalar(100, 300, 200);
The answer you will get is:
200
this is because it's giving us the value of the last item in the array. So watch out!

Iterating through arrays - this topic is a bit more complex and needs scary things like loops, which I'll cover later so I'm going to run away from it now...


Array of arrays

I think my brain is exploding at trying to visualise this. You can put as many arrays in arrays in arrays in arrays as you like and end up with a multi-dimensional mess. But as long as you can understand it and everyone reading your code can understand it, then everything is fine.

There are, again, many ways implement this. You can declare some arrays and then put all of these into one big array:
1. my @name1 = ("Emma", "Jane", "Howson", 23);
2. my @name2 = ("Willard", "Carroll", "Smith", 45);
3. my @name3 = ("Tyra", "Lynne", "Banks", 40);
4. my@name4 = ("Alecia", "Beth", "Moore", 34);
5. my@name5 = ("David", "Boreanaz", 44);
6.
7. my@perfectdinnerparty = (\@name1, \@name2, \@name3, \@name4, \@name5);
On line 7 you may have notice the backslashes before the @ sigils. This is to prevent one giant array being created where 23 is the 4th element in the array with an index of 3, "Willard" will then become the 5th element in the array with an index of 4 and "Carroll" will be the 6th element with an index of 5 etc. The backslashes actually turn the arrays into array refs, which I don't understand at the moment but I'll get to in a later post. All I know is they enable the structure of an array of arrays.

Or you can declare them all at once:
1.   my @perfectdinnerparty = (
2.     ["Emma", "Jane", "Howson", 23],
3.     ["Willard", "Carroll", "Smith", 45],
4.     ["Tyra", "Lynne", "Banks", 40],
5.     ["Alecia", "Beth", "Moore", 34],
6.     ["David", "Boreanaz", 44],
7. );

The outer brackets need to be parentheses () and the inner ones surrounding each array element need to be square brackets [] and these indicate that the contents are also an array ref.

Another note is that you need to put commas between each outer array element (so between each set of square brackets), but you don't need to put a comma after the last array entry (where it's highlighted). I think it's good practise to do this anyway because if you need to add another element (another square bracket set) to the array, you only need to alter one like of code.

Accessing elements in an array of arrays

The process is very similar to an array although there are two numbers in brackets to find the location of the element you need. So to get the string "Smith" from my perfectdinnerparty array, I would use:

1.   $perfectdinnerparty[1][2];

The first number is the index of the number of arrays in the super array and the second number is the index of the element in each array in square brackets.

Another note

$var and @var are two completely different variables and have nothing to do with each other. You can't access one by using the other one so it's probably best not to use the same variable name to avoid confusion. Another case of just because you can do it, doesn't mean you should.


As always, please correct me if I'm wrong with anything and any suggestions of how to find out more are welcome.