Operation Manual
Experiments in Python
107
Notes:
The permanent storage on a computer (where all the data is kept when the
machine is turned off) will usually be larger than the temporary storage (internal
computer memory, which is lost when power goes off). On the Raspberry PI, your
SD card (permanent storage) is likely to be much larger than the 256Mb of RAM
(temporary storage). Look back at the previous experiment that loaded a text file,
sorted the entries and then displayed the sorted list. If the file was larger than the
fraction of 256Mb that was allotted to Python, how could we sort that file?
The Raspberry Pi’s Linux operating system has a virtual memory solution but
what if you wanted to keep down the Python memory being used? The answer is
to sort part of the list, save that partially sorted list to the permanent storage, and
repeat. Once the entire large file has been partially sorted to smaller files, merge
all the entries in those files together one by one to make the final file. Very little is
kept in memory.
The clever part here is to use a Python generator and heap queue.
Understanding a “generator” is quite difficult, but imagine that it is like a function
that generates the elements of a sequence. After it “yields” an element of the
sequence, it does nothing until it is called again. So the generator here takes a file,
reads a line from it and yields that line to whatever called it. Once it cannot find
any more lines, it just stops returning items, which the caller will notice and do
something else. You’ve seen a generator already. It was called “range”.
The file that is passed to this generator is a subset of the larger file that we want
to sort. The first loop creates a lot of temporary files on your permanent storage,
each of which contains 10 lines that have been sorted. The first loop doesn’t make
use of the sorted partial files immediately. It appends the generator to a list. It is
making a list of the generator functions (linesFromFile) that will read lines one at a
time from the file.
The “heapq” merge function loops through the values given by the generators and
merges them in order. It assumes that the elements from the generator will come
in a sorted order. So, imagine it takes the initial entries from each partial file until
it has worked out which of these is the first. It gives that back, then it continues
doing this one by one, only holding a small amount in memory. All that’s left is to
do something with each value as it is given. In this case, print it on the display.