\( \newcommand{\NOT}{\neg} \newcommand{\AND}{\wedge} \newcommand{\OR}{\vee} \newcommand{\XOR}{\oplus} \newcommand{\IMP}{\Rightarrow} \newcommand{\IFF}{\Leftrightarrow} \newcommand{\TRUE}{\text{True}\xspace} \newcommand{\FALSE}{\text{False}\xspace} \newcommand{\IN}{\,{\in}\,} \newcommand{\NOTIN}{\,{\notin}\,} \newcommand{\TO}{\rightarrow} \newcommand{\DIV}{\mid} \newcommand{\NDIV}{\nmid} \newcommand{\MOD}[1]{\pmod{#1}} \newcommand{\MODS}[1]{\ (\text{mod}\ #1)} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\Q}{\mathbb Q} \newcommand{\R}{\mathbb R} \newcommand{\C}{\mathbb C} \newcommand{\cA}{\mathcal A} \newcommand{\cB}{\mathcal B} \newcommand{\cC}{\mathcal C} \newcommand{\cD}{\mathcal D} \newcommand{\cE}{\mathcal E} \newcommand{\cF}{\mathcal F} \newcommand{\cG}{\mathcal G} \newcommand{\cH}{\mathcal H} \newcommand{\cI}{\mathcal I} \newcommand{\cJ}{\mathcal J} \newcommand{\cL}{\mathcal L} \newcommand{\cK}{\mathcal K} \newcommand{\cN}{\mathcal N} \newcommand{\cO}{\mathcal O} \newcommand{\cP}{\mathcal P} \newcommand{\cQ}{\mathcal Q} \newcommand{\cS}{\mathcal S} \newcommand{\cT}{\mathcal T} \newcommand{\cV}{\mathcal V} \newcommand{\cW}{\mathcal W} \newcommand{\cZ}{\mathcal Z} \newcommand{\emp}{\emptyset} \newcommand{\bs}{\backslash} \newcommand{\floor}[1]{\left \lfloor #1 \right \rfloor} \newcommand{\ceil}[1]{\left \lceil #1 \right \rceil} \newcommand{\abs}[1]{\left | #1 \right |} \newcommand{\xspace}{} \newcommand{\proofheader}[1]{\underline{\textbf{#1}}} \)

18.1 Sorted Lists and Binary Search

In this chapter, we’re going to explore a topic that is at once new and familiar: sorting a list. By this point in the course, you’ve used Python’s built-in functions sorted and list.sort to sort lists to help solve various programming problems. But you might wonder, how do these functions work, and how efficient are they?

Over the past few chapters, we’ve explored various ways of organizing data into data structures like linked lists, trees, and graphs, and learned how to implement various common operations on those data structures. This chapter is going to be a little different. We’re going to look at just one data structure, the Python array-based list, and one operation, sorting it. The complexity that we’ll learn about stems from the fact that there are many different algorithms for sorting a list, that differ not just in individual lines of code but in their conceptual approach.

In this chapter, we’ll learn about a diverse range of sorting algorithms, covering their ideas, their implementations, and their running times. By the end of this chapter, we hope that you appreciate just how inventive computer scientists can be when designing new algorithms, and how there can be vastly different ways of solving the same problem.

Before getting to our search algorithms, we’ll take a quick stop at one topic that both motivates why sorting a list can be useful, and previews some of the technical loop designs we’ll start using when we get to the sorting algorithms in the next section. This topic is binary search, which we touched on back in 15.5 Introduction to Binary Search Trees.

Recall the standard way of searching for an item in a list, using a for loop with an early return:

def search(lst: list, item: Any) -> int:
    """Return whether item appears in this list."""
    for element in lst:
        if element == item:
            return True

    return False

This function has a worst-case running time of \(\Theta(n)\), where \(n\) is the length of lst. Intuitively, there are cases where the for loop will need to iterate \(n\) times, check each element of the list against the item being searched for. After all, any of these elements might be the desired item!

We said back in [Section 13.5] that if we have a sorted list, then we don’t need to check all of its elements, if we’re smart about which elements we’re checking. To see why, suppose we have a sorted list lst containing one-million numbers, and we’re searching for the number 111. If we check the middle element (the one at index 500,000), there are three possibilities: Unsurprisingly, these match up exactly with various cases of BinarySearchTree.__contains__!

In the latter two cases, a single comparison allows us to ignore roughly half of the list’s items. And we can repeat this process on either the left or right half of the list until we’ve either found the item or ended with no more list elements to check. This algorithm is known as binary search, where the “binary” refers to the fact that at every comparison, the range of elements to check is halved.

Now, you might be expecting us to implement binary search recursively, and indeed this is possible, following the same approach as BinarySearchTree.__contains__. However, we’re going to implement it iteratively instead, to prepare us for the more complex iterative sorting algorithms you’ll study in the next two sections.

The key idea for implementing binary search iteratively is that we need to keep track of the current search range, which gets smaller every time a new item is checked. To do this, we’ll use two variables, b and e, to represent the endpoints of this range: at any point in time in our algorithm, the current search range is the list slice lst[b:e]. These variables will divide our list into three parts:

Diagram illustrating variables b and e in binary search.

Here is how we’ll start our implementation:

def binary_search(lst: list, item: Any) -> bool:
    """Return whether item is in lst using the binary search algorithm.

    Preconditions:
        - lst is sorted in non-decreasing order
    """
    b = 0
    e = len(lst)

So at the start of our search, the full list (lst[0:len(lst)]) is being searched. Now, the binary search algorithm uses a while loop to decrease the size of the range.

  1. First, we calculate the midpoint m of the current range.

  2. Then we compare lst[m] against item.

    1. If item == lst[m], we can return True right away.
    2. If item < lst[m], we know that all indexes >= m contain elements larger than item, and so update e to reflect this.
    3. If item > lst[m], we know that all indexes <= m contain elements less than the item, and so update b to reflect this.

Let’s write out our loop:

def binary_search(lst: list, item: Any) -> bool:
    """Return whether item is in lst using the binary search algorithm.

    Preconditions:
        - lst is sorted in non-decreasing order
    """
    b = 0
    e = len(lst)

    while ...:
        m = (b + e) // 2
        if item == lst[m]:
            return True
        elif item < lst[m]:
            e = m
        else:  # item > lst[m]
            b = m + 1

Finally, to complete the loop condition, we can take our usual indirect approach and ask, when should the loop stop? Intuitively, this is when there are no more items to check—when lst[b:e] is empty. This occurs when b >= e, and negating this gives us our loop condition.

def binary_search(lst: list, item: Any) -> bool:
    """Return whether item is in lst using the binary search algorithm.

    Preconditions:
        - lst is sorted in non-decreasing order
    """
    b = 0
    e = len(lst)

    while b < e:
        m = (b + e) // 2
        if item == lst[m]:
            return True
        elif item < lst[m]:
            e = m
        else:  # item > lst[m]
            b = m + 1

    # If the loop ends without finding the item, the item is not in the list.
    return False

Don’t forgot about loop invariants!

Because while loops have less explicit structure than for loops, it can be easy to get lost in what the code is doing. Adding good loop invariants is a great way of helping us track what while loops are doing, and we’ll be using this technique throughout this chapter.

In our final version of binary_search below, we’ve added three loop invariants to encode how b and e split up the list.

def binary_search(lst: list, item: Any) -> bool:
    """Return whether item is in lst using the binary search algorithm.

    Preconditions:
        - lst is sorted in non-decreasing order
    """
    b = 0
    e = len(lst)

    while b < e:
        # Loop invariants
        assert all(lst[i] < item for i in range(0, b))
        assert all(lst[i] > item for i in range(e, len(lst)))

        m = (b + e) // 2
        if item == lst[m]:
            return True
        elif item < lst[m]:
            e = m
        else:  # item > lst[m]
            b = m + 1

    # If the loop ends without finding the item, the item is not in the list.
    return False

Running-time analysis

Now that we’ve completed our implementation of binary_search, let’s talk about running time. We claimed at the start of this section that binary search is faster than searching every item in the list—is it?

We won’t do a formal analysis here, but this is the main idea. Our implementation has two loop variables, b and e that change over time in unpredictable ways, depending on the result of comparing item and lst[m] at each iteration. However, it we focus on the quantity e - b, we can make the following observations:

So putting these three observations together, we get that binary_search runs for at most \(1 + \log_2 n\) iterations, with each iteration taking constant time. Since the other statements outside of the loop are constant time, the worst-case running time is \(\cO(\log n)\). Much better indeed than the linear search’s \(\Theta(n)\) running time across unsorted lists.

So for large datasets where we plan to do many search operations, it is often more efficient to pre-sort the list of data so that every subsequent search runs quickly. As for how we actually do that sorting—stay tuned for the next section!

References