2.2 How is a table stored?

Contents

2.2 How is a table stored?#

In postgreSQL, we see a table like this:

csc343h-dianeh=> select * from student;
  sid  | firstname |  surname   | campus |   email   | cgpa 
-------+-----------+------------+--------+-----------+------
 99132 | Avery     | Marchmount | StG    | avery@cs  | 3.13
 98000 | William   | Fairgrieve | StG    | will@cs   | 4.00
 99999 | Afsaneh   | Ali        | UTSC   | aali@cs   | 2.98
   157 | Leilani   | Lakemeyer  | UTM    | lani@cs   | 3.42
 11111 | Homer     | Simpson    | StG    | doh@gmail | 0.40
(5 rows)

Now we know our tables are kept in secondary storage much of the time. But how exactly is a table stored – what bytes go where? Is it laid out in row order, or column order? If in row order, is it sorted, and if so according to what column(s)?

And it turns out that more is stored than the data. PostgreSQL reveals this when we define a table.

csc343h-dianeh=> create table Student(
        sID integer primary key,
        firstName varchar(15) not null,
        surName varchar(15) not null,
        campus Campus,
        email varchar(25),
        cgpa CGPA);
CREATE TABLE
csc343h-dianeh=> d student
                     Table "university.student"
  Column   |         Type          | Collation | Nullable | Default 
-----------+-----------------------+-----------+----------+---------
 sid       | integer               |           | not null | 
 firstname | character varying(15) |           | not null | 
 surname   | character varying(15) |           | not null | 
 campus    | campus                |           |          | 
 email     | character varying(25) |           |          | 
 cgpa      | cgpa                  |           |          | 
Indexes:
    "student_pkey" PRIMARY KEY, btree (sid)

This last part holds important infomation:

Indexes:
    "student_pkey" PRIMARY KEY, btree (sid)

It says that postgreSQL has created an auxilliary data structure called an index. We didn’t ask for this structure, but postgreSQL made it on our behalf because we declared sID to be a primary key. This is a huge hint to the DBMS that we are probably going to use sID a lot to look things up in this table, and we are also probably going to do a lot of joins based on matching sIDs. The index postgreSQL builds is designed to make those operations fast.

A database index#

An index is an extra structure that helps us find things quickly in our data. In the physical world, many books such as cookbooks and textbooks have an index that maps topics to page numbers. If you want to bake macarons, you find “macaron” in the index and it tells you which pages contain macaron recipes. The index is organized alphabetically, to speed up search. The index is separate from the data itself (for example, the recipes in a cookbook). Without the index, to find macaron recipes or anything else, we would have to flip through the book one page at a time. A book can also have multiple indices, organized based on different things. For example, textbooks often have an author index and a topic index.

A database index is exactly analogous.

  • It is a separate structure from the data (the rows in our table).

  • It is organized to make search according to some attribute(s) — in this case, sID — fast.

  • When we find a desired sID in the index, stored with it is the location, in secondary storage, of the row from the Student table for that sID. This location can be used as a pointer to go directly to that row.

  • Without the index, in order to find a desired sID, we would have to traverse through all the rows of the table.

When we created our Student table, postgreSQL told us:

Indexes:
    "student_pkey" PRIMARY KEY, btree (sid)

indicating that it made an index organized by the primary key we declared for the table, sID. We said that the index is organized to make search by sID fast. You learned in first year that a binary search tree is a good way to make search fast without compromising on fast insertion and deletion. We would want to upgrade to a balanced BST, in order to ensure O(log n) search time, where n is the number of values (in this case sIDs) in the tree. But even O(log n) is totally inadequate for our database index. If our data is so large that it can’t fit in memory, then our index, which must have one entry for each key value in the data, may not fit in memory either. And if a balanced BST index were stored in secondary storage, traversing it during a search would require bringing each node we visit into memory so that we could examine its value and decide whether to go left or right. Doing this very slow operation O(log n) times would simply take too much time.

Traversal would be faster if the search path were shorter, and we can make the search path shorter by making the tree wider. This is what a “B-tree” does — and notice that postgreSQL told us it had made a B-tree.