Data representation

Contents:

Integer representation

We want to represent data in a computer. We want to store numbers and text and all these data types in a computer, but what we have is electric currents in wires. We need to represent the data with these voltage levels. If you like, we need to encode the data.

One possible basis for building up data representation is to insist that all of the wires will either have a standard amount of electricity flowing through them, or none at all. Two possibilities: On or off.

This is a very small amount of information. It is called a "bit". It's the theoretically smallest amount of information you can have.

We can store more information by using more bits. For example, if we use two bits, we can store four different possibilities:

  1. off, off (respectively)
  2. off, on
  3. on, off
  4. on, on
are the four possible states of two bits.

In general, n bits have 2n possible configurations. This is assuming we know which order they're in; in other words, possibilities 2 and 3 above have to count as different configurations.

With current computers, you frequently find bits organized into groups of eight, which are somewhat humorously called "bytes". This means there are 28 possible configurations of the bits in a byte. 28 is 256.

We want a representation of numbers with these bits, and we want a representation which is convenient for arithmetic. We want to be able to build arithmetic circuitry, and software, which doesn't have huge addition tables built in. We want the numbers to be in order, in some sense.

In fact there's a fairly straightforward way to represent numbers with bits. Each bit can be on or off; we can call this "0" or "1".

Then if we put a bunch of bits together, like this: 10011
we can treat that as a number in base 2 (binary).

You know that "926" in "base ten" means 9*102 + 2*101 + 6*100

so 10011 in base two is 1*24 + 0*23 + 0*22 + 1*21 + 1*20

Actually we can omit the zeroes, and the multiplication by one... binary is very easy. So we get
24 + 21 + 20
= 16 + 2 + 1
= 19

A byte is generally eight bits. What's the largest number we could represent in that byte?

111111112
= 1*27 + 1*26 + 1*25 + 1*24 + 1*23 + 1*22 + 1*21 + 1*20
= 255

All bits zero will be zero, of course.

So an eight-bit number can be anywhere from zero to 255, inclusive. If we want larger numbers, we have to use more bits, i.e. more than one byte.

In general, we can use n bits to represent an integer from 0 to 2n-­1 inclusive, e.g. 8 bits gives us 0 to 255.

How do we get negative numbers? Two ways:

  1. "sign-and-magnitude": one bit is the "sign bit", other bits are an "n-1"-bit number. Like how we write numbers on paper. You figure out what the number is, ignoring the sign bit; but actually that's not necessarily the number, it's just the number's magnitude (absolute value).
    If the sign bit is 1, the number is negative.
    If the sign bit is 0, the number is positive or zero ("non-negative").
    So we can represent an integer from -2**{n-1}-1 to 2**{n-1}-1 inclusive. (because we have n-1 bits, plugged into the previous formula; and then we can put '-' or not before it.)
  2. "two's-complement": almost always used these days (for integers). It uses n bits to represent an integer from -2n-1 to 2n-1-1 inclusive.
These two formats yield approximately the same number of numbers: half positive, half negative. For example, 16 bits can represent an integer from -32768 to 32767 inclusive, in the two's complement format.

We won't discuss the two's-complement representation further in this course. It's discussed in CSC 258, at least if you take it from me.

In either case, you can see that we have 2n possibilities for the numbers, and we can choose all non-negative, which gives us 0 to 2n-1; or we can choose about half each, positive and negative.

Fixed-point representation

Suppose we want numbers other than integers, e.g. 2½?

A straightforward representation for non-integer numbers involves adding something like decimal points. But "decimal" means base 10, so we tend to call it a "binary point". There's also a generic term "radix point", which applies equally well to base 2 or to base 10 or to any base.

I claim that we can represent two and a half by 10.1 in binary. That is, 2.510 = 10.12.

10.12
= 1*21 + 0*20 + 1*2-1
= 1*2 + 0*1 + 1*½
= 2 + 0 + ½
= 2½

This is called "fixed point". We can use the same basic format as we did for integers, so long as we agree on how many places there are after the radix point.

Floating-point representation

Floating-point numbers involve an exponent of the base, like scientific notation. Example: Avogadro's number can be written as 6.02x1023 in base 10.

But computer floating-point numbers are usually based on base two, of course.
6.02x1023 is approximately (1 and 63/64)x278
or 1.111111 (base two) x 21001110 (base two)

To represent this in a computer, in (say) 32 bits, we can use something like the IEEE floating-point standard (IEEE 754):

If the exponent is all ones (looks like 255), the number is a special value:

If exponent is all zeroes, the meaning of the number is +/- 0.{mantissa} x 2-126

For exponents other than all ones (255) and all zeroes (0), the meaning of the number is:
+/- 1.{mantissa} x 2 {exponent}-127

Why subtract 127? 8 bits represents a number from 0 to 255 -- subtract 127 to give us the ability to express exponents from -127 to 128.

Note that we save a bit by not representing the '1.'. This means that all floating-point numbers must be "normalized" -- they begin with "1", DOT.

This restriction resembles that in scientific notation in base ten, except that in base ten there are nine possible digits for that single digit to the left of the decimal point, e.g. '6' in Avogadro's number, whereas in base two there is only the one possibility.

The sole digit to the left of the radix point can't be zero (in either base 2 or base 10). If it's zero, you just slide around the radix point as needed, adjusting the exponent value to compensate, which is called "normalizing" the number.

This is why the special rule for an all-zeroes exponent is essential; otherwise we couldn't represent the number zero itself. This rule also gives us a few smaller-magnitude numbers, e.g. 0.00000000000001 x 2-126, but its chief importance is that it gives us a way to represent zero.

(Why -126 instead of -127 (which would seem more obvious, to me anyway)? Consider trying to represent the number 2-127 itself. With the exponent being -126, we could represent this as (in effect) 0.1*2-126. If the actual exponent for this special case of an all-zeroes exponent field were -127, this number would be unrepresentable. But this is getting rather far afield and you don't have to know this for this course.)

IEEE 754 standard sizes

IEEE 754 has two standard sizes:

Example floating-point conversions

Convert 12.4 to IEEE single-precision.

1210 is 11002

.4 is 2/5

In base ten, we'd divide 5 into 2 in base 10 to get 0.4. You do this with long division.

For base two, we instead divide 5 into 2 in base 2, or if you like divide the base two number 101 into 10 (these being the base-two representations of 5 and 2 respectively), using long division, in base two.

This turns out to be a repeating fraction, with "0110" repeating.

so 12.4 = 1100.0110011001100110...
= 1.1000110011001100110... x 23
so use 130 for the exponent field (because 130-127 == 3 -- see the conversion back the other way, below, if this isn't clear)

Convert 130 to base two by repeated integer division, and take the remainders. (We could have done this for 12 near the beginning of this example, except that it was so easy to see that 1100 is 8+4 is 12, but we do need to have an algorithm, so here it is.)

130 div 2 = 65 remainder 0
6 div 2 = 32 remainder 1
3 div 2 = 16 remainder 0
1 div 2 = 8 remainder 0
8 div 2 = 4 remainder 0
4 div 2 = 2 remainder 0
2 div 2 = 1 remainder 0
1 div 2 = 0 remainder 1
Read the remainders upwards to get the value.
so 13010 = 100000102

so IEEE 754 single-precision representation is:
0 10000010 10001100110011001100110

Remember that we omit the initial '1' from the mantissa field, thus gaining an additional bit of precision.

There are no spaces there in any sense of course, I've just shown the grouping. The computer knows the grouping by counting bits.

Now let's take that value, 01000001010001100110011001100110, and interpret it as an IEEE single-precision floating-point number and see what it is.

We separate the bits into the three fields: the first bit is the sign bit, the next 8 are the exponent field, the remaining 23 are the mantissa field. So the sign bit is 0, meaning that the number is not negative. The exponent field is 10000010, or 27+22, which is 128+2 or 130. And the mantissa field is 10001100110011001100110. Remember that there is an implied '1' at the beginning there, if you look at how the mantissa field substitutes into the meaning of the number, towards the beginning of this document.

Substituting into our formula, we get 1.10001100110011001100110 x 2130-127

So you see why we used 130 for the exponent field when the actual exponent is supposed to be 3.

The mantissa, then, is
20 + 2-1 + 2-5 + 2-6 + 2-9 + 2-10 + 2-13 + 2-14 + 2-17 + 2-18 + 2-21 + 2-22

Add these up with your calculator and you'll get 1.5499999523162841796875.

This is times 23, which is 8; multiply by 8 and you'll get 12.3999996185302734375.

So 01000001010001100110011001100110 isn't actually a representation of 12.4 exactly, but it's as close as we can get, just as how 0.3333333333 in base ten is not equal to a third but it is as close as we can get with ten significant decimal digits.


[
list of topics covered, evening section]
[course notes available (evening section)]
[main course page]