Wednesday, May 8, 2013

How to make random numbers and histograms in Excel (XP, 2007, 2010, 2011) and Google Docs: Part I

If you've been reading the blog for a while you know that I've complained a few times about both Google Docs and Microsoft Excel and their failure to easily convert data into histograms.  They can make bar charts out of data from categorized tables, but can't just take a raw data array and easily just convert it to a graphical representation of frequency (a histogram).

Now, there are ways to take raw data arrays and convert them into categorized tables, and we're going to talk about that in a bit (next week).  First, though, we should answer the easily sarcastic question: why is this so hard to do this?

Well, it is, and it isn't.  There are plenty of statistical programs that will let you create a histogram fairly easily, but in doing so it's very easy to forget about some of the underlying information used to create that histogram.

Let's make things concrete and start with some numbers, shall we?

I could just make up a string of numbers, but they we wouldn't be learning anything from it (except how bad I am at making up numbers).  Instead, let's use the very programs with which we're looking to make histograms to make some random numbers.

Both Excel and Google Docs have some pretty decent random number generation, depending on what you actually consider random.  The heart of this is the function:

= rand()

This command will return a random number between 0 and 1.  If you're looking for a uniform random number this will take care of it.

Oh, you wanted a random number between some other range?  Say 0 to 50?  Well, then take your 0 to 1 random number and multiply it by 50.  You wanted it between 1 and 50?  Multiply it by 49 and then add 1.

You wanted it between -37 and 224?  First off, why?  Second, multiply your random number by 261 and then subtract 37.  DONE.

You want it between .4 and .6?  Feel free to take a stab at that one in the comments - I've given you enough to figure it out.

Think of it this way.  You and a friend are in a large room.  The floor has a long line from one end of the room to the other with 0 at the center of the room.  Marks are painted out on the line at each foot to mark out the (relatively low) positive and negative numbers.

Laying on the ground, with the tips of his arms resting neatly at 0 and at 1, is a mint condition Stretch Armstrong.



Stretch, in this example, illustrates what the rand() command has given you - a random number pulled from the range of 0 to 1.

You're pretty confident that you and your friend could each grab an arm and pull Stretch to either end of the room, and that's exactly what you're doing when you multiply your rand() output by any given number.

The first example of wanting a number from 0 to 50 may oversimplify things a bit.  You're multiplying by 50 because you want a range of values that covers the numbers from 0 to 50.  Things are easier when you want to start with 0, as it's always going to be the bottom value when you finish this multiplication step.

If you want a range that doesn't have 0 as the lower bound, then you need to shift that range one way or another.  Only after you multiply - if necessary - though.  It's why you multiple by 49 if you want the range to be from 1 to 50 instead of 0 to 50 - you have to start with a range that extends to 0 due to the fact that 0 multiplied by any number is still 0.  After multiplication you can simply increment in either direction.

This is accomplished by taking your stretched out Stretch Armstrong and walking up and down the number line - the range that Stretch's arms cover is the range from which your random number will be pulled.  If you stretch Stretch to 20 feet long and then walk him 10 feet to the left you'll have a random number centered on 0 within the values of -10 to 10 (ish).   

I should note that both Excel and Google Docs have functions that allow you to specify a ceiling or floor for random numbers, but if you understand how rand() works there's really no need for it.  It's a completely redundant function, and you should feel angry that it's there.

We're looking to make a distribution to plot out on a histogram.  What rand() gives us is a uniform distribution, which makes for boring graphics.  How about something flashy, like a normal distribution?

Well, Excel and Google Docs don't have random normal commands (there are many programs that do), so we have to make use of some other functions to transform our rand() values into something a bit more...well, normal.

In Google Docs this function is:

=NORMINV(number, mean, standard deviation)
And in Excel it is:

=NORM.INV(number, mean, standard deviation)

Google actually sums it up pretty well in the description of the function:

"Returns the inverse of the normal distribution for the given Number in the distribution. Mean is the mean value in the normal distribution. STDEV is the standard deviation of the normal distribution."


Thus, if we use the form:

=NORMINV(rand(), 50, 15)

We'll end up with random numbers drawn from a normal distribution with a mean of 50 and a standard deviation of 15.  If we pulled 50 such numbers they might look something like this:

31
49
36
37
47
13
46
33
36
52
48
60
63
36
47
85
49
45
70
34
45
70
41
65
65
45
24
62
45
42
59
50
78
49
63
37
45
45
64
32
13
56
47
31
57
42
52
63
45
62

And that's where we'll pick it up next week!

No comments:

Post a Comment