Wednesday, May 15, 2013

How to make histograms in Excel (XP, 2007, 2010, 2011) and Google Docs without any stupid add-ons: Part II

So last week I left you with a number of random numbers and an idea of what we might be able to do with them.  That idea of what we might do was histograms, and those random numbers were these:

31
49
36
37
47
13
46
33
36
52
48
60
63
36
47
85
49
45
70
34
45
70
41
65
65
45
24
62
45
42
59
50
78
49
63
37
45
45
64
32
13
56
47
31
57
42
52
63
45
62

If you remember, these numbers should be drawn from a normal distribution with a mean of 50 and a standard deviation of 15.

Now, if you do a search of 'how to make histograms in excel' most of the responses will come up with a whole bunch of proprietary junk that builds you histograms if you buy and/or download it, with the remainder suggesting that you find your Excel CD to load a whole bunch of extra packages.  Many of these sites are trying to get some money out of you.  To be fair, I'm also trying to get some money out of you, I'm just a lot worse at it.  =)

Anyway, we're not here for that today, because we don't need that - you can make your own histograms perfectly well just with what excel gave you.  For that matter, with what Google Docs gave you (for free!).  

We're going to rely on two main concepts today.  The first is Excel's 'Frequency' function, and the second is the conceptual act of binning (not to be confused with Dr. John F. Binning).

You see, any program that will just make you a histogram all willy-nilly is making some choices for you, and those choices basically manifest in how many bars you get on the graph that is created.  For instance, this is a histogram of the above data:



The trick is that I've only created one bin - in this case for numbers from 0 to 100.  Every number was placed in the same bin because I made it far too large for the data.  Variance has been washed away completely.

This is actually a fairly important point - unless you create a bin for every number on your chart you are likely to display less variance than you actually have when you produce a standard histogram.  For the most part it's not something that anyone really worries too much (unless you create a histogram like the one I just did), but it's something to keep in the back of your mind.  If any bar contains more than one number, then those numbers are no longer being treated as distinct.

Let's start with the idea of binning.  You may have already picked up on it from the above talk about bins, but the goal here is to create a number of bins (or buckets) into which we'll sort our numbers.  You want to pick something that makes sense, covers the range of numbers, maintains equal distance!, and maintains as much variance as possible.  We'll go through each of those steps in turn, but let's start by just making a pretty straightforward set of bins: sets of 10 from 1 to 100.

Open up your spreadsheet program of choice - I'm going to start by running through Excel 2010 but a lot of things are similar no matter what we do.  The main difference turns out to be the keyboard shortcuts between Windows and Mac (not surprisingly).  That said, it turns out that the things that work in Excel 2010 apply to Excel 2007 and Excel XP, and the things that work on Excel 2011 presumably work on whatever the last iteration Macs had.

We need the random numbers in one column, so go ahead and copy paste them in there - or better yet create your own.  Some of you might also just have some data you want to use, so all you need to do is make sure it's in some sort of array (like a column).

In a different column we need to create bins, and for this first part we can set them as mentioned, ten sets from 1 to 100.  To do this we need a column that looks like this:

10
20
30
40
50
60
70
80
90
100

The bins in excel are defined by the distance between the prior number and that number, so the first bin contains all numbers 10 and below, and the second bin contains numbers 11 through 20.

Now for the tricky part.

You have a column of numbers, and you have a column defining your bins.  Now it's time to use the frequency function.

If you just go to any cell in your spreadsheet and type' =frequency(' you're going to get a little pop-up with some helpful notes on what you need to include in this formula.  In Excel it is going to prompt that you want a 'data_array' and a 'bins_array'.

An array is simply a systematic arrangement of objects, in the case of Excel a arrangement of objects in a column or row.  So, we know what we need - in my case I placed the random numbers in the first column starting at the top, so my data array is A1:A50.  If you placed your bins in the second column starting from the top your bins array would be B1:B10.  Your arrays may vary.

Don't go making your formula so fast, though.  If you're using Excel 2010 (or 2007 or XP) you need to do something else first.  You need to link a bunch of cells to this same formula.  This is done with CTRL+SHIFT+ENTER when you have all the destination cells selected.

Before you type out your frequency formula, select the 10 cells just next to the bins you created (or however many cells for however many bins - it's 10 for this example because there are 10 bins).  In my case this would be cells C1:C10.

Once those are selected, type out the frequency formula - for me this looks like '=frequency(A1:A50,B1:B10)'.  Instead of hitting just ENTER when you finish, though, hit CTRL+SHIFT+ENTER.

If you've done it right, it should have filled out each of the selected cells with counts that are in the bins next to those cells.  All you need to do now is make a chart in the normal fashion using the chart builder and you'll come up with something like this:



If you're using a Mac you might be using Excel 2011.  In this case the things we just did very likely did not work on your computer.  The steps are exactly the same, except for the whole CTRL+SHIFT+ENTER part.

You still need to have a data array and a bins array, and you can type out your frequency formula in the first cell of a new column just like on Windows (except you don't need to have all the destination cells selected when you do).  After you have that cell, however, press enter to get a value in it.  Then, select that cell and all others in the final array you're creating (those cells next to the bins).

Once you have the correct frequency formula in the first cell, and those cells selected, press CONTROL+U.   This should highlight a bunch of cells.  Then press COMMAND+SHIFT+ENTER, which should fill in the cells you're looking for.

I mentioned Google Docs, and the same technique should work there - perhaps depending on your operating system.  There's an interesting quirk in that Google Docs goes one more cell beyond what you select, and that cell count is everything above the final number.  It's hard to explain, but if you test it out with some data you should figure it out fairly quickly.

Google Docs also doesn't require any of this multi-key pressing either, as if you simply start a frequency in a cell based on your two arrays it will fill out cells below that until it runs out of things in your bin array.  It actually takes out an entire step of odd keyboard shortcuts that means it probably functions the same on both Windows and Mac (it seemed to work the same for me on both).

And best of all, you didn't need any fancy extra software.

So go make some histograms!

2 comments:

  1. Thank you! I have Excel 2011 and I was getting crazy!!! Thanks!

    ReplyDelete
  2. This was incredibly useful! Thank you

    ReplyDelete