The Skeptical Statistician

Wednesday, May 15, 2013

How to make histograms in Excel (XP, 2007, 2010, 2011) and Google Docs without any stupid add-ons: Part II

So last week I left you with a number of random numbers and an idea of what we might be able to do with them. That idea of what we might do was histograms, and those random numbers were these:

62

If you remember, these numbers should be drawn from a normal distribution with a mean of 50 and a standard deviation of 15.

Now, if you do a search of 'how to make histograms in excel' most of the responses will come up with a whole bunch of proprietary junk that builds you histograms if you buy and/or download it, with the remainder suggesting that you find your Excel CD to load a whole bunch of extra packages. Many of these sites are trying to get some money out of you. To be fair, I'm also trying to get some money out of you, I'm just a lot worse at it. =)

Anyway, we're not here for that today, because we don't need that - you can make your own histograms perfectly well just with what excel gave you. For that matter, with what Google Docs gave you (for free!).

We're going to rely on two main concepts today. The first is Excel's 'Frequency' function, and the second is the conceptual act of binning (not to be confused with Dr. John F. Binning).

You see, any program that will just make you a histogram all willy-nilly is making some choices for you, and those choices basically manifest in how many bars you get on the graph that is created. For instance, this is a histogram of the above data:

The trick is that I've only created one bin - in this case for numbers from 0 to 100. Every number was placed in the same bin because I made it far too large for the data. Variance has been washed away completely.

This is actually a fairly important point - unless you create a bin for every number on your chart you are likely to display less variance than you actually have when you produce a standard histogram. For the most part it's not something that anyone really worries too much (unless you create a histogram like the one I just did), but it's something to keep in the back of your mind. If any bar contains more than one number, then those numbers are no longer being treated as distinct.

Let's start with the idea of binning. You may have already picked up on it from the above talk about bins, but the goal here is to create a number of bins (or buckets) into which we'll sort our numbers. You want to pick something that makes sense, covers the range of numbers, maintains equal distance!, and maintains as much variance as possible. We'll go through each of those steps in turn, but let's start by just making a pretty straightforward set of bins: sets of 10 from 1 to 100.

Open up your spreadsheet program of choice - I'm going to start by running through Excel 2010 but a lot of things are similar no matter what we do. The main difference turns out to be the keyboard shortcuts between Windows and Mac (not surprisingly). That said, it turns out that the things that work in Excel 2010 apply to Excel 2007 and Excel XP, and the things that work on Excel 2011 presumably work on whatever the last iteration Macs had.

We need the random numbers in one column, so go ahead and copy paste them in there - or better yet create your own. Some of you might also just have some data you want to use, so all you need to do is make sure it's in some sort of array (like a column).

In a different column we need to create bins, and for this first part we can set them as mentioned, ten sets from 1 to 100. To do this we need a column that looks like this:

10
20
30
40
50
60
70
80
90
100

The bins in excel are defined by the distance between the prior number and that number, so the first bin contains all numbers 10 and below, and the second bin contains numbers 11 through 20.

Now for the tricky part.

You have a column of numbers, and you have a column defining your bins. Now it's time to use the frequency function.

If you just go to any cell in your spreadsheet and type' =frequency(' you're going to get a little pop-up with some helpful notes on what you need to include in this formula. In Excel it is going to prompt that you want a 'data_array' and a 'bins_array'.

An array is simply a systematic arrangement of objects, in the case of Excel a arrangement of objects in a column or row. So, we know what we need - in my case I placed the random numbers in the first column starting at the top, so my data array is A1:A50. If you placed your bins in the second column starting from the top your bins array would be B1:B10. Your arrays may vary.

Don't go making your formula so fast, though. If you're using Excel 2010 (or 2007 or XP) you need to do something else first. You need to link a bunch of cells to this same formula. This is done with CTRL+SHIFT+ENTER when you have all the destination cells selected.

Before you type out your frequency formula, select the 10 cells just next to the bins you created (or however many cells for however many bins - it's 10 for this example because there are 10 bins). In my case this would be cells C1:C10.

Once those are selected, type out the frequency formula - for me this looks like '=frequency(A1:A50,B1:B10)'. Instead of hitting just ENTER when you finish, though, hit CTRL+SHIFT+ENTER.

If you've done it right, it should have filled out each of the selected cells with counts that are in the bins next to those cells. All you need to do now is make a chart in the normal fashion using the chart builder and you'll come up with something like this:

If you're using a Mac you might be using Excel 2011. In this case the things we just did very likely did not work on your computer. The steps are exactly the same, except for the whole CTRL+SHIFT+ENTER part.

You still need to have a data array and a bins array, and you can type out your frequency formula in the first cell of a new column just like on Windows (except you don't need to have all the destination cells selected when you do). After you have that cell, however, press enter to get a value in it. Then, select that cell and all others in the final array you're creating (those cells next to the bins).

Once you have the correct frequency formula in the first cell, and those cells selected, press CONTROL+U. This should highlight a bunch of cells. Then press COMMAND+SHIFT+ENTER, which should fill in the cells you're looking for.

I mentioned Google Docs, and the same technique should work there - perhaps depending on your operating system. There's an interesting quirk in that Google Docs goes one more cell beyond what you select, and that cell count is everything above the final number. It's hard to explain, but if you test it out with some data you should figure it out fairly quickly.

Google Docs also doesn't require any of this multi-key pressing either, as if you simply start a frequency in a cell based on your two arrays it will fill out cells below that until it runs out of things in your bin array. It actually takes out an entire step of odd keyboard shortcuts that means it probably functions the same on both Windows and Mac (it seemed to work the same for me on both).

And best of all, you didn't need any fancy extra software.

So go make some histograms!

Wednesday, May 8, 2013

How to make random numbers and histograms in Excel (XP, 2007, 2010, 2011) and Google Docs: Part I

If you've been reading the blog for a while you know that I've complained a few times about both Google Docs and Microsoft Excel and their failure to easily convert data into histograms. They can make bar charts out of data from categorized tables, but can't just take a raw data array and easily just convert it to a graphical representation of frequency (a histogram).

Now, there are ways to take raw data arrays and convert them into categorized tables, and we're going to talk about that in a bit (next week). First, though, we should answer the easily sarcastic question: why is this so hard to do this?

Well, it is, and it isn't. There are plenty of statistical programs that will let you create a histogram fairly easily, but in doing so it's very easy to forget about some of the underlying information used to create that histogram.

Let's make things concrete and start with some numbers, shall we?

I could just make up a string of numbers, but they we wouldn't be learning anything from it (except how bad I am at making up numbers). Instead, let's use the very programs with which we're looking to make histograms to make some random numbers.

Both Excel and Google Docs have some pretty decent random number generation, depending on what you actually consider random. The heart of this is the function:

= rand()

This command will return a random number between 0 and 1. If you're looking for a uniform random number this will take care of it.

Oh, you wanted a random number between some other range? Say 0 to 50? Well, then take your 0 to 1 random number and multiply it by 50. You wanted it between 1 and 50? Multiply it by 49 and then add 1.

You wanted it between -37 and 224? First off, why? Second, multiply your random number by 261 and then subtract 37. DONE.

You want it between .4 and .6? Feel free to take a stab at that one in the comments - I've given you enough to figure it out.

Think of it this way. You and a friend are in a large room. The floor has a long line from one end of the room to the other with 0 at the center of the room. Marks are painted out on the line at each foot to mark out the (relatively low) positive and negative numbers.

Laying on the ground, with the tips of his arms resting neatly at 0 and at 1, is a mint condition Stretch Armstrong.

Stretch, in this example, illustrates what the rand() command has given you - a random number pulled from the range of 0 to 1.

You're pretty confident that you and your friend could each grab an arm and pull Stretch to either end of the room, and that's exactly what you're doing when you multiply your rand() output by any given number.

The first example of wanting a number from 0 to 50 may oversimplify things a bit. You're multiplying by 50 because you want a range of values that covers the numbers from 0 to 50. Things are easier when you want to start with 0, as it's always going to be the bottom value when you finish this multiplication step.

If you want a range that doesn't have 0 as the lower bound, then you need to shift that range one way or another. Only after you multiply - if necessary - though. It's why you multiple by 49 if you want the range to be from 1 to 50 instead of 0 to 50 - you have to start with a range that extends to 0 due to the fact that 0 multiplied by any number is still 0. After multiplication you can simply increment in either direction.

This is accomplished by taking your stretched out Stretch Armstrong and walking up and down the number line - the range that Stretch's arms cover is the range from which your random number will be pulled. If you stretch Stretch to 20 feet long and then walk him 10 feet to the left you'll have a random number centered on 0 within the values of -10 to 10 (ish).

I should note that both Excel and Google Docs have functions that allow you to specify a ceiling or floor for random numbers, but if you understand how rand() works there's really no need for it. It's a completely redundant function, and you should feel angry that it's there.

We're looking to make a distribution to plot out on a histogram. What rand() gives us is a uniform distribution, which makes for boring graphics. How about something flashy, like a normal distribution?

Well, Excel and Google Docs don't have random normal commands (there are many programs that do), so we have to make use of some other functions to transform our rand() values into something a bit more...well, normal.

In Google Docs this function is:

=NORMINV(number, mean, standard deviation)And in Excel it is:

=NORM.INV(number, mean, standard deviation)

Google actually sums it up pretty well in the description of the function:

"Returns the inverse of the normal distribution for the given Number in the distribution. Mean is the mean value in the normal distribution. STDEV is the standard deviation of the normal distribution."

Thus, if we use the form:

=NORMINV(rand(), 50, 15)

We'll end up with random numbers drawn from a normal distribution with a mean of 50 and a standard deviation of 15. If we pulled 50 such numbers they might look something like this:

And that's where we'll pick it up next week!

Wednesday, May 1, 2013

Pistachios OR One simple method to have a pretty weird grocery shopping experience

So for a long time I had been under the impression that I didn't really like pistachios. Once - years ago - I had a bunch and then didn't feel great later on that evening. The human mind has some great methods to create strong linkages between foods and feelin' bad, as back in a hunter gatherer stage such linkages would have been exceptionally useful.

In any case, I didn't really think much of it and just didn't really eat pistachios again. It doesn't seem like I really came across them that much in day to day life, so it wasn't something that I really had to avoid.

Recently, I was hanging out with some friends and they put out some pistachios, and I figured I'd give them a try. Long story short, they're delicious. Take that, hunter gatherer part of my brain.

This was a few months ago, and I finally saw some bulk pistachios at the store last week. I bought some, and realized that they're one of the few nuts that you can't buy without the shell (there are actually some good reasons for it, apparently). You can certainly buy other nuts with their shells, but few of them really require it (with the exception of some of the more novel nuts, like hazelnuts).

Examining what I got from the store led to the conclusion that in any given scoop of pistachios you're getting mostly whole pistachios. Aside from this, you're also getting some nuts that have come out of the shell and some shells that don't have nuts in them.

I was curious about this, initially, and a cursory pass does seem to suggest this is noise that basically cancels out - you get about the same number of empty shells as you get shell-free nuts in any given pull. In the aggregate, then, you're still basically getting whole nuts.

Continuing to think about this a little bit - while sitting around eating pistachios - a thought crossed my mind.

What if you were really careful at the bulk bin? Instead of taking indiscriminate scoops, why not try to be a bit more calculating? If you could aim for already shelled nuts and avoid nut-free shells, you might save a little change (which you could use to buy more pistachios, duh).

Now, the short answer to the longer question is that you don't do this because it would look really weird if you sat at the bulk bin picking out shells and shell-less nuts. People would probably start to stare. The extension would also be that at a certain point you should stop searching and just start shelling - take the good stuff to the scale and leave the shells behind.

But how much is this really going to impact your total? How much does a pistachio nut weigh in relation to a pistachio shell?

Easy enough - I have a kitchen scale for just these sorts of questions.

The weight of one whole pistachio is...0 ounces.

The weight of one pistachio shell is...presumably less than that.

You see, my kitchen scale doesn't have the resolution to specify things at the level of a single pistachio, let alone a single shell. This is not a problem only unique to my kitchen scale (or to pistachios).

How to fix it? Well, averages based on larger samples.

10 pistachios give a measure on my scale, though the resolution is still not there to pick up small differences. In fact, 10 pistachios generally fluctuates between 1/8 and 1/4 of an ounce, meaning that the actual average weight it likely somewhere between those numbers. My kitchen scale does not give values between those two numbers.

Where between does that actual number fall? Well, my scale can't tell me that, at least not exactly. But if I take enough samples I can get a proportion that shows what amount of time the scale comes up 1/8 vs 1/4.
If we treat these as the likely bounds of the weight for 10 pistachios, then the proportion of the time the higher number comes up is the percent distance we have to travel between those two numbers.

Put another way, we can take the average of those measures and find a point between them that is a best guess for the true weight of 10 pistachios.

We can do the same for some sets of 20, even 40, and see if those help to give us a better picture of the scale (they do).

At a certain point, it's simply time to eat some pistachios.

I didn't weight the nuts themselves, as with the weight of the whole pistachio all we really need is the weight of the shell - the average weight of the nut should be what's missing. It's also much harder to shell a pistachio and then just set the nut and shell aside, especially when you only need to set one aside to be able to figure out the weight of whatever you eat.

The same idea of needing to use multiple shells on each measure holds, as the shells are lighter than the whole nut before shelling (obviously).

All said and done, the weights come out as follows:

Whole nut: 0.042 ounces

Just shell: 0.019 ounces

Thus, just nut should be: 0.23 ounces

Interestingly, the weights of just the shell and just the nut are pretty close to each other. This is a good sign - for every five empty shells you leave in the bin you should be able to take four shelled nuts away at basically an even weight trade.

It really only makes sense to make the trade, as if you're just leaving shells to save money you'd have to do a lot to make a dent. At $7.99 a pound, a single shell (without nut) is worth a little less than...1 cent? What's a few cents going to buy you?

Since the nut and shell weigh about the same, it means that (at these prices) a full pistachio should run you just over 2 cents. A pistachio nut runs just a bit over a cent. So, the question of what a few cents will buy you is one or two whole pistachios (or roughly twice as many shelled pistachios).

A single penny might not seem like much, but given the fact that the whole thing is only twice the price of the nut means you're looking at somewhere in the ballpark of 40-50% savings by removing the shell.

For a single nut, this might not make much sense. But remember, we're doing this in aggregate! Who goes to the store and buys a single pistachio?

The logical conclusion is that you should - just as some people do with sweet corn - stand at the store and peel back those pennies from your pistachios.

Just don't blame me if you get kindly (or unkindly) escorted out of the store.

Wednesday, April 24, 2013

Games of the Price is Right: The Wheel (Part III)

The last few times I've discussed The Wheel on The Price is Right I've talked about some of the odds that you face as the first or second contestant. A lot of the discussion has been based simply on how The Wheel is designed and played, and not necessarily on actual data.

For today's post, I'd like to talk about what it actually takes to win at The Wheel.

I've watched and coded a bunch of episodes at this point, and have a good deal of information on what people spin at The Wheel. There's a lot we can look at that might take a bit more data, but for now I thought it would be the most interesting to just put together a chart like this:

What this chart is showing is the distribution of contestants' winning totals at The Wheel. What is quickly evident is that a lot of people who win at The Wheel win by walking away with a dollar total. That's not to say that they hit a dollar on their first spin, but just that they totaled a dollar by some combination of one or two spins. 95 cents is also a bit winner - almost accounting for as many winners as a dollar itself.

If any of the contestants hits a dollar total, the other contestants only have a 1 in 20 (5%) of forcing a tie and spin-off each time they spin. Those aren't great odds. 95 cents isn't much better - there's a 1 in 20 (5%) chance of forcing a tie and spin-off, and a 1 in 20 (5%) chance of simply winning.

The odds of straight out winning against any given score actually double from 95 down to 90. While the odds of forcing a tie and spin-off stay constant (5%), the chances of winning straight out go from 1 in 20 (5%) to 2 in 20 (10%). This might account for the larger gap between 90 and 95 cents on this graph.

What should also be evident is that not many contestants make it as a winner at The Wheel with scores much below 70 cents. Two contestants managed to eke out that win at 65 cents, but the singular winner with 40 cents is actually an interesting fluke.

It was a situation where the first and second contestants had gone over a dollar and lost - the third contestant had a single spin at the wheel to see if they could hit a dollar in one spin. Good trivia for The Price is Right - in such a situation the third contestant does not get another spin at the wheel no matter how they do on their first. It is the only realistic situation where someone could win at the wheel with a spin of 5 cents. The value they come up with is completely random, as it is based on just one spin.

This brings up an important point, though. How do things look if we break them down by contestant?

Of the 72 events at The Wheel (36 episodes with 2 events per episode), the breakdown of wins by placement is actually starting to look fairly interesting. I looked at it after the last post with a smaller sample, and things seemed to be a bit more biased toward later spins. The numbers as they break down now are:

Contestant 1: 23 wins

Contestant 2: 23 wins

Contestant 3: 26 wins

Those wins are distributed as follows:

I was initially expecting the wins for contestant 1 to come a bit higher up the scale - perhaps a disproportionate number of wins from the 95 to a dollar range. Interestingly enough, contestant 1 as a place in line seems to be taking home a lot of wins right around the 70-75 cent region.

Now, this is still a small sample (so I'm going to keep coding), but 70-75 does seem to be the area where contestant 1 generally starts to feel safe enough to stay. If a first spin is below 70 the odds are that they are either: going to stay and get beat, use a second spin to fail to get to 70 and get beat, use a second spin to get into the 70+ region, or spin big and go over a dollar. I'd like to see how this continues to play out with more examples, but if this effect sticks around that might be a start of what's going on.

In terms of dollar wins, everyone seems to be pretty close to equal footing. If you spin a dollar you pretty much have things wrapped up, no matter where you are in line. The best someone can do is also spin a dollar and take you to a spin-off.

Speaking of, we can also take a look and see how many of these wins resulted in a spin-off win. The way I have things coded is which contestant won (overall), what value they won with (enough to make the last chart), but also if there was a tie and what happened during the tie. The winner of the tie is who made it into these charts, so if contestant 2 and contestant 3 tied at 75, went to a spin-off, and contestant 2 won, then contestant 2 gets credited with the win and the winning value is logged as 75.

[To be fair I also have information on every spin that occurred, not just who won, so there's a lot of things I can look at in future posts.]

Basically, we can simply add this information to the counts from earlier:

Contestant 1: 23 wins (6 through spin-offs)

Contestant 2: 23 wins (2 through spin-offs)

Contestant 3: 26 wins (4 through spin-offs)

The first contestant is relying a bit more on winning in spin-offs, but hardly enough in comparison to really make any strong claims about it. If we simply negated wins from spin-offs the ordering would still stay similar, though contestant 1 would take a bit of a hit, with:

Contestant 1: 17 wins

Contestant 2: 21 wins

Contestant 3: 22 wins

In any case, the small sample still seems to be limiting things a bit, but we're also starting to get into the range where some potential trends could be emerging. If there's one takeaway, it seems that your chances of winning with a spin below 70 - for any contestant - are fairly low. Your odds at winning with a spin below 65 are virtually nonexistent.

Wednesday, April 17, 2013

The Monty Hall Problem OR the Omniscient Lawful Neutral Companion Problem

Today we're going to talk about a classic game show that isn't The Price is Right. That show is Let's Make a Deal.

Let's Make a Deal is famous in part for having within it a very curious game element. That element has come to be known as the Monty Hall Problem after the name of the host of the show.

The basic idea is that you're presented with three doors, Door A, Door B, and Door C. Behind one of those doors is something pretty cool. It could be some money, or a car, or whatever. Behind the other two doors there is something less cool, like a goat.

Unless you really like goats, in which case the goat is the prize and the other doors have something that you find less cool. Like less cool goats.

By the way, thanks as usual to wikipedia for having a pretty sweet totally public domain image to put things into perspective.

So the host brings you up to these doors, and tells you there is something great behind one of them. Without anything but dumb luck to guide you, he tells you that you're allowed to pick a door and receive the prize behind it.

Pretty straightforward, right? You have a 1 in 3 (1/3) chance of picking the good prize, and a 2 in 3 (2/3) chance of walking home with a goat (assuming from this point forward that you're one of the people who is trying to not win a goat).

This should make sense to you - this is the easy part. Don't forget this part, though, as the rest turns out to be about that easy.

You pick a door. So, pick a door. Take a deep breath, and wonder if you've won. You look at the host, and he's smiling. You start to get the feeling he knows something that you don't.

Well, it turns out that he does. In fact - for all intents and purposes - he knows EVERYTHING. Sure, he can be easily blinded to the situation by receiving information in an earpiece, but it's so much more fun if he's really just trying to have some fun with his all-knowingness.

The host informs you that you have two options. You can either stick with the door you just picked, or switch to one of the others. You think about it for a moment and realize that either of the other two doors has the same odds of a prize as the one you just picked. No reason to switch or not switch, as you may as well have flipped a coin in the first place.

Wait - the host says - that's not all. He points at a door that you haven't picked, and it opens. Hey, there's a goat behind it! He has just revealed one of the losing choices. The only rule is that he has to pick a door that you didn't pick.

For example, if you picked Door A, he may open and reveal a goat behind Door B or Door C. If you picked B he may open and reveal a goat behind Door A or Door C. If you picked Door C he may open and reveal a goat behind Door A or Door B.

It's not his choice which door gets opened in all situations - no matter what is behind the door you picked there is still at least one goat remaining behind one of the other doors. If you picked the winner right away he has a choice of two goats, but if there is a goat behind your door then there's still a goat remaining on the board that he can reveal.

This is where it gets interesting.

The host asks if you want to stay with the door you picked first, or switch to the last remaining door. You ponder it for a moment. What's the benefit of switching? You already decided that all of the doors are a coin flip anyway, right? Right? Right...?

Let's walk through a possible scenario.

Let's say the correct door is Door C. The host knows it, but you don't, and you can't learn it. Here's how things play out if you stay:

You initially pick door A. The host reveals door B, but you stay with A. You lose.
You initially pick door B. The host reveals door A, but you stay with B. You lose.
You initially pick door C. The host reveals door A, but you stay with C. You win.

In the case of staying, you have to correctly guess the right door out of three on your initial try. You have a 1 in 3 chance of winning, because you pass on the second step. You already figured this out as the easy part earlier in the post. Now what happens if you switch?

Here's how things play out if you switch:

You initially pick door A. The host reveals door B, and you switch to the door remaining, C. You win.
You initially pick door B. The host reveals door A, and you switch to the door remaining, C. You win.
You initially pick door C. The host reveals door A, and you switch to the door remaining, B. You lose.

You see, when switching, you're actually hoping that you chose poorly on your first guess.

If you picked wrong to start, you'll have the correct door when you switch. Do you see that part? That's the trick.

When you get to the second choice - when you have an opportunity to switch - you're playing a slightly different game. There's something cool behind one door, and a goat behind the other. One of the doors (the one opened) is out of play. The door that's out of play was part of the pair involving the door that you can switch to - in essence you're not switching to one door, you're switching to both of the doors you didn't pick initially. It's just that one of those doors has already been opened.

Think of it this way. Instead of opening a door and then asking you if you want the remaining, the host is actually asking you if you want to stay with your initial door choice or switch to both of the other doors.

One of those doors has a goat, and the host knows it. He actually likes goats, so he'll simply open that door and take the goat off your hands. He'll never steal the cool prize, because he's not like that. You get whatever is in the door that he didn't choose.

He may leave you a goat (if you picked right and have the cool prize behind your door), or leave you the cool prize (if you picked wrong and have a goat behind your door).

Since there are three doors you have a 2 out of 3 chance of picking incorrectly to start. If you stay, you need to have picked the right door out of three - and there you have your 1 in three chance.

On the surface people assume you always have a coin flip choice (or overweight their initial guess), while the best bet is to always switch. Over time, you'll win more prizes (and less goats).

Wednesday, April 10, 2013

Amazon, Tax Structure Misconceptions, and Giant Candy

How many of you have ever eaten a five pound Hershey Bar?

How many of you have ever eaten 3,131 five pound Hershey Bars?

Don't worry, we'll get there.

Today I mostly wanted to talk about Amazon's affiliate reward system, but it actually also gives us a good vehicle to talk about some misconceptions of how the tax structure works. We also get to talk about giant candy.

http://www.amazon.com/gp/product/B004AH0MLG/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B004AH0MLG&linkCode=as2&tag=48808-2

http://www.amazon.com/gp/product/B004LJYDXG/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B004LJYDXG&linkCode=as2&tag=48808-20

http://www.amazon.com/gp/product/B00315HJ8C/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B00315HJ8C&linkCode=as2&tag=48808-20

Oddly, my thought after finding all of those is mostly disappointment that the Reese's Peanut Butter Cups aren't bigger. Seems like 'World's Largest' should actually mean something.

I want to use giant candy to talk about Amazon's affiliate programs due to the sheer absurdity of the numbers I'm going to be talking about. It should also make things nice and simple when we have a nice five pound block of chocolate to think about every time we need to talk about an item sold.

Some of you are probably asking what Amazon's affiliate program is. Well, it was discovered in the early days of the internets that one of the best ways to get people to buy things is to get other people who know those people to tell those people to buy things. Depending on how well those people know each other, this covers all behavior from word of mouth to banner ads.

Amazon has a pretty interesting setup, but a fairly complicated one.

The short rundown is this. If I send you a link (or put it on a page where you see it), and you click on it, go to Amazon, and buy any product, I would get some cut of the cost of that product. That's the easy part.

The more complex part is what cut I'm getting in any given situation.

[As a brief promotional aside, Amazon is willing to give me money at no cost to you any time you buy any product on Amazon, as long as you got there through one of my links. If you feel like helping out the blog you can simply bookmark Amazon as this link: ( http://www.amazon.com/?_encoding=UTF8&camp=1789&creative=390957&linkCode=ur2&tag=48808-20 ) . If you do that - and every time you're going to search for and buy something on Amazon do it through that link - then every time you buy something on Amazon I'll get a small share of it. Not going to force you, but any little bit helps keep this blog running. The biggest impact by far is cell phones with wireless plans, which is a flat rate $25 a shot. I'd buy you a drink for that sort of contribution.]

Amazon has some tables (accurate as of April 10, 2013):

Now, the first table is pretty boring, aside from the fact that any of you looking to buy 'magazine products' on Amazon should let me know. We're also going to look the other way and pretend that giant chocolate bars don't fall into the 'grocery' category. We're doing this because anything that doesn't fall into one of those first categories (or other categories they specify elsewhere) falls into 'general products' and the much more interesting second chart.

It's even more interesting if we start to use it to make some graphs.

Let's assume that one of my readers really likes Hershey bars. They read this post, then click on this link:

http://www.amazon.com/gp/product/B000IW68YC/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B000IW68YC&linkCode=as2&tag=48808-20

which takes them to the Amazon page for five pound Hershey bars. If that same person were to buy one of those bars, Amazon tosses me a Jefferson and some change ($2.08) at the end of the month because I sent them there. That $2.08 is (from the above chart) 4% of the $52.00 current price of a giant Hershey bar.

Now, let's say that same person REALLY likes Hershey bars, and goes back after putting down the first one to pick up five more giant Hershey bars. How much do I get for each of those? Well, it's still under seven items, so I still just get $2.08 a bar.

A few days later, the same person decides that six giant Hershey bars are great, but they'd really be happier with just one more. They pick up one more - how much do I make on that one?

Well, more than $2.08, because I've now helped Amazon sell seven items in a given month. At that point I would get bumped up into the 6% range for that month, and start bringing home $3.12 for every bar.

But what about the six bars I had already helped sell previously?

Well, it's not the end of the month yet, and Amazon has yet to cut a check. They were going to give me $2.08 for every bar, but now I've found my way into a better part of the scale. Pretty soon Amazon will start comping me drinks, I would imagine. At the end of the month, I get $3.12 for every one of those seven bars.

When six bars had been sold, I was going to get $12.48. On the purchase of the seventh I didn't just make the $3.12 for that bar, but made up a bunch of money on bars that had already been bought. In fact, the money earned after the seventh bar is $21.84 - by convincing my chocolate-loving reader to pick up that seventh bar I actually earned $9.36, or a full 18% of the cost of a bar.

Let's say I was really persuasive, and convinced this (apparently pretty well off) reader to drop some cash on stockpiling for the winter by buying 125 giant Hershey bars. Here's what that looks like in a graph.

You can see that the cost is starting to become pretty decent, but with three things on the same graph it's hard to make sense of it (or even name the y-axis). If we remove cost we get this:

And you can start to see some of the jumps in payout that match up with changes in this scale. Like I said, when the seventh item gets bought I get paid more on all the items before. Amazon doesn't make their tallies until the end of the month, so the place you end up in terms of number sold is the rate you get on everything.

At least on everything that falls into table 2 - remember we're basically forgetting about the much more boring table 1.

We can even make a graph that illustrates this a bit better by plotting out how the percentage changes based on number of items sold.

We by no means have to stop at 125 bars sold. We can take it out to the full range of Amazon's structure, and assume our friend hit the lottery and decided to spend some of that on building a chocolate house somewhere in a cool enough climate.

That's right, I get one final bump way out there when this reader buys his 3,131st giant Hershey bar. That jump is from 8.25% to 8.5%. Not much, right? Well, it's the difference between $4.29 a bar and $4.42 on any given bar, so that is right...partly.

When the 3,130th bar is bought I'd be making 8.25% ($4.29) on it, and all others before it. When the 3,131st bar is bought I'd be making 8.5% ($4.42) on it, and all others before it.

We can graph the actual payout per number of Hershey bars, and that looks like this:

You'll notice the upticks in the graph at several points - these are the points where the payout rate changes for all products bought that month. At the first jump it nets me a few extra dollars, but as things continue to escalate these jumps get bigger and bigger.

Now, I'm not going to force you to do calculus here (we're really close, though), but we can take a look at the incremental gain of adding one more Hershey bar to the pile at any point.

We are again faced with a graph whose immensity of scale wipes out some of the (less important) effect. The main point of this graph are the peaks - while the rest of the line does have variability it is negligible in this case.

Let's walk through it, because there's a really important point here.

When the seventh bar was purchased, I mentioned that I'd be pulling down not just the payout for that bar, but the retroactively applied greater payout for all those earlier bars. The sixth bar makes me $2.08, and the eighth bar makes me $3.12, but the seventh bar makes me $9.36, because the seventh bar triggered a change in the underlying payment structure.

Now, $9.36 is enough to make me happy, but not enough for me to pressure someone who has just bought six bars into buying their seventh. That $9.36 is only one of the peaks on this graph, though. Let's cut to the chase.

When our poor reader buys their 3,130th giant Hershey bar (IN A MONTH), I mimic a cash register noise and earn 8.25% of it ($4.29). When the same reader buys their 3,131st bar, I break open the bubbly - Amazon just bumped my total by $411.32. The 3,132nd bar again simply gets a fairly weak cash register noise (for the comparatively paltry sum of $4.42).

In fact, there'd be no reason for me to wait for this chocolate magnate to buy the 3,131st bar - I could do it myself and still net $359.32. I'D ALSO HAVE A GIANT CHOCOLATE BAR.

The point is that it's actually quite foolish to finish the month at 3,130 sales, as I'm simply leaving money on the table. If I'm smart enough to trick someone into buying this much chocolate I should have the good numerical sense to never stop when so close to such a payout.

Take a quick pause for a moment - do you get that? Because it's about to stop being about fifteen thousand pounds of chocolate.

Some of you might think this whole example looks vaguely familiar, as if you interact with systems like this occasionally. I kind of gave away the spoilers in the title, but this second chart from Amazon does look vaguely like a series of tax brackets:

I don't want to talk too much about taxes, because it's really a secondary part of this post and I've already talked about taxes quite a bit before.

The main sticking point that I think most people have with taxes is that they assume them to play out one way, when they really play out quite differently.

You may think they play out the same way that Amazon hands out checks in exchange for chocolate (and I guess other stuff too). You make $5,000, you get taxed at 10%; you make $400,000, you get taxed at 35%. You can see that this introduces the same (but in the opposite direction) sort of argument as buying the 3,131st Hershey bar yourself. If you are making $178,000 it would seem stupid to make another thousand and have all your money taxed at a higher rate. Just stop making money, right?

Well, that thought process comes out of a complete misunderstanding of the process of taxation. Unlike Amazon - who waits until the end of the month when everything is all set - taxes are paid on every dollar as you earn them. You never have to go back and retroactively pay a higher rate on money that's already been earned. When you move to a higher tax bracket you only pay that higher rate on the money you make from that point onward.

If you're thinking that you kind of get it, I'd suggest reading this earlier post. I'd also suggest that post if you are simply lost but want to try to make sense of it.

If you're one of the people who feels like they now understand taxes better (or had already understood them pretty well before), you should feel proud of yourself. Proud enough that perhaps you've earned yourself the right to treat yourself to something nice?

http://www.amazon.com/gp/product/B000IW68YC/ref=as_li_ss_tl?ie=UTF8&camp=1789&creative=390957&creativeASIN=B000IW68YC&linkCode=as2&tag=48808-20

Wednesday, April 3, 2013

CNN, statistical-minded proofreading, and percentages of percentages of percentages (of percentages)

This week's post should be a quick one - it has to do the media. The poor, poor media.

Don't worry, business cat will make sense by the end of all of this.

Specifically my problem this week is with CNN, though they're by no means the only ones guilty of poor statistical reporting. They're simply the one that I have most recently noticed. Once you start looking, though, it's no real trick to catch any of the major news outlets in the same kind of gaffe.

The story in question is here:

http://schoolsofthought.blogs.cnn.com/2013/03/11/when-teachers-are-the-bullys-target/?hpt=hp_bn11

Please take note, this is not a discussion of content. Obviously I was at least a little interested in the content to be reading the article, but this particular discussion should be completely free of content. The main content of the article could be written in Latin - what's important today are simply the numbers and how they report them.

Now, to drill down into it as quickly as possible, the only paragraph we need concern ourselves with is a little past the halfway point of the article. For your sake, here it is, copied from the article as originally seen several weeks ago but accessed today, April 3rd, 2013:

"MetLife’s 2012 Survey of the American Teacher revealed that job satisfaction is the lowest in more than 20 years. The survey reported that 29% of teachers said they are likely to leave the profession. That’s 12% higher than the number of teachers who said they would leave in 2009."

I'm sure I'm not the only one that starts to get a fight or flight response when I see any media outlet reporting statistics or percentages, but this paragraph throws up some pretty obvious flags that make a little anxious.

First off, they say that in 2012, x% of teachers are likely to something. They tell you that this is a z% increase from 2009, but fail to provide you with y, or rather y%, the percent of teachers that were likely to do this same thing back in 2009.

Some of you are getting a flight or flight response (I can feel it, even over the internets) because I just used letters instead of numbers. I feel for you, I really do. But this is grade school algebra I'm dropping on you. If you have no idea how to do grade school algebra it shouldn't make you feel sad or angry or anxious, it should make you feel motivated to take a few hours and just learn grade school algebra. If you know me, and want me to teach you, I will. Honestly. Just ask. It will be quick and painless.

In any case, y is not given, but inferred. We have an equation:

(y + (z/100)*y) = x

This equation has three variables, but only one unknown. That means it's solvable for y.

To clarify - for those that are looking at that equation like it is Latin - all that's happening in it is that we're taking a 2009 number (y), and increasing it (+) by a percent (z). Percents are given as numbers from 0 to 100, but to do math in terms of incrementing we actually want a proportion, which ranges from 0 to 1. We can easily change a percentage to a proportion by dividing by 100 (/100).

This proportion is the part of the first number that increases. If there is some percentage growth (z), we take the original number (y), and add on to it the share of itself that it is growing by ((z/100)*y).

To make it concrete for you, if y is 10, and it grew by 50%, then the way we figure out what the new value (x) should be is to start with 10, and add on half (or .50*10 = 5). Thus, a 50% increase to the number 10 results in the number 15.

Are we all on board with that?

Some of you might be saying, 'hey, this is different because you just did it on 10 and not on 10%'

You, good readers, have just hit on the teachable moment.

I used the number 10, but it doesn't matter. For your sake I'm going to copy paste the same explanation but add in the % symbols.

To make it concrete for you, if y is 10%, and it grew by 50%, then the way we figure out what the new value (x) should be is to start with 10%, and add on half (or .50*10 = 5). Thus, a 50% increase to the number 10% results in the number 15%.

Still following? Because it's somewhere in there that CNN stopped following. Business cat has also moved on to chasing a laser pointer across the floor.

I said we can use the equation up above to figure out the number that CNN isn't reporting (y). I won't hold you in suspense much longer - or make you do the math - the value from the given x and z should be y = 25.892...

You see, if you start with just shy of 26%, and take 12% of that (it's around 3%) to add on, you end up at around 29%. If the percents are confusing you, take the % signs off the 26, 3, and 29.

If you start with just shy of 26, and take 12% of that (it's around 3) to add on, you end up at around 29.

The % signs don't matter on any of those except for eventual interpretation in context of the content, and I've already told you I don't care one bit about interpretation of the content here.

Where calling something a percent does matter is on the 12. You may also notice it's the only one I didn't remove the % sign from. I start to worry when I read something like this because an increase of 12% is a lot different than an increase of 12 percentage points.

Let's walk through this a little more. The equation we talked about above deals with an increase in percent:

(y + (z/100)*y) = x

But if we're talking about percentage point increases it's a bit simpler:

y + z = x

In that case, you would be saying that the 2012 number is 29%, and since 2009 it has not grown 12%, but rather moved up 12 points on a percentage scale. It's a lot easier to figure out the 2009 number, as it's simple subtraction. y = 17%

The fact that this is a lot cleaner and simpler (and doesn't give a solution with a non-simplifying decimal) makes me wonder if this is in fact what they might have been doing there.

OH WAIT WE CAN FIGURE THIS OUT.

You see, despite their poor understanding of statistics and percents, CNN does at least take the time to link you to things they are citing (so they are actually doing a little better than some of the news outlets in that regard). In this case, the link in that paragraph is actually a live link (at the moment) to the pdf research report from which they are drawing their numbers. For those that want it as a separate link, here you go:

https://www.metlife.com/assets/cao/contributions/foundation/american-teacher/MetLife-Teacher-Survey-2011.pdf

It's not a small document, but we're looking for a very particular piece of information. A quick search pulls it up, and reveals that CNN didn't even have to read the actual report - they're citing information from the executive summary. Think of the executive summary like http://simple.wikipedia.org

You've never been to simple.wikipedia.org? Stop wasting your time here, and start wasting (making use of?) your time here:

http://simple.wikipedia.org/wiki/Large_Hadron_Collider

or here:

http://simple.wikipedia.org/wiki/Special_relativity

or here:

http://simple.wikipedia.org/wiki/Love

or here:

http://simple.wikipedia.org/wiki/Candy

The last one containing what may be my favorite pair of sentences ever written in conjunction on the internets:

"Many people like candy and think it tastes good. Other people do not like it."

Anyway, back to the stats.

Finding the Executive Summary TL:DR, CNN appears to have conveniently found the 'Major Findings' bullet point list of the Executive Summary to be the place to go for numbers. I don't even have a good comparison for a Major Findings bullet point list in an Executive Summary - simple.wikipedia.org is about as simple as my comparisons get.

Maybe, uh, quickmeme?

http://www.quickmeme.com/make/

Well NOW I've killed your day. That's also where business cat came from. Quickly, in fact.

The place I'm trying to get us to is this bullet point in the report:

"The percentage of teachers who say they are very or fairly likely to leave the profession has increased by 12 points since 2009, from 17% to 29%."

Bam.

Hopefully at this point - if you've been following - you can see that the people who were paid to put together a statistical report actually put it together correctly. They used the correct terminology, and left a % symbol off of the number 12. They did this as it is not a percent. It is a growth in percent, not a percent growth. These two things are very, very, very different.

If you've been reading the blog for a while you might recognize that this is the same thing that a of companies use to trick you into thinking things are much larger or smaller than they appear. The way CNN reworded things actually translated into only about 3 percentage points growth - not very impressive. 12 percentage points is...well, larger.

The same way that Jimmy Fallon can change something on the order of half of a percentage point increase into a drastically different 50% increase (or as we noted, much smaller increases into much larger percent increases), so too can poor statistical reporting change any effect into something that it is obviously not (in either direction).

Look for sources, and don't just read through numbers without thinking. The person feeding you the information might be actively trying to deceive you to prey on your weaknesses (like Jimmy Fallon), or might simply be negligently ignorant about those statistics (like reporters at every major news outlet).