agaricus5 wrote:
Although the distributions are not normal, there is a theorem called the Central Limit Theorem that states roughly that for any distribution with a mean, (μ, and finite standard deviation (σ, if you pick at random a certain number of people or things from it (in this case, you could pick a number of DRODders at random and ask them what they would vote), and average the numbers, you'll find that if you did it over and over many times, the average mean you'd get will be μ, and the means would be distributed approximately normally (i.e. the graph of means would be roughly bell-shaped), with a standard deviation of σ/√n, where n is the number of things you picked.
Seeing a reference to the Central Limit Theorem, I can't resist saying something. Let me first say that although the std deviation does give some useful information, I'd also prefer seeing the actual distribution, but it's not really important to me. The rest is slightly off-topic.
Let me also say in advance that if this turns into a normality rant, it's not aimed at you, agaricus5. I'm merely interested in the topic, and I know you did yourself indicate that there may be some shaky assumptions in your arguments. This is purely a discussion on a topic you mentioned, not a response to what you said.
Click here to view the secret text
×I feel a bit uncomfortable around applications of limit theorems. A limit theorem goes something like: For every d>0, if n is sufficiently large, blah blah blah.. That "sufficiently large" is a highly theoretical idea. What is sufficiently large, in practice? It depends. What bugs me is just how loosely that theorem often is interpreted as "a mean is more or less normally distributed" - see the wiki article. (Even worse, "most things are normally distributed, because they are in some way a sum of different factors". For instance, we should expect hold ratings to be normally distributed, since ratings tend to be a sum of various criteria. Rubbish. Btw, I'm not accusing you of this or the "means are normal" assumption, agaricus5.)
A classic counterexample is to consider Poisson variables. A sum of independent Poisson variables is again Poisson distributed, with the parameter of the sum equal to the sum of the original parameters. So, according to typical "practical" reasoning, if X1, X2,...X100 (100 is always a nice large number) are independent Poisson(1) variables, their mean is normally distributed. At this point it is admitted that any one of them is obviously not normally distributed. At this point I also agree. But, next time, when they're all distributed Poisson(0.01), the same reasoning will be applied. However, in this case the sum has a Poisson(1) distribution, which we know not to be normal, hence the mean is not normal (if X/n is distributed normally, then so is the distribution obtained by multiplying by a constant, X = n(X/n)). The usual excuse is that anything "more or less bell-shaped with a very small variance" is normally distributed for practical purposes. I've never seen anything remotely resembling a valid proof. Note also that the Poisson distribution is one of the most popular assumptions in the rare cases where normality is not assumed. Assumptions other than normality are usually more interesting and much more justified. It does, for instance, make a lot of sense that the number of typos on a page, or the number of glasses that break per month in a household, are Poisson distributed. It then goes downhill when means are calculated. Interesting side note: Further analysis explains that weird phenomenon of one glass in a set always seeming to outlive the rest by a surprisingly long time.
(Someone actually wrote a whole book on counterexamples to typical loose reasoning in statistics. I don't remember who, or what the book's called.)
I find this convenient normality assumption funny. That's all it's about. Convenience. Its density, distribution, moment generating and other functions associated with random variables are all very nice to work with. It simplifies life. It enables you to actually calculate *something*, when you'd be nowhere if, as is often the case, you simply don't know what the real underlying distribution is. There are normality tests, but most people agree (I think) that they're not much use, the biggest problem being that they take the form of hypothesis tests where you want to *accept* the null hypothesis, which goes directly against the whole philosophy of hypothesis testing.
The best part to me is that if "most" things are normally distributed, then ratios of the same things will have the extremely awkward Cauchy distribution (its mean doesn't even exist, let alone its variance). But in practice, if you're not going to calculate X/Y, then X and Y are probably normally distributed, by the Central Limit Theorem.
I see this did turn into a rant then, so please excuse me. Note: Part of my annoyance with normality assumptions, and statistics in general, dates back to my student years, when I had no problem giving proper mathematical proofs in exams, but was very irritated by the fact that the only way to reproduce statistical "proofs" was to memorise them, since they made no mathematical sense.
...if the 45 votes the hold has now are representative of what people think, then...
Rather a big if? I'm not sure. I suppose 45 probably is getting there, but the bigger question is whether those who do vote form a biased or representative sample. (Another reason for Schik to yell at us to vote.)
I can be probably 95% sure that the true mean (i.e. the mean I'd get if I made everyone vote, including those who will play DROD in the future) lies between 8.16 and 9.34.
There I disagree. The part about future voters, I mean. Remove that and I'm with you. That future voters will vote the same way as current voters seems like an unjustified assumption to me. If anything, BD and your next example, HIJK, show that there is a tendency for old holds to be rated lower over time, so you can't construct confidence intervals for BD's rating a year from now based on current votes. If that assumption were valid, you'd be right, of course, otherwise you're entering the world of time-series analysis.