Statistics

There are three kinds of lies: lies, damned lies, and statistics. Statistics is the field of study within mathematics that concerns itself with data collection, classification, analysis and interpretation, usually taken from representative samples of a particular cohort, of events or opinions to give a reasonable estimate of the whole population's outcome or opinion. Statistics is a form of inductive reasoning, but because it uses almost exclusively mathematics to make its arguments and because most statistical tests are now run automatically by modern computer software it has somehow gained the status of deductive reasoning to many. When someone feeds their data into a statistics program and it comes out significant it is treated almost like a proof (in the mathematical sense) that the alternative hypothesis is correct. A p-value is the probability of obtaining a result equal to or more extreme than what was actually observed, when the null hypothesis (H0) is true. For example, p<0.01 means that that there is a <1% chance that the null hypothesis fits the data.

Average
A measure of where the data points as a whole are, many times summarizing (and sometimes oversimplifying) the data set. Does not describe the variance of the data set.

Standard deviation
A standard measure of how spread out the data is. Standard deviation follows the empirical rule. Assuming a normal distribution, the rule states that one standard deviation away from the mean (in either direction) will encompass 68% of the data, two deviations in either direction will encompass ~95% of the data, and three deviations away from the mean will encompass ~99.7% of the data in question.

Sample size
The number of observations/events that are being evaluated in a data set. Larger sample sizes are more representative of it's

Sampling bias
Sampling bias occurs when the population being sampled is not truly representative of the population as a whole to whom the study is relevant. In such a situation the results are skewed because some members of the larger population are more likely to be sampled than others. An example of this would be a school, deciding to issue a survey to determine how funding for the semester should be allocated. On the day in question the physics students, normally forming the majority of the population, are on a field-trip to the local cardboard box factory &mdash; leading to their views being under-represented in the data. They return to find that next semester the school shall shift funding away from them and into sports.

Sampling bias isn't necessarily dishonest, but will draw into question the validity of the results. In some cases self-selection bias may occur, by which participants can choose to opt-in or out of the survey. Not all participants may be equally motivated to complete a survey, and this can make the sample set unrepresentative. This can easily occur in online polls regarding divisive issues. For example, Fox News could post a poll asking the question "Is Obama ineligible for the presidency?" The average reader may encounter the poll and provide an answer, but those with little interest in this issue may simply ignore the poll. Elsewhere a real-estate agent, dentist and alleged lawyer notices the poll and immediate directs everyone on her mailing list to go and vote for the removal of the uppity president. With sufficient following a relatively small number of birthers may succeed in giving the impression that their position is shared by a greater percentage of the larger population than is actually the case.

A way to control for sampling bias is to accurately track the demographics of those participating in a survey in order to establish a sample set that, when scaled-up, represents the population the survey is purported to address.

The misunderstanding of statistics
88.2% of statistics are made up on the spot. A common problem is a misunderstanding of what a statistic actually means.

For example, life expectancy is often confused with maximum life span as seen in the In the Search of... episode "The Man Who Would Not Die" (About Count of St. Germain) where it is stated "Evidence recently discovered in the British Museum indicates that St. Germain may have well been the long lost third son of Rákóczi born in Transylvania in 1694. If he died in Germany in 1784, he lived 90 years. The average life expectancy in the 18th century was 35 years. Fifty was a ripe old age. Ninety... was forever." This ignores the fact that life expectancy is an average with high child mortality rates bringing that number down. In fact, Socrates, Saint Anthony, Michelangelo, and Ben Franklin all lived way past the life expectancy of their times.

Another example is provided in Carl Sagan's The Demon-Haunted World with "President Dwight Eisenhower expressing astonishment and alarm on discovering that fully half of all Americans have below average intelligence"

In many local areas of the 1990s, the difference between a TV rating that kept a show on the air and one that would cancel it was so small as to be statistically insignificant, and yet the show that just happened to get the higher rating would survive.

The misuse of statistics
Kent Brockman: Mr. Simpson, how do you respond to the charge that petty vandalism such as grafitti is down eighty percent, while heavy sack-beatings are up a shocking nine-hundred percent? Homer Simpson: Oh, people can come up with statistics to prove anything, Kent. Forfty percent of all people know that. Kent Brockman: I... see. This problem has become much greater in modern times because more data is available. If you're looking for a statistic to support your argument you can often find one, even if it's partial, selective, out-of-date or invented. It's also the case that statistical software packages are available that do all the math for researchers. This leads to the phenomenon where data goes into the black box and statistical significance comes out magically. Researchers can run a wide range of statistical test without having a clue what they are doing or what the underlying math is doing. Statistical tests are very sensitive to the structure of data and key assumptions that must be met or the results are meaningless.

It is important to always keep in mind that statistics are simply an argument and just like any other argument it does not exist in a vacuum. The reliability of its assumptions, the accuracy of its propositions and the relationship of all of these to the conclusions being drawn are all subject to as many problems as any argument made by words. Because of the overwhelming focus on "significance" being the goal a series of major endemic biases are built into most published literature (see statistical significance for this discussion) and as such single results in single experiments are worthless for creating an accepted body of knowledge. Results must be reliable and repeatable, many cranks and quacks will take advantage of the exalted status of statistics and the likelihood of finding a few "significant" results by chance alone to sell their pseudoscience and woo nonsense.

Along the same lines, it is important to remember that a statistic is simply a number. Without knowing the background information, such as sample size, alpha level, etc., it is difficult (if not impossible) to draw any real conclusions. Moreover, a statistic merely describes a relationship… it does not comment on "cause and effect" (see Causalation).

Partly in response to the replication crisis, the American Statistical Association reiterated a set of 6 principles for interpreting p-values:
 * 1) P-values can indicate how incompatible the data are with a specified statistical model.
 * 2) P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
 * 3) Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
 * 4) Proper inference requires full reporting and transparency.
 * 5) A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
 * 6) By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Those wishing to see examples of exactly how not  to use statistics should see our article Conservapedia:Schlafly Statistics.

TV ratings as an example
TV rating statistics have been heavily criticized over the years. To be fair the system was originally designed when there were three (and later four) major networks and only one TV per household. But even at the beginning there was the issue that the sample being taken wasn't random in the statistical sense of the word because a small fraction of the population is selected and only those that actually accepted were used as the sample size. Also thanks to advertisers there was a heavy push for demographic data which fragmented what was already a small sample size. Then you had the issue of the people being chosen having response bias sometimes reporting a show they hadn't watched because they wanted it to survive. Getting rid of the written diaries with an electronic log didn't help much as people would put in an easy code (which was usually assigned to a child in the house hold) rather than their own, which totally hosed the demographic data.

Worse, this same demographic data could be used to "game" the ratings. Shows that a network wanted to die would be put in timeslots totally unsuited to their demographic (as happened with Star Trek in its third season), run it in the same timeslot as an equally popular show (one of the ways the original Doctor Who show's ratings were manipulated), or move it around so much it couldn't get a stable audience (such as what happened to the 1990 Flash series) resulting in poor ratings they could then use to justify killing the show or so editorially manipulate a show that nothing could be done ratingwise to save it (as happened with the original Doctor Who).

The first major crack in that system came as people started getting more than one TV...which odds were wouldn't be counted. Things only got worse with the advent of cable networks in the 1990s as the number of channels increased making the margin of error so high that in very many local areas there was no statical different between a rating that would keep a show on the air and one that wouldn't but the show that just happened to get the higher rating would survive. If this wasn't enough shows that had been time-shifted (recorded to be watched later) were not counted nor where shows seen outside the home (college dormitories, transport terminals, bars, and so on) effectively killing the original purpose for which ratings had been created for (to measure the popularity of a show and determine how much an ad there was worth). Then you had the issue of local networks not carrying the national feed and showing their own programing (though this was rare for Prime Time shows)

Statistically TV ratings were a total joke  but the real punchline was to come further down the line.

That punchline was delivered by the Internet. If cable expanded the number of available channels then the internet exploded it. As late as 2013 the sponsors seemed determined to act as if was still 1950 ignoring people who watched a show via iTunes, Hulu, YouTube, or even the network's own feeds (such as ABC.com and CBS.com) in their ratings degenerating the whole practice into a feel good statistic that effectively didn't tell anybody anything really useful.

Even when online viewing was used there was no mechanism in place to determine any overlap between TV and internet viewing. For example, NBC had no idea what overlap (if any) there was between the 111.3 million traditional television viewers and 2.1 million live stream viewers of Super Bowl XLVI.

Manipulation
While many people are familiar with "cooking" statistical data after it is gathered there are other ways to manipulate the data.

In surveys how you ask the question can change the outcome. Changing one word in a question can produce totally different results (using "allow" verses "forbid" for example). Even the order questions are asked can change the results.

Also what the static measures may not mean what you think it means. Take unemployment for example. There are actually six different ways to measure unemployment (bolded one is the choosen baseline):

U1: Persons unemployed 15 weeks or longer, as a percent of the civilian labor force

U2: Job losers and persons who completed temporary jobs, as a percent of the civilian labor force

U3: Total unemployed, as a percent of the civilian labor force (official unemployment rate) (actively looked for work within the past four weeks).

U4: Total unemployed plus discouraged workers, as a percent of the civilian labor force plus discouraged workers

U5: Total unemployed, plus discouraged workers, plus all other persons marginally attached to the labor force, as a percent of the civilian labor force plus all persons marginally attached to the labor force

U6: Total unemployed, plus all persons marginally attached to the labor force, plus total employed part time for economic reasons, as a percent of the civilian labor force plus all persons marginally attached to the labor force.

When people talk about unemployment being higher than is recorded odds are they are referring to the U4 through U6 percentages in comparison to the U3 numbers.

Right away you can see a problem with the U3 statistic; as people get discouraged and stop looking for work this causes the U3 number to go down...even if the U4 through U6 numbers remain the same (or worse go up). Compounding matters that before 1994 there was a U1-U7 spectrum which doesn't measure the same thing as the later U1-U6 spectrum.

Another problem is that formal unemployment data wasn't gathered until 1940, meaning that the unemployment numbers before then are estimates.

A final issue is not only is the unemployment numbers from a sample but certain groups (people considered "self employed" for example) are counted

How to evaluate a statistic (example)
One of the first things to do is find out just how a particular statistic works i.e. what does it actually measure, and what are its limits.

For example both radiocarbon and palaeography dating are "normal distribution" curves ie they are bell shaped and follow the 68–95–99.7 rule.

Ideally you should be given the date, the range, and if the divination range (σ) is 1, 2, or 3. Sadly much of the time you will be given only the date and no other information. Thankfully some research can give us a 1σ divination range.

For C-14 dating it is known that the half life of C14 is 5730 ± 40 years but the earlier estimate of 5568 ± 30 years is generally used and correlated afterwords as the levels of C14 to normal C12 are not constant. Then there are the limits in measuring. All these are factored into the laboratory error multiplier C-14 dating only tells you how old an object was when it died; not the date it was last used.

Palaeography has at best a 50 range or ± 25 years at 1σ and some argue that a range of 70 to 80 years (± 35 to ± 40 years) is more realistic.

Statistics and evil
One death is tragedy; one million is statistic Statistics has a very rich history of being involved in evil. And by "being involved", try "invented specifically for".
 * Regression to the mean- Originally called "reversion to mediocrity" because the scientist involved, Francis Galton, didn't like the result, this was invented for the study of Eugenics. The original study was to find the height of the son given the height of the father, to determine the heredity of height. Because the mother's height has no influence, apparently. The result was that the son's height was expected to fall about halfway between the father's height and the population average height.
 * Surveys- Also invented by Francis Galton, in order to collect more data for Eugenics research
 * P-value (and so much more)-, Galton's protege who happened to be even more racist than his mentor, and when your mentor literally coined the word "eugenics", that's saying something.
 * Any Statistics Department in college- The first one was founded in 1911 in University College London by Karl Pearson... in order to help promote the study of Eugenics.
 * Monte Carlo Valuation- Rather than actually do the theoretical math, throw in random values and run thousands of scenarios, and then use that as your possible results. Invented for the purpose of building the atom bomb. Also the name has a bit of an evil history as well; Stanislaw Ulam’s uncle lost his fortune gambling in the Monte Carlo casino in Monaco, so Stan used the name as he was 'betting' on random results.
 * Student's T Distribution- Discovered by a statistician working with Guinness Brewers to try and figure out how to get more barley for the beer (the name "student" was done so he could post anonymously), this one may be either evil or good depending upon your view of beer or thick, aromatic brews rather than American Pisswater.