Decisions can be made using gut feeling (instinct, guess, idea) or data driven, a simple example could be rolling 2 dice’s (cube) and predicting a single number (sum of the individual dice number), based on your gut feeling you might come with a single number but based on data driven decision you might choose 7 which has the highest probability. There could be many instances where decisions based on gut feeling would have given 100% success, but going with data driven decisions you can reduce the chances of failure. And the core of data driven decision making is Hypothesis testing.
What is Hypothesis testing?
“A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters.” -Wikipedia
So basically, Hypothesis testing is like proving the gut feeling wrong based on the data which we collect :).
Before we proceed with how we can do Hypothesis testing, we need to understand some basic statistical terminology.
1) Population and Sample: Data can be based on Population or Sample, when we say population it’s the data of all the users, and getting the data of the whole population is very difficult, hence most of the time the data driven decisions are done on Sample data.
2) Mean: One of the primary properties of measure of central tendency, it’s the simple average, the value calculated is different for Population and Sample.
3) Variance: The dispersion of the values from its mean, also the units are squared
4) Standard Deviation: Square root of variance, variance is represented in squared units, to keep it in the actual units, standard deviation is used.
5) Normal Distribution: Also called Bell curve, where the mean, mode and median are almost equal, the data when arranged in ascending order is symmetric to the mean
6) Standard normal distribution: A normal distribution where mean is 0 and standard deviation is 1
7) Central limit theorem: Irrespective of the distribution of the data, using sampling distribution of the means, we can approximate it to normal distribution.
8) Confidence interval: It is a range where you expect your population parameter to fall, the probability of being it correct is based on the significance level, 95% level is a standard which is widely accepted based on the dataset. Confidence interval calculation depends on 2 things, first when population variance is known here, we will be using z stastic using Z table (https://www.ztable.net/) And when population variance is unknown, in this case we will be using student’s t distribution t stastic using t table (http://www.ttable.org/).
So, with these basic statistics under our belt we can proceed with understanding Hypothesis testing
Basically, Hypothesis testing contains 2 parts 1) Null Hypothesis: This is the statement which we try to prove wrong 2) Alternate Hypothesis: Opposite of null hypothesis statement.
A Hypothesis test can be a single sided or two-sided test
1) one sided test:
Null Hypothesis states, 1kg Apples in Bangalore are greater than equal to 1000rs and alternate hypothesis is its less than 1000rs
2) two-sided test: Null Hypothesis states, 1kg Apples in Bangalore is equal to 1000rs and alternate hypothesis is greater than or less than 1000rs.
And with any type of test, there are 2 errors that can occur 1) Type 1 error: Rejecting a null hypothesis which is true, also called false positive 2) Type 2 error: Accepting a null hypothesis which is false, also called false negative
That’s all folks, with this, I end this short post with the stastics needed and also a high-level brief on Hypothesis testing. In the next post we will see all of these using a practical example.