# how to understand statistics.

*this is a study guide that was created from lecture videos and is used to help you gain an understanding of how to understand statistics.

## How to Understand Statistics: Part 1

Statistics- quantify uncertainty, discern biases

*Is my data set good? *

Data needs to be organized, should be good quality data

Secondary data- collected by others

Primary data- collected by you in a natural environment

How was the data collected and are there flaws in the data?

Organized Data- provides service and convenience, enables good decision making, persuades, saves time and money

*The middle of the data: means and medians*

A data set is a collection of values.

Knowing the center of the data (which is the average/mean) gives you a better balance.

Mean (Average)- sum of all data points divided by the total number of observations.

Median- the midpoint of the data with equal number of data points above and below

Medians for data sets with even numbers of data points- take the average of the two middle points/numbers

Weighted mean- the categories and the weights of each category are important with weighted means. How were the categories and weight chosen? A weighted mean is arbitrary.

The mode- the data point that is most prevalent in the data set. The data/number that appears the most often. The mode represents the most likely outcome in a data set. The mode does not have a minimum frequency. Use the mode, with the mean and the median.

The range- the difference between the largest and smallest value in a data set. Use a histogram chart. Range is not always indicative as one number can change it.

Standard deviation- sort of the average distance from the mean. It explains if numbers/data points are similar to each other. It is the average squared distance from the mean.

*How many standard deviations?*

ST can be used to evaluate individual data points.

*Outliers*

A data point that is an abnormal distance from other values in a data set. Tables and charts or standard deviation can determine outliers. Do not just throw outlier data sets away. Use as an opportunity. Questions to ask: Is this really an outlier? How did this happen? What can be learned? What needs to change?

*Z-score: Measuring by using standard deviations*

Used how to find how many standard deviations our data points lie from the mean. You need the data point, mean and standard deviation to get z-score. A z-score of 2.11, means a data point is 2.11 standard deviations away from the mean.You can also have a negative mean.

*Empirical rule / Three Sigma Rule *

Sigma means Standard Deviation. Empirical rule is useful to understand the distribution of data points in data sets. It works for symmetrically distributed data. Symmetrically distributed data follows a pattern whereby most data points fall within 3 standard deviations of the mean.

68% of data points fall within 1 StDev from the mean

95% of data points are within 2 StDev from the mean

99.7% of data points are within 3 StDev from the mean

It only works with a well centered, symmetrical bell shaped curve. If a data point is above a 3.0 z-score, you can typically be confident that it is an outlier.

*Calculating percentiles*

*100th percentile is not possible, a 99 percentile or the top 1% percent is the highest.

*Defining probability*

The likelihood that some event will occur. It is the desired outcome divided by all possible outcomes. Probability is easiest when you are able to quantify all of the possible outcomes.

*Examples of probability*

Total probability of a sample space- the sum of probabilities of all possible outcomes must add up to 100%. The highest probability for any scenario is 100% and the lowest probability for any scenario is 0%.

*Types of probability*

Classical probability- example is a coin flip. There are two outcomes that are equally likely. It works well when you know all possible outcomes and the outcomes are equally likely to occur.

When not everything is fair and equal then use, Empirical probability- when there are many variables and different situations, then use empirical probability. Use when you have reliable data to assist you.

Subjective probability- when you do not have any reliable data. It is typically an opinion based probability.

*Multiple Event Probability*

Probability of two events- use addition rule. You need to know the amount of scenarios to predict.

*Explanation of conditional probability if X happens, then.. *

Conditional probability- the probability of an event, given that another has already occurred. Probability trees can help visualize.

*Relationship between two events: independence vs. dependence *

Test results to see if they are independent or dependent.

*Bayes theorem and false positives*

What is probability that the results are wrong? It is the basis for Bayes theorem

*Permutations: The order of things*

The number of ways in which objects can be arranged. The letters AB, can be arranged in two permutations (AB and BA). There is a permutation formula which is n! It is factorial.

*Combinations: Permutations without regard for other *

For combinations, the order of events do not matter. There is a combination formula. Used how to figure out how many different ways there are to arrange people into groups. Or how many combinations of poker hands are possible. Used to calculate probabilities.

*Discrete vs. Continuous*

Random variable- value of the outcome is unknown.

Discrete random variables- no decimals, go up by integers such as drink purchases.

Continuous random variables- possibilities are endless.

*Discrete probability distribution*

Specific calculations for mean and standard deviation.

Discrete Random Variables- characterized by whole numbers and not decimals. An example being ordering drinks. You can not order half a drink.

Expected monetary value/EMV- total of the weighted payoffs associated with a decision. This can be used towards a money related decision.

*Binomial Random Variable*

An experiment that only has two possible outcomes.

Binomial Probability Tables can help with binomial problems. When ‘n’ gets bigger, is when you will have to introduce Calculus into the occasion. When ‘p’ is not equal to 0.05, and when ‘n’ gets really big, the normal curve is introduced.

*Probability Densities: Curves and continuous random variables. *

Instead of using bar charts that we would use with discrete probabilities, continuous random variables use line charts and curves/probability density.

*Bell-Shaped Curve*

When data takes on a bell shape for its probability distribution. The mean typically will be centered at the highest point of the curve. The area under the curve accounts for 100% of all possible outcomes. This called the Classic Normal Curve. The wider the curve is, the larger the standard deviation. The more narrow the bell shape, will have a smaller standard deviation.

*Fuzzy Central Limit Theorem*

Data introduced by many small and unrelated random effects are normally distributed.

When data is influenced by dozens or even hundreds of small and often unrelated random effects, the results end up being normally distributed. Most things in life are like this, and that is why the normal distribution shows up everywhere as a bell shaped curve.

*Z transformation to find probabilities*

The formula to is to find the Z-score and the chart value on the standard normal distribution table.

If you need more information on learning statistics basics then visit

## How to Understand Statistics: Part 2

Statistics is about trying to understand a situation.

*Understanding data and distributions*

Mean is the average of the data points

Median is the middle data point/central point

Standard Deviation is the average distance between data point and the mean

Normally distributed means the data is symmetrically distributed

Distribution curves tell us the data distribution, points, location and expectation.

Z-Scores are a measure of the number of standard deviations a particular data point is from the mean.

*Probability and random variables*

Probability is the ratio of a particular event or outcome versus all the possible outcomes.

Random experiments are opportunities to observe the outcome of a chance event.

Random variable is the numerical outcome of a random experiment.

*Inferential Statistics*

Used to find meaningful statistics that will inform us about a population.

Confidence intervals provide a level of confidence for a given interval.

Hypothesis testing is used as a process to test findings from inferential statistics.

*Sample considerations*

Sample – Small group or subset of a population used for testing and can act as a representative for the entire population

Sample size, Selection process, Bias, Measurement are all considerations for evaluating a sample. The best samples are typically chosen at random.

*Random Samples*

Simple Random Sample- Each individual has the same probability of being chosen at any stage. Each subset of k individuals has the same probability of being chosen as any other subset containing k individuals. A simple random sample must be an unbiased sample and have independent data points.

A simple random sample is one where every member of the population has an equal chance of being chosen. A simple random sample is the only way to get dependable statistical outcomes.

Independence within a simple random sample means that the selection of one member must not influence the selection of other units.

Alternatives to random samples- simple to organize, easy to carry out and logical. Systematic sample for every ‘k’, opportunity sample takes every first ‘k’, stratified sample is when population is broken into homogenous groups, cluster sample is similar to stratified as we break things up into groups the difference is there can be a mix of characteristics and is heterogenous compared to stratified samples.

*Sample size*

A sample is a group of units drawn from a population. Sample size is the number of units drawn and measured for that particular sample. The larger the sample, the greater accuracy and confidence.

The bigger the sample size, the smaller standard deviation.

*The Central Limit theorem *

You can use the sample mean to can direct you to the population mean. The central limit theorem explains that the more samples we take, the closer the means of our sample means will get to the population mean. As the sample size increases, the curve becomes more towards normal distribution.

*Standard error for proportions*

Standard error is the standard deviation of our proportion distribution. The standard error is related to the standard deviation. The standard error allows us to set up a range around the population proportion that extends the equivalent of one standard deviation in both positive and negative direction.

When the samples do not fall within the percentage range, it can be explained as beyond limits, and this can be due to a unique environment, flaw in data, flaw in the gathering of the data, a market change or bias in the sampling.

*Sampling distribution of the mean*

Due to the central limit theorem, you can typically trust your simple random samples. The central limit theorem helps us with sampling to approximate population means. It explains that as long as you have enough random samples then it will be an excellent approximation of population means.

*Standard Error for Means*

Average of sample means comes with a standard error. If we use larger sample sizes, the standard error gets smaller.

*Confidence Intervals*

An interval has a lower and upper limit and is centered from the sample proportion/p-hat.

95% confidence intervals for population proportions- a Z-score will tell you how many standard deviations away from the mean you would need to be to capture a certain percentage of the total distribution. To find the limits of a 95% confidence interval, use sample proportion plus or minus the sampling error times 1.96 (which is the Z score for 95%)

*Explaining unexpected outcomes*

Ways that can explain unexpected outcomes- Lying can occur, they can change their initial decision from the initial sample, respondents can be unsure in general, it is not a random sample, biased organization. Uncontrollable events can also occur the day of that leads the sample to have an uncontrollable outcome such as bad weather affecting the experiment of how many people will watch a sporting event.

Ensure you know how the study was conducted and how the data was collected to verify the validity of it.

*Hypothesis Testing*

Make an assumption -> collect random samples -> measure the samples -> make conclusions

How to test a hypothesis-

Develop a Hypotheses: get your null hypothesis -> get your alternative hypothesis -> state a significance level, such as 5% ->

Identify Test Statistic: get your binomial probability ->

Determine the p-value: ->

Compare p-value to significance level

One tailed vs two-tailed tests- a hypothesis test can be one tailed, which means only one area on the distribution can reject the null hypothesis. A two tailed test can have a null hypothesis that can be rejected in both the lower and upper areas.

Type I and Type II errors-

A type II is a false negative

Remember, hypotheses tests can make errors.

## How to Understand Statistics: Part 3

A Z score of 1.85 tells us that a particular data point is 1.85 standard deviations away from the data set’s mean.

The larger our sample size, the more reliable our data.

Hypothesis testing is a process that helps us test an outcome.

*T-statistic vs z-statistic*

The z-score is used to determine how many standard deviations a data point may lie from the population mean. The z-score can only work if the data is normally distributed, the sample is larger than 30 and we know the standard deviation of the population.

When you are creating a confidence interval when the population variance is unknown then use the t-distribution. The t and z score both require normal distribution to work. The difference is that the z-score is used to compare the mean of a sample to a larger population. The sample comes from the population, so the means of the sample and population are intertwined.

The t-test compares two completely independent samples. And they do not have to come form the same population. The t distribution isn’t one curve but a series. The smaller the sample size, the flatter the curve.

*T-scores*

Sample size T-Score

3 4.303

10 2.262

20 2.093

100 1.98

*T-score tables and degrees of freedom*

There are multiple T distribution charts for each sample size.

Degrees of freedom is simply the sample size minus 1 (n – 1)

*Calculating confidence intervals using t-scores*

To calculate confidence interval, use UCL and LCL formulas and find your standard error.

UCL- sample mean + (t-score) (approx standard error of the mean)

LCL- sample mean – (t-score) (approx standard error of the mean)

*Comparing Two Populations for Proportions*

Each situation can be analyzed by comparing two independent random samples.

How to set up a comparison-

True difference of population proportions = observed difference from samples =/- [critical value x Standard error]

*Comparing Two Populations for Means*

Visualization (re-randomizing)- you can continuously randomize two groups in an experiment to visualize through a distribution chart

Setting up a confidence interval- the confidence interval will contain the true difference between the population mean score for both groups.

True difference of population means = observed difference from samples +/- [critical value x standard error]

Hypothesis testing for comparing two populations- you need to know size of 1 SD, you only have sample data, large sample size so the standard error difference between the two standard means can be the standard deviation

*Chi-Square*

The Goodness of Fit test is used to perform hypothesis tests to compare two or more populations. Using goodness of fit/chi square is better for evaluating data sets where the data is categorized. Goodness of fit will help decisions based from observed data for a single year and if it follows the probability distribution that is provided.

*Curves and distribution*

The Y axis represents probability. Chi Square allows us to see how multiple independent variables interact.

For each degree of freedom, we have a different chi-square distribution curve.The greater the degrees of freedom, the closer a chi square distribution gets to a normal distribution.

How to read a Chi square critical values chart- identify degrees of freedom on left side -> identify probability threshold along top of chart

*Goodness of Fit test*

A type of chi-square hypothesis test used to compare two or more populations.

Set up Hypotheses -> Calculate Chi-Square Statistic -> Get Chi Square Statistic for each time period -> add all chi square values to get chi square test statistic -> compare with chi square critical value -> determine the degrees of freedom -> go to chi square table and find the row for your degrees of freedom -> obtain the critical value from the chart -> compare on the chi square distribution

The smaller the chi square values, the better the goodness of fit.

*What is analysis of variance / ANOVA*

ANOVA is a procedure used to determine if the variation reported output is the result of some particular factor, or if the variation is simply the result of randomness

ANOVA relies on assumptions: each population is normally distributed, the observations are independent from one another, the populations compared have an equivalent variance.

ANOVA can allow you to have different number of data points for each individual level.

One way ANOVA is a procedure that allows us to compare the means of different levels of one factor.

Another type of ANOVA is randomized block ANOVA, which allows you to see other factors that can be influencing outcomes.

Two way ANOVA allows you to look at the intersection between two factors.

*One way ANOVA and the total sum of squares (SST)*

Total sum of squares/SST is the total amount of variation between each data value and the grand mean.

*Variance within and variance between (SSW and SSB)*

Variance within is variance between data values for each **company and the mean for each mean score for each company. **

Instead of subtracting the grand mean, you subtract the individual mean for each company. Once you add up all the squares you will receive the sum of squares within (SSW)

Variance between- The variance between the mean score for each company and the grand mean. Add up all the squares to receive the sum of squares between.

SSW + SSB = SST

*Hypothesis test and f-statistic for ANOVA*

With ANOVA, the null hypothesis is always the same. The null hypothesis will state that all populations are equal.

The F statistic formula is used to test an ANOVA hypothesis.

If there is a big F-statistic, that means there is a big difference in companies, so reject the null hypothesis. If the F-statistic is small, then there is not a big difference in companies.

After you obtain your F-statistic, then go to the F-distribution table. There is a different table for each level of significance.

*Regression*

A regression will help investigate the relationship between two variables.

Obtain the data -> create the data points on a graph -> regression analysis will find the formula for the line that fits the distribution

R squared helps to understand the relationship in variation between the X and Y variables for regression analysis.

Correlation coefficient helps understand the regression line on how it fits the data.

*Find the best fitting line*

Slope intercept form is y = ax + b

Get your dataset and add the columns necessary

The fourth column you multiply the x and y variables. The fifth column is the value of the x variable squared. The sixth column is the value of the y variables squared. -> plug in the values into the formula for slope -> plug in the values to find the value for y-intercept

*Coefficient of determination*

The coefficient of determination is used when you have larger data sets and need to be able to tell if the regression line is a good fit for the data.

R-squared is a number between 0 and 1. 0 will indicate the data is a poor fit and 1 will indicate a perfect fit.

The formula for R-squared is SSR / SST. SSR is the sum of squares regression and SST is the total sum of squares.

To calculate R-squared- take the initial dataset and add a y hat column which is created from the previous formula above. Add a total column for the mean. -> to calculate the y hat formula column it is 883 (times the *x* variable for each row) minus 479.3.

-> Determine the Individual Square Regression by subtracting y hat minus mean y for each row

-> Calculate the Sum of Squares Regression by adding up all the Individual Square Regressions

-> Calculate the Individual Squares by subtracting (observed) *y *minus mean *y *for each row

-> Calculate the Total Sum of Squares by adding up all the Individual Squares from above

-> Calculate R squared by dividing SSR / SST

*Correlation Coefficient*

The lowercase r is the correlation coefficient which is the square root of R-squared. It is the same sign as the slope of the regression line.

Visit DataCamp for an interactive course to learning statistics.

## Your Mind Moves the Machine.