Random Sampling and Sample Bias

Data Scientist must focus on the data in hand and the sampling method
If a sample comes from physical process, would be good to know which process is to facilitate modelling
- We can make some assumptions here, because physical process obey some physical laws and rules
- For example: flip a coin obey a physical process, and it is binomial distribution (head or tails)
  - We can model other problems as flip a coin or binomial phenomena (fraud or not, buy or not)
Try to create samples that can represent the population of the data (hard task)
Small datasets are better to plot and manipulate. Focus on the good data.
Missing/Outliers are important when we are dealing with Big Data
Massive data are important when they are sparse, like a matrix of terms on search query algorithm
Sample mean: x, Population mean: u
Stratified sample can be hard because we need to know which group is important to divided and take into account
Selection Bias A systematic way to commit mistakes when sampling from population making the measurement process not confident for our purpose. It has a specific direction to point!