#statistics #math #sampling #modelling
- Data Scientist must focus on the data in hand and the sampling method
-
If a sample comes from physical process, would be good to know which process is to facilitate modelling
- We can make some assumptions here, because physical process obey some physical laws and rules
-
For example: flip a coin obey a physical process, and it is binomial distribution (head or tails)
- We can model other problems as flip a coin or binomial phenomena (fraud or not, buy or not)
- Try to create samples that can represent the population of the data (hard task)
- Small datasets are better to plot and manipulate. Focus on the good data.
- Missing/Outliers are important when we are dealing with Big Data
- Massive data are important when they are sparse, like a matrix of terms on search query algorithm
- Sample mean: x, Population mean: u
- Stratified sample can be hard because we need to know which group is important to divided and take into account
- Selection Bias A systematic way to commit mistakes when sampling from population making the measurement process not confident for our purpose. It has a specific direction to point!