Random Sampling and Sample Bias

#statistics #math #sampling #modelling

  • Data Scientist must focus on the data in hand and the sampling method
  • If a sample comes from physical process, would be good to know which process is to facilitate modelling
    • We can make some assumptions here, because physical process obey some physical laws and rules
    • For example: flip a coin obey a physical process, and it is binomial distribution (head or tails)
      • We can model other problems as flip a coin or binomial phenomena (fraud or not, buy or not)
  • Try to create samples that can represent the population of the data (hard task)
  • Small datasets are better to plot and manipulate. Focus on the good data.
  • Missing/Outliers are important when we are dealing with Big Data
  • Massive data are important when they are sparse, like a matrix of terms on search query algorithm
  • Sample mean: x, Population mean: u
  • Stratified sample can be hard because we need to know which group is important to divided and take into account
  • Selection Bias A systematic way to commit mistakes when sampling from population making the measurement process not confident for our purpose. It has a specific direction to point!
#statistics #math #sampling #modelling