#statistics #data #exploring #apply #plot
Boxplots
- Could be use the Boxplots plots, it is the visual percentiles: Dispersion
- Percentiles are good to summarize data and see extreme values with 95th percentile, 99th percentile
- In the Boxplots, the last whisker does not go more than 1.5 IQR (Dispersion). The rest are outliers
Frequency Table (Histogram)
-
value_counts()
- Divided by bins
- Could be used histogram plots
- Large bins can hide features important for the models. Too small bin make hard to see the big pictures
- Bins have equal distance between them
- The bars are continuous
Skewness and Kurtosis
- They are the third and fourth momentum of the data. Central Tendendcy with mean is the first momentum, and Dispersion with variance is the second momentum
- They are more see in the graph than a number to represent them.
- Kurtosis: Propensity to have extreme values
- Skewness: Has an extensively tail on the plot
Density plots and Estimates
- Histograms are plots of densities and estimations with lines can be generated using kernel density estimate
- Y scale is different, but a proportion
- The area below the curve sum to 1
- You can calculate the area between two points to know the proportion
Exploring Binary and Categorical
- Binary, proportion of 0 and 1
- Categorical, proportion of categorical variables
- You could use Bar charts, a little bit different from histograms where x-axis represents a category and y-axis is the count
- Central Tendendcy: Mode can be used to count the most common values
-
Transform categorical variables into discretized ones
- Expected Value: Discretized Values * Probability
Exploring two or more Variables
- Scatter plots are good for small values
- Hexagonal binning for millions of rows
-
Two categorical variables: Contingency Table
- Divided category and proportion
-
Categorical and numerical
- Multiple Boxplots
-
Multiple Violin plots
- Show the distribution as well
References
- Bruce, 2017, p19-30
- Bruce, 2017, p36-45