Statistical Summaries

 

📊Understanding Statistical Summaries with Examples 

Statistical summaries are essential tools in data analysis, helping us understand the main characteristics of data at a glance. Whether you're preparing data for machine learning or doing exploratory data analysis (EDA), these summaries are your first step.

In this post, we’ll explore the key types of statistical summaries—measures of central tendency, dispersion, and shape, along with clear examples and Python code.


🔍 What Are Statistical Summaries?

They help answer:

  • What’s the average value?

  • How spread out is the data?

  • Are there outliers or skewness?

We categorize summaries into:

  1. Measures of Central Tendency

  2. Measures of Dispersion

  3. Measures of Shape

Let’s go step by step.


1️⃣ Measures of Central Tendency

These describe the center or average of the data.

🧮 Mean (Average)

📌 Explanation:

Mean=xin\text{Mean} = \frac{\sum x_i}{n}

🧪 Example and Code:

Scores = [60, 70, 80, 90, 100] Mean = (60 + 70 + 80 + 90 + 100) / 5 = 80

import numpy as np data = [60, 70, 80, 90, 100] mean_value = np.mean(data) print("Mean:", mean_value)

Output:
Mean: 80.0


🔸 Median

📌 Explanation:

  • Middle value of sorted data.

  • For even n: average of two middle values.

🧪 Example and Code:

Examples:

  • Odd count: [10, 20, 30] → Median = 20

  • Even count: [10, 20, 30, 40] → Median = (20 + 30) / 2 = 25

import numpy as np data = [60, 70, 80, 90, 100]
median_value = np.median(data) print("Median:", median_value)

Output:
Median: 80.0


🔸 Mode

📌 Explanation:

  • The most frequent value.

  • Can be multiple modes or no mode at all.

🧪 Example and Code:

Data = [5, 6, 7, 7, 8] → Mode = 7

from scipy import stats

data=[60, 70, 80, 90, 100, 60] mode_value = stats.mode(data, keepdims=False) print("Mode:", mode_value.mode)

Output:
Mode: 60 


2️⃣ Measures of Dispersion

These describe how spread out the values are.

🔹 Range

📌 Explanation:

Range=MaxMin\text{Range} = \text{Max} - \text{Min}

🧪 Example and Code:

Data = [20, 35, 50] → Range = 50 - 20 = 30
data=[20,35,50]
range_value = max(data) - min(data)
print("Range:", range_value)

Output:
Range: 30


🔹 Variance

📌 Explanation:

Variance=1n(xiμ)2\text{Variance} = \frac{1}{n} \sum (x_i - \mu)^2

🧪 Code:

import numpy as np
data = [60, 70, 80, 90, 100]
variance = np.var(data) print("Variance:", variance)

Output:
Variance: 200.0


🔹 Standard Deviation

📌 Explanation:

  • Square root of variance.

  • Indicates average deviation from the mean.

🧪 Code:

import numpy as np
data = [60, 70, 80, 90, 100]
std_dev = np.std(data) print("Standard Deviation:", std_dev)

Output:
Standard Deviation: 14.14


🔹 Interquartile Range (IQR)

📌 Explanation:

IQR=Q3Q1\text{IQR} = Q3 - Q1

🧪 Code:

import numpy as np
data = [60, 70, 80, 90, 100]
q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 print("IQR:", iqr)

Output:
IQR: 20.0


3️⃣ Measures of Shape

These help us understand the distribution of the data.

🔸 Skewness

Skewness measures the asymmetry of a distribution — whether the data leans to the left or right of the mean.

📌 Explanation:

  • Positive skew: longer tail on the right.

  • Negative skew: longer tail on the left.

📌 Example:

  • If most students scored high and a few scored very low → left-skewed

  • If most students scored low and a few scored very high → right-skewed

🧪 Code:

from scipy.stats import skew
data = [60, 70, 80, 90, 100] skewness = skew(data) print("Skewness:", skewness)

🔸 Kurtosis

Kurtosis measures the "tailedness" or peakedness of the distribution.

It answers:

  • Are the data values clustered tightly around the mean?

  • Are there heavy tails (more extreme values/outliers)?

📌 Explanation:

  • High kurtosis: heavy tails (outliers likely).

  • Low kurtosis: light tails (uniform-like).

🧪 Code:


from scipy.stats import kurtosis data = [60, 70, 80, 90, 100] kurt = kurtosis(data) print("Kurtosis:", kurt)

📋 Summary Table Example

Let’s say:


data = [10, 12, 15, 18, 20, 20, 21, 24, 30]
StatisticCodeOutput
Mean        np.mean(data)    19.56
Median        np.median(data)    18.0
Mode        stats.mode(data, keepdims=False).mode    20
Range        max(data) - min(data)    20
Standard Deviation        np.std(data)    ~6.36
IQR        np.percentile(data, 75) - np.percentile(data, 25)        9
Skewness        skew(data)~0.5
Kurtosis            kurtosis(data)~-1.2

📊 Visualizing the Data

Boxplot


import matplotlib.pyplot as plt import seaborn as sns data = [10, 12, 15, 18, 20, 20, 21, 24, 30] sns.boxplot(data=data) plt.title("Boxplot") plt.show()



Histogram

import matplotlib.pyplot as plt
data = [10, 12, 15, 18, 20, 20, 21, 24, 30]
plt.hist(data, bins=5, edgecolor='black') plt.title("Histogram") plt.show()




🧠 When to Use What?

Use Case            Statistic to Prefer
Normally distributed data                Mean, Std Dev
Skewed data                Median, IQR
Detecting outliers                Boxplot, IQR
Understanding distribution                Skewness, Kurtosis
Quick overview                Summary table

🎓 Final Thoughts

Statistical summaries are your data's first story. Before modeling or machine learning, use these tools to:

  • Understand the shape and scale of your data

  • Identify problems like outliers or skewness

  • Choose the right preprocessing and modeling techniques


✅ Try This:

Take any dataset (e.g., Titanic, Iris

✅ Try This:

Take any dataset (e.g., Titanic, Iris, or your own project data) and compute:

  • Mean, Median, Mode

  • Standard Deviation, IQR

  • Skewness and Kurtosis

  • Plot a histogram and boxplot

Let the data tell its story! 

Comments

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Types of Machine Learning Systems