Data Science Process in Detail

 

Step 1: Defining Research Goals and Creating a Project Charter

The very first step of any data science project is understanding the problem clearly and aligning it with the organization’s needs.

This step answers three key questions:

  • What → What exactly does the company want you to do?

  • Why → Why is this project valuable? Does it align with a bigger strategy or is it a one-off project?

  • How → How will the project be carried out? What resources, data, and methods will you use?


1. Spend Time Understanding the Goals and Context

  • The research goal should be clear, precise, and agreed upon by all stakeholders.

  • Ask questions until you fully understand the expectations.

  • Avoid misunderstandings — one of the biggest mistakes in data science is solving the wrong problem.

  • Example:

    • Business asks: “Why are customers leaving our service?”

    • If misunderstood, you might analyze sales trends instead of customer churn.

👉 Tip for students: This step is less about coding and more about people skills and business understanding.


2. Create a Project Charter

A project charter is a formal document (short or detailed depending on company size) that outlines the project. It ensures everyone agrees on the scope, approach, and deliverables.

Key Elements of a Project Charter

  • Clear research goal → What exactly you will investigate.

  • Mission and context → How the project fits in the business strategy.

  • Approach → Methods, tools, and analysis steps.

  • Resources required → Data, software, and human resources.

  • Feasibility proof → Show the project is achievable (maybe with a proof of concept).

  • Deliverables and success criteria → Reports, models, dashboards, etc.

  • Timeline → Schedule of tasks and deadlines.

Why important?

  • Helps management estimate cost and resources.

  • Prevents misunderstandings later in the project.

  • Acts as a contract between data scientists and the business.


Summary of Step 1

👉 Before writing a single line of code, a data scientist must:

  1. Define the research goal clearly (What, Why, How).

  2. Prepare a project charter with deliverables, resources, timeline, and success measures.

This ensures the project is aligned, feasible, and valuable for the organization


Step 2: Retrieving Data

Once the research goal is set (Step 1), the next step is acquiring the right data. Without good data, even the most advanced models and algorithms will fail. This stage is about finding, accessing, and collecting relevant data that will serve as the raw material for your analysis.


2.1 Sources of Data

Data can come from many places. Broadly, there are two categories:

(a) Internal Data (Within the Organization)

  • Databases → Structured storage managed by IT teams (e.g., MySQL, Oracle).

  • Data warehouses → Designed for analysis & reporting with pre-processed data.

  • Data marts → Subsets of data warehouses for specific departments (e.g., sales, HR).

  • Data lakes → Raw, unprocessed data stored in bulk for future use.

  • Local files → Excel sheets, CSVs, PDFs stored on employees’ systems.

👉 Challenges: Finding scattered data, understanding metadata, and navigating access restrictions (“Chinese walls”).


(b) External Data (Outside the Organization)

  • Purchased data → Market research firms like Nielsen or GFK.

  • Third-party APIs → Social media (Twitter, LinkedIn, Facebook), Google Maps, weather APIs, etc.

  • Open data sources → Many governments and institutions release high-quality free data.

📌 Examples of open-data providers:


2.2 Key Considerations When Retrieving Data

  1. Data Ownership and Access

    • Check who owns the data.

    • Ensure proper permissions and legal compliance (GDPR, HIPAA).

  2. Data Quality Checks (Early Stage)

    • Make sure the data matches the original source.

    • Verify data types (text, numbers, dates).

    • Ensure you retrieved all necessary fields.

    • Spot missing values, typos, or unusual entries.

⚠️ If ignored, poor-quality data here will cause huge problems later (during modeling).

  1. Data Completeness

    • Do you have enough data to answer the research goal?

    • If not, consider merging multiple sources.


2.3 The Role of the Data Scientist

  • Sometimes, design a data collection system (e.g., sensors, web scraping, surveys).

  • Often, collaborate with IT/data engineers who manage databases and warehouses.

  • Be prepared to play “data detective” to locate scattered or undocumented datasets.


2.4 Common Issues in Data Retrieval

  • Data silos across departments.

  • Outdated or incomplete records.

  • Access restrictions (privacy laws, company policies).

  • Different formats (structured vs. unstructured data).


Summary of Step 2:

Data retrieval is about finding, accessing, and gathering all relevant data from both internal and external sources. This step requires balancing technical skills (to query and import data) with business awareness (to know what data is actually useful). Importantly, early quality checks prevent bigger problems in later stages.

Step 3: Data Preparation

(Cleansing, Integration, Transformation)

Raw data, just like a rough diamond, is rarely ready for use. Before building models, the data must be cleaned, standardized, and transformed into a usable format. This step often takes 60–80% of the total project time.


3.1 Why Data Preparation Is Important

  • Data collection is error-prone (missing values, duplicates, wrong formats).

  • Models rely on quality inputs → “Garbage in, garbage out.”

  • Proper preparation improves accuracy, reliability, and efficiency of analysis.


3.2 Sub-Phases of Data Preparation

(a) Data Cleansing

  • Remove or fix errors in the data.

  • Tasks include:

    • Handling missing values (imputation, removal).

    • Removing duplicates.

    • Correcting typos or inconsistent entries (e.g., “USQ” → “USA”).

    • Identifying and handling outliers.

  • Example: Customer age recorded as “200” → clearly an error.


(b) Data Integration

  • Combine data from multiple sources into a unified dataset.

  • Resolve conflicts (e.g., different formats, naming conventions).

  • Tasks include:

    • Joining datasets from databases, files, or APIs.

    • Ensuring consistent identifiers (e.g., customer IDs across systems).

  • Example: Merge sales data (Excel) with customer demographics (SQL database).


(c) Data Transformation

  • Convert data into a suitable format for modeling.

  • Tasks include:

    • Normalization or standardization of numerical values.

    • Encoding categorical variables (one-hot encoding, label encoding).

    • Aggregation (e.g., daily → weekly totals).

    • Feature scaling for machine learning algorithms.

  • Example: Converting “Male/Female” into binary variables (0/1).


3.3 Common Challenges

  • Inconsistent units (e.g., “kg” vs. “lbs”).

  • Different date formats (e.g., “MM/DD/YYYY” vs. “DD-MM-YYYY”).

  • Large datasets requiring efficient cleaning methods.

  • Hidden errors discovered only during later analysis.


3.4 Role of the Data Scientist

  • Apply statistical and programming techniques (Python: Pandas, NumPy, etc.).

  • Document cleaning and transformation decisions for reproducibility.

  • Collaborate with domain experts to resolve ambiguities in the data.


Summary of Step 3
Data preparation involves cleansing, integrating, and transforming raw data into a usable and consistent form. It’s often the most time-consuming phase but also the most crucial—well-prepared data makes later modeling much more accurate and efficient.

Step 4: Exploratory Data Analysis (EDA)

Once the data has been collected (Step 2) and prepared (Step 3), the next crucial step is exploring the data. This is where we start uncovering the story hidden in the numbers.

Purpose of EDA

  • To understand the structure and relationships in the dataset.

  • To detect patterns, trends, or anomalies.

  • To generate hypotheses for modeling later.

  • To decide which transformations or features might be useful.

EDA is not yet about building predictive models. Instead, it is about getting to know the data as if you were “interviewing” it.


Key Features of EDA

  1. Visualization is central

    • Humans understand pictures much faster than tables of numbers.

    • Graphs make it easy to identify trends, clusters, and outliers.

    Common techniques:

    • Line plots → trends over time.

    • Histograms → distribution of values.

    • Boxplots → spread, median, and outliers.

    • Pareto charts (80/20 rule) → show cumulative contribution.

    • Scatterplots → relationships between two variables.

    • Composite and interactive graphs → combine multiple views for deeper insights.

    For example:

    • A Pareto chart might show that 20% of customers generate 80% of sales.

    • A boxplot can show which group of users tends to give higher ratings.


  1. Interactive Exploration

    • With tools like brushing and linking, selecting points in one plot highlights related data points in others.

    • This makes hidden correlations more obvious.

    • Example: selecting countries with high average scores in one graph might highlight their performance across other questions.


  1. Not only Visuals
    While graphs are most common, EDA also includes:

    • Tabulations (frequency tables).

    • Summary statistics (mean, variance, quartiles).

    • Clustering (grouping similar data points).

    • Even simple models can be built just to understand the data better.


Common Outcomes of EDA

  • Detecting data quality issues (outliers, missing values, errors).

  • Understanding variable distributions and skewness.

  • Identifying correlations between variables.

  • Generating ideas for feature engineering before modeling.


Why EDA is Important

  • Prevents “blind modeling.” Without EDA, models might be biased or misleading.

  • Helps communicate findings visually to both technical and non-technical audiences.

  • Acts as a bridge between raw data and model building.

Python Code for EDA

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Example dataset
data = {
    "Age": [18, 19, 20, 21, 19, 22, 23, 24, 20, 21, 22, 20],
    "Marks": [55, 60, 65, 70, 58, 75, 80, 85, 68, 72, 77, 66]
}

df = pd.DataFrame(data)

# 1. Summary statistics
print(df.describe())

# 2. Histogram - distribution of ages
plt.hist(df["Age"], bins=5, edgecolor="black")
plt.title("Age Distribution of Students")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# 3. Boxplot - detect outliers in marks
sns.boxplot(x=df["Marks"])
plt.title("Boxplot of Marks")
plt.show()

# 4. Scatterplot - relationship between Age and Marks
plt.scatter(df["Age"], df["Marks"])
plt.title("Age vs Marks")
plt.xlabel("Age")
plt.ylabel("Marks")
plt.show()


Step 5: Data Modeling

After cleaning and exploring the data, the next step is to build models that help answer the research question or make predictions.

1. What is Data Modeling?

Data modeling is the process of applying statistical, machine learning, or deep learning algorithms to your dataset to:

  • Discover hidden patterns.

  • Test hypotheses.

  • Make predictions.

  • Classify or cluster data points.


2. Tasks in the Modeling Step

  1. Select the right model

    • Regression → If predicting a continuous value (e.g., predicting house prices).

    • Classification → If predicting categories (e.g., spam or not spam).

    • Clustering → If grouping data (e.g., customer segmentation).

    • Dimensionality Reduction → If simplifying large data (e.g., PCA for high-dimensional features).

  2. Split the dataset

    • Training set → Used to train the model.

    • Validation set → Used to tune hyperparameters.

    • Test set → Used to evaluate performance.

  3. Train the model

    • Fit the algorithm to the training data.

  4. Tune hyperparameters

    • Adjust settings (e.g., learning rate, number of layers, depth of tree).

  5. Evaluate the model

    • Use metrics like accuracy, precision, recall, F1-score, RMSE, AUC, etc.


3. Example Workflow

Let’s say we want to predict student exam pass/fail based on study hours and attendance:

  • Choose a classification algorithm (e.g., Logistic Regression).

  • Split data into 80% training, 20% testing.

  • Train the model on training data.

  • Test it on unseen data.

  • Evaluate with accuracy and confusion matrix.


4. Common Algorithms

  • Regression: Linear Regression, Logistic Regression.

  • Tree-based methods: Decision Trees, Random Forest, Gradient Boosting.

  • Clustering: K-Means, Hierarchical Clustering, DBSCAN.

  • Neural Networks: Deep Learning models for image, text, and speech.


5. Goal of this Step

The main goal is to build the best-performing model that balances:

  • Bias vs Variance (avoid underfitting/overfitting).

  • Accuracy vs Interpretability (sometimes a simpler model is better for explaining results).




# Step 5: Data Modeling

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Sample dataset
data = {
    'Study_Hours': [2, 4, 5, 7, 1, 3, 8, 6, 9, 10],
    'Attendance': [50, 60, 65, 80, 40, 55, 90, 70, 95, 98],
    'Pass':       [0, 0, 1, 1, 0, 0, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features (X) and Target (y)
X = df[['Study_Hours', 'Attendance']]
y = df['Pass']

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate model
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Step 5: Data Modeling

After cleaning and exploring the data, the next step is to build models that help answer the research question or make predictions.

1. What is Data Modeling?

Data modeling is the process of applying statistical, machine learning, or deep learning algorithms to your dataset to:

  • Discover hidden patterns.

  • Test hypotheses.

  • Make predictions.

  • Classify or cluster data points.


2. Tasks in the Modeling Step

  1. Select the right model

    • Regression → If predicting a continuous value (e.g., predicting house prices).

    • Classification → If predicting categories (e.g., spam or not spam).

    • Clustering → If grouping data (e.g., customer segmentation).

    • Dimensionality Reduction → If simplifying large data (e.g., PCA for high-dimensional features).

  2. Split the dataset

    • Training set → Used to train the model.

    • Validation set → Used to tune hyperparameters.

    • Test set → Used to evaluate performance.

  3. Train the model

    • Fit the algorithm to the training data.

  4. Tune hyperparameters

    • Adjust settings (e.g., learning rate, number of layers, depth of tree).

  5. Evaluate the model

    • Use metrics like accuracy, precision, recall, F1-score, RMSE, AUC, etc.


3. Example Workflow

Let’s say we want to predict student exam pass/fail based on study hours and attendance:

  • Choose a classification algorithm (e.g., Logistic Regression).

  • Split data into 80% training, 20% testing.

  • Train the model on training data.

  • Test it on unseen data.

  • Evaluate with accuracy and confusion matrix.


4. Common Algorithms

  • Regression: Linear Regression, Logistic Regression.

  • Tree-based methods: Decision Trees, Random Forest, Gradient Boosting.

  • Clustering: K-Means, Hierarchical Clustering, DBSCAN.

  • Neural Networks: Deep Learning models for image, text, and speech.


5. Goal of this Step

The main goal is to build the best-performing model that balances:

  • Bias vs Variance (avoid underfitting/overfitting).

  • Accuracy vs Interpretability (sometimes a simpler model is better for explaining results).


What Step 5 is about

You shift from open-ended exploration to goal-directed modeling: predicting numbers, classifying labels, or uncovering structure. It’s iterative: you’ll try models, learn from errors, refine features/choices, and loop.


1) Model & variable (feature) selection

Start with the target and constraints

  • Target definition: What exactly are you predicting? (numeric → regression; category → classification; sequence → time series; groups → clustering.)

  • Operational constraints:

    • Deployability/latency: Can this run fast enough in production (e.g., mobile, real-time)?

    • Maintainability: How often will features drift? How painful is retraining?

    • Interpretability: Do stakeholders need a simple, explainable model?

Choose candidate model families

  • Regression: Linear/regularized (OLS, Ridge/Lasso/Elastic Net), tree-based (Random Forest, Gradient Boosting), GLMs.

  • Classification: Logistic Regression, K-NN, SVM, tree-based ensembles, Naive Bayes.

  • Unsupervised: K-means, DBSCAN, PCA, autoencoders.

  • Time series: ARIMA/Prophet/exponential smoothing; tree-based/boosted models with lag features.

Select and engineer features

  • From EDA → features: Use the relationships you saw to propose inputs, interactions, and transformations (logs, ratios, bins).

  • Preprocessing: Handle missing values; scale if the model needs it (e.g., K-NN, SVM).

  • Leakage check: Exclude any variable that “peeks into the future” or encodes the label.

  • Parsimony: Prefer the smallest feature set that meets performance; it’s more stable.

Establish a baseline

Before fancy modeling, fit a dumb baseline (mean predictor for regression, majority class for classification). Your model must beat this by a meaningful margin.


2) Model execution (coding & training)

Reproducible setup

  • Fix random seeds, log data/version, and freeze library versions.

Data splitting strategy

  • Holdout or K-fold cross-validation (stratify for classification).

  • Time-aware splits for temporal data (train on past, validate on future).

Pipelines & hyperparameters

  • Use a single pipeline that includes preprocessing + model (so your test set never “sees” fitted transformers from train).

  • Tune hyperparameters (grid/random/Bayesian); for each setting, evaluate via CV on train only.

Examples mirrored to your text

  • Linear regression (StatsModels/Scikit-learn): Great for a first pass and for interpretability; inspect coefficients, confidence intervals, and p-values (StatsModels) when assumptions are plausible.

  • K-Nearest Neighbors: Simple, nonparametric; requires scaling; choose k via CV. Accuracy on the training data alone is not meaningful—evaluate on unseen data.


3) Diagnostics & model comparison

Evaluate on unseen data

  • Keep a validation set for tuning and a final test set for honest performance. Never tune on test.

Pick metrics that match the goal

  • Regression: RMSE/MAE/R²; often MAE is easier to explain, RMSE penalizes big misses.

  • Classification: Accuracy (only if balanced), Precision/Recall/F1, ROC-AUC, PR-AUC, calibration error.

  • Ranking/decisions: Top-k precision, lift, cumulative gains, expected profit/cost.

Confusion matrix & thresholds

  • Go beyond accuracy: examine false positives/negatives and tune the decision threshold to the business trade-off (e.g., recall-heavy for fraud/medical screening).

Diagnostics (assumptions & behavior)

  • Linear models: Check residual plots (nonlinearity, heteroscedasticity), influential points, multicollinearity (VIF).

  • K-NN/SVM/tree models: Sensitivity to scaling, class imbalance, overfitting (depth/complexity).

  • Stability: Performance variance across folds; model should be consistent, not lucky.

Compare models fairly

  • Use the same folds/splits and preprocessing.

  • Prefer simpler models if performance is similar (interpretability, robustness).

  • Consider statistical tests (e.g., McNemar for paired classifications; paired t-tests on fold scores) when differences are small.

Error analysis (the secret sauce)

  • Inspect the worst errors and systematically ask “why?”

    • Bad labels? Missing features? Specific segments (e.g., country, device) failing?

    • This guides targeted feature engineering and often yields the biggest gains.


Step 5: Data Modeling Example (Student Pass/Fail Prediction)

# Step 5: Data Modeling

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Sample dataset
data = {
    'Study_Hours': [2, 4, 5, 7, 1, 3, 8, 6, 9, 10],
    'Attendance': [50, 60, 65, 80, 40, 55, 90, 70, 95, 98],
    'Pass':       [0, 0, 1, 1, 0, 0, 1, 1, 1, 1]  # 0 = Fail, 1 = Pass
}

df = pd.DataFrame(data)

# Features (X) and Target (y)
X = df[['Study_Hours', 'Attendance']]
y = df['Pass']

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate model
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Step 6: Presenting Findings and Building Applications

Once the data has been collected, prepared, explored, and modeled, the final step is to share the results and, if necessary, deploy them for repeated use.


1. Presenting Findings

  • At this stage, you communicate the insights from your analysis to the stakeholders (teachers, managers, customers, policymakers, etc.).

  • Your job is to explain the results clearly and simply:

    • What was discovered in the data?

    • What does the model predict or recommend?

    • Why does this matter for decision-making?

  • Tools used: visualizations (charts, dashboards, reports), storytelling with data, and presentations.

💡 Example: If you built a model to predict student performance, you could show that students with poor attendance are more likely to fail, using graphs and summaries.


2. Automating the Process

  • Sometimes stakeholders want these insights again and again (weekly sales predictions, daily customer churn alerts, monthly reports, etc.).

  • Instead of manually redoing everything, you automate parts of the process:

    • Automating model scoring (predicting on new data only).

    • Creating dashboards that update automatically when new data is added.

    • Building applications (web apps, Excel reports, APIs) that make the model’s predictions available in real time.

💡 Example: A retail store prediction model can be automated to refresh sales forecasts daily in a dashboard.


3. Role of Soft Skills

  • Technical analysis alone isn’t enough.

  • You need to convince and inspire stakeholders to use your results.

  • Communication, presentation skills, and storytelling with data are just as important as coding.

💡 Why? If people don’t understand or trust your findings, your entire project might be ignored.


Summary

Step 6 combines two important aspects:

  1. Presentation → Explaining your findings in a clear, engaging, and useful way.

  2. Automation → Making sure your model or analysis can be reused efficiently.

This step ensures your hard work has a real-world impact and that decision-makers can take action based on your results.

Comments

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Types of Machine Learning Systems