Data Science Process

The Data Science Process (Six Steps)

Data science projects follow a structured process. The book highlights six main steps, which help ensure that insights are reliable, reproducible, and useful for organizations.

1. Setting the Research Goal

What it means: Clearly define the purpose of the project.
How it’s done: Prepare a project charter that specifies:
- What you’re going to research.
- Why it benefits the organization.
- What data and resources are needed.
- Timeline and deliverables.
Example: A company may want to know: “Can we predict customer churn to improve retention?”

2. Retrieving Data

What it means: Collect the data required for the project.
Sources: Databases, spreadsheets, APIs, third-party vendors, or logs.
Checks needed:
- Does the data exist?
- Is the quality sufficient?
- Do we have access rights?
Example: Gathering customer purchase records from a database or downloading open data from a government portal.

3. Data Preparation

What it means: Make raw data ready for analysis.
Sub-steps:
- Data Cleansing: Remove errors, duplicates, or false values.
- Data Integration: Combine data from multiple sources.
- Data Transformation: Convert data into a usable format (e.g., encoding categories into numbers).
Example: If customer ages are stored differently across branches (some in years, some in birth dates), standardize them.

4. Data Exploration (EDA – Exploratory Data Analysis)

What it means: Gain an initial understanding of the data.
Techniques:
- Descriptive statistics (mean, median, standard deviation).
- Visualizations (histograms, scatter plots, box plots).
- Checking distributions and relationships between variables.
Why: Helps detect patterns, correlations, and outliers before building models.
Example: Discovering that customers with low engagement (few logins) are more likely to churn.

5. Data Modeling (Model Building)

What it means: Apply statistical and machine learning models to answer the research question.
Process:
- Select variables (features) for the model.
- Choose modeling technique (regression, classification, clustering, etc.).
- Train and evaluate the model iteratively.
Example: Using logistic regression to predict whether a customer will churn (yes/no).

6. Presentation and Automation

What it means: Share results and put them into use.
Forms: Reports, dashboards, visualizations, or presentations to stakeholders.
Automation: In some cases, the model needs to be automated (e.g., daily fraud detection in banking).
Example: A churn prediction model may be integrated into a CRM system so sales teams can act in real time.

Comments