Data Science Process
The Data Science Process (Six Steps)
Data science projects follow a structured process. The book highlights six main steps, which help ensure that insights are reliable, reproducible, and useful for organizations.
1. Setting the Research Goal
-
What it means: Clearly define the purpose of the project.
-
How it’s done: Prepare a project charter that specifies:
-
What you’re going to research.
-
Why it benefits the organization.
-
What data and resources are needed.
-
Timeline and deliverables.
-
-
Example: A company may want to know: “Can we predict customer churn to improve retention?”
2. Retrieving Data
-
What it means: Collect the data required for the project.
-
Sources: Databases, spreadsheets, APIs, third-party vendors, or logs.
-
Checks needed:
-
Does the data exist?
-
Is the quality sufficient?
-
Do we have access rights?
-
-
Example: Gathering customer purchase records from a database or downloading open data from a government portal.
3. Data Preparation
-
What it means: Make raw data ready for analysis.
-
Sub-steps:
-
Data Cleansing: Remove errors, duplicates, or false values.
-
Data Integration: Combine data from multiple sources.
-
Data Transformation: Convert data into a usable format (e.g., encoding categories into numbers).
-
-
Example: If customer ages are stored differently across branches (some in years, some in birth dates), standardize them.
4. Data Exploration (EDA – Exploratory Data Analysis)
-
What it means: Gain an initial understanding of the data.
-
Techniques:
-
Descriptive statistics (mean, median, standard deviation).
-
Visualizations (histograms, scatter plots, box plots).
-
Checking distributions and relationships between variables.
-
-
Why: Helps detect patterns, correlations, and outliers before building models.
-
Example: Discovering that customers with low engagement (few logins) are more likely to churn.
5. Data Modeling (Model Building)
-
What it means: Apply statistical and machine learning models to answer the research question.
-
Process:
-
Select variables (features) for the model.
-
Choose modeling technique (regression, classification, clustering, etc.).
-
Train and evaluate the model iteratively.
-
-
Example: Using logistic regression to predict whether a customer will churn (yes/no).
6. Presentation and Automation
-
What it means: Share results and put them into use.
-
Forms: Reports, dashboards, visualizations, or presentations to stakeholders.
-
Automation: In some cases, the model needs to be automated (e.g., daily fraud detection in banking).
-
Example: A churn prediction model may be integrated into a CRM system so sales teams can act in real time.
Comments
Post a Comment