Data Science Process

 

The Data Science Process (Six Steps)

Data science projects follow a structured process. The book highlights six main steps, which help ensure that insights are reliable, reproducible, and useful for organizations.


1. Setting the Research Goal

  • What it means: Clearly define the purpose of the project.

  • How it’s done: Prepare a project charter that specifies:

    • What you’re going to research.

    • Why it benefits the organization.

    • What data and resources are needed.

    • Timeline and deliverables.

  • Example: A company may want to know: “Can we predict customer churn to improve retention?”


2. Retrieving Data

  • What it means: Collect the data required for the project.

  • Sources: Databases, spreadsheets, APIs, third-party vendors, or logs.

  • Checks needed:

    • Does the data exist?

    • Is the quality sufficient?

    • Do we have access rights?

  • Example: Gathering customer purchase records from a database or downloading open data from a government portal.


3. Data Preparation

  • What it means: Make raw data ready for analysis.

  • Sub-steps:

    • Data Cleansing: Remove errors, duplicates, or false values.

    • Data Integration: Combine data from multiple sources.

    • Data Transformation: Convert data into a usable format (e.g., encoding categories into numbers).

  • Example: If customer ages are stored differently across branches (some in years, some in birth dates), standardize them.


4. Data Exploration (EDA – Exploratory Data Analysis)

  • What it means: Gain an initial understanding of the data.

  • Techniques:

    • Descriptive statistics (mean, median, standard deviation).

    • Visualizations (histograms, scatter plots, box plots).

    • Checking distributions and relationships between variables.

  • Why: Helps detect patterns, correlations, and outliers before building models.

  • Example: Discovering that customers with low engagement (few logins) are more likely to churn.


5. Data Modeling (Model Building)

  • What it means: Apply statistical and machine learning models to answer the research question.

  • Process:

    • Select variables (features) for the model.

    • Choose modeling technique (regression, classification, clustering, etc.).

    • Train and evaluate the model iteratively.

  • Example: Using logistic regression to predict whether a customer will churn (yes/no).


6. Presentation and Automation

  • What it means: Share results and put them into use.

  • Forms: Reports, dashboards, visualizations, or presentations to stakeholders.

  • Automation: In some cases, the model needs to be automated (e.g., daily fraud detection in banking).

  • Example: A churn prediction model may be integrated into a CRM system so sales teams can act in real time.



Comments

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Types of Machine Learning Systems