Big Data and Data Science

What is Big Data?

Big data describes datasets whose size, rate of arrival, or complexity exceed the capability of a single machine or simple tools to store, process and analyze them efficiently.

  • Simple idea: Data that is too large, too fast, or too complex for a normal computer to handle easily.

  • Example:

    • Your laptop can analyze a few thousand rows in Excel.

    • But can it handle all WhatsApp messages sent in a day or all YouTube videos uploaded in a week? → That’s big data.

The 5 Vs of Big Data :

  1. Volume – huge amount of data (terabytes, petabytes).GB → TB → PB → EB

    • Ex: Facebook stores billions of photos.

  2. Velocity – data comes in very fast,speed of generation and the need for low-latency processing 

    • Ex: Stock market prices update every second.

  3. Variety – different forms of data.heterogeneous formats: structured tables, semi-structured logs/JSON, text, images, audio, video, graphs.

    • Ex: Text, video, images, sensor signals.

  4. Veracity – data may not be clean or reliable.

    • Ex: Wrong entries, fake reviews.noise,, missingness

  5. Value – the useful information hidden inside.

    • Ex: Netflix learns what movies to recommend from your watch history.

Important: “big” is relative — what’s big changes with available hardware. The core idea is when single-machine / in-memory tools (e.g., pandas on a laptop) fail, you need distributed storage, parallel compute, or algorithmic approximations.

What is Data Science?

  • Data Science is the process of turning raw data into useful insights.

  • It combines:

    • Statistics (understanding data patterns)

    • Programming (Python, R, SQL, etc.)

    • Domain knowledge (knowing the field like healthcare, business, or social media)

Example:
From data about student attendance and marks, a data scientist can predict which students may need extra help

Data science is the interdisciplinary process of extracting insight (models, predictions, decisions) from data using statistics, machine learning, and domain knowledge.

Big data → changes the scale, tools, and algorithms a data scientist must use; it’s an enabler and a constraint for data science workflows.

How is Big Data Related to Data Science?

  • Big Data is the raw material.

  • Data Science is the toolset to extract knowledge from it.

  • Relation:

    • Data Science asks questions (e.g., “Why are sales dropping?”).

    • Big Data provides huge amounts of evidence (e.g., customer transactions, online clicks).

    • Together → help make better decisions, predictions, and recommendations.

Why is Big Data Important in Data Science?

  • Without big data, data science would only work on small, simple problems.

  • Today’s real-world problems are big:

    • Google search engine → processes billions of queries daily.

    • Amazon → recommends products using millions of shopping records.

    • Healthcare → analyzing MRI images and patient history for disease prediction.


Tools and Technologies

  • For small data: Python (pandas, scikit-learn).

  • For big data: Spark, Hadoop, Dask (used in industries)

Sources & types of big data

Common sources you can show students:

  • Web/server logs and click streams (user behavior analytics).

  • Sensor / IoT telemetry (time-series at high frequency).

  • Social media (text, images, metadata).

  • Transactional systems / financial trades (high volume, low-latency).

  • Scientific data (genomics, astronomy — very large files).

  • Multimedia (video streams).

  • Graphs (social networks, knowledge graphs).

Emphasize structure: whether data is tabular, nested, binary (images), or graph-shaped changes how you store and process it.


Comments

Popular posts from this blog

GNEST305 Introduction to Artificial Intelligence and Data Science KTU BTech S3 2024 Scheme - Dr Binu V P

Basics of Machine Learning

Types of Machine Learning Systems