Skip to content

vishal-labade/spark_setup

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local Spark + Iceberg Data Platform (Analytics Infra Case Study)

Local Spark + Iceberg Analytics Platform

  1. This repository demonstrates a production-style analytics infrastructure built locally using Spark, Iceberg, and S3-compatible object storage.
  2. The project showcases how a data scientist or analytics engineer can design, configure, and operate a reproducible data platform that supports large-scale ETL, exploratory analysis, and forecasting workloads.
  3. Key themes include distributed compute, lakehouse architecture, performance tuning, and security-aware configuration — intentionally scoped to be transparent, reproducible, and suitable for public sharing.

Note on Bitnami: The current implementation uses Spark 3.4.1 via Bitnami’s legacy images due to recent changes in Bitnami’s public image availability. This was a deliberate choice to preserve stability and reproducibility. A migration to non-Bitnami Spark images is planned for v2 of this repository.

What This Repo Is (and Is Not)

This repo is:

  • A realistic, end-to-end analytics infrastructure stack
  • Designed the way a data platform would be built for analysts and data scientists
  • Focused on correctness, scalability, and operational clarity

This repo is not:

  • A cloud-managed setup (no EMR / Databricks / BigQuery)
  • A toy “hello world” Spark example
  • A full production deployment (intentionally scoped to local + reproducible)

Architecture Overview

The platform uses a lakehouse-style architecture backed by object storage and a REST catalog.

                   ┌────────────────────┐
                   │    Jupyter Client   │
                   │  (PySpark / SQL)   │
                   └─────────┬──────────┘
                             │
                             ▼
                   ┌────────────────────┐
                   │   Spark Master     │
                   │  + Spark Workers  │
                   └─────────┬──────────┘
                             │
               ┌─────────────┴─────────────┐
               │                           │
               ▼                           ▼
      ┌──────────────────┐       ┌──────────────────┐
      │  Iceberg REST    │       │      MinIO       │
      │    Catalog       │       │  (S3-compatible) │
      └──────────────────┘       └──────────────────┘
                         │
                         ▼
              Iceberg Tables (Lakehouse)


Core Technologies

  1. Apache Spark Distributed compute for large-scale ETL and analytics

  2. Apache Iceberg Transactional lakehouse tables with schema evolution and time travel

  3. MinIO S3-compatible object storage used as the data lake

  4. Docker Compose Reproducible, local orchestration of all services

  5. Jupyter (Client-only) Clean separation between compute and interactive analysis

Why This Architecture

This stack mirrors real analytics platform patterns:

  1. Object storage as the source of truth
  2. Compute decoupled from storage
  3. Centralized catalog for table metadata
  4. Clients treated as ephemeral, stateless entry points

These are the same principles used in production systems — implemented locally and transparently.

Repository Structure:

.
├── jupyter-compose
│   ├── docker-compose.jupyter.yml
│   ├── Dockerfile
│   ├── notebooks
│   ├── requirements.txt
│   └── spark-defaults.conf
├── minio-compose
│   └── docker-compose.yml
├── README.md
├── spark-compose
│   ├── docker-compose.yml
│   └── spark-defaults.conf
└── worker-setup
    └── docker-compose.worker.yml


Key Design Decisions:

  1. Spark as a Shared Service

    • Spark runs independently of Jupyter.
    • Jupyter acts purely as a client, avoiding JVM conflicts and tight coupling.
  2. Iceberg via REST Catalog

    • Centralized metadata
    • No dependency on Hive Metastore
    • Matches modern lakehouse deployments
  3. Object Storage First

    • All data reads/writes go through S3A-compatible storage (MinIO).
    • This enforces stateless compute and reproducibility.
  4. Explicit Performance Tuning

    spark-defaults.conf includes:

    - Adaptive Query Execution (AQE)
    - Shuffle sizing
    - Broadcast thresholds
    - S3A upload and threading optimizations
    

These settings reflect real tuning tradeoffs, not defaults.

Security & Public-Repo Hygiene

This repo is intentionally safe for public release:

✅ No credentials committed ✅ All secrets sourced from environment variables ✅ .env.example provided for local use ✅ No internal IPs or machine-specific paths ✅ No proprietary data or schemas

This mirrors how infra code should be shared across teams or orgs.

How to Run (Local)

Prerequisites

Docker Docker Compose ~32 GB RAM recommended for meaningful Spark workloads

Start Everything

cp .env.example .env
chmod +x start_c.sh
./start_c.sh

Access Points

How to Validate the Platform

From a Jupyter notebook or spark-shell:

spark.sql("SHOW NAMESPACES IN iceberg").show()
spark.sql("""
CREATE TABLE iceberg.demo.test_table (
  id INT,
  value STRING
) USING iceberg
""")
spark.sql("SELECT * FROM iceberg.demo.test_table").show()

Successful execution confirms:

  • Spark ↔ Iceberg integration
  • Iceberg ↔ MinIO storage
  • End-to-end lakehouse functionality

Intended Use

This repository is designed to support analytics and modeling projects, such as:

  • Demand forecasting
  • Feature generation pipelines
  • Large-scale exploratory analysis
  • Offline ML training

(Downstream modeling lives in separate repos by design.)

Cloud Architecture Mapping (Conceptual)

The local stack in this repository directly maps to managed cloud services. This table highlights how the same architectural concepts translate to AWS and GCP.

Platform Component Local Implementation AWS Equivalent GCP Equivalent
Distributed Compute Apache Spark (Docker) EMR / EMR on EKS Dataproc
Object Storage MinIO (S3-compatible) Amazon S3 Google Cloud Storage
Table Format Apache Iceberg Iceberg on S3 Iceberg on GCS
Metadata Catalog Iceberg REST Catalog Glue Catalog / REST Dataplex / REST
SQL / Analytics Client Jupyter (PySpark) EMR Studio / SageMaker Vertex AI Workbench
Orchestration Docker Compose EKS / ECS GKE
Local Development Docker LocalStack / EC2 Cloud Workstations

Key takeaway: The architecture remains constant; only the managed service implementations change.

V2 and V3 Roadmap (Planned Enhancements)

This repository is intentionally scoped to a local, reproducible analytics platform. A future v2 iteration would focus on improving portability, scalability, and operational maturity while preserving the same architectural principles.

Planned areas of evolution:

  1. Spark Distribution Modernization for V2
  • Migrate from Bitnami legacy images to:
    • Official Apache Spark images, or
    • Community-maintained Spark images with clearer upgrade paths
  • Standardize Spark image build to decouple from vendor-specific assumptions
  1. Cloud-Native Object Storage Support for V3:
  • Validate identical Iceberg workloads on: A. AWS S3, B. GCS
  • Maintain MinIO as a local development and testing backend
  1. Expanded Iceberg Capabilities for V3:
  • Partition evolution and hidden partitioning
  • Snapshot-based data quality checks
  • Time-travel–based reproducibility for analytics and ML experiments
  1. Platform Observability for V3:
  • Structured Spark event logging
  • Query-level metrics for cost and performance analysis
  • Lightweight monitoring suitable for analytics platforms (not ops-heavy)
  1. Multi-User & Governance Readiness
  • Namespace-level table isolation
  • Read/write role separation at the catalog layer
  • Dataset lifecycle conventions (raw → curated → consumption)

These enhancements reflect how a local analytics platform naturally evolves toward production, without changing its core design principles.

Summary

This repo represents a serious, realistic data infrastructure setup — intentionally small enough to run locally, but architected the same way modern analytics platforms are built at scale.

It is meant to complement analytics and modeling work by showing how the data platform itself is designed and operated.

License

This project is licensed under the MIT License.

About

Containerized Spark cluster with Iceberg table format and MinIO S3 storage, designed for local experimentation and analytics workloads. Includes Docker Compose orchestration, version-pinned environments, spill configuration, and separation of compute and storage layers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors