Local Spark + Iceberg Data Platform (Analytics Infra Case Study)

Local Spark + Iceberg Analytics Platform

This repository demonstrates a production-style analytics infrastructure built locally using Spark, Iceberg, and S3-compatible object storage.

The project showcases how a data scientist or analytics engineer can design, configure, and operate a reproducible data platform that supports large-scale ETL, exploratory analysis, and forecasting workloads.

Key themes include distributed compute, lakehouse architecture, performance tuning, and security-aware configuration — intentionally scoped to be transparent, reproducible, and suitable for public sharing.

Note on Bitnami: The current implementation uses Spark 3.4.1 via Bitnami’s legacy images due to recent changes in Bitnami’s public image availability. This was a deliberate choice to preserve stability and reproducibility. A migration to non-Bitnami Spark images is planned for v2 of this repository.

What This Repo Is (and Is Not)

This repo is:

A realistic, end-to-end analytics infrastructure stack
Designed the way a data platform would be built for analysts and data scientists
Focused on correctness, scalability, and operational clarity

This repo is not:

A cloud-managed setup (no EMR / Databricks / BigQuery)
A toy “hello world” Spark example
A full production deployment (intentionally scoped to local + reproducible)

Architecture Overview

The platform uses a lakehouse-style architecture backed by object storage and a REST catalog.

                   ┌────────────────────┐
                   │    Jupyter Client   │
                   │  (PySpark / SQL)   │
                   └─────────┬──────────┘
                             │
                             ▼
                   ┌────────────────────┐
                   │   Spark Master     │
                   │  + Spark Workers  │
                   └─────────┬──────────┘
                             │
               ┌─────────────┴─────────────┐
               │                           │
               ▼                           ▼
      ┌──────────────────┐       ┌──────────────────┐
      │  Iceberg REST    │       │      MinIO       │
      │    Catalog       │       │  (S3-compatible) │
      └──────────────────┘       └──────────────────┘
                         │
                         ▼
              Iceberg Tables (Lakehouse)

Core Technologies

Apache Spark Distributed compute for large-scale ETL and analytics
Apache Iceberg Transactional lakehouse tables with schema evolution and time travel
MinIO S3-compatible object storage used as the data lake
Docker Compose Reproducible, local orchestration of all services
Jupyter (Client-only) Clean separation between compute and interactive analysis

Why This Architecture

This stack mirrors real analytics platform patterns:

Object storage as the source of truth
Compute decoupled from storage
Centralized catalog for table metadata
Clients treated as ephemeral, stateless entry points

These are the same principles used in production systems — implemented locally and transparently.

Repository Structure:

.
├── jupyter-compose
│   ├── docker-compose.jupyter.yml
│   ├── Dockerfile
│   ├── notebooks
│   ├── requirements.txt
│   └── spark-defaults.conf
├── minio-compose
│   └── docker-compose.yml
├── README.md
├── spark-compose
│   ├── docker-compose.yml
│   └── spark-defaults.conf
└── worker-setup
    └── docker-compose.worker.yml

Key Design Decisions:

Spark as a Shared Service
- Spark runs independently of Jupyter.
- Jupyter acts purely as a client, avoiding JVM conflicts and tight coupling.
Iceberg via REST Catalog
- Centralized metadata
- No dependency on Hive Metastore
- Matches modern lakehouse deployments
Object Storage First
- All data reads/writes go through S3A-compatible storage (MinIO).
- This enforces stateless compute and reproducibility.

Explicit Performance Tuning

spark-defaults.conf includes:

- Adaptive Query Execution (AQE)
- Shuffle sizing
- Broadcast thresholds
- S3A upload and threading optimizations

These settings reflect real tuning tradeoffs, not defaults.

Security & Public-Repo Hygiene

This repo is intentionally safe for public release:

✅ No credentials committed ✅ All secrets sourced from environment variables ✅ .env.example provided for local use ✅ No internal IPs or machine-specific paths ✅ No proprietary data or schemas

This mirrors how infra code should be shared across teams or orgs.

How to Run (Local)

Prerequisites

Docker Docker Compose ~32 GB RAM recommended for meaningful Spark workloads

Start Everything

cp .env.example .env
chmod +x start_c.sh
./start_c.sh

Access Points

How to Validate the Platform

From a Jupyter notebook or spark-shell:

spark.sql("SHOW NAMESPACES IN iceberg").show()

spark.sql("""
CREATE TABLE iceberg.demo.test_table (
  id INT,
  value STRING
) USING iceberg
""")

spark.sql("SELECT * FROM iceberg.demo.test_table").show()

Successful execution confirms:

Spark ↔ Iceberg integration
Iceberg ↔ MinIO storage
End-to-end lakehouse functionality

Intended Use

This repository is designed to support analytics and modeling projects, such as:

Demand forecasting

Feature generation pipelines

Large-scale exploratory analysis

Offline ML training

(Downstream modeling lives in separate repos by design.)

Cloud Architecture Mapping (Conceptual)

The local stack in this repository directly maps to managed cloud services. This table highlights how the same architectural concepts translate to AWS and GCP.

Platform Component	Local Implementation	AWS Equivalent	GCP Equivalent
Distributed Compute	Apache Spark (Docker)	EMR / EMR on EKS	Dataproc
Object Storage	MinIO (S3-compatible)	Amazon S3	Google Cloud Storage
Table Format	Apache Iceberg	Iceberg on S3	Iceberg on GCS
Metadata Catalog	Iceberg REST Catalog	Glue Catalog / REST	Dataplex / REST
SQL / Analytics Client	Jupyter (PySpark)	EMR Studio / SageMaker	Vertex AI Workbench
Orchestration	Docker Compose	EKS / ECS	GKE
Local Development	Docker	LocalStack / EC2	Cloud Workstations

Key takeaway: The architecture remains constant; only the managed service implementations change.

V2 and V3 Roadmap (Planned Enhancements)

This repository is intentionally scoped to a local, reproducible analytics platform. A future v2 iteration would focus on improving portability, scalability, and operational maturity while preserving the same architectural principles.

Planned areas of evolution:

Spark Distribution Modernization for V2

Migrate from Bitnami legacy images to:
- Official Apache Spark images, or
- Community-maintained Spark images with clearer upgrade paths
Standardize Spark image build to decouple from vendor-specific assumptions

Cloud-Native Object Storage Support for V3:

Validate identical Iceberg workloads on: A. AWS S3, B. GCS
Maintain MinIO as a local development and testing backend

Expanded Iceberg Capabilities for V3:

Partition evolution and hidden partitioning
Snapshot-based data quality checks
Time-travel–based reproducibility for analytics and ML experiments

Platform Observability for V3:

Structured Spark event logging
Query-level metrics for cost and performance analysis
Lightweight monitoring suitable for analytics platforms (not ops-heavy)

Multi-User & Governance Readiness

Namespace-level table isolation
Read/write role separation at the catalog layer
Dataset lifecycle conventions (raw → curated → consumption)

These enhancements reflect how a local analytics platform naturally evolves toward production, without changing its core design principles.

Summary

This repo represents a serious, realistic data infrastructure setup — intentionally small enough to run locally, but architected the same way modern analytics platforms are built at scale.

It is meant to complement analytics and modeling work by showing how the data platform itself is designed and operated.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local Spark + Iceberg Data Platform (Analytics Infra Case Study)

What This Repo Is (and Is Not)

Architecture Overview

Core Technologies

Why This Architecture

Repository Structure:

Key Design Decisions:

Security & Public-Repo Hygiene

How to Run (Local)

How to Validate the Platform

Intended Use

Cloud Architecture Mapping (Conceptual)

V2 and V3 Roadmap (Planned Enhancements)

Summary

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
jupyter-compose		jupyter-compose
minio-compose		minio-compose
spark-compose		spark-compose
worker-setup		worker-setup
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
start_c.sh		start_c.sh

Folders and files

Latest commit

History

Repository files navigation

Local Spark + Iceberg Data Platform (Analytics Infra Case Study)

What This Repo Is (and Is Not)

Architecture Overview

Core Technologies

Why This Architecture

Repository Structure:

Key Design Decisions:

Security & Public-Repo Hygiene

How to Run (Local)

How to Validate the Platform

Intended Use

Cloud Architecture Mapping (Conceptual)

V2 and V3 Roadmap (Planned Enhancements)

Summary

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages