Local Spark + Iceberg Analytics Platform
- This repository demonstrates a production-style analytics infrastructure built locally using Spark, Iceberg, and S3-compatible object storage.
- The project showcases how a data scientist or analytics engineer can design, configure, and operate a reproducible data platform that supports large-scale ETL, exploratory analysis, and forecasting workloads.
- Key themes include distributed compute, lakehouse architecture, performance tuning, and security-aware configuration — intentionally scoped to be transparent, reproducible, and suitable for public sharing.
Note on Bitnami: The current implementation uses Spark 3.4.1 via Bitnami’s legacy images due to recent changes in Bitnami’s public image availability. This was a deliberate choice to preserve stability and reproducibility. A migration to non-Bitnami Spark images is planned for v2 of this repository.
This repo is:
- A realistic, end-to-end analytics infrastructure stack
- Designed the way a data platform would be built for analysts and data scientists
- Focused on correctness, scalability, and operational clarity
This repo is not:
- A cloud-managed setup (no EMR / Databricks / BigQuery)
- A toy “hello world” Spark example
- A full production deployment (intentionally scoped to local + reproducible)
The platform uses a lakehouse-style architecture backed by object storage and a REST catalog.
┌────────────────────┐
│ Jupyter Client │
│ (PySpark / SQL) │
└─────────┬──────────┘
│
▼
┌────────────────────┐
│ Spark Master │
│ + Spark Workers │
└─────────┬──────────┘
│
┌─────────────┴─────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Iceberg REST │ │ MinIO │
│ Catalog │ │ (S3-compatible) │
└──────────────────┘ └──────────────────┘
│
▼
Iceberg Tables (Lakehouse)
-
Apache Spark Distributed compute for large-scale ETL and analytics
-
Apache Iceberg Transactional lakehouse tables with schema evolution and time travel
-
MinIO S3-compatible object storage used as the data lake
-
Docker Compose Reproducible, local orchestration of all services
-
Jupyter (Client-only) Clean separation between compute and interactive analysis
This stack mirrors real analytics platform patterns:
- Object storage as the source of truth
- Compute decoupled from storage
- Centralized catalog for table metadata
- Clients treated as ephemeral, stateless entry points
These are the same principles used in production systems — implemented locally and transparently.
.
├── jupyter-compose
│ ├── docker-compose.jupyter.yml
│ ├── Dockerfile
│ ├── notebooks
│ ├── requirements.txt
│ └── spark-defaults.conf
├── minio-compose
│ └── docker-compose.yml
├── README.md
├── spark-compose
│ ├── docker-compose.yml
│ └── spark-defaults.conf
└── worker-setup
└── docker-compose.worker.yml
-
Spark as a Shared Service
- Spark runs independently of Jupyter.
- Jupyter acts purely as a client, avoiding JVM conflicts and tight coupling.
-
Iceberg via REST Catalog
- Centralized metadata
- No dependency on Hive Metastore
- Matches modern lakehouse deployments
-
Object Storage First
- All data reads/writes go through S3A-compatible storage (MinIO).
- This enforces stateless compute and reproducibility.
-
Explicit Performance Tuning
spark-defaults.conf includes:
- Adaptive Query Execution (AQE) - Shuffle sizing - Broadcast thresholds - S3A upload and threading optimizations
These settings reflect real tuning tradeoffs, not defaults.
This repo is intentionally safe for public release:
✅ No credentials committed ✅ All secrets sourced from environment variables ✅ .env.example provided for local use ✅ No internal IPs or machine-specific paths ✅ No proprietary data or schemas
This mirrors how infra code should be shared across teams or orgs.
Prerequisites
Docker Docker Compose ~32 GB RAM recommended for meaningful Spark workloads
Start Everything
cp .env.example .env
chmod +x start_c.sh
./start_c.sh
Access Points
- Spark UI: http://localhost:8080
- Jupyter: http://localhost:8888
- MinIO Console: http://localhost:9001
- Iceberg REST: http://localhost:8181
From a Jupyter notebook or spark-shell:
spark.sql("SHOW NAMESPACES IN iceberg").show()spark.sql("""
CREATE TABLE iceberg.demo.test_table (
id INT,
value STRING
) USING iceberg
""")
spark.sql("SELECT * FROM iceberg.demo.test_table").show()
Successful execution confirms:
- Spark ↔ Iceberg integration
- Iceberg ↔ MinIO storage
- End-to-end lakehouse functionality
This repository is designed to support analytics and modeling projects, such as:
- Demand forecasting
- Feature generation pipelines
- Large-scale exploratory analysis
- Offline ML training
(Downstream modeling lives in separate repos by design.)
The local stack in this repository directly maps to managed cloud services. This table highlights how the same architectural concepts translate to AWS and GCP.
| Platform Component | Local Implementation | AWS Equivalent | GCP Equivalent |
|---|---|---|---|
| Distributed Compute | Apache Spark (Docker) | EMR / EMR on EKS | Dataproc |
| Object Storage | MinIO (S3-compatible) | Amazon S3 | Google Cloud Storage |
| Table Format | Apache Iceberg | Iceberg on S3 | Iceberg on GCS |
| Metadata Catalog | Iceberg REST Catalog | Glue Catalog / REST | Dataplex / REST |
| SQL / Analytics Client | Jupyter (PySpark) | EMR Studio / SageMaker | Vertex AI Workbench |
| Orchestration | Docker Compose | EKS / ECS | GKE |
| Local Development | Docker | LocalStack / EC2 | Cloud Workstations |
Key takeaway: The architecture remains constant; only the managed service implementations change.
This repository is intentionally scoped to a local, reproducible analytics platform. A future v2 iteration would focus on improving portability, scalability, and operational maturity while preserving the same architectural principles.
Planned areas of evolution:
- Spark Distribution Modernization for V2
- Migrate from Bitnami legacy images to:
- Official Apache Spark images, or
- Community-maintained Spark images with clearer upgrade paths
- Standardize Spark image build to decouple from vendor-specific assumptions
- Cloud-Native Object Storage Support for V3:
- Validate identical Iceberg workloads on: A. AWS S3, B. GCS
- Maintain MinIO as a local development and testing backend
- Expanded Iceberg Capabilities for V3:
- Partition evolution and hidden partitioning
- Snapshot-based data quality checks
- Time-travel–based reproducibility for analytics and ML experiments
- Platform Observability for V3:
- Structured Spark event logging
- Query-level metrics for cost and performance analysis
- Lightweight monitoring suitable for analytics platforms (not ops-heavy)
- Multi-User & Governance Readiness
- Namespace-level table isolation
- Read/write role separation at the catalog layer
- Dataset lifecycle conventions (raw → curated → consumption)
These enhancements reflect how a local analytics platform naturally evolves toward production, without changing its core design principles.
This repo represents a serious, realistic data infrastructure setup — intentionally small enough to run locally, but architected the same way modern analytics platforms are built at scale.
It is meant to complement analytics and modeling work by showing how the data platform itself is designed and operated.
This project is licensed under the MIT License.