{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreicczm4vfvtgsanmn2zwdqdtl7mzoq2hrv2c3sljvtadoad5qdhcvi",
"uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3mpfex4ser4l2"
},
"coverImage": {
"$type": "blob",
"ref": {
"$link": "bafkreid25czzfooxdr56smepx7d6cz6ppr7otzcb63prjvnjvnwwt62uky"
},
"mimeType": "image/webp",
"size": 473162
},
"path": "/cybermeshsec/how-to-build-a-modern-on-premise-data-lakehouse-without-vendor-lock-in-5440",
"publishedAt": "2026-06-29T01:20:06.000Z",
"site": "https://dev.to",
"tags": [
"dataengineering",
"architecture",
"opensource",
"devops"
],
"textContent": "Building a modern analytical platform usually rhymes with one thing: migrating to the cloud.\n\nBut what happens when operational realities, budget constraints, or strict compliance requirements demand a 100% on-premise solution?\n\nUntil recently, maintaining a local Data Lake meant choosing between two painful paths:\n\nWrestling with heavy, outdated Hadoop clusters.\n\nPaying exorbitant licensing fees for proprietary software.\n\nToday, the game has completely changed.\n\nThe open-source ecosystem has evolved to a point where you can build a robust, high-performance, and entirely decoupled Data Lakehouse running on bare metal servers.\n\nNo proprietary tech. No exorbitant fees. And absolutely no vendor lock-in.\n\nIn this article, I will break down the architecture, the technology stack, and the governance model we used to build a cutting-edge, in-house Data Lakehouse.\n\nThe Stack: Building an \"Open-Source Databricks\"\nTo ensure modularity and scalability, we applied the fundamental principle of modern data architecture: the separation of compute and storage.\n\nOur stack was assembled using the absolute best the current open-source ecosystem has to offer:\n\nStorage: MinIO\nWe replaced HDFS and AWS S3 with MinIO. It offers full compatibility with the S3 API but runs locally. Installed on NVMe drives with an XFS file system, it guarantees the incredibly high I/O throughput required for heavy analytical workloads.\n\nTable Format: Apache Iceberg\nIceberg is the beating heart of our Lakehouse. It brings relational database features directly to the Data Lake, such as ACID transactions, painless schema evolution, and Time Travel (allowing us to query the exact state of the data at a specific point in the past).\n\nMetadata Catalog: Project Nessie\nNessie acts as our Iceberg catalog. Its killer feature is providing a Git-like experience for data—allowing for branches, commits, and tags right at the data level.\n\nSQL Engine: Trino\nFor the compute layer, we utilize Trino (formerly PrestoSQL). It is a lightning-fast, distributed query engine. Beyond querying our Lakehouse, Trino features query federation connectors. This allows us to join Lakehouse data directly with operational databases (SQL Server, MySQL) in a single query, avoiding unnecessary data duplication.\n\nIngestion & Transformation: dlt and dbt Core\nTo extract data from our sources, we use the Python library dlt (data load tool), which makes building resilient pipelines a breeze. For transformation and modeling, we rely on the industry-standard dbt Core, turning SQL code into a software engineering pipeline complete with version control, testing, and documentation.\n\nInfrastructure: Bare Metal vs. Containers\nTo extract maximum performance exactly where it matters, we divided our infrastructure across two distinct servers:\n\n 1. The Core Server (Bare Metal):\nThis is where the heavy lifting happens. Services like MinIO, Nessie, and Trino run natively on Ubuntu Server 24.04, managed via systemd. We deliberately chose not to use Docker in this layer to avoid any virtualized network overhead or disk I/O bottlenecks. We ensure the SQL engine and storage communicate at the absolute maximum speed the hardware allows.\n\n 2. The Support Server (Docker):\nOn a secondary server, we run our entire support ecosystem containerized via Docker. This is home to tools like:\n\n\n\n\nDagster: Our data pipeline orchestrator.\n\nGrafana & Prometheus: For monitoring server health metrics.\n\nCloudBeaver: Our web interface for SQL exploration.\n\nThe Medallion Architecture and Decentralized Governance\nThe biggest challenge of a Data Lakehouse isn't technological; it's organizational.\n\nTo prevent our Lakehouse from turning into a \"data swamp\"—and to stop the IT team from becoming a bottleneck for report requests—we adopted the Medallion Architecture with strictly divided responsibilities:\n\nBronze Layer (Raw): Exact replicas of the source tables, with no business rules applied. Managed by IT using incremental polling via dlt.\n\nSilver Layer (Cleaned): Also managed by IT. Here, data is cast to correct types, deduplicated, and standardized using dbt.\n\nGold Layer (Business): This is where the magic of decentralized governance happens. The modeling (Star Schema, aggregations, KPIs) is the exclusive responsibility of the QA and BI teams.\n\nHow does this workflow operate in practice?\nWhen the BI area needs a new metric, they don't open a ticket begging Data Engineering for help. They develop the SQL model themselves in dbt, test it locally against Trino, and open a Pull Request in Git.\n\nIT simply reviews the SQL code quality, approves the PR, and triggers the deploy. From that moment on, Dagster takes over and automatically orchestrates the schedule for that new table.\n\nThe Future: Evolving from Batches to Real-Time (CDC)\nCurrently, our architecture operates on V1, based on scheduled incremental polling. This delivers a latency that is perfectly acceptable for our current business needs.\n\nHowever, our architectural roadmap is already prepped for V2. We will replace scheduled extractions with CDC (Change Data Capture) using Debezium and Kafka to read transaction logs directly from our operational databases. Data will land in Iceberg in a matter of seconds.\n\nThe beauty of this modern stack is that by swapping the ingestion engine at the Bronze layer, all our dbt models in the Silver and Gold layers will continue to function seamlessly, without us having to rewrite a single line of transformation code.\n\nWhat’s Next?\nBuilding this data infrastructure solved our analytical scalability issues, but it also opened the doors to new initiatives—especially the intensive use of Artificial Intelligence to generate internal applications and mini-systems integrated with this data.\n\nWith systems being rapidly generated by AI, a new (and critical) challenge emerged: How do we guarantee that an AI-generated portal accesses the Lakehouse securely?\n\nIn future posts, I will dive deep into how we solved this exact problem, covering Security, IAM (Identity and Access Management), API Gateways, and Row-Level Security in Trino.\n\n🔔 Follow me here to stay updated on future articles! Drop a comment below: what are your thoughts on this architecture, and how do you handle the On-Premise vs. Cloud dilemma in your own projects?",
"title": "How to Build a Modern On-Premise Data Lakehouse (Without Vendor Lock-in)"
}