Datahive is the open data lakehouse for organizations that need to unify fragmented systems, govern every byte, and activate data for analytics and AI — without giving up sovereignty. Built on Apache Iceberg, StarRocks, and Project Nessie. Deployed on your infrastructure. Supported by a team on the ground in Jakarta.
Modern organizations run on dozens of systems — CRM, ERP, SIS, LMS, APIs, spreadsheets, sensors. Without a unified foundation, fragmentation compounds silently into operational debt: reconciliation work, duplicated pipelines, delayed decisions, and stalled AI initiatives. We've seen the same patterns in every engagement.
Datahive is not a black box. Every component is open-source, battle-tested at hyperscale, and replaceable. Below are the four foundational technologies — what they do, why they matter, and how they fit together to give you a lakehouse that performs like a warehouse and stays open like a data lake.
Source systems push events through Kafka and Debezium CDC. Airbyte connectors handle batch extracts. Everything is written as Parquet files into MinIO, your sovereign S3-compatible storage layer.
Apache Iceberg organizes those Parquet files into proper tables with ACID semantics, schema evolution, and time-travel. Project Nessie versions the catalog metadata — branches, tags, commits over your entire data estate.
StarRocks reads Iceberg tables directly from object storage with sub-second latency. BI tools, AI agents, and operational APIs all hit the same governed substrate through MySQL-compatible SQL or REST.
Six essential capabilities that turn fragmented data into a unified, governed, analytics-ready foundation. Each is productionized, each is open, each integrates with what you already run.
Ingest structured, semi-structured, and unstructured data from any source — CRM, ERP, databases, SaaS APIs, flat files, telemetry — into a single governed substrate. Schema-on-read where it makes sense, schema-on-write where it matters.
Ingest, enrich, and serve events in sub-second timeframes via Kafka, Apache Flink, and RisingWave. Exactly-once semantics; backpressure-aware; compatible with batch dbt models for unified streaming-batch pipelines.
Standardize, enrich, and structure data via dbt with full model lineage; SQLMesh for blue-green semantic versioning; a Cube-based semantic layer that defines metrics once and serves them consistently to every downstream tool.
End-to-end policy control, column-level lineage via OpenMetadata, full audit trail, and row-level security tied to your identity provider via SSO/OIDC and SCIM. UU PDP and ISO 27001 aligned out of the box.
Sub-second interactive analytics through StarRocks and Trino; semantic modeling via Cube; native integration with Metabase, Superset, and your existing BI tooling. Reverse-ETL pushes governed data back into operational systems.
Vector search via pgvector and Milvus; feature store via Feast (offline + online); LLM orchestration via LangGraph and Dify. Your governed data becomes retrieval-ready, fine-tuning-ready, and agent-ready — without a second platform.
Datahive's architecture is intentionally layered: each layer talks to its neighbors through stable, open interfaces. You can replace any single layer (a different query engine, a different orchestrator, a different BI tool) without re-architecting the rest. This is what "no lock-in" actually means in practice.
Each engine is a standalone, production-grade asset — self-hostable, auditable, and independently scalable. Together, they compose into a single institutional intelligence stack.
Every component is swappable. No proprietary formats, no vendor lock-in. The stack below is the foundation we deploy — but each layer can be replaced independently as your needs evolve.
Purpose-built reference deployments across Indonesian higher education, government, and enterprise. Each built and operated under UU PDP, on infrastructure the customer owns.
Consolidated 150K+ student records across SIS, LMS, admissions, and payment systems into a single governed foundation. Executive dashboards now refresh in real-time; OCR-powered document workflows eliminated 18 hrs/week of manual reconciliation.
Non-IoT operations platform for Jakarta's environmental agency. Seven actor roles, six integrated modules, and direct integration with SILIKA, Bank Sampah Portal, and BGMIOTA — all under a single governed data plane.
Five-layer AI-native CRM platform for Indonesian SMBs. TwentyCRM, Dify, and Evolution API unified through Datahive — WhatsApp-first channel, IDR-native billing via Xendit, fully UU PDP compliant from day one.
Datahive is designed for the compliance environment Indonesian institutions operate in. On-premise deployment keeps sensitive data within your jurisdiction; governance is built in, not bolted on.
Single-binary CLI, declarative configuration, GitOps-native. Datahive runs on your infrastructure, inside your firewall, under your keys — with the same deployment ergonomics you expect from modern developer tools.
Every pipeline, every model, every access policy is defined in version-controlled YAML. Deploy the same configuration across dev, staging, and production with a single command — and roll it back with another.
datahive command across init, deploy, migrate, query, audit. No external dependencies, no runtime installation..dh format)Datahive is sold as an outcome, not a SKU. Every engagement is scoped to your data estate, regulatory profile, and operational scale — and priced in IDR with predictable annual commitments. Below are the three tiers our customers typically map into; the right fit takes a 30-minute conversation.
A production-ready open lakehouse for organizations getting serious about unified data. Designed for teams who want enterprise architecture without enterprise overhead.
The full platform with all five intelligence engines, dedicated infrastructure, and a named solutions team. The reference deployment for higher-ed, government agencies, and mid-market enterprises.
Fully air-gapped deployment on hardware you own, in datacenters you operate. Designed for the regulatory environment of financial services, healthcare, defense, and central government.
Everything we're asked in the first sales call — compressed. If you need depth on any of these, our solutions team is one email away.
Yes, fully on-premise. Once deployed, the runtime does not require any outbound connectivity to our infrastructure to operate. Telemetry is opt-in and aggregated; source access means you can audit every outbound call yourself with tcpdump or your own SIEM. Air-gapped installation is a first-class deployment mode on Sovereign plans.
Datahive is built on the same open standards Databricks embraces (Iceberg, Parquet, Spark), but deploys to your infrastructure with no proprietary runtime. Snowflake is a managed SaaS warehouse; Datahive is an on-premise open lakehouse designed for data residency, regulatory compliance, and cost predictability in the Indonesian market. The two are not mutually exclusive — several customers use Datahive as the sovereign tier alongside cloud warehouses for non-sensitive workloads, with the same Iceberg tables federated across both.
Iceberg is the only major table format with a truly open catalog specification (REST + multi-engine), no historical Spark coupling, and a metadata architecture that scales to billions of files via manifest pruning. Delta Lake originated at Databricks and has the deepest Spark integration, but its catalog ecosystem is narrower. Hudi optimizes for streaming upsert workloads but sacrifices schema flexibility. For an open lakehouse where you'll mix StarRocks, Trino, Flink, and Spark, Iceberg is the lowest-risk choice — and it's converging with the others on features.
Personal data is tagged, classified, and column-level access-controlled. Every read and write is logged to an append-only audit trail with user, timestamp, and query context. Data residency is guaranteed by on-premise deployment. Data subject requests (akses, koreksi, penghapusan) are supported as first-class CLI operations. We provide a full UU PDP control mapping document on request, mapped to specific platform features.
A Standard deployment can be operational in 2–3 weeks. Enterprise deployments typically run 6–10 weeks including source system integration, data modeling, and semantic layer definition. Sovereign engagements are scoped individually. We publish our delivery methodology and reference Gantt with every proposal.
Both. Datahive runs on any Kubernetes cluster — your existing one, or a Proxmox/VMware setup we help architect. For Sovereign deployments, we can source, configure, and ship pre-imaged hardware to your datacenter with full burn-in and commissioning.
Your data stays in open formats (Iceberg + Parquet) on your object storage. Your pipelines are declarative .dh files in your Git repo. Your semantic layer is portable YAML. There is no proprietary lock-in to remove — you own the substrate already. We help you transition, if that day comes.
Deploy on-premise. Own every byte. Ship faster. Built in Indonesia — open everywhere.