All systems operational Version 2.4.1 Region ID-JKT-01
Uptime 99.982% Latency p95 142 ms Events/s 142,891 Status page ↗
Open Lakehouse · On-premise · UU PDP compliant

/The unified data foundation Indonesian institutions actually run on.

Datahive is the open data lakehouse for organizations that need to unify fragmented systems, govern every byte, and activate data for analytics and AI — without giving up sovereignty. Built on Apache Iceberg, StarRocks, and Project Nessie. Deployed on your infrastructure. Supported by a team on the ground in Jakarta.

Trusted by
Higher education Government agencies Financial services Healthcare networks
datahive · live-view.dh
Streaming
Ingestion
8.4TB/h
↑ 12.3% WoW
Query p95
142ms
↓ 8.1% faster
Pipelines
1,284
+24 today
Data managed
128TB
Iceberg / Nessie
SOURCES PROCESS LAKEHOUSE CONSUMERS CRM · ERP SIS · LMS Postgres · MySQL Gov APIs Files · OCR IoT / sensors Ingest Airbyte · Kafka CDC Transform dbt · Flink · Spark Govern OpenMetadata · RBAC LAKEHOUSE CORE Iceberg128 TB StarRocksp95 142 ms MinIO99.98% Nessiev2.8 Trinoactive DuckDBembed Feastonline K8s24 nodes BI · Dashboards AI · LangGraph APIs
Powering data for institutions across Indonesia
Ø01 · The fragmentation problem

Your data is everywhere. Your decisions shouldn't be.

Modern organizations run on dozens of systems — CRM, ERP, SIS, LMS, APIs, spreadsheets, sensors. Without a unified foundation, fragmentation compounds silently into operational debt: reconciliation work, duplicated pipelines, delayed decisions, and stalled AI initiatives. We've seen the same patterns in every engagement.

fragmentation-diagnostic · sample output
3Critical 7Warning 142Resolved
14:23:08 FRAG-001
Inconsistent executive reporting
Finance and operations teams report divergent Q3 revenue figures. Delta of 4.2% traced to two conflicting definitions of "active account".
source: crm.v1, erp.legacy −18 hrs/wkreconciliation overhead
14:19:44 FRAG-002
Duplicated data pipelines across teams
Three departments independently built and maintain overlapping ETL jobs against the same source systems. Infrastructure cost and maintenance burden duplicated 3×.
source: audit.catalog −IDR 240Mannual run-cost
14:12:31 FRAG-003
Decision-lag in executive dashboards
Leadership dashboard refresh cycle exceeds 48 hours. Critical KPIs are materially stale at the moment of board-level decision-making.
source: bi.legacy −3.2 daysavg decision lag
14:08:15 FRAG-004
Ungoverned shadow data marts
Forty-seven spreadsheet-based "mini warehouses" discovered across business units — untracked lineage, untracked access, no audit trail. UU PDP exposure.
source: scanner.file Compliance riskUU PDP, ISO 27001
14:02:47 FRAG-005
AI initiatives blocked at the data layer
Four planned ML/LLM use-cases indefinitely stalled. No unified feature store, no clean training set, no lineage from model to source.
source: platform.ml 0 modelsin production
Ø02 · The technology underneath

Built on the open standards the world's largest data platforms run on.

Datahive is not a black box. Every component is open-source, battle-tested at hyperscale, and replaceable. Below are the four foundational technologies — what they do, why they matter, and how they fit together to give you a lakehouse that performs like a warehouse and stays open like a data lake.

Apache Iceberg v1.6 · ASF
Table Format
The open table format that turns your object storage into a database. Originated at Netflix, now the de-facto standard for open lakehouses.
Traditional data lakes store files in object storage with no metadata management — making updates, deletes, time-travel, and schema evolution painful. Iceberg solves this by adding a multi-layered metadata tree on top of your Parquet files: a catalog points to a metadata file, which references a manifest list, which references individual manifests tracking the actual data files. Each manifest carries column-level statistics (min/max values, null counts, distinct counts) — enabling the query engine to prune files at planning time, before scanning a single byte. For tables with billions of files, this avoids the full-scan planning bottleneck that plagues simpler approaches.
ACID Snapshot isolation, optimistic concurrency, atomic commits
Schema Add, drop, rename, reorder columns — fully backward-compatible
Time travel Query any snapshot by ID or timestamp; audit history without copies
Partition Hidden partitioning with evolution — change scheme without rewriting data
Engines Spark, Trino, Flink, StarRocks, Dremio, Snowflake, BigQuery
Apache FoundationOpen REST catalogParquet native
iceberg.apache.org ↗
StarRocks v3.3 · Linux Foundation
Query Engine
A vectorized MPP database designed to deliver sub-second analytics on petabyte-scale data, including data sitting natively in Iceberg.
Most analytical databases use a scatter-gather pattern: fan a query out to worker nodes, then aggregate results on a single coordinator. That coordinator becomes a bottleneck on complex queries. StarRocks instead uses true MPP (massively parallel processing): data is shuffled across many nodes, joins and aggregations happen in parallel everywhere, and the final assembly is itself distributed. For high-cardinality group-bys and large fact-table joins, this is materially faster. The execution engine is fully vectorized — written in C++ with SIMD instructions to process multiple values per CPU cycle. It uses Operation on Encoded Data: joins, aggregations, and expressions run directly on dictionary-encoded strings without decoding. The result is documented at 5–10× faster than previous-generation systems for multi-dimensional analytics.
Architecture MPP with shared-nothing FE/BE topology — scales horizontally
Engine C++ vectorized executor with SIMD; columnar in-memory layout
Optimizer Cost-based optimizer with statistics-driven plan selection
Materialized views Auto-refresh and auto-rewrite — query pre-aggregated data transparently
Lake queries Query Iceberg/Hive/Hudi tables without migration — zero-copy analytics
Compatibility MySQL wire protocol — every BI tool already speaks it
Linux FoundationSub-second p95Storage-compute separation
starrocks.io ↗
Project Nessie v0.95 · Dremio
Versioned Catalog
Git for your data. Branches, tags, and commits over your entire lakehouse — without copying a single byte.
Nessie applies the mental model of Git to your data catalog. Instead of versioning data files (which would mean copying terabytes), Nessie versions the catalog metadata: the registry of which tables exist, what schemas they have, and which underlying Parquet files they reference. This separation enables zero-copy branching — branches share file pointers, so creating a branch is instant regardless of table size. The implications are significant. Data engineers can spin up an isolated branch to test a destructive migration, validate the result, then atomically merge back to main — exactly the workflow developers expect from Git. Multi-table transactions become possible: commit changes to ten tables in one atomic operation, with rollback if anything fails. Production isolation becomes trivial: dev, staging, and prod branches all see the same physical data, with their own logical histories.
Versioning unit Catalog metadata pointers — not data files (zero-copy)
Operations commit, branch, tag, merge, cherry-pick, revert
Atomicity Multi-table transactions with optimistic concurrency control
Use cases ETL isolation, A/B data testing, regulatory point-in-time queries
Engines Iceberg-native; works with Spark, Flink, Trino, Dremio
Open sourceREST APIIceberg + Delta Lake
projectnessie.org ↗
MinIO v RELEASE.2026 · MinIO Inc.
Object Storage
High-performance, S3-compatible object storage you run on your own hardware. The substrate every other component sits on.
A lakehouse needs cheap, infinite, durable storage — historically that meant cloud object stores like S3. MinIO gives you the same S3 API, the same scaling characteristics, and the same durability guarantees, but running on bare metal in your datacenter. Your data never leaves your jurisdiction. MinIO uses erasure coding (Reed-Solomon) to shard objects across drives — typically tolerating up to half of the drives failing without data loss. Throughput is measured in GB/s per node on commodity hardware; benchmarks routinely exceed cloud object storage on read-heavy workloads. For a lakehouse on UU PDP-regulated data, this is the only architecture that works without compromise.
API Drop-in S3 compatibility — every Iceberg/Spark/Trino client works
Durability Erasure coding with configurable parity (e.g. 8+4 EC scheme)
Encryption SSE-S3, SSE-KMS, SSE-C; per-bucket key management
Replication Active-active site replication for DR; bucket-level policies
Identity IAM-compatible RBAC, OIDC SSO, audit logs to SIEM
S3-compatibleBare-metal nativeSovereign-ready
min.io ↗
1step

Data lands in object storage

Source systems push events through Kafka and Debezium CDC. Airbyte connectors handle batch extracts. Everything is written as Parquet files into MinIO, your sovereign S3-compatible storage layer.

AirbyteKafkaDebeziumMinIO
2step

Iceberg catalogs the files

Apache Iceberg organizes those Parquet files into proper tables with ACID semantics, schema evolution, and time-travel. Project Nessie versions the catalog metadata — branches, tags, commits over your entire data estate.

Apache IcebergProject NessieParquet
3step

StarRocks queries everything

StarRocks reads Iceberg tables directly from object storage with sub-second latency. BI tools, AI agents, and operational APIs all hit the same governed substrate through MySQL-compatible SQL or REST.

StarRocks MPPTrinoDuckDB
Ø03 · Platform capabilities

What Datahive powers.

Six essential capabilities that turn fragmented data into a unified, governed, analytics-ready foundation. Each is productionized, each is open, each integrates with what you already run.

01

Data unification

Ingest structured, semi-structured, and unstructured data from any source — CRM, ERP, databases, SaaS APIs, flat files, telemetry — into a single governed substrate. Schema-on-read where it makes sense, schema-on-write where it matters.

300+ connectorsCDC native
02

Real-time processing

Ingest, enrich, and serve events in sub-second timeframes via Kafka, Apache Flink, and RisingWave. Exactly-once semantics; backpressure-aware; compatible with batch dbt models for unified streaming-batch pipelines.

Sub-second latencyExactly-once
03

Transformation & modeling

Standardize, enrich, and structure data via dbt with full model lineage; SQLMesh for blue-green semantic versioning; a Cube-based semantic layer that defines metrics once and serves them consistently to every downstream tool.

dbt CoreSemantic metrics
04

Governance & lineage

End-to-end policy control, column-level lineage via OpenMetadata, full audit trail, and row-level security tied to your identity provider via SSO/OIDC and SCIM. UU PDP and ISO 27001 aligned out of the box.

UU PDPColumn lineageSSO / OIDC
05

Analytics & activation

Sub-second interactive analytics through StarRocks and Trino; semantic modeling via Cube; native integration with Metabase, Superset, and your existing BI tooling. Reverse-ETL pushes governed data back into operational systems.

p95 < 200msReverse-ETL
06

AI-ready foundation

Vector search via pgvector and Milvus; feature store via Feast (offline + online); LLM orchestration via LangGraph and Dify. Your governed data becomes retrieval-ready, fine-tuning-ready, and agent-ready — without a second platform.

Vector + BM25LangGraph
0.4TB
Data under management
Across active deployments
0.98%
Platform uptime SLA
12-month rolling average
0ms
Query p95 latency
Measured on StarRocks MPP
0×
Faster than legacy DW
Mid-size customer benchmark
Ø04 · Reference architecture

One governed foundation. Five layers, fully open.

Datahive's architecture is intentionally layered: each layer talks to its neighbors through stable, open interfaces. You can replace any single layer (a different query engine, a different orchestrator, a different BI tool) without re-architecting the rest. This is what "no lock-in" actually means in practice.

system.architecture · v2.4
Streaming Batch Federated Control
L5 — ACTIVATION L4 — SEMANTIC & GOVERNANCE L3 — LAKEHOUSE CORE L2 — PROCESSING L1 — INGESTION BI Dashboards Metabase · Superset AI Agents LangGraph · Dify Reverse ETL Hightouch · Census APIs GraphQL · REST Operational apps respon.app · custom Executive intelligence Real-time KPI command Semantic layer Cube · MetricFlow Catalog & lineage OpenMetadata · DataHub Feature store Feast · offline + online AI trust & governance UU PDP · audit · RBAC · lineage LAKEHOUSE CORE · OPEN TABLE FORMAT Apache Iceberg StarRocks Trino MinIO S3 Nessie RisingWave DuckDB pgvector / Milvus Stream processing Flink · Spark Stream Transform dbt · SQLMesh Orchestration Dagster · Airflow Quality Great Expectations Compute (Kubernetes) 24 nodes · 192 vCPU · 768 GB CDC Debezium Event streaming Kafka · Redpanda Connectors Airbyte · 300+ OCR & docs Document AI IoT · MQTT Sensors · telemetry Government APIs Coretax · Dukcapil · OJK
Ø05 · Intelligence engines

Five engines. One governed foundation.

Each engine is a standalone, production-grade asset — self-hostable, auditable, and independently scalable. Together, they compose into a single institutional intelligence stack.

01ENG
Knowledge Intelligence Engine
Institutions sit on millions of pages — policies, archives, reports, regulations. This engine turns latent institutional memory into a queryable, grounded, retrieval-augmented corpus with full citation and provenance. Indonesian and English native.
Search Hybrid vector + BM25
Languages ID · EN native
Grounding Citation-first
GA · v2.4 Read spec →
02ENG
Document Intelligence Engine
Every institution ships thousands of documents — contracts, invoices, faktur pajak, academic records — most still processed by hand. This engine extracts, classifies, and routes structured data at industrial throughput with layout-aware OCR and custom schemas.
Extraction 99.2% accuracy
Templates Faktur Pajak ready
Throughput 10k docs/hour
GA · v2.1 Read spec →
03ENG
Predictive Intelligence Engine
Good strategic decisions need two things: complete data, and the ability to look ahead. Forecasting, anomaly detection, and causal inference — productionized, monitored, and versioned alongside your data pipelines.
Methods Forecasting, causal ML
Retraining Automated, scheduled
Monitoring Drift & accuracy
Beta · v1.8 Join early access →
04ENG
Enterprise Data Engine
A unified data foundation plus an institutional base model — trained on your own dataset so it speaks your internal terminology, policy hierarchy, and operational procedures. The data layer and the model layer, governed together.
Storage StarRocks + Iceberg
Custom LLM On your corpus
Deploy Fully on-prem
GA · v3.0 Read spec →
05ENG
AI Trust & Governance Layer
Running AI in an institutional setting without governance isn't just technical risk — it's legal and reputational. This layer provides UU PDP-compliant guardrails, end-to-end audit trails, role-based access control, and column-level lineage for every AI surface you build.
Compliance UU PDP · ISO 27001
Audit End-to-end trail
Access RBAC · SSO · SCIM
GA · v2.0 Read spec →
Ø06 · Modular stack

Built on open standards.

Every component is swappable. No proprietary formats, no vendor lock-in. The stack below is the foundation we deploy — but each layer can be replaced independently as your needs evolve.

L5 · Activation BI & Dashboards Metabase · Superset · Evidence
L5 · Activation AI Agents LangGraph · Dify · CrewAI
L4 · Semantic Semantic Layer Cube · MetricFlow
L4 · Governance Catalog & Lineage OpenMetadata · DataHub
L3 · Core Query Engine StarRocks · Trino · DuckDB
L3 · Core Table Format Apache Iceberg · Nessie
L3 · Core Object Storage MinIO · S3 compatible
L3 · Core Vector & Search pgvector · Milvus · Elastic
L2 · Process Streaming Apache Flink · RisingWave
L2 · Process Transform dbt · SQLMesh
L2 · Process Orchestration Dagster · Airflow
L2 · Process Quality Great Expectations · Soda
L1 · Ingestion CDC Debezium · Kafka Connect
L1 · Ingestion Connectors Airbyte · 300+ sources
L1 · Ingestion Event Bus Kafka · Redpanda
L0 · Compute Orchestration Kubernetes · Proxmox
Ø07 · Customer stories

Where Datahive delivers impact.

Purpose-built reference deployments across Indonesian higher education, government, and enterprise. Each built and operated under UU PDP, on infrastructure the customer owns.

Higher Education

Unified student & academic intelligence

Consolidated 150K+ student records across SIS, LMS, admissions, and payment systems into a single governed foundation. Executive dashboards now refresh in real-time; OCR-powered document workflows eliminated 18 hrs/week of manual reconciliation.

"We moved from static monthly reports to a live institutional dashboard. The IT team didn't have to rip out a single existing system."
— CTO, private university (42k active students)
150K+
Students unified
24×
Query speedup
4 days → live
Report refresh
Read the case study →
Government · Environment

Environmental operations platform

Non-IoT operations platform for Jakarta's environmental agency. Seven actor roles, six integrated modules, and direct integration with SILIKA, Bank Sampah Portal, and BGMIOTA — all under a single governed data plane.

7
Actor roles
3
Gov integrations
Real-time
Field ops
Read the case study →
Enterprise · CRM

AI-native WhatsApp CRM (respon.app)

Five-layer AI-native CRM platform for Indonesian SMBs. TwentyCRM, Dify, and Evolution API unified through Datahive — WhatsApp-first channel, IDR-native billing via Xendit, fully UU PDP compliant from day one.

5
Product layers
WA-first
Channel
IDR
Native billing
Read the case study →
Ø08 · Security & compliance

Built for the standards you answer to.

Datahive is designed for the compliance environment Indonesian institutions operate in. On-premise deployment keeps sensitive data within your jurisdiction; governance is built in, not bolted on.

UU PDP
Indonesia PDP law
ISO 27001
Information security
SOC 2 Type II
Audit framework
GDPR ready
EU cross-border
OJK guidelines
Financial services
Ø09 · Developer experience

Deploy in minutes. Own it forever.

Single-binary CLI, declarative configuration, GitOps-native. Datahive runs on your infrastructure, inside your firewall, under your keys — with the same deployment ergonomics you expect from modern developer tools.

Infrastructure as code.
Data as code.

Every pipeline, every model, every access policy is defined in version-controlled YAML. Deploy the same configuration across dev, staging, and production with a single command — and roll it back with another.

Single-binary CLI
One datahive command across init, deploy, migrate, query, audit. No external dependencies, no runtime installation.
Declarative pipelines (.dh format)
Type-checked at plan time. Sources, transforms, sinks, schedules — all composable primitives with explicit dependencies.
Semantic queries, not raw SQL
Query metrics defined once in the semantic layer — consistent numbers across every downstream tool, every dashboard, every API.
GitOps-native
Changes go through PR review. Every deploy is signed, audited, and reversible. Branch your data with Nessie like you branch your code with Git.
Zero-downtime schema migrations
Iceberg + Nessie give you branching and time-travel. Test a migration on a branch, validate, merge atomically when safe.
~ datahive-cli · zsh
deploy query audit
user@datahive:~/infra/prod$ datahive init --profile enterprise
→ Initializing Datahive control plane...
Kubernetes context dh-prod-jkt-01 (24 nodes)
Provisioning Iceberg catalog on MinIO
StarRocks topology deployed (3 FE, 12 BE)
Nessie branch main initialized
UU PDP governance layer activated
 
user@datahive:~/infra/prod$ datahive pipeline deploy ./enrollment.dh
▸ parsing pipeline definition...
Found 4 sources, 12 transforms, 3 sinks
Schema validation passed · zero breaking changes
Deployed to dh-runtime-prod-01
 
user@datahive:~/infra/prod$ datahive query --semantic "enrollment by faculty Q3"
→ resolving semantic layer ...
→ compiled to StarRocks SQL
┌─────────────────────────┬────────┬──────────┐
│ faculty │ count │ yoy │
├─────────────────────────┼────────┼──────────┤
│ Teknik │ 12,847 │ +8.2% │
│ Bisnis Ekonomi │ 18,230 │ +12.4% │
│ Kedokteran │ 4,291 │ +3.1% │
│ Hukum │ 6,104 │ −1.8% │
└─────────────────────────┴────────┴──────────┘
4 rows · executed in 127ms
user@datahive:~/infra/prod$
Ø10 · Engagement tiers

Built around the scale of your ambition.

Datahive is sold as an outcome, not a SKU. Every engagement is scoped to your data estate, regulatory profile, and operational scale — and priced in IDR with predictable annual commitments. Below are the three tiers our customers typically map into; the right fit takes a 30-minute conversation.

Foundation
For growing data teams
Investment level Entry tier

A production-ready open lakehouse for organizations getting serious about unified data. Designed for teams who want enterprise architecture without enterprise overhead.

  • Mid-scale data volumes, room to grow
  • StarRocks + Iceberg + MinIO core stack
  • Curated connector library · Airbyte OSS
  • Standard production SLA
  • Business-hours support · named contact
  • UU PDP baseline compliance package
  • Quarterly architecture review
Request scoping call
Sovereign
For regulated & sovereign workloads
Investment level Custom engagement

Fully air-gapped deployment on hardware you own, in datacenters you operate. Designed for the regulatory environment of financial services, healthcare, defense, and central government.

  • Air-gapped on-premise installation
  • Full source access & reproducible builds
  • On-site engineering residency option
  • Sovereign-grade SLA & runbooks
  • OJK / Bank Indonesia guideline alignment
  • Custom compliance & audit attestations
  • Hardware sourcing & commissioning available
Speak with our team
Pricing model
Annual commitment, billed in IDR. Predictable line-items: platform license, infrastructure footprint, support tier, professional services. No per-query, per-user, or per-seat charges.
What shapes the investment
Data volume under management, number of source systems, expected concurrency, deployment topology (single-site, HA, multi-DC), and the depth of UU PDP / sectoral compliance attestations you need.
How we quote
A 30-minute discovery call, followed by a written proposal within five business days. Every proposal includes scope, milestones, success criteria, and a fixed price for the first phase.
Ø11 · Common questions

Frequently asked.

Everything we're asked in the first sales call — compressed. If you need depth on any of these, our solutions team is one email away.

Is Datahive really fully on-premise? Is anything phoned home?

Yes, fully on-premise. Once deployed, the runtime does not require any outbound connectivity to our infrastructure to operate. Telemetry is opt-in and aggregated; source access means you can audit every outbound call yourself with tcpdump or your own SIEM. Air-gapped installation is a first-class deployment mode on Sovereign plans.

How is Datahive different from Databricks or Snowflake?

Datahive is built on the same open standards Databricks embraces (Iceberg, Parquet, Spark), but deploys to your infrastructure with no proprietary runtime. Snowflake is a managed SaaS warehouse; Datahive is an on-premise open lakehouse designed for data residency, regulatory compliance, and cost predictability in the Indonesian market. The two are not mutually exclusive — several customers use Datahive as the sovereign tier alongside cloud warehouses for non-sensitive workloads, with the same Iceberg tables federated across both.

Why Apache Iceberg specifically? Why not Delta Lake or Hudi?

Iceberg is the only major table format with a truly open catalog specification (REST + multi-engine), no historical Spark coupling, and a metadata architecture that scales to billions of files via manifest pruning. Delta Lake originated at Databricks and has the deepest Spark integration, but its catalog ecosystem is narrower. Hudi optimizes for streaming upsert workloads but sacrifices schema flexibility. For an open lakehouse where you'll mix StarRocks, Trino, Flink, and Spark, Iceberg is the lowest-risk choice — and it's converging with the others on features.

What does UU PDP compliance actually mean in the product?

Personal data is tagged, classified, and column-level access-controlled. Every read and write is logged to an append-only audit trail with user, timestamp, and query context. Data residency is guaranteed by on-premise deployment. Data subject requests (akses, koreksi, penghapusan) are supported as first-class CLI operations. We provide a full UU PDP control mapping document on request, mapped to specific platform features.

How long does a typical deployment take?

A Standard deployment can be operational in 2–3 weeks. Enterprise deployments typically run 6–10 weeks including source system integration, data modeling, and semantic layer definition. Sovereign engagements are scoped individually. We publish our delivery methodology and reference Gantt with every proposal.

Can we bring our own cluster, or do you ship hardware?

Both. Datahive runs on any Kubernetes cluster — your existing one, or a Proxmox/VMware setup we help architect. For Sovereign deployments, we can source, configure, and ship pre-imaged hardware to your datacenter with full burn-in and commissioning.

What happens if we want to leave?

Your data stays in open formats (Iceberg + Parquet) on your object storage. Your pipelines are declarative .dh files in your Git repo. Your semantic layer is portable YAML. There is no proprietary lock-in to remove — you own the substrate already. We help you transition, if that day comes.

Ready for production

Your data foundation,
engineered for the AI era.

Deploy on-premise. Own every byte. Ship faster. Built in Indonesia — open everywhere.

30-minute demo No obligation NDA available on request
Press ⌘K to search anywhere ×