All systems operational Version 2.4.1 Region ID-JKT-01

Uptime 99.982% Latency p95 142 ms Events/s 142,891 Status page ↗

Open Lakehouse · On-premise · UU PDP compliant

/The unified data foundation Indonesian institutions actually run on.

Datahive is the open data lakehouse for organizations that need to unify fragmented systems, govern every byte, and activate data for analytics and AI — without giving up sovereignty. Built on Apache Iceberg, StarRocks, and Project Nessie. Deployed on your infrastructure. Supported by a team on the ground in Jakarta.

Book a demo → Read the technical brief ↗

Trusted by

Higher education Government agencies Financial services Healthcare networks

datahive · live-view.dh

Streaming

Ingestion

8.4TB/h

↑ 12.3% WoW

Query p95

142ms

↓ 8.1% faster

Pipelines

1,284

+24 today

Data managed

128TB

Iceberg / Nessie

Powering data for institutions across Indonesia

Pelita Harapan Sapta Tunas DLH DKI respon.app Virtue Digital VAL.ID

Ø01 · The fragmentation problem

Your data is everywhere. Your decisions shouldn't be.

Modern organizations run on dozens of systems — CRM, ERP, SIS, LMS, APIs, spreadsheets, sensors. Without a unified foundation, fragmentation compounds silently into operational debt: reconciliation work, duplicated pipelines, delayed decisions, and stalled AI initiatives. We've seen the same patterns in every engagement.

fragmentation-diagnostic · sample output

3Critical 7Warning 142Resolved

14:23:08 FRAG-001

Inconsistent executive reporting

Finance and operations teams report divergent Q3 revenue figures. Delta of 4.2% traced to two conflicting definitions of "active account".

source: crm.v1, erp.legacy −18 hrs/wkreconciliation overhead

14:19:44 FRAG-002

Duplicated data pipelines across teams

Three departments independently built and maintain overlapping ETL jobs against the same source systems. Infrastructure cost and maintenance burden duplicated 3×.

source: audit.catalog −IDR 240Mannual run-cost

14:12:31 FRAG-003

Decision-lag in executive dashboards

Leadership dashboard refresh cycle exceeds 48 hours. Critical KPIs are materially stale at the moment of board-level decision-making.

source: bi.legacy −3.2 daysavg decision lag

14:08:15 FRAG-004

Ungoverned shadow data marts

Forty-seven spreadsheet-based "mini warehouses" discovered across business units — untracked lineage, untracked access, no audit trail. UU PDP exposure.

source: scanner.file Compliance riskUU PDP, ISO 27001

14:02:47 FRAG-005

AI initiatives blocked at the data layer

Four planned ML/LLM use-cases indefinitely stalled. No unified feature store, no clean training set, no lineage from model to source.

source: platform.ml 0 modelsin production

Ø02 · The technology underneath

Built on the open standards the world's largest data platforms run on.

Datahive is not a black box. Every component is open-source, battle-tested at hyperscale, and replaceable. Below are the four foundational technologies — what they do, why they matter, and how they fit together to give you a lakehouse that performs like a warehouse and stays open like a data lake.

Apache Iceberg v1.6 · ASF

Table Format

The open table format that turns your object storage into a database. Originated at Netflix, now the de-facto standard for open lakehouses.

Traditional data lakes store files in object storage with no metadata management — making updates, deletes, time-travel, and schema evolution painful. Iceberg solves this by adding a multi-layered metadata tree on top of your Parquet files: a catalog points to a metadata file, which references a manifest list, which references individual manifests tracking the actual data files. Each manifest carries column-level statistics (min/max values, null counts, distinct counts) — enabling the query engine to prune files at planning time, before scanning a single byte. For tables with billions of files, this avoids the full-scan planning bottleneck that plagues simpler approaches.

ACID Snapshot isolation, optimistic concurrency, atomic commits

Schema Add, drop, rename, reorder columns — fully backward-compatible

Time travel Query any snapshot by ID or timestamp; audit history without copies

Partition Hidden partitioning with evolution — change scheme without rewriting data

Engines Spark, Trino, Flink, StarRocks, Dremio, Snowflake, BigQuery

Apache FoundationOpen REST catalogParquet native

iceberg.apache.org ↗

StarRocks v3.3 · Linux Foundation

Query Engine

A vectorized MPP database designed to deliver sub-second analytics on petabyte-scale data, including data sitting natively in Iceberg.

Most analytical databases use a scatter-gather pattern: fan a query out to worker nodes, then aggregate results on a single coordinator. That coordinator becomes a bottleneck on complex queries. StarRocks instead uses true MPP (massively parallel processing): data is shuffled across many nodes, joins and aggregations happen in parallel everywhere, and the final assembly is itself distributed. For high-cardinality group-bys and large fact-table joins, this is materially faster. The execution engine is fully vectorized — written in C++ with SIMD instructions to process multiple values per CPU cycle. It uses Operation on Encoded Data: joins, aggregations, and expressions run directly on dictionary-encoded strings without decoding. The result is documented at 5–10× faster than previous-generation systems for multi-dimensional analytics.

Architecture MPP with shared-nothing FE/BE topology — scales horizontally

Engine C++ vectorized executor with SIMD; columnar in-memory layout

Optimizer Cost-based optimizer with statistics-driven plan selection

Materialized views Auto-refresh and auto-rewrite — query pre-aggregated data transparently

Lake queries Query Iceberg/Hive/Hudi tables without migration — zero-copy analytics

Compatibility MySQL wire protocol — every BI tool already speaks it

Linux FoundationSub-second p95Storage-compute separation

starrocks.io ↗

Project Nessie v0.95 · Dremio

Versioned Catalog

Git for your data. Branches, tags, and commits over your entire lakehouse — without copying a single byte.

Nessie applies the mental model of Git to your data catalog. Instead of versioning data files (which would mean copying terabytes), Nessie versions the catalog metadata: the registry of which tables exist, what schemas they have, and which underlying Parquet files they reference. This separation enables zero-copy branching — branches share file pointers, so creating a branch is instant regardless of table size. The implications are significant. Data engineers can spin up an isolated branch to test a destructive migration, validate the result, then atomically merge back to main — exactly the workflow developers expect from Git. Multi-table transactions become possible: commit changes to ten tables in one atomic operation, with rollback if anything fails. Production isolation becomes trivial: dev, staging, and prod branches all see the same physical data, with their own logical histories.

Versioning unit Catalog metadata pointers — not data files (zero-copy)

Operations commit, branch, tag, merge, cherry-pick, revert

Atomicity Multi-table transactions with optimistic concurrency control

Use cases ETL isolation, A/B data testing, regulatory point-in-time queries

Engines Iceberg-native; works with Spark, Flink, Trino, Dremio

Open sourceREST APIIceberg + Delta Lake

projectnessie.org ↗

MinIO v RELEASE.2026 · MinIO Inc.

Object Storage

High-performance, S3-compatible object storage you run on your own hardware. The substrate every other component sits on.

A lakehouse needs cheap, infinite, durable storage — historically that meant cloud object stores like S3. MinIO gives you the same S3 API, the same scaling characteristics, and the same durability guarantees, but running on bare metal in your datacenter. Your data never leaves your jurisdiction. MinIO uses erasure coding (Reed-Solomon) to shard objects across drives — typically tolerating up to half of the drives failing without data loss. Throughput is measured in GB/s per node on commodity hardware; benchmarks routinely exceed cloud object storage on read-heavy workloads. For a lakehouse on UU PDP-regulated data, this is the only architecture that works without compromise.

API Drop-in S3 compatibility — every Iceberg/Spark/Trino client works

Durability Erasure coding with configurable parity (e.g. 8+4 EC scheme)

Encryption SSE-S3, SSE-KMS, SSE-C; per-bucket key management

Replication Active-active site replication for DR; bucket-level policies

Identity IAM-compatible RBAC, OIDC SSO, audit logs to SIEM

S3-compatibleBare-metal nativeSovereign-ready

min.io ↗

1^step

Data lands in object storage

Source systems push events through Kafka and Debezium CDC. Airbyte connectors handle batch extracts. Everything is written as Parquet files into MinIO, your sovereign S3-compatible storage layer.

AirbyteKafkaDebeziumMinIO

2^step

Iceberg catalogs the files

Apache Iceberg organizes those Parquet files into proper tables with ACID semantics, schema evolution, and time-travel. Project Nessie versions the catalog metadata — branches, tags, commits over your entire data estate.

Apache IcebergProject NessieParquet

3^step

StarRocks queries everything

StarRocks reads Iceberg tables directly from object storage with sub-second latency. BI tools, AI agents, and operational APIs all hit the same governed substrate through MySQL-compatible SQL or REST.

StarRocks MPPTrinoDuckDB

Ø03 · Platform capabilities

What Datahive powers.

Six essential capabilities that turn fragmented data into a unified, governed, analytics-ready foundation. Each is productionized, each is open, each integrates with what you already run.

Data unification

Ingest structured, semi-structured, and unstructured data from any source — CRM, ERP, databases, SaaS APIs, flat files, telemetry — into a single governed substrate. Schema-on-read where it makes sense, schema-on-write where it matters.

300+ connectorsCDC native

Real-time processing

Ingest, enrich, and serve events in sub-second timeframes via Kafka, Apache Flink, and RisingWave. Exactly-once semantics; backpressure-aware; compatible with batch dbt models for unified streaming-batch pipelines.

Sub-second latencyExactly-once

Transformation & modeling

Standardize, enrich, and structure data via dbt with full model lineage; SQLMesh for blue-green semantic versioning; a Cube-based semantic layer that defines metrics once and serves them consistently to every downstream tool.

dbt CoreSemantic metrics

Governance & lineage

End-to-end policy control, column-level lineage via OpenMetadata, full audit trail, and row-level security tied to your identity provider via SSO/OIDC and SCIM. UU PDP and ISO 27001 aligned out of the box.

UU PDPColumn lineageSSO / OIDC

Analytics & activation

Sub-second interactive analytics through StarRocks and Trino; semantic modeling via Cube; native integration with Metabase, Superset, and your existing BI tooling. Reverse-ETL pushes governed data back into operational systems.

p95 < 200msReverse-ETL

AI-ready foundation

Vector search via pgvector and Milvus; feature store via Feast (offline + online); LLM orchestration via LangGraph and Dify. Your governed data becomes retrieval-ready, fine-tuning-ready, and agent-ready — without a second platform.

Vector + BM25LangGraph

0.4TB

Data under management

Across active deployments

0.98%

Platform uptime SLA

12-month rolling average

0ms

Query p95 latency

Measured on StarRocks MPP

0×

Faster than legacy DW

Mid-size customer benchmark

Ø04 · Reference architecture

One governed foundation. Five layers, fully open.

Datahive's architecture is intentionally layered: each layer talks to its neighbors through stable, open interfaces. You can replace any single layer (a different query engine, a different orchestrator, a different BI tool) without re-architecting the rest. This is what "no lock-in" actually means in practice.

system.architecture · v2.4

Streaming Batch Federated Control

Ø05 · Intelligence engines

Five engines. One governed foundation.

Each engine is a standalone, production-grade asset — self-hostable, auditable, and independently scalable. Together, they compose into a single institutional intelligence stack.

01^ENG

Knowledge Intelligence Engine

Institutions sit on millions of pages — policies, archives, reports, regulations. This engine turns latent institutional memory into a queryable, grounded, retrieval-augmented corpus with full citation and provenance. Indonesian and English native.

Search Hybrid vector + BM25

Languages ID · EN native

Grounding Citation-first

GA · v2.4 Read spec →

02^ENG

Document Intelligence Engine

Every institution ships thousands of documents — contracts, invoices, faktur pajak, academic records — most still processed by hand. This engine extracts, classifies, and routes structured data at industrial throughput with layout-aware OCR and custom schemas.

Extraction 99.2% accuracy

Templates Faktur Pajak ready

Throughput 10k docs/hour

GA · v2.1 Read spec →

03^ENG

Predictive Intelligence Engine

Good strategic decisions need two things: complete data, and the ability to look ahead. Forecasting, anomaly detection, and causal inference — productionized, monitored, and versioned alongside your data pipelines.

Methods Forecasting, causal ML

Retraining Automated, scheduled

Monitoring Drift & accuracy

Beta · v1.8 Join early access →

04^ENG

Enterprise Data Engine

A unified data foundation plus an institutional base model — trained on your own dataset so it speaks your internal terminology, policy hierarchy, and operational procedures. The data layer and the model layer, governed together.

Storage StarRocks + Iceberg

Custom LLM On your corpus

Deploy Fully on-prem

GA · v3.0 Read spec →

05^ENG

AI Trust & Governance Layer

Running AI in an institutional setting without governance isn't just technical risk — it's legal and reputational. This layer provides UU PDP-compliant guardrails, end-to-end audit trails, role-based access control, and column-level lineage for every AI surface you build.

Compliance UU PDP · ISO 27001

Audit End-to-end trail

Access RBAC · SSO · SCIM

GA · v2.0 Read spec →

Ø06 · Modular stack

Built on open standards.

Every component is swappable. No proprietary formats, no vendor lock-in. The stack below is the foundation we deploy — but each layer can be replaced independently as your needs evolve.

L5 · Activation BI & Dashboards Metabase · Superset · Evidence

L5 · Activation AI Agents LangGraph · Dify · CrewAI

L4 · Semantic Semantic Layer Cube · MetricFlow

L4 · Governance Catalog & Lineage OpenMetadata · DataHub

L3 · Core Query Engine StarRocks · Trino · DuckDB

L3 · Core Table Format Apache Iceberg · Nessie

L3 · Core Object Storage MinIO · S3 compatible

L3 · Core Vector & Search pgvector · Milvus · Elastic

L2 · Process Streaming Apache Flink · RisingWave

L2 · Process Transform dbt · SQLMesh

L2 · Process Orchestration Dagster · Airflow

L2 · Process Quality Great Expectations · Soda

L1 · Ingestion CDC Debezium · Kafka Connect

L1 · Ingestion Connectors Airbyte · 300+ sources

L1 · Ingestion Event Bus Kafka · Redpanda

L0 · Compute Orchestration Kubernetes · Proxmox

Ø07 · Customer stories

Where Datahive delivers impact.

Purpose-built reference deployments across Indonesian higher education, government, and enterprise. Each built and operated under UU PDP, on infrastructure the customer owns.

Higher Education

Unified student & academic intelligence

Consolidated 150K+ student records across SIS, LMS, admissions, and payment systems into a single governed foundation. Executive dashboards now refresh in real-time; OCR-powered document workflows eliminated 18 hrs/week of manual reconciliation.

"We moved from static monthly reports to a live institutional dashboard. The IT team didn't have to rip out a single existing system."

— CTO, private university (42k active students)

150K+

Students unified

24×

Query speedup

4 days → live

Report refresh

Read the case study →

Government · Environment

Environmental operations platform

Non-IoT operations platform for Jakarta's environmental agency. Seven actor roles, six integrated modules, and direct integration with SILIKA, Bank Sampah Portal, and BGMIOTA — all under a single governed data plane.

Actor roles

Gov integrations

Real-time

Field ops

Read the case study →

Enterprise · CRM

AI-native WhatsApp CRM (respon.app)

Five-layer AI-native CRM platform for Indonesian SMBs. TwentyCRM, Dify, and Evolution API unified through Datahive — WhatsApp-first channel, IDR-native billing via Xendit, fully UU PDP compliant from day one.

Product layers

WA-first

Channel

IDR

Native billing

Read the case study →

Ø08 · Security & compliance

Built for the standards you answer to.

Datahive is designed for the compliance environment Indonesian institutions operate in. On-premise deployment keeps sensitive data within your jurisdiction; governance is built in, not bolted on.

UU PDP

Indonesia PDP law

ISO 27001

Information security

SOC 2 Type II

Audit framework

GDPR ready

EU cross-border

OJK guidelines

Financial services

Ø09 · Developer experience

Deploy in minutes. Own it forever.

Single-binary CLI, declarative configuration, GitOps-native. Datahive runs on your infrastructure, inside your firewall, under your keys — with the same deployment ergonomics you expect from modern developer tools.

Infrastructure as code.
Data as code.

Every pipeline, every model, every access policy is defined in version-controlled YAML. Deploy the same configuration across dev, staging, and production with a single command — and roll it back with another.

✓

Single-binary CLI

One datahive command across init, deploy, migrate, query, audit. No external dependencies, no runtime installation.

✓

Declarative pipelines (.dh format)

Type-checked at plan time. Sources, transforms, sinks, schedules — all composable primitives with explicit dependencies.

✓

Semantic queries, not raw SQL

Query metrics defined once in the semantic layer — consistent numbers across every downstream tool, every dashboard, every API.

✓

GitOps-native

Changes go through PR review. Every deploy is signed, audited, and reversible. Branch your data with Nessie like you branch your code with Git.

✓

Zero-downtime schema migrations

Iceberg + Nessie give you branching and time-travel. Test a migration on a branch, validate, merge atomically when safe.

~ datahive-cli · zsh

deploy query audit

user@datahive:~/infra/prod$ datahive init --profile enterprise

→ Initializing Datahive control plane...

✓ Kubernetes context dh-prod-jkt-01 (24 nodes)

✓ Provisioning Iceberg catalog on MinIO

✓ StarRocks topology deployed (3 FE, 12 BE)

✓ Nessie branch main initialized

✓ UU PDP governance layer activated

user@datahive:~/infra/prod$ datahive pipeline deploy ./enrollment.dh

▸ parsing pipeline definition...

✓ Found 4 sources, 12 transforms, 3 sinks

✓ Schema validation passed · zero breaking changes

✓ Deployed to dh-runtime-prod-01

user@datahive:~/infra/prod$ datahive query --semantic "enrollment by faculty Q3"

→ resolving semantic layer ...

→ compiled to StarRocks SQL

┌─────────────────────────┬────────┬──────────┐

│ faculty │ count │ yoy │

├─────────────────────────┼────────┼──────────┤

│ Teknik │ 12,847 │ +8.2% │

│ Bisnis Ekonomi │ 18,230 │ +12.4% │

│ Kedokteran │ 4,291 │ +3.1% │

│ Hukum │ 6,104 │ −1.8% │

└─────────────────────────┴────────┴──────────┘

✓ 4 rows · executed in 127ms

user@datahive:~/infra/prod$

Ø10 · Engagement tiers

Built around the scale of your ambition.

Datahive is sold as an outcome, not a SKU. Every engagement is scoped to your data estate, regulatory profile, and operational scale — and priced in IDR with predictable annual commitments. Below are the three tiers our customers typically map into; the right fit takes a 30-minute conversation.

Foundation

For growing data teams

Investment level Entry tier

A production-ready open lakehouse for organizations getting serious about unified data. Designed for teams who want enterprise architecture without enterprise overhead.

Mid-scale data volumes, room to grow
StarRocks + Iceberg + MinIO core stack
Curated connector library · Airbyte OSS
Standard production SLA
Business-hours support · named contact
UU PDP baseline compliance package
Quarterly architecture review

Request scoping call →

Institutional

For enterprises & institutions

Investment level Tailored to scale

The full platform with all five intelligence engines, dedicated infrastructure, and a named solutions team. The reference deployment for higher-ed, government agencies, and mid-market enterprises.

Unbounded data & concurrent users
All five intelligence engines, GA
Premium connector tier · 300+ sources
Mission-critical SLA · 24×7 incident response
Dedicated solutions engineer
UU PDP + ISO 27001 + SOC 2 attestations
Custom base-model training included
Joint quarterly business review

Book a strategy session →

Sovereign

For regulated & sovereign workloads

Investment level Custom engagement

Fully air-gapped deployment on hardware you own, in datacenters you operate. Designed for the regulatory environment of financial services, healthcare, defense, and central government.

Air-gapped on-premise installation
Full source access & reproducible builds
On-site engineering residency option
Sovereign-grade SLA & runbooks
OJK / Bank Indonesia guideline alignment
Custom compliance & audit attestations
Hardware sourcing & commissioning available

Speak with our team →

Pricing model

Annual commitment, billed in IDR. Predictable line-items: platform license, infrastructure footprint, support tier, professional services. No per-query, per-user, or per-seat charges.

What shapes the investment

Data volume under management, number of source systems, expected concurrency, deployment topology (single-site, HA, multi-DC), and the depth of UU PDP / sectoral compliance attestations you need.

How we quote

A 30-minute discovery call, followed by a written proposal within five business days. Every proposal includes scope, milestones, success criteria, and a fixed price for the first phase.

Ø11 · Common questions

Frequently asked.

Everything we're asked in the first sales call — compressed. If you need depth on any of these, our solutions team is one email away.

Is Datahive really fully on-premise? Is anything phoned home?

Yes, fully on-premise. Once deployed, the runtime does not require any outbound connectivity to our infrastructure to operate. Telemetry is opt-in and aggregated; source access means you can audit every outbound call yourself with tcpdump or your own SIEM. Air-gapped installation is a first-class deployment mode on Sovereign plans.

How is Datahive different from Databricks or Snowflake?

Datahive is built on the same open standards Databricks embraces (Iceberg, Parquet, Spark), but deploys to your infrastructure with no proprietary runtime. Snowflake is a managed SaaS warehouse; Datahive is an on-premise open lakehouse designed for data residency, regulatory compliance, and cost predictability in the Indonesian market. The two are not mutually exclusive — several customers use Datahive as the sovereign tier alongside cloud warehouses for non-sensitive workloads, with the same Iceberg tables federated across both.

Why Apache Iceberg specifically? Why not Delta Lake or Hudi?

Iceberg is the only major table format with a truly open catalog specification (REST + multi-engine), no historical Spark coupling, and a metadata architecture that scales to billions of files via manifest pruning. Delta Lake originated at Databricks and has the deepest Spark integration, but its catalog ecosystem is narrower. Hudi optimizes for streaming upsert workloads but sacrifices schema flexibility. For an open lakehouse where you'll mix StarRocks, Trino, Flink, and Spark, Iceberg is the lowest-risk choice — and it's converging with the others on features.

What does UU PDP compliance actually mean in the product?

Personal data is tagged, classified, and column-level access-controlled. Every read and write is logged to an append-only audit trail with user, timestamp, and query context. Data residency is guaranteed by on-premise deployment. Data subject requests (akses, koreksi, penghapusan) are supported as first-class CLI operations. We provide a full UU PDP control mapping document on request, mapped to specific platform features.

How long does a typical deployment take?

A Standard deployment can be operational in 2–3 weeks. Enterprise deployments typically run 6–10 weeks including source system integration, data modeling, and semantic layer definition. Sovereign engagements are scoped individually. We publish our delivery methodology and reference Gantt with every proposal.

Can we bring our own cluster, or do you ship hardware?

Both. Datahive runs on any Kubernetes cluster — your existing one, or a Proxmox/VMware setup we help architect. For Sovereign deployments, we can source, configure, and ship pre-imaged hardware to your datacenter with full burn-in and commissioning.

What happens if we want to leave?

Your data stays in open formats (Iceberg + Parquet) on your object storage. Your pipelines are declarative .dh files in your Git repo. Your semantic layer is portable YAML. There is no proprietary lock-in to remove — you own the substrate already. We help you transition, if that day comes.

Ready for production

Your data foundation,
engineered for the AI era.

Deploy on-premise. Own every byte. Ship faster. Built in Indonesia — open everywhere.

Book a demo → Read the technical brief ↗ Talk to an engineer ↗

30-minute demo No obligation NDA available on request

Press ⌘K to search anywhere ×

/The unified data foundation Indonesian institutions actually run on.

Your data is everywhere. Your decisions shouldn't be.

Built on the open standards the world's largest data platforms run on.

Data lands in object storage

Iceberg catalogs the files

StarRocks queries everything

What Datahive powers.

Data unification

Real-time processing

Transformation & modeling

Governance & lineage

Analytics & activation

AI-ready foundation

One governed foundation. Five layers, fully open.

Five engines. One governed foundation.

Built on open standards.

Where Datahive delivers impact.

Unified student & academic intelligence

Environmental operations platform

AI-native WhatsApp CRM (respon.app)

Built for the standards you answer to.

Deploy in minutes. Own it forever.

Infrastructure as code.Data as code.

Built around the scale of your ambition.

Frequently asked.

Your data foundation,engineered for the AI era.

Infrastructure as code.
Data as code.

Your data foundation,
engineered for the AI era.