Data 360 Ingestion: Style Guide

This is the page where Cleon stops describing what Data 360 (formerly Data Cloud) ingestion is and starts saying what we do with it. Salesforce defines the sources and the modes. The reference pages in this subcategory document each one — the Data Stream, the connectors, the Ingestion API, full refresh versus upsert — and the gotchas document where a clean-looking ingest quietly lands the wrong data. This Style Guide is the discipline that keeps a Data Stream trustworthy, because everything downstream — identity, query, segmentation, the agents you ground on it — inherits whatever you ingested.

Use it as a checklist before any new Data Stream ships. The rules are short on purpose — when a rule needs an explanation, the explanation is in the page it links to. One question sits at the center, and it decomposes into three you answer in order.

The central decision: how should this source land?

Every new source forces the same decision, and getting it wrong is a rebuild, not a setting you flip. A stream landed on the wrong cadence is a latency bug or a wasted bill; a stream landed on the wrong mode quietly keeps records that should be gone. The decision has three parts, and they have an order — freshness, then mode, then cost. Answer them out loud before you connect the stream.

1. Streaming or scheduled batch?

Decide by the freshest downstream decision the data feeds — not by "fresher is better" (principle 6). If a downstream consumer is genuinely wrong on stale data — a real-time activation, an agent answering about a customer mid-session — the source needs to arrive continuously, and that means a streaming path: the Ingestion API's Streaming pattern or the Web/Mobile SDK. If the freshest decision the data feeds is fine on a schedule, it's scheduled batch — a connector, an S3 or cloud-storage drop, or the Ingestion API's Bulk pattern. Streaming a source a nightly batch would serve is wasted cost; refreshing daily behind a real-time decision is a latency bug. Name the freshest decision first, then pick the path that meets it.

2. Full refresh or upsert?

This is the correctness fork, and it has a prerequisite most teams skip: you cannot choose upsert until you can name the primary key and the delete story (refresh modes). Upsert inserts-or-updates on a stable, unique key — lighter and incremental, but a wrong or non-unique key silently duplicates or overwrites, and on its own upsert does not remove deleted source records. A record gone from the source simply stops arriving, and "stops arriving" is the one case upsert leaves untouched — so the deleted record lingers, indistinguishable from a live one, until you send an explicit delete. Full refresh re-lands the whole set each run: simpler, it captures deletes by absence, but it is heavier on a large source (principle 11). So before you choose: can you name a key that is genuinely unique and stable, and if the source can delete records, how does that delete reach Data 360? If you can't answer both, you are not ready for upsert.

3. Does the cadence match the decision — and is the processing cost justified?

The first two answers set a cadence; this question makes you defend it (principles 6 + 11). The cadence has to match the freshest decision the data feeds — and the processing it implies has to be worth it. A daily full refresh of a million-row source behind a weekly report is a cost decision disguised as a schedule setting; an hourly refresh feeding a decision that's reviewed once a day pays for freshness nobody consumes. Set the cadence to the decision, confirm the mode and volume make that cadence affordable, and write the cadence down next to the stream so the next person doesn't assume the DLO is current when it's on a 24-hour clock.

This is the same freshness discipline the Segmentation Style Guide applies one layer out — set the cadence to the freshest decision the data feeds, never to "as fresh as possible" — applied at the front of the lifecycle instead of the back. Ingestion is where it starts: a stale or wrongly-keyed ingest surfaces as a downstream bug three layers away (ingestion and the lifecycle).

Mode discipline

Name the primary key before you choose upsert — it's the contract, not a detail

The key is what tells the stream whether an incoming record is one it already has or a new one (refresh modes). A non-unique key (two real records share a value) means one upsert silently overwrites the other and you lose a record with no error. A wrong key (a field that isn't actually stable) means the same record arrives looking new and silently duplicates. Both look like a healthy green run. Verify the key is genuinely unique and stable against real data before the stream goes live — this is principle 1, model the keys, applied at ingestion (relationships & keys).

Settle the delete story as half of the mode decision, not a later patch

A full refresh captures deletes by absence; an upsert does not handle them at all on its own. If a stream is on upsert and the source can delete records, you need a deliberate delete strategy before it goes live — otherwise the stale record flows downstream into identity resolution, segments, and activations, and someone gets contacted who should have been removed, with no error the whole way. If the source can delete and there's no way to signal those deletes, full refresh is the safer mode despite its cost.

Land the source raw — don't reshape it inside the stream

A Data Stream lands a DLO (__dll); it does not model anything. The instinct to "fix" a messy source by reshaping it inside the stream is the wrong layer — land it raw, then reshape on the way to the DMO where the transformation is visible and documented. That modeling step belongs to Data Architecture, not ingestion (mapping); a stream that quietly cleans data is a stream whose logic nobody downstream can see (principle 2).

Category and cost discipline

Set the stream category to what the data is, not what's convenient

A stream's category — Profile, Engagement, or Other — constrains how the data behaves downstream, and it's the first modeling decision you make even though it happens at ingestion (data streams). Event data lands as Engagement (time-series, and it requires an event-time field), not as Profile. The classic mistake is landing events as a Profile stream: it ingests, it looks fine, and then time-windowed segmentation and engagement metrics have nothing to stand on. Pick the category by the data's real shape, before the stream connects (principle 1).

Match the cadence to the decision, and revisit the expensive streams

Cost in Data 360 scales with what you process, not what you store (principle 11), and a stream's processing is whatever each refresh moves. A large source on a frequent full refresh is a recurring bill; an over-frequent stream feeding a decision nobody reviews that often is freshness paid for and not consumed. Set the cadence to the freshest decision the data feeds, prefer upsert's lighter delta when you have a trustworthy key and the volume justifies it, and revisit the streams that move the most — the cheapest refresh is the one you didn't need to run.

Prefer a connector before the Ingestion API; verify the org's connector list

If a packaged connector covers the source, use it — it's configured, not coded, and it handles auth and schema discovery for you. Reach for the Ingestion API only when no connector fits: a custom app, an internal service, a homegrown event stream. And because the connector catalog moves release over release, confirm what the org actually has before you design around any single connector; a design that names a connector the org can't enable is a rebuild you discover late.

Patterns to prefer

The freshest downstream decision named first, then the streaming-or-batch path chosen to meet it — not "fresher is better."
Scheduled batch on full refresh as the default, moved off only when you can justify streaming or name a key for upsert.
The primary key named and verified unique-and-stable before upsert is chosen, against real data.
The delete story settled as half the mode decision — how a source delete reaches Data 360, or full refresh instead.
The stream category set to the data's real shape — event data as Engagement with its event-time field, not Profile.
A connector preferred over the Ingestion API when one exists, with the org's available list verified first.
The refresh cadence written next to the stream, not held in someone's memory.

Patterns to refuse

A daily-refreshed stream behind a real-time decision — a latency bug that surfaces three layers downstream.
A streaming source where a nightly batch would serve — cost paid for freshness nobody consumes.
Upsert chosen without a named key — a green run that silently duplicates or overwrites and loses records with no error.
Upsert on a source that can delete, with no delete signal — ghost records that linger and get someone contacted who should have been removed.
Event data landed as a Profile stream — it ingests fine, then time-windowed segmentation has nothing to stand on.
A source reshaped inside the stream instead of landed raw and modeled on the way to the DMO, where the logic is visible.
A build hard-wired to a connector nobody confirmed the org has enabled.

The pre-ship checklist before any Data Stream ships

The freshest downstream decision the data feeds is named, and the streaming-or-batch path was chosen to meet it — not "as fresh as possible."
The refresh mode was chosen on purpose — upsert only if a unique, stable primary key is named; full refresh otherwise.
If upsert: the primary key is verified genuinely unique and stable against real data, not assumed.
The delete story is settled — full refresh captures deletes by absence, or an explicit delete strategy reaches Data 360 for an upsert source that can delete.
The stream category matches the data's real shape — event data as Engagement with an event-time field, not Profile.
The cadence matches the freshest decision the data feeds, the processing cost is justified, and the cadence is written down next to the stream.
A packaged connector is preferred over the Ingestion API where one exists, and the org's available connector list was verified — not assumed from a page.
The source lands raw as a DLO; any reshaping is deferred to the DLO→DMO mapping in Data Architecture, where it's visible.

When all of them fire, the Data Stream is ready to ship.

Ingestion gotchas — the silent failures these rules prevent, the production version
Data Streams — the unit of ingestion every rule here is configured on: source, category, schedule
Connectors — the sources a stream ingests from, and why you verify the org's list before designing
The Ingestion API — Streaming versus Bulk, the programmatic path when no connector fits
Refresh modes — full refresh versus upsert, the primary key, and the deletes trap in depth
Ingestion and the lifecycle — why everything downstream inherits what you ingest here
Debugging ingestion — when a stream lands wrong: missing or duplicate records, wrong counts, a stream that won't refresh
Data 360 principles from production — the meta-rules above these specifics (1, 6, 11)

If you spot a rule missing — or one of these rules being violated in our public work — write to hello@wearecleon.com. We add it, or we fix it and we say so.