Ingestion gotchas: the silent failures at the front of the pipeline

Ingestion is the easy demo and the hard production system. In a sandbox with one clean source and a few thousand rows, a Data Stream looks like a checkbox: point it at a source, pick a schedule, watch the rows land, move on. The cost shows up later, at scale, on real data that changes and deletes and arrives late — and almost never as an error. Ingestion fails quietly. The stream runs green, the row count looks plausible, the last-run timestamp is recent, and the failure is a deleted customer still being mailed, a count that double-counts, or a "missing data" incident that is really an expired credential three weeks old.

Seven ingestion choices that bit Cleon's Data 360 (formerly Data Cloud) builds, synthesized with Salesforce's official guidance and the corrections the practitioner community learned the hard way. Each is paired with the instinct that leads you in, what actually happens in production, and the fix. The throughline is the one this subcategory keeps returning to: ingestion is the front of the lifecycle, and everything downstream inherits whatever you landed — so a quiet ingestion mistake surfaces as an identity, segment, or activation bug three layers away, long after the stream that caused it stopped looking suspect.

The gotchas

1. Choosing upsert for efficiency — it keeps deleted source records forever, because nothing told it they left

The instinct is correct about cost and wrong about completeness: the source is large, a full refresh moves the whole set every run, so you pick upsert to land only the delta. It works exactly as advertised for inserts and updates — each run carries what changed, the DLO stays current, the processing bill drops. What it does not do is notice a record that left the source, because upsert only ever touches a record that is present in the run, and a deleted record is precisely the one that stops arriving.

What actually happens in production is a DLO that accumulates ghosts. A customer deleted from the CRM, a product pulled from the catalog, a contact who exercised a deletion request — none of them disappear from an upsert-fed DLO on their own. They linger, indistinguishable from live records, and flow downstream into identity resolution, segments, and activations. The first signal is usually a human one: someone gets contacted who should have been removed, or a count includes people the source no longer has. Nothing errored; the run was green; the data is wrong.

2. Putting a huge source on a frequent full refresh — it's a cost decision wearing a schedule setting's clothes

The instinct is to reach for full refresh because it's the simpler mode to reason about — no key to get right, deletes handled by absence — and then to set the cadence high because "fresher is better." Both halves feel like the safe, thorough choice. On a small source they are. On a large one, the combination quietly becomes the most expensive line item in the build, and nobody decided it on purpose.

What actually happens is that every run re-lands and re-processes the entire dataset, even when one row changed, and Data 360 bills on the work it processes, not the data at rest (principle 11). A multi-million-row source on an hourly full refresh moves all those rows every hour, all day, whether or not anything changed. The schedule field looks innocent — it's a dropdown next to the stream — but it is a throughput decision, and an over-frequent full refresh of a large source is a cost decision disguised as a schedule setting. The bill is the kind of thing discovered in a usage review, not in a green run.

The fix is to size the cadence to the freshest decision the data actually feeds, then weigh the mode against the volume. If the downstream decision is a nightly batch, an hourly refresh buys nothing and costs every hour. If the source is large and mostly append/update with a stable key, upsert moves only the delta and is the cheaper mechanism — provided you've settled the deletes question (gotcha 1). Write the cadence down next to the stream with the decision it serves, so the next person doesn't inherit an expensive default nobody chose (principles 6 + 11; the Style Guide frames this as its third question).

3. Keying an upsert on a field that isn't truly unique — it silently overwrites or duplicates, and the run still goes green

The instinct is to pick a primary key that looks unique enough — an email, a name-plus-zip, an external ID you assume is stable — and trust the upsert to reconcile on it. The editor accepts it, the first run lands, the count looks right. The key is the entire contract upsert runs on, and a key that is almost unique fails in the worst way: not loudly, but by silently corrupting the set it was supposed to keep clean.

What actually happens splits two ways, both invisible. A non-unique key — two genuinely different records share the value — means the stream cannot tell them apart, so one upsert overwrites the other and a real record vanishes with no error. A wrong key — you keyed on a field that isn't actually stable, so the same record arrives looking new each run — means the record duplicates on every run, and the DLO inflates with copies the platform believes are distinct. Both present as a healthy green run. The damage surfaces downstream as a count that won't reconcile or a profile missing data it should have, and the grain, not the data, is the cause.

The fix is to treat the key as a prerequisite you verify, not a field you fill in. Name the key before you choose upsert — principle 1, model the keys, applied at ingestion — and confirm against real data that it is genuinely unique (no two distinct records share it) and genuinely stable (the same record carries the same value run to run). If you cannot name such a key, you are not ready for upsert; use full refresh until you can (refresh modes, relationships & keys).

4. Setting the cadence by "fresher is better" — a daily refresh behind a real-time decision is a latency bug

The instinct treats freshness as a single dial where higher is always better, so the cadence gets set by habit or by the source's convenience rather than by what consumes the data. Sometimes that means too frequent (gotcha 2's cost problem); just as often it means too slow for what the data feeds — a daily or twice-daily refresh sitting under a decision that needs near-real-time signal. It looks fine because the stream is green and the data is real; it's just old.

What actually happens is a latency bug that no error reports. A stream that refreshes daily behind a real-time activation means the activation fires on yesterday's state — a customer who converted this morning is still in the "hasn't converted" segment tonight, and gets the nudge they no longer need. An abandoned-cart signal that lands on a nightly batch arrives a day after the cart is cold. The freshness gap is invisible in every dashboard, because the dashboard shows the data that is there, not the hours of staleness in front of it. Freshness is a feature (principle 6), and a missing feature here is silent by construction.

The fix is to set cadence from the downstream decision, not from the source. Find the freshest decision the data feeds and make ingestion meet it — streaming (the Ingestion API streaming pattern or the Web/Mobile SDK) for genuine real-time, scheduled batch for everything a schedule serves. Then write the cadence and the decision it serves next to the stream, so nobody downstream assumes real-time where there's a 24-hour lag, and nobody upstream streams what a nightly batch would have served (principles 6 + 11; connectors covers which sources are batch and which stream).

5. Landing event data as a Profile stream — it ingests cleanly, and the time series never exists

The instinct is that data about a customer's behavior is still data about the customer, so a stream of purchases or web events gets created as a Profile stream — the same category as the customer record it relates to. It ingests without complaint. The rows land in a DLO, the count is right, nothing flags the choice. The category is not cosmetic, though: it tells Data 360 how the data behaves and constrains what you can do with it downstream, and Profile and Engagement are not interchangeable.

What actually happens is that the time series you needed never comes into being. Engagement is the category for time-series event data, and it requires an event-time field — the timestamp that places each event on a timeline. Land those events as Profile and you've told the platform they describe a subject's current state, not a sequence of moments. Time-windowed segmentation ("purchased in the last 30 days"), engagement metrics, and recency logic then have nothing to stand on, because the data was never modeled as events in time. The failure surfaces a layer downstream, when someone tries to build the segment and finds the timeline isn't there — and the category, set once at stream creation, is the cause.

The fix is to choose the category from what the data is, not from what it relates to. One record per subject, updated over time → Profile. A thing that happened at a moment, with a timestamp → Engagement, and confirm the event-time field is present and populated before the stream goes live. Reference or lookup data that's neither → Other. Getting this right is the first modeling decision you make even though it happens at ingestion (principle 1), and it's far cheaper to set correctly now than to re-ingest later (data streams covers the three categories).

6. Never reconciling what you ingest against what you use — the DLO bloats, and cost climbs for data nobody reads

The instinct is to ingest generously: bring in the whole source, every field, every table, because storage feels cheap and you might need it later. Each individual stream is defensible. The trouble is that nobody ever runs the other side of the ledger — what of all this ingested data is actually used by a DMO, a segment, an insight, an activation — so the gap between ingested and used widens silently, one reasonable stream at a time.

What actually happens is DLO bloat: streams refreshing data that nothing downstream consumes, each run paying the processing cost of keeping current something no segment ever reads (principle 11). It rarely shows up as a single bad decision; it accumulates. A field ingested for a use case that never shipped still refreshes every run. A whole source connected "to have it" lands on a cadence and bills on every run, unread. Because each stream is individually small and green, the bloat is invisible until a cost review asks why ingestion processing is what it is, and the answer is a dozen streams nobody mapped to a consumer.

The fix is to make ingest-versus-use a deliberate, recurring reconciliation rather than a thing you assume. Before a stream goes live, name what consumes it — which DMO, which segment, which activation; if nothing does yet, it isn't ready to ingest on a cadence. Periodically walk the live streams and ask the same question of each: a stream whose data nothing reads is a refresh you can stop. Ingest for a consumer you can name, not for a "might need it" that pays rent forever (principle 11; the model doc that principle 12 asks for is where this trace lives).

7. Reading a green console as "no data arrived" — an expired connector credential looks identical to an empty source

The instinct, when a downstream team says "the data's missing," is to look at the segment or the DMO and conclude the source had nothing to send — the records simply aren't there, so the source must be empty or behind. The console doesn't obviously contradict you; the absence of new rows looks the same whether the source sent nothing or the stream couldn't reach it. So the investigation starts at the wrong end of the pipeline.

What actually happens, often enough to check first, is that the ingestion failed rather than the source being empty. A connector's credential expired or was rotated, an OAuth token lapsed, a storage bucket's permission changed, a source-side API limit throttled the pull — and the stream stopped landing data while everything downstream kept reporting on the last good load. "No new data" and "the connection silently broke" present identically from the consumer's seat. File-storage connectors have their own version: a source that's supposed to drop a CSV into a bucket on a schedule silently stops, and an empty bucket looks downstream exactly like a source with no new records (connectors).

The fix is to make the stream's own health the first hypothesis, not the source's. Check the stream's last successful run and its run status before you theorize about empty data: a stale last-run timestamp or a failed/auth-error status says the connection broke, not that the source went quiet. Confirm the credential or token is still valid, that permissions and any source-side limits haven't changed, and that the expected file actually landed in the bucket. "Missing data" is a connector-health question before it is a source question — debugging ingestion walks the full diagnostic order.

The throughline across all seven: ingestion does not tell you when it's wrong. Choose upsert and deletes linger; over-refresh a large source and the bill climbs under a green run; key on a not-quite-unique field and records overwrite or duplicate; set cadence by habit and a real-time decision runs on yesterday; mis-categorize events and the time series never exists; ingest without reconciling against use and the lake bloats; read a broken connector as an empty source and debug the wrong layer. Every one is silent, and the leverage is the same place every time: the refresh mode and its key, the cadence matched to the decision, the category set from what the data is, and the stream's own health checked before anything downstream is blamed. Decide each deliberately and write it down — because the platform will never raise its hand.

Closing

These seven are the ingestion failures Cleon has watched bite hardest in Data 360 builds. The shared theme echoes the rest of this catalog: the platform makes the easy ingest easy and the correct one deliberate. A deleted record that won't leave, a cost nobody chose, a key that overwrites, a cadence behind the decision, an event stream with no timeline, a lake that bloats, a credential that lapsed in silence — none is loud in the moment, and each is a DLO that lies to everything downstream until someone notices a number that can't be right, or a customer who shouldn't have been contacted.

If an ingestion gotcha bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

Data streams — the unit of ingestion these gotchas configure: source to DLO, the category, and the schedule
Connectors — the sources, their batch-versus-streaming nature, and the auth/limit failures that look like "no data"
Refresh modes — full refresh vs upsert, the primary key, and the deletes behavior behind gotchas 1 and 3
Ingestion and the lifecycle — why a quiet ingestion mistake surfaces as a downstream bug three layers away
Debugging ingestion — the diagnostic order: stream not refreshing, wrong counts, missing or duplicate records, connector health
Ingestion Style Guide — "how should this source land?" — streaming vs batch, full vs upsert, and matching cadence to the decision

Reference: