Refresh modes: full refresh vs upsert

Every Data Stream runs on a refresh mode, and the choice is the highest-stakes correctness decision in ingestion. The mode decides one thing: how a new run reconciles against the records the stream already landed in its Data Lake Object. There are two, and they behave differently in exactly the place that hurts — what happens to a record that left the source.

This page is the reference for that decision in Data 360 (formerly Data Cloud): what full refresh and upsert each do, why upsert needs a primary key, and the one behavior that turns a clean-looking ingest into a profile that lies — the way deletes are, and are not, handled.

Full refresh: replace the whole dataset every run

A full refresh re-lands the entire dataset on every run. The previous contents of the DLO are replaced by what the source returns this time; the run is the new truth in full.

That makes it the simpler mode to reason about. You don't track what changed — you don't have to, because nothing carries over. Whatever the source holds at run time is what the DLO holds after. The cost is the obvious one: every run moves and processes the full dataset, even if one row changed, so a large source on a full refresh is heavier than the change it captured (principle 11 — cost scales with what you process).

The property that matters most is how full refresh handles a delete: by absence. A record that existed in the last run but is gone from the source this run is simply not in the new load — so after the run, it is gone from the DLO too. You never send a delete signal; the record's disappearance from the source is the delete. This is the quiet strength of full refresh, and it is exactly what upsert does not give you for free.

Upsert: insert-or-update on a primary key

An upsert reconciles incrementally. Each run carries a set of records, and for every one the stream asks a single question against the primary key: does a record with this key already exist? If yes, update it; if no, insert it. Records the run doesn't mention are left untouched. Upsert is sometimes called incremental ingestion for exactly this reason — you land the delta, not the whole set.

That is lighter and faster than a full refresh on a large source, because you move only what changed. But the entire mechanism rests on one thing being correct: the primary key. The key is what tells the stream whether an incoming record is the same record it already has or a new one. Get the key right and upsert is precise. Get it wrong and the failures are silent.

The trap: upsert does not delete

This is the single behavior to internalize, because it is where a refresh-mode mistake surfaces three layers downstream as data that's just wrong.

A full refresh captures deletes by absence. An upsert does not. An upsert only ever inserts or updates — by definition it touches a record only when that record is in the run. A record deleted from the source simply stops arriving, and "stops arriving" is precisely the case upsert leaves untouched. So the deleted record stays in the DLO indefinitely, looking exactly as valid as a current one.

So the deletes question is not a detail you settle later — it is half of the refresh-mode decision. Before you choose upsert, answer it out loud: can this source delete records, and if so, how does the delete reach Data 360? If the answer is "it can delete and there's no delete signal," upsert will silently retain ghosts and a full refresh is the safer mode despite its cost.

How to choose

The trade is legible once you state it as the two things that differ:

Weight and frequency. Full refresh moves the whole dataset every run; upsert moves only the delta. On a large, frequently-refreshed source, the difference is real processing cost (principle 11) — over-frequent full refreshes are a cost decision disguised as a schedule setting.
Deletes and the key. Full refresh captures deletes by absence and needs no key for that. Upsert needs a stable, unique primary key and an explicit delete strategy if the source can remove records. Name both before you commit to upsert.

The honest default: if you cannot name a stable unique key, you are not ready for upsert — use full refresh. If the source can delete records and you have no way to signal those deletes, a full refresh captures them for free where upsert would silently keep ghosts. Reach for upsert when you have a trustworthy key, the volume makes full refresh wasteful, and you've decided how deletes are handled — not before. The Ingestion Style Guide walks this as the second of its three ordered questions.

Data Streams — the unit of ingestion the refresh mode is configured on, and where the mode is set
Ingestion gotchas — the silent failures, including upsert keeping deleted records and a non-unique key duplicating
Relationships & keys — what makes a primary key unique and stable, the prerequisite for upsert
Ingestion Style Guide — "how should this source land?" — full refresh vs upsert as the second ordered question

Reference:

Full refresh: replace the whole dataset every run

Upsert: insert-or-update on a primary key

The trap: upsert does not delete

How to choose

Related