Skip to main content

Data Extension architecture: the schema choices you live with

Every other Marketing Cloud system reads or writes Data Extensions. Get the schema right at bootstrap and the tenant operates cleanly for years; get it wrong and every SQL Activity, every AMPscript Lookup, every Journey decision-split inherits the mess. Ten production-note items on naming, primary keys, column types, retention, and the structural choices that compound.

Production note·Last updated 2026-05-19·Drafted by Lira · Edited by German Medina

Data Extensions are the foundation of every other system in Marketing Cloud. SQL Activities query them. AMPscript reads them at render time. SSJS scripts read and write them. Journey Builder routes recipients through them. Reporting aggregates from them. Get the DE architecture right at bootstrap and the tenant operates cleanly for years; get it wrong and every other system inherits the mess — every SQL query carries the same LTRIM(RTRIM(CAST(... AS NVARCHAR(255)))) because the SubscriberKey type wasn't pinned, every AMPscript reads a DE column that's NULL half the time because the audience-build SQL Activity wasn't disciplined.

Ten architecture choices Cleon has seen compound across multi-tenant rollouts. Each is paired with the question to answer at bootstrap and the cost of getting it wrong. Same framing as the Config gotchas, narrower in scope to the DE layer specifically.


The architecture choices

1. Four DE roles, four naming prefixes — picked once

Every Data Extension serves one of four roles. The SQL Style Guide lists the prefix convention; this article is where the convention is justified.

| Role | Prefix | What it holds | Lifecycle | |---|---|---|---| | Master | DE_ | Owned subscriber data; the canonical source for an entity | Long-lived; updates via controlled SQL Activities | | Staging | de_stg_ | Computed intermediate result; safe to truncate | Rebuilt on every run; never read by downstream sends directly | | Log | de_log_ | Append-only run records; instrumentation | Long-lived but high-volume; needs retention policy | | Lookup | de_lookup_ | Reference / config data; marketing edits without SQL | Long-lived; low-volume; manually maintained |

Picking the prefix matters at bootstrap because retrofitting is brutal. A tenant with 200 DEs named inconsistently can't be batch-renamed (Send Definitions, Automations, AMPscript, and Journeys all hold DE references by name). The Cleon convention: prefix at creation, document the convention in the team runbook, every code review verifies the prefix matches the role.

2. Sendable vs non-sendable — declared, not implicit

A DE is "sendable" when it has a column mapped to SubscriberKey and a checkbox flipped in the DE properties. Sendable DEs become the source of Send Activities; non-sendable DEs are everything else (staging, logs, lookups, reference data that doesn't drive a send).

The hand-off failure: someone needs to send to "all subscribers in
de_stg_active_recent". The DE wasn't created as sendable. The "fix"
is to recreate it (sendable can't be toggled after creation), which
breaks every downstream Automation referencing the old DE name. The
team makes a parallel sendable DE with a slight name variation, both
DEs exist forever, every future code review wonders which is current.

The defense: when creating any DE, decide sendable-vs-not at the moment of creation. Document the role in the DE description. Sendable DEs should only ever be the ones that drive Send Activities directly — de_send_audience_* style names that make the role visible from a list.

3. SubscriberKey type pinned at NVARCHAR(254) (or matched to source)

SubscriberKey is the column every DE joins on, and its declared type determines whether joins work without coercion. The MC default is NVARCHAR(254) (effectively unlimited). Some tenants pin it to a narrower type (NVARCHAR(50)) hoping for performance gains — the gains don't materialize, the joins start failing when a downstream source supplies a longer key, and casting becomes mandatory in every query.

The Cleon convention: NVARCHAR(254) on every SubscriberKey column, declared at DE creation. Same for EmailAddress. The standard LTRIM(RTRIM(CAST(... AS NVARCHAR(255)))) join pattern from the SQL gotchas becomes unnecessary when both sides of the join are already the right type.

4. Primary key configured at creation — not after the first row

Primary keys on DEs are how MC determines which rows are "the same" for upsert / update / dedup operations. Set the wrong PK (or no PK) and UpsertData silently inserts duplicates — the failure mode that the AMPscript debugging snippet and SSJS debugging snippet both diagnose.

PK rules:

- SubscriberKey is the PK on subscriber-keyed DEs.
- Log DEs are PK-less append-only (or use a synthetic GUID).
- Composite PKs (SubscriberKey + Date) for snapshot DEs that
  capture state at a moment in time.
- Lookup DEs PK the dimension being looked up (e.g. TierCode,
  RegionCode).

The MC UI allows creating a DE without specifying a PK and adding one later — but adding a PK to a populated DE that has duplicate rows for the chosen key fails. Set the PK at creation, while the DE is empty. If you discover the PK is wrong after the DE has data, you're rebuilding (truncate + reinsert deduplicated rows + set the PK).

5. Column types declared specifically — no NVARCHAR(4000) defaults

The MC UI defaults new text columns to NVARCHAR(4000). This is convenient at creation and disastrous at scale:

  • Storage cost is proportional to declared width even when actual values are short.
  • The 4000-char limit silently truncates pasted JSON / HTML / serialized values that exceed it.
  • Index efficiency degrades when the join keys are over-wide.
The Cleon convention for text column widths:

- SubscriberKey       NVARCHAR(254)   # MC standard
- EmailAddress        NVARCHAR(254)   # RFC 5321 max + headroom
- FirstName / LastName NVARCHAR(50)    # plenty for any real name
- Status / Tier        NVARCHAR(20)    # short, controlled vocabulary
- Country code         NVARCHAR(2) or 3 # ISO-3166
- URLs                 NVARCHAR(2048)  # max practical URL length
- Free-text comments   NVARCHAR(1000)  # cap to avoid pasted-payload abuse

Pin types at creation. The hand-off failure is a DE with everything NVARCHAR(4000) getting joined to a master DE with proper types — MC issues a Conversion error on the implicit cast, or worse silently truncates.

6. Retention policy at DE creation, not at "we're running out of space"

DEs grow. Log DEs that capture every Send + Open + Click can reach hundreds of millions of rows. SDV snapshots stored in de_log_sdv_* accumulate one row per send-per-subscriber. Without a retention policy, the tenant accumulates years of historical data that nobody queries but everyone pays for.

The Cleon convention for retention per role:

- de_log_ssjs_runs       90 days (rolling)
- de_log_ssjs_errors     180 days (errors are diagnostic gold)
- de_log_sf_writes       180 days (audit trail)
- de_log_email_events    1 year minimum (compliance + reporting)
- de_log_sdv_sent        2 years minimum (CAN-SPAM consent history)
- de_stg_*               truncate on every run
- Master DEs             no rolling; archive deletes manually

Set the retention via Data Retention Policy at DE creation. Adding retention to a populated DE that's already several years old causes a mass-delete event that can lock the DE for hours; configure retention on day 1.

7. Sub-DE relationships are by-convention, not enforced

Marketing Cloud has no foreign-key enforcement between DEs. A row in de_log_ssjs_runs references a SubscriberKey that may or may not exist in _Subscribers. A row in a child DE references a parent that may or may not exist. There's no integrity check; nothing alerts when a reference dangles.

The Cleon convention: document parent-child relationships in the DE description and have a scheduled SQL Activity that audits for orphans. A weekly de_log_orphans row count by DE-pair makes the data integrity visible without needing FK enforcement at the storage layer.

8. Shared DEs (Enterprise 2.0) — sparingly, documented, owned

Cross-BU data sharing happens via Shared DEs. The pattern is legitimate but high-risk: which BU writes? which BUs read? what happens when the writer is decommissioned? See BU architecture — patterns to prefer.

Shared DE creation checklist (Cleon convention):

- [ ] Documented owner BU (the one that writes)
- [ ] Documented reader BUs (the ones that read)
- [ ] Schema review at creation (no random-shape Shared DEs)
- [ ] Retention policy explicit (Shared DEs often outlive their writers)
- [ ] Migration plan if the owner BU is ever decommissioned
- [ ] Naming convention includes "Shared" prefix (e.g. shared_DE_*)

A Shared DE without documented ownership is an orphan-in-waiting. The team that wrote it leaves; the dependent BUs keep reading; nobody knows who can change the schema. The naming convention makes the Shared-ness visible at a glance.

9. Column nullability is a decision, not a default

Every column has a Nullable checkbox in the DE creation UI. Defaulting everything to Nullable is convenient but loses information — when reading a row, you can't tell whether a NULL means "no value" or "we never collected it". The hand-off failure is downstream AMPscript with IF Empty(@x) defaulting NULLs and partial-completes to the same fallback, hiding the difference.

The Cleon convention: declare nullability explicitly per column at DE creation. Required identity columns (SubscriberKey, EmailAddress) NOT NULL. Optional attributes (MiddleName, SecondaryEmail) Nullable. Status columns with a meaningful "unknown" value have a NOT NULL default ('Unknown') rather than allowing NULL.

10. Schema evolution: additive only, never silent renames

Once a DE has downstream consumers (SQL Activities, AMPscript, Journeys), schema changes are visible across the tenant. Adding columns is safe; renaming a column breaks every consumer that references the old name; removing a column breaks every consumer that references it at all.

The Cleon migration discipline:

1. NEVER rename a column in production. Add a new column with the
   new name, migrate downstream consumers to read the new column,
   then deprecate the old column (mark as "deprecated" in the
   description but leave it populated for an audit window).

2. NEVER remove a column without a deprecation period. A removal
   that breaks one consumer is found in dev; a removal that breaks
   five consumers is found in production at 11pm.

3. NEW columns added at the end of the column list (some legacy
   API consumers care about column order).

4. Every schema change documented in the team runbook with a date,
   the consumers that needed update, and the deprecation window.

The runbook entry is the receipt. Six months after a column was deprecated, the runbook tells you whether it's safe to delete; without the runbook, the answer is "I don't know, let's wait another six months" forever.


Closing

Data Extension architecture decisions don't have a satisfying single answer — the "right" choice depends on the tenant's volume profile, the operational team's discipline, and the downstream systems that depend on it. The shared theme across these ten items: at bootstrap, every choice is reversible at low cost; six months in, every choice is reversible at high cost; two years in, most choices are effectively permanent.

The discipline that prevents the worst outcomes is a DE creation checklist enforced in code review — prefix, sendable-or-not, PK, column types, retention policy, nullability, owner-BU (for Shared DEs). Cleon's checklist lives in the team runbook alongside the BU architecture and SAP procedures.

If you spot a DE architecture pattern that bit your team and isn't here — write to hello@wearecleon.com. We add it, with credit.

Reference: