Data Manipulation Language—the INSERT, UPDATE, DELETE, and MERGE commands that most developers learn in their first SQL tutorial—is rarely treated as a strategic tool. We learn the syntax, we pass the interview questions, and then we write scripts that move rows around without much thought about the story those rows tell. But in a world where data quality directly impacts decision-making, the way we manipulate data matters as much as the data itself. This playbook repositions DML as a narrative craft: every update is a revision of the record, every delete is an editorial cut, every insert is a new sentence in the dataset's ongoing story. We'll walk through where DML shows up in real work, what foundations trip people up, which patterns hold up under pressure, and—just as important—when it's best to leave the data alone.
Where DML Shapes Real Work: From Pipelines to Ad-Hoc Queries
DML operations aren't confined to a single stage of the data lifecycle. They appear in ingestion pipelines, transformation steps, correction scripts, and even in the final presentation layer. Understanding the context of each operation helps us choose the right approach and anticipate side effects.
Batch Ingestion and Upsert Logic
In a typical ETL pipeline, new records arrive nightly. The classic pattern is to stage them in a landing table, then use MERGE (or separate INSERT/UPDATE) to sync the target. The strategic question here isn't "which syntax works" but "how do we handle conflicts?" For instance, if the source system sends a row with a corrected value, do we overwrite blindly, or do we keep the old value as a historical version? Teams often default to overwrite because it's simpler, but that choice erases the audit trail. A better narrative pattern is to use MERGE with a flag column that tracks the last update timestamp and the source system—so the story of each row includes its revision history.
Ad-Hoc Corrections and the Danger of Direct Updates
An analyst spots a data entry error: a customer's name is misspelled. The quick fix is an UPDATE statement. But what if that same table is being queried by a nightly report that already captured the wrong name? The correction changes the past, and downstream aggregations become inconsistent unless the report is rerun. This is where narrative thinking helps: instead of a direct update, consider inserting a correction record that the reporting layer can apply retroactively. The original row stays as-is, and the correction is a separate edit in the dataset's story.
Delete Operations in Regulatory Contexts
GDPR and similar privacy regulations require the ability to delete personal data on request. A hard DELETE removes the row entirely, but that may break referential integrity or audit logs. A strategic alternative is a soft delete (setting an `is_deleted` flag) combined with a scheduled purge that runs only after verification. This preserves the narrative for compliance while satisfying the deletion request. The choice between hard and soft delete is not just technical; it's a narrative decision about whether the record ever existed.
Foundations Readers Confuse: Transactional vs. Analytical DML
One of the most persistent confusions is treating analytical databases like transactional ones. In OLTP systems, DML is frequent, small-scale, and often wrapped in transactions. In analytical or data warehouse contexts, DML is typically bulk, scheduled, and rarely transactional in the ACID sense. Mixing the two mindsets leads to performance issues and data inconsistencies.
ACID vs. Eventually Consistent
Transactional systems guarantee atomicity, consistency, isolation, and durability. If you UPDATE a row in a bank account table, you expect the balance to reflect immediately and for concurrent transactions to see a consistent state. Analytical systems, especially those built on columnar storage or distributed platforms, often relax isolation for performance. An UPDATE in such a system might not be visible to all readers until the next compaction. Professionals who expect transactional behavior in an analytical store are in for a surprise. The strategic approach is to know your system's guarantees and design your DML accordingly—for example, batching updates into a staging table and swapping partitions rather than issuing individual UPDATE commands.
Idempotency and Replayability
Another foundation that gets confused is idempotency. In a transactional system, running the same INSERT twice creates two rows. In an analytical pipeline, the same batch might be replayed due to a failure. Without idempotency checks (like using a unique key or a deduplication step), the data narrative gets duplicated sentences. A common mistake is to rely on the source system to guarantee uniqueness; the safer pattern is to make the DML itself idempotent by using MERGE with a hash key or by inserting into a staging table and applying a window function to pick only the latest version.
Null Semantics and the Silent Update
NULL values are another foundation that often trips teams. An UPDATE that sets a column to NULL may be intended to "clear" the value, but downstream joins and aggregations treat NULL as unknown, not as empty. In narrative terms, a NULL is an ambiguous sentence—it could mean "not applicable", "not yet collected", or "removed". Many data dictionaries fail to document these semantics, and the DML script becomes the only record of intent. The fix is to either use a sentinel value (like -1 for numeric columns or 'N/A' for strings) or to add a separate column that explains why the value is NULL (e.g., `null_reason`).
Patterns That Usually Work: Reliable DML Strategies
Over years of observing teams across industries, certain DML patterns consistently produce clean, auditable, and performant data. These patterns are not about fancy syntax but about structuring operations with intent and safety nets.
Pattern 1: Staging + MERGE with Audit Columns
The most reliable pattern for keeping a target table in sync is to load new data into a staging table, then run a MERGE that adds audit columns: `created_at`, `updated_at`, `source_system`, and `operation_type` (insert, update, delete). This pattern works for daily batches and streaming micro-batches alike. The audit columns turn every row into a self-documenting story. If something goes wrong, you can trace which operation caused the change and when.
Pattern 2: Soft Delete with a Purge Window
Rather than deleting rows immediately, set a `deleted_at` timestamp and a `purge_after` date. Queries that need current data filter on `deleted_at IS NULL`. A scheduled job runs weekly to physically delete rows where `purge_after` has passed. This gives analysts a grace period to catch accidental deletes and preserves referential integrity for historical reports.
Pattern 3: Append-Only for Critical History
Some tables should never be updated or deleted—only inserted. Examples include event logs, financial transactions, and audit trails. The strategic choice is to make the table append-only at the database level (using triggers or permissions) and model changes as new rows with a version number. This pattern simplifies debugging and satisfies regulatory requirements for immutable records. The trade-off is storage growth, but in most modern systems, the cost of storage is far lower than the cost of data loss or misinterpretation.
Pattern 4: Parameterized UPDATEs with WHERE Clauses
A common anti-pattern is writing UPDATE statements without a WHERE clause (or with a condition that's too broad). The reliable pattern is to always test the WHERE clause in a SELECT first, then convert to UPDATE. Better yet, use a CTE to select the rows to update, then apply the UPDATE. This prevents unintentional mass updates and makes the operation reviewable.
Anti-Patterns and Why Teams Revert
Even experienced teams fall into habits that undermine data quality. Understanding why these anti-patterns persist helps us avoid them.
Anti-Pattern 1: The "Just This Once" Direct Update
An urgent request comes in: fix a batch of incorrect values. The easiest path is a direct UPDATE on the production table. But "just this once" becomes a pattern, and soon there's no record of what was changed or why. Teams revert to this because it's fast and requires no coordination. The fix is to enforce a policy that all DML operations go through a version-controlled script, even if it adds five minutes to the task.
Anti-Pattern 2: Over-Indexing on Performance
Some teams avoid MERGE because it's slower than separate INSERT and UPDATE statements. They revert to the two-step approach to shave milliseconds off a batch job. But the two-step approach is not atomic: if the job fails between the INSERT and UPDATE, the data is left in an inconsistent state. The performance gain is rarely worth the risk. A better approach is to optimize the MERGE with proper indexing and batch sizes.
Anti-Pattern 3: DELETE Without Backup
Deleting rows is permanent. Yet many teams delete without first backing up the affected rows. When a stakeholder asks "what was in that row?" there's no answer. The revert happens because the team didn't anticipate the question. The pattern to adopt is to always SELECT the rows into a backup table before DELETE, or use a soft delete with a retention period.
Anti-Pattern 4: Ignoring Transaction Boundaries
In transactional databases, wrapping multiple DML statements in a BEGIN/COMMIT block ensures all-or-nothing execution. But teams sometimes skip transactions for "simple" updates, assuming they won't fail. When a failure occurs mid-operation, the data is left partially modified. The revert is to run a manual fix, which may introduce further errors. Always use transactions when multiple tables or rows need to stay consistent.
Maintenance, Drift, and Long-Term Costs
DML scripts, like any code, require maintenance. Over time, the assumptions they were built on drift: source schemas change, business rules evolve, and data volumes grow. Ignoring this drift leads to silent failures and costly rewrites.
Schema Drift and the Brittle MERGE
A MERGE statement that references specific column names will break when the source table adds a new column. Many teams handle this by listing all columns explicitly, which means every schema change requires a script update. A more resilient pattern is to use dynamic SQL that reads column names from the information schema, or to use a tool that auto-generates the MERGE. The cost of not addressing drift is that the pipeline fails at 3 AM, and someone has to debug it manually.
Volume Growth and Performance Degradation
A DELETE operation that took one second on a million-row table may take ten minutes on a hundred-million-row table. The DML pattern that worked for years suddenly becomes a bottleneck. Teams often react by throwing hardware at the problem, but the smarter move is to archive old data or partition the table. The long-term cost of ignoring volume growth is that scheduled jobs start overlapping, causing contention and missed SLAs.
Audit Trail Erosion
Even with audit columns, the trail erodes if the retention policy is too short or if the audit table itself is pruned without consideration. A common scenario: the audit table is set to retain 90 days of history, but a compliance audit requires data from six months ago. The cost is a manual reconstruction effort or a failed audit. The fix is to align retention policies with regulatory requirements and to store critical audit data in an append-only archive.
Technical Debt in DML Scripts
Scripts that are written quickly, without comments or error handling, accumulate technical debt. A year later, no one knows why a particular UPDATE uses a strange WHERE clause. The cost is that every change becomes risky, and the team becomes reluctant to touch the code. Investing in documentation, logging, and test cases for DML scripts pays off many times over.
When Not to Use This Approach
Not every data problem needs a DML solution. Sometimes the best narrative is the one that doesn't change the record at all. Here are scenarios where DML is the wrong tool.
Immutable Event Stores
Systems designed as event logs (e.g., Apache Kafka, event sourcing databases) are built on the principle that records are never modified. Inserting a correction event is the right pattern; updating or deleting the original event violates the system's semantics. Using DML in such systems can cause data loss or inconsistency. The strategic choice is to append a correction event and let the read model handle the reconciliation.
Analytical Views vs. Base Tables
When a business user asks for a "fix" to a number in a report, the instinct is to update the underlying table. But if the base table is a source of truth for multiple reports, changing it may break other views. A better approach is to create a derived view or a separate table that applies the correction, leaving the base table intact. The DML approach would be appropriate only if the error is in the source data itself.
High-Frequency Real-Time Systems
In systems that process millions of events per second (e.g., ad bidding platforms), DML operations can create bottlenecks. The overhead of locking, logging, and transaction management makes UPDATE and DELETE impractical. Instead, these systems use append-only logs and in-memory state stores. The DML playbook is designed for systems where consistency and auditability matter more than raw throughput.
One-Time Data Cleanup
If you're cleaning up a dataset as a one-time exercise, writing a full-blown DML script with audit columns and error handling may be overkill. A quick UPDATE with a backup is sufficient—as long as you accept the risk. The playbook is for ongoing, repeatable processes. For one-off tasks, the cost of building a robust pattern may exceed the benefit.
Open Questions and FAQ
Even with a solid playbook, questions remain. Here are answers to the most common ones we hear from teams adopting a narrative approach to DML.
How do we handle DML in a microservices architecture?
Each microservice typically owns its database. DML operations should be scoped to the service's domain. Cross-service updates are better handled through events or API calls, not direct DML. The narrative then lives in the event log, not in a shared table.
Should we use ORMs for DML?
ORMs can generate DML, but they often hide the underlying operations. For strategic data work, we recommend writing DML explicitly so you can review and optimize it. ORMs are fine for simple CRUD in application code, but for batch data operations, raw SQL gives more control.
What's the best way to test DML scripts?
Test in a staging environment with a copy of production data (anonymized if necessary). Use a transaction so you can roll back after testing. Verify row counts, column values, and audit columns. Automate these tests in a CI/CD pipeline if possible.
How do we convince stakeholders to invest in DML best practices?
Frame it as risk management. Show examples of data inconsistencies that led to wrong decisions. Estimate the cost of manual fixes vs. automated, auditable DML. Often, a single incident (e.g., a failed audit) is enough to get buy-in.
Is there a future where DML becomes obsolete?
Some modern data platforms (like Apache Iceberg or Delta Lake) support time travel and schema evolution, reducing the need for manual DML. But DML will remain relevant for corrections, compliance deletions, and operational systems. The narrative approach we've described is about mindset, not just syntax—and that mindset will outlast any particular tool.
The next time you write an UPDATE or DELETE statement, pause and ask: what story does this operation tell? Who will read it later? By treating DML as a strategic narrative tool, you turn a routine technical task into a practice that builds trust, clarity, and long-term data value.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!