Leases, Ownership, and Fencing

The dangerous worker is not always the dead one. It is often the worker that was presumed dead, replaced, and then wakes up still holding enough local state to write.

That pattern shows up in job systems, leadership election, file processing, schedulers, deployment controllers, and agent runtimes. The system thinks ownership moved. The old owner did not get the message in time.

The thesis

A lease is only a time-bounded claim. Fencing is the mechanism that makes stale ownership harmless.

If a design stops at "the worker holds a lock," it has not answered the production question. What happens when the worker pauses past its lease, another worker takes over, and the old worker resumes? If the resource accepts writes from both, the lock was advisory theater.

The point is not to eliminate pauses. The point is to make old owners unable to mutate state after ownership has advanced.

The production pattern

A worker picks up a job and writes "owned by worker A until this time." It starts processing. Then the process stalls during a long garbage collection pause, a host hiccup, a network partition, or a slow external call. The lease expires. Another worker sees the job as available, claims it, and continues.

So far, the system is doing what the lease design asked it to do. The failure begins when worker A resumes. It still has file handles, local variables, a downstream request in flight, or credentials that let it write a final result. Worker B may have already written a different result. Now two owners exist in the only place that matters: the resource being changed.

The database row says one thing. The side effect boundary accepts another.

The model

I separate four concepts:

Work item: the thing that needs progress.
Lease: the temporary right to attempt progress.
Epoch or fencing token: a monotonic ownership version issued when the lease is acquired.
Guarded resource: the system that rejects writes from stale epochs.

The lease helps coordinate who should work. The fencing token proves which owner is newer. The guarded resource enforces that proof.

Without the guarded resource, the token is only decoration. It has to be checked where the mutation occurs: when updating the job row, writing the output, publishing the result, completing the workflow, or calling the downstream system if the downstream system can support it. If the downstream system cannot support fencing directly, the local system needs an intermediate record that prevents stale completion from becoming authoritative.

This model also clarifies ownership transfer. Taking over a job is not just changing a worker id. It is advancing the epoch. Every later write must show it belongs to the current epoch.

Where this goes wrong

The first mistake is relying on clocks alone. Time decides when a lease may be considered expired. It should not be the only thing that decides whether an old write is accepted.

The second mistake is renewing leases from a thread that can outlive the work's real health. A worker may keep renewing while the part that matters is wedged, or it may stop renewing while an external operation continues. Renewal says the process is alive; it does not prove the operation is still safe.

The third mistake is checking ownership only when work starts. The stale owner problem happens at completion time. Every meaningful write after acquisition needs to carry the ownership version.

The fourth mistake is letting manual tools bypass fences. Incident scripts that mark jobs complete, reassign owners, or replay work need the same ownership rules. Otherwise the emergency path becomes a source of drift.

There is a practical counterpoint. A single-threaded process with one local resource may not need leases at all. A database transaction or a simple status column can be enough. Leases enter the design when ownership can move while work is in progress. Once that is true, fencing becomes the part that protects correctness.

What I do now

I ask where stale owners are rejected. If the answer points to the scheduler, I keep asking. The scheduler can stop assigning new work, but it cannot stop an old owner from writing unless the write path checks the token.

I prefer ownership records that include an epoch, owner identity, lease deadline, and reason for transfer. The reason is useful during incidents. It tells operators whether the takeover was a timeout, manual action, cancellation, or recovery process.

I design completion as a conditional write: complete this job only if the epoch still matches. If completion triggers more side effects, those side effects should be tied to the same operation identity so a stale completion cannot fan out new work.

For long-running work, I also separate heartbeat from proof of progress. A worker can be alive and still stuck. Progress markers, checkpoints, and reconciler observations tell a more useful story than a heartbeat alone.

In agent systems, this matters when multiple planners or runners can act on the same task. The fact that one agent started drafting, filing, or editing does not mean it still owns the action after a supervisor reassigns the work. The tool boundary needs to know which owner is current.

Closing takeaway

Use leases to decide who may try. Use fencing to decide whose writes still count.