Background Jobs: Common Pitfalls and How to Avoid Them

Most developers learn background jobs through a deceptively simple mental model:

Put a task into a queue.
A worker picks it up.
The task gets executed.

That’s it.

At least, that’s what we tell ourselves when we build our first async system.

Need to send an email? Throw it into a queue.
Generate a PDF? Queue.
Process images? Queue.

Queues feel like magic. They make slow operations disappear from your request-response cycle.

And for a while, everything works perfectly.

Then production happens.

A customer receives the same email four times.
A payment gets processed twice.
A report generation task disappears without a trace.
Workers suddenly spike to 100% CPU because thousands of jobs retry at once.

The system that was supposed to increase reliability becomes your biggest source of incidents.

And here’s the uncomfortable truth:

Background jobs are one of the most misunderstood parts of modern backend systems.

In this article, we’ll unpack what’s really going on behind queues — and how to design systems that don’t fall apart under real-world conditions.

The Happy Path Is a Lie

Let’s start with a familiar flow:

User Registration
        ↓
Create User
        ↓
Push Job to Queue
        ↓
Worker Picks Job
        ↓
Send Email

Looks clean. Predictable. Safe.

Most developers stop thinking here.

But distributed systems don’t live on the happy path.

Here’s what can actually go wrong:

Redis crashes right after pushing the job.
A worker crashes after sending the email.
The email provider responds too slowly.
The network drops mid-processing.
The worker finishes the task but fails to ACK.

Each of these creates a different failure mode.

And production systems hit these scenarios every day.

The key insight:

A queue does not guarantee execution. It only coordinates work.

Once you internalize this, your architecture starts to change.

At-Least-Once Delivery

Most queue systems operate with at-least-once delivery:

RabbitMQ
BullMQ
Sidekiq
AWS SQS
Kafka consumers
Celery

What this actually means:

Your job can run more than once.

Not exactly once. Not maybe once.
Multiple times.

Example:

Worker receives job
        ↓
Worker sends email
        ↓
Worker crashes
        ↓
No ACK sent
        ↓
Queue retries job

From the queue’s perspective, the job was never completed.

So it retries.

But the email was already sent.

Result? Duplicate emails.

This isn’t a bug.
This is expected behavior.

Idempotency Is Not Optional

Once you accept duplicate execution, one requirement becomes unavoidable:

Your jobs must be idempotent.

Bad example:

account.balance += 100;

Run twice → user gets $200.

Good example:

if (!processedTransactions.has(transactionId)) {
  await creditAccount();
  await markTransactionProcessed(transactionId);
}

Now duplicates are harmless.

This pattern applies everywhere:

Emails — store notification IDs.
Payments — use transaction IDs.
Orders — enforce unique references.
Webhooks — track processed events.

A more production-friendly version usually combines multiple protections:

async function processPaymentJob(job: { transactionId: string; userId: string; amount: number }) {
  const alreadyProcessed = await db.paymentEvents.findUnique({
    where: { transactionId: job.transactionId },
  });

  if (alreadyProcessed) return;

  await db.$transaction(async (tx) => {
    await tx.paymentEvents.create({
      data: {
        transactionId: job.transactionId,
        userId: job.userId,
        amount: job.amount,
      },
    });

    await tx.accounts.update({
      where: { userId: job.userId },
      data: { balance: { increment: job.amount } },
    });
  });
}

The goal is not to prevent duplicate delivery.

The goal is to make duplicate delivery harmless.

The Myth of Exactly-Once

Sooner or later, every backend engineer asks:

Can I guarantee exactly-once processing?

Short answer: not really.

Failures can happen between any two steps:

Charge customer
        ↓
Success
        ↓
Save result in DB

If the service crashes in between:

Payment succeeded.
Database says it didn’t.

Now you have an inconsistent system.

“Exactly-once” in real systems usually means a mix of:

Deduplication
Idempotency
Checkpointing
Transaction logs
Atomic writes

Practical takeaway:

Design for duplicates. Don’t fight them.

Retries Can Kill Your System

Retries sound harmless.

Until they aren’t.

Imagine:

You have 50,000 jobs.
A third-party API goes down.
Every job retries immediately.

Now you’ve created a retry storm.

Instead of helping recovery, you’re amplifying the outage.

Better approach: exponential backoff.

Example:

Retry #1 → 5 seconds
Retry #2 → 30 seconds
Retry #3 → 2 minutes
Retry #4 → 10 minutes
Retry #5 → 1 hour

Even better: add jitter.

Instead of synchronized retries, jobs spread out over time.

A practical pattern:

function getRetryDelay(attempt: number) {
  const base = Math.min(60_000 * 2 ** attempt, 3_600_000);
  const jitter = Math.floor(base * (0.5 + Math.random() * 0.5));
  return jitter;
}

This is standard in large-scale systems for a reason.

Dead Letter Queues Save You

Some jobs are just broken:

Invalid payload
Missing data
Deleted resources
Validation errors

Retrying them forever is pointless.

Yet many systems do exactly that.

Solution: Dead Letter Queues (DLQ)

Main Queue
      ↓
Fails multiple times
      ↓
Dead Letter Queue

Benefits:

Cleaner monitoring
Lower infrastructure cost
Faster debugging
Predictable behavior

If your queue supports DLQ, enable it early.

Not later. Not “when needed.”
From day one.

Long Jobs Break Easily

A common anti-pattern:

One massive job doing everything:

Generate report
Download assets
Resize images
Upload files
Send email
Update DB

Looks efficient.

It’s not.

If step 5 fails, what do you do?

Re-run everything?
Risk duplicates?
Re-upload files?

Instead, break it down:

Generate Report
        ↓
Process Assets
        ↓
Upload Results
        ↓
Notify User

Benefits:

Easier retries
Better observability
Smaller failure scope
Independent scaling

Smaller jobs fail less catastrophically.

Monitoring Matters More Than Processing

Most teams:

Spend weeks building queues.
Spend zero time monitoring them.

That’s a mistake.

Track at least:

Queue length — are workers keeping up?
Processing latency — are jobs waiting too long?
Job duration — any unexpected spikes?
Failure rate — early signal of issues.
Retry volume — warning before incidents.

Without visibility, you’re debugging blind.

Backpressure Is the Silent Killer

If your system receives:

100 jobs/sec

But can process:

50 jobs/sec

You’re in trouble.

The backlog will grow forever.

Eventually:

Memory explodes.
Latency becomes unacceptable.
The system degrades.

This is backpressure.

Handle it with:

Rate limiting
Worker scaling
Priority queues
Load shedding

Every system has limits. Ignoring them is not a strategy.

What Production Systems Actually Look Like

What we imagine:

Queue → Worker → Done

Reality:

App
 ↓
Queue
 ↓
Workers
 ↓
Retry logic
 ↓
Dead Letter Queue
 ↓
Monitoring
 ↓
Alerting

Notice something?

Actual processing is the simplest part.

Most complexity is in handling failure.

A Quick Production Checklist

Before shipping any queue system:

Is the job idempotent?
Can it safely run multiple times?
Are retries using exponential backoff?
Is jitter enabled?
Is DLQ configured?
Are metrics in place?
Have failure scenarios been tested?
Can workers scale horizontally?
Is backpressure handled?
Are jobs small and composable?

If not, you're not production-ready yet.

Final Thought

Queues look simple because tutorials only show success paths.

Real systems deal with:

Crashes
Network failures
Duplicate messages
Out-of-order execution
Retry storms

And still need to work.

The key shift:

Background jobs aren’t about executing work. They’re about executing work reliably under constant failure.

Once you start thinking this way, your architecture changes.

And that’s when your queue system becomes something you can actually trust.

Why Your Background Jobs Are Lying to You

The Happy Path Is a Lie

At-Least-Once Delivery

Idempotency Is Not Optional

The Myth of Exactly-Once

Retries Can Kill Your System

Dead Letter Queues Save You

Long Jobs Break Easily

Monitoring Matters More Than Processing

Backpressure Is the Silent Killer

What Production Systems Actually Look Like

A Quick Production Checklist

Final Thought

Comments

More from this blog

Designing a Real-Time Chat System at Scale

CI/CD for Modern Applications: from manual deployments to reliable delivery pipelines

Real-Time Notification Systems Are Harder Than Most Teams Expect (And How to Build Them Right)

Why Most AI Products Fail to Build Real Technical Moats

Command Palette

The Happy Path Is a Lie

At-Least-Once Delivery

Idempotency Is Not Optional

The Myth of Exactly-Once

Retries Can Kill Your System

Dead Letter Queues Save You

Long Jobs Break Easily

Monitoring Matters More Than Processing

Backpressure Is the Silent Killer

What Production Systems Actually Look Like

A Quick Production Checklist

Final Thought

Comments

More from this blog