Security

Fail at Startup, Not at Breach

A production system that starts with default signing keys has no JWT security. A system with concurrent token refreshes breaks user integrations silently. The fix: crash at startup if secrets are insecure, lock concurrent refreshes atomically, and make the dev/prod boundary absolute.

Jyhad Aamri, Architect of Decision Systems9 min read

The Most Expensive Line of Code You Will Never Write

Somewhere in your deployment, there is a signing key. It signs every JWT the platform issues. Every authentication token. Every session. Every WebSocket credential. If that key is the default value from the development configuration, every token is predictable. Every session is forgeable. Every authentication check is theater.

This happens more than you think. A team deploys to production. The environment variables are mostly configured. The application starts. It serves requests. Everything looks normal. Nobody checks whether the signing key was changed from the default because the application did not complain.

The application did not complain because the application does not check. It takes whatever key it is given and uses it. A 64-character cryptographic random key and the string "change-me-in-production" look identical to the signing function. Both produce valid JWTs. Both verify correctly. The difference is that one is secure and the other can be forged by anyone who reads the development configuration.

The cost of detecting this at startup: one string comparison. The cost of detecting this at breach: incalculable.

A system that starts with insecure defaults is an insecure system that has not been exploited yet.

Startup Validation: Crash Before Serving

HeartBeatAgents validates critical security configuration at startup, before the application accepts any requests. If the validation fails, the application does not start. Not "starts with a warning." Not "starts in degraded mode." Does not start. The process terminates with an error that tells the operator exactly what is wrong and exactly how to fix it.

Two secrets are validated:

The signing key. Used to sign every JWT the platform issues. If this key is the default development value, every token is predictable. An attacker who knows the default key (from the source code, from the documentation, from a leaked configuration file) can forge any JWT. Any user identity. Any organization. Any permission level. The startup check detects the default value and crashes with an error message that includes the exact command to generate a secure replacement.

The encryption key. Used to encrypt OAuth tokens at rest in the database. If this key is missing, OAuth tokens are stored in plaintext. A database breach exposes every integration credential for every user. The startup check detects the missing key and crashes with an error message that includes the exact command to generate one.

// Startup validation decision tree

          Check signing key
        
            Default value?
            →
            Error: "Generate a secure key"
          
            Custom value?
            →
            Pass
          
          Check encryption key
        
            Missing?
            →
            Error: "Generate an encryption key"
          
            Present?
            →
            Pass
          
            Any error + production mode?
            →
            Process crashes. Does not start.
          
            Any error + dev mode?
            →
            Warning logged. Starts with insecure defaults.

The error messages are actionable. They do not say "invalid configuration." They say what is wrong, why it matters, and the exact terminal command to generate a secure value. An operator who sees the error can fix it in under thirty seconds. There is no ambiguity. There is no "consult the documentation." The fix is in the error message.

This is not a warning. A warning is a message that scrolls past in a log file. Warnings are ignored. Warnings accumulate. Nobody reads a startup log that says "everything is fine except these 3 warnings." They read a startup log that says "the application did not start." A crash is impossible to ignore. A crash blocks deployment. A crash forces the fix before the first request is served.

The Distributed Token Refresh Problem

OAuth providers issue single-use refresh tokens. When a refresh token is used to obtain a new access token, the old refresh token is invalidated. A new refresh token is issued alongside the new access token. This is standard OAuth security: it ensures that a stolen refresh token can only be used once.

Now consider what happens when two concurrent agent runs need to refresh the same integration's token at the same time:

// The concurrent refresh race condition

          Agent Run A
          
          Agent Run B
        
          Sends refresh token X
          →
          
          ←
          Also sends refresh token X
        
          Gets new token Y
          
          Error: token X already used
        
          Run B's integration is now broken. User must re-authenticate manually.

Run A sends the refresh token first. The provider issues new tokens and invalidates the old refresh token. Run B sends the same (now invalidated) refresh token. The provider rejects it. Run B's integration is broken. The user's connection to that service is severed until they manually re-authenticate.

This is not a hypothetical. It happens in any multi-agent platform where agents share integration credentials and refresh tokens concurrently. The more agents you run, the more likely the race condition becomes. With a single agent, you might never see it. With ten agents sharing the same calendar integration, it is a matter of days.

Atomic Distributed Locking

The fix is a distributed lock: before refreshing a token, acquire a lock on that specific integration account. If the lock is already held, wait briefly and then read the token again (another process has refreshed it for you). If the lock is not held, acquire it, refresh the token, and release it.

The lock uses an atomic set-if-not-exists operation with an automatic expiration. This is one operation, not two. There is no gap between "check if the lock exists" and "create the lock." The check and the creation are the same operation. This eliminates the race condition at the lock level: two processes cannot both acquire the lock because acquisition is atomic.

The automatic expiration provides crash safety. If the process that holds the lock crashes mid-refresh, the lock expires on its own after a timeout. Other processes do not wait indefinitely. The timeout is generous enough to cover slow networks and provider API latency, and short enough that a crash does not block other processes for long.

// Distributed lock resolves the race

          Agent Run A
          
          Agent Run B
        
          Acquires lock
          
          Lock denied. Waits.
        
          Refreshes token. Stores new token.
          
          Releases lock.
          
          Reads the already-refreshed token.
        
          Both runs succeed. Token refreshed exactly once.

The result: the token is refreshed exactly once. Both agent runs get valid credentials. The user's integration is never broken. The race condition is eliminated, not mitigated. There is no window where two processes can refresh simultaneously.

Transport Security

Every HTTP response from the platform includes security headers that instruct browsers how to handle the content:

Strict Transport Security (HSTS). Browsers that have visited the platform once are instructed to use HTTPS for all subsequent requests for one year, including all subdomains. This prevents downgrade attacks where an attacker forces a browser to use HTTP instead of HTTPS. After the first visit, the browser refuses to connect over plaintext. Period.

Clickjacking prevention. The platform cannot be embedded in an iframe on another site. This prevents an attacker from overlaying the platform's interface with invisible elements and tricking users into clicking buttons they cannot see. The response header instructs browsers to refuse rendering if the page is inside a frame on a different origin.

MIME type enforcement. Browsers are instructed not to guess the content type of responses. Without this header, a browser might interpret a JSON response as HTML if it contains HTML-like content, potentially executing injected scripts. With the header, the browser trusts the server's declared content type and does not attempt to sniff an alternative.

Referrer control. When a user navigates from the platform to an external site, the referrer header is limited to the origin (domain) only. The full URL path (which might contain conversation IDs, agent IDs, or other sensitive identifiers) is not sent. This limits information leakage through referrer headers.

These headers are sent on every response, including error responses. A 404 page, a 500 error, a redirect: all carry the same security headers. Error responses are often overlooked in header configuration. An error page without HSTS is a downgrade attack surface.

TLS is enforced at version 1.2 and above. TLS 1.0 and 1.1 are disabled entirely. Both have known vulnerabilities that allow traffic interception under specific conditions. There is no backward compatibility for broken protocols.

Path Traversal Prevention

When an agent uploads a file as part of an API integration call, the filename must be sanitized. An unsanitized filename like ../../etc/passwd could be used in a path traversal attack, where the directory traversal characters escape the intended upload directory and access files elsewhere on the filesystem.

HeartBeatAgents strips all directory components from uploaded filenames, leaving only the base filename. The directory traversal characters are removed before the filename is used in any file operation. Additionally, file paths are validated to be within the shared folder boundary. A file operation that references a path outside the explicitly shared folders is rejected.

This is a belt-and-suspenders defense. The filename sanitization prevents traversal through the filename itself. The path validation prevents traversal through the path argument. Neither alone is sufficient (a creative attacker might find a way around one), but together they close the path traversal surface.

The Single-Flag Philosophy

The boundary between development and production is controlled by a single environment variable. Not a set of variables. Not a configuration file. One variable.

This variable controls every security bypass in the platform:

Startup validation. Default secrets crash in production, warn in development.

WebSocket authentication. Required in production, optional in development.

Credential broker. Required in production, falls back to raw tokens in development.

Egress policy. Enforced in production, skippable in development.

Network isolation. Enforced in production, relaxed in development.

One flag. Checked at every bypass point. Logged loudly at every bypass point.

// The single-flag philosophy
Production (flag = false)
Missing signing key → crash
Missing WebSocket token → connection refused
Missing credential broker → hard error
Missing egress policy → hard error
Exec network isolation → enforced
Development (flag = true)
Missing signing key → WARNING logged, app starts
Missing WebSocket token → WARNING logged, accepted
Missing credential broker → WARNING logged, raw token
Missing egress policy → WARNING logged, bypassed
Exec network isolation → WARNING logged, relaxed

Loud, Searchable, Never Silent

Every security bypass in development mode is logged with a consistent, searchable prefix. The prefix is the same across every component: startup validation, WebSocket authentication, credential broker, egress policy, network isolation. Every bypass. Same prefix. Same log level.

This design serves one purpose: a single search across production logs for that prefix should return zero results. Zero. If it returns anything, the system is running in development mode in production. The search takes seconds. The audit is binary: zero results means production security is enforced everywhere. Any result means it is not.

The bypasses are never silent. There is no code path where a security check is skipped without a log entry. A developer running locally sees the warnings in their terminal. They know which security layers are relaxed. They know their local environment does not match production. There are no surprises when deploying to production because the warnings made the differences visible throughout development.

This is the critical distinction between a single flag that bypasses silently and a single flag that bypasses loudly. Silent bypasses create a false sense of security. Loud bypasses create awareness. The developer knows. The operator knows. The log search confirms.

Why Crashing Is Safer Than Warning

Warnings are ignored. This is not a failure of discipline. It is human nature. A system that starts successfully with 3 warnings in the log is a system that is running. It is serving requests. Users are using it. The operator sees "running" and moves on. The warnings sit in a log file that nobody reads until something breaks.

A crash is different. A crash means the system is not running. Users cannot use it. The deployment pipeline reports failure. The operator cannot move on. The crash demands attention. It demands a fix. It demands the fix now, not "when we get to it."

For security configuration, crashing is the correct behavior because the alternative is worse. A system running with a default signing key is a system that appears to work correctly. Authentication succeeds. JWTs are issued. Sessions are maintained. Everything functions. But every token is forgeable. Every session is compromisable. The system is functionally insecure while appearing functionally correct.

A warning does not prevent this. A crash does. The system either starts with secure configuration or it does not start. There is no middle state. There is no "running insecurely." In production, insecure configuration is not a degraded state. It is a failed state. The application treats it as such.

What to Check in Your Platform

"What happens if production starts with default signing keys?" If it starts normally, every JWT is forgeable. If it crashes with an actionable error, the misconfiguration is caught before the first request.

"What happens when two agent runs refresh the same OAuth token concurrently?" If both refreshes proceed, one will fail and the user's integration breaks silently. If a distributed lock serializes refreshes, the token is refreshed exactly once and both runs succeed.

"How many environment variables control the dev/prod security boundary?" If the answer is more than one, there is a combinatorial explosion of configurations, some of which are partially secure (the most dangerous state). If the answer is one, the boundary is binary: fully secure or fully relaxed.

"Search your production logs for the dev mode bypass prefix. How many results?" The correct answer is zero. Any other answer means production is running with development security bypasses active.

Startup validation, distributed locking, security headers, and the single-flag philosophy are not glamorous. They are operational hygiene. But operational hygiene is what separates a system that is secure in design from a system that is secure in practice. The design can be perfect. If the deployment runs with default keys, the design is irrelevant.

Fail at startup. Not at breach.