Debugging In The Wild: My Cockpit Reading JSON File Nightmare

cockpit reading json file

Picture this: it’s 2:47 AM, your coffee is cold, and you’re SSH’d into a production server that absolutely cannot go down. Somewhere between Frankfurt and Ashburn, a Kubernetes pod decided it couldn’t read a config file. Not a database. Not a secret. A JSON file. A plain, boring Cockpit Reading JSON File that worked fine yesterday. Welcome to my Debugging in the Wild: My Cockpit Reading JSON File Nightmare.

If you’ve ever had an app crash because it refused to read a file that your own eyes can see right there on disk, you know the special kind of rage this brings. This post walks you through exactly how I diagnosed, fixed, and bulletproofed a JSON loading issue inside a Cockpit-based dashboard, plus the lessons that will save you hours the next time your code lies to your face. We’ll keep it simple, practical, and real. No fluff. Just the stuff I wish someone had told me at 3 AM.

ALSO READ: Why Im Building Capabilisense Medium: My Honest Story

What Was Actually Happening: The Symptom

Let me set the stage. We run a small internal tool built on Cockpit CMS to serve dashboards for our edge devices. Each device drops a status.json into /var/cockpit/devices/{device_id}/status.json. The dashboard reads that file, parses it, and renders charts.

Suddenly, 14% of devices started showing “No data available.” The logs said: ENOENT: no such file or directory, open '/var/cockpit/devices/9b1e.../status.json'.

Except the file was there. I could cat it. Permissions looked fine. ls -la showed -rw-r--r-- cockpit cockpit. But Node.js kept throwing ENOENT. That’s when I knew this wasn’t a simple typo. This was a Debugging in the Wild moment.

The Environment Breakdown

Before we dive deeper, here’s what the stack looked like:

  • Runtime: Node.js 20 running in a Docker container
  • Orchestration: Kubernetes 1.29, 3 nodes, hostPath volume for device data
  • File source: IoT agents writing JSON via atomic write -> rename pattern
  • Reader: Cockpit plugin using fs.readFile with a 2s cache
  • Red herring: Only failed on specific nodes, and only after ∼36h uptime

If any of that sounds familiar, you’re probably nodding already. If not, stick around. The root cause applies to Python, Go, Rust, or any language that touches the filesystem.

The First 5 Things I Checked That Went Nowhere

When you’re debugging in the wild, you start with the obvious. I did. All of these were dead ends, but they’re worth listing so you don’t repeat them.

File Permissions and Ownership

ls -la said cockpit:cockpit 644. The Node process ran as user cockpit. I even ran sudo -u cockpit cat /path/to/status.json from inside the container. It printed fine. Not a permissions issue.

File Encoding and BOM Characters

Cockpit Reading JSON File written by different agents can sneak in a UTF-8 BOM. That breaks JSON.parse, not fs.readFile. I opened the file with hexdump -C and confirmed: no BOM, no weird Windows line endings. The file was clean.

Path Typos and Case Sensitivity

Linux is case-sensitive. “Status.json” is not “status.json”. I copied the exact path from the error log and pasted it into test -f. It returned true. The path was correct.

Symlinks and Mount Points

Sometimes /var/cockpit is a symlink to a network volume that flakes out. I ran df -h and mount | grep cockpit. It was a local hostPath, not NFS. No symlink weirdness.

Disk Full or Inode Exhaustion

df -i showed 12% inode usage. df -h showed 41% disk usage. Plenty of room.

At this point I was 45 minutes in and had nothing. That’s when you have to go deeper than the app logs.

The Real Debugging Begins: Tracing The Filesystem

Debugging in the wild means you stop trusting your code and start trusting the kernel.

Strace Is Your Best Friend at 3 AM

I attached to the running Node process with strace -p <pid> -e trace=file -f and watched what happened when the dashboard refreshed. Here’s the smoking gun:

Code

[pid 1284] openat(AT_FDCWD, "/var/cockpit/devices/9b1e.../status.json", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

But immediately after, I ran ls in the same container and the file was there. So how can openat fail with ENOENT if the file exists?

Two possibilities: the file was deleted between ls and open, or the process wasn’t seeing the same filesystem view.

The Rename Race Condition

I checked how the IoT agents wrote files. They did this:

Write to status.json.tmp.18273

fsync the file

rename('status.json.tmp.18273', 'status.json')

That’s the correct atomic pattern on POSIX. Rename is supposed to be atomic. If the reader opens status.json while the rename happens, it should see either the old file or the new file, never “missing”.

Unless… the reader’s directory listing is cached.

Node’s fs.readdir Cache and Docker OverlayFS

Our Cockpit plugin listed the device directory every 2 seconds and cached results for 30 seconds to avoid hammering disk. The logic was: if status.json isn’t in the cached list, don’t even try to open it. Throw “No data” early.

On Docker OverlayFS, when a file is replaced via rename, the upper layer gets a new inode. If another process on the same node has an open directory handle, its dentry cache can go stale until it re-opens the directory. Our 30s cache held the old directory state. So the code thought the file didn’t exist, skipped the readFile, and logged ENOENT preemptively. The log lied. The open call never happened.

That was the nightmare: the error message blamed the filesystem, but the bug was in my own “optimization”.

Why It Only Happened On Some Nodes

This is the part that made it a true Debugging in the Wild story. The failure only occurred on nodes where two things were true:

The node had been up for >24h, so the OverlayFS cache had lots of churn

The node was running both the writer agent and the reader pod for the same device

Kubernetes likes to co-locate pods for data locality. So the same node was doing the atomic rename and the cached readdir. On other nodes, the reader saw the file via a fresh mount, so the cache miss rate was lower.

The Fix: Three Lines That Ended The Nightmare

Once I knew the cache was lying, the fix was simple. Stop trusting the cache for existence checks.

Remove Existence Check Before Read

Bad code:

JavaScript

const files = await cachedReaddir(dir);if (!files.includes('status.json')) {  throw new Error('ENOENT: no such file or directory'); // We lied here}return fs.readFile(path);

Good code:

JavaScript

try {  return await fs.readFile(path); // Let the kernel tell you if it exists} catch (e) {  if (e.code === 'ENOENT') return null;  throw e;}

1 line hidden

Let open be the source of truth. It’s one syscall whether the file exists or not. The premature readdir was a false optimization.

Invalidate Cache on Inotify Events

If you must cache directory listings, hook into inotify to wipe the cache when the directory changes. For Node, chokidar does this well. We set it to watch /var/cockpit/devices and drop the cache key on addunlink, or change.

Add Telemetry Around the Failure

We added a counter: json_read_attempts_totaljson_read_success_totaljson_read_enoent_total. Within an hour we saw the ENOENT rate drop from 14% to 0.02%. The remaining ones were real: devices that were actually offline.

Lessons You Can Steal For Your Own Debugging In The Wild

I lost sleep so you don’t have to. Here are the big takeaways from this Cockpit reading JSON file nightmare.

Error Messages Can Lie

ENOENT doesn’t always mean the file is missing. It means the code path you took concluded the file was missing. Check whether your code ever actually called open. Strace doesn’t lie.

Caching Filesystem State Is Dangerous

Directories change. Caching readdir is like caching DNS for a week. You save nanoseconds and pay with hours. If you need speed, use stat on the exact file, not a directory listing.

Atomic Rename Isn’t Atomic to Everyone

rename is atomic for a single viewer. If you have one process listing the directory and another opening files, there’s a window where the listing is stale. Design for that.

Co-located Writers and Readers Expose Race Conditions

This bug would never show up in dev because my laptop never had the writer and reader on the same OverlayFS mount. Production topology matters. Test with chaos.

Observability Beats Hope

We added three metrics and a structured log with pathinode_beforeinode_after, and cache_hit. The next time something weird happens, we’ll know in 2 minutes instead of 2 hours.

A Better Pattern For Cockpit Reading JSON File Safely

After this nightmare, here’s the pattern I now use for any service that reads Cockpit Reading JSON File dropped by another process:

Never check existence separately. Go straight to readFile and handle ENOENT.

Use fs.readFile with encoding: 'utf8' and wrap JSON.parse in try/catch. Log the first 200 characters on parse failure so you can see BOM or truncation issues.

If you need high throughput, put a memory cache in front of the parsed object, not the file list. Key it by path + mtime + size. On ENOENT, delete the key.

For critical files, add a health check: if status.json is older than 2x the expected interval, alert. Don’t let a silent writer failure look like a reader bug.

When running in containers, remember that hostPath + OverlayFS + caching = pain. If you can, use a shared POSIX filesystem or an object store and avoid local file races entirely.

Conclusion

Debugging in the Wild: My Cockpit Reading JSON File Nightmare started as a simple “file not found” and ended as a lesson in filesystem semantics, container overlays, and the cost of premature optimization. The root cause wasn’t Cockpit, Node, Docker, or Kubernetes. It was a stale cache that made my own code report the wrong error.

If you take one thing from this 3 AM story, let it be this: when the filesystem and your logs disagree, believe the filesystem. Fire up strace, trace the syscalls, and question your assumptions. The bug is usually in the code you’re most proud of.

Now my dashboards are stable, my ENOENT rate is basically zero, and I sleep better. Until the next nightmare, at least.

FAQs

What is Debugging in the Wild: My Cockpit Reading JSON File Nightmare about?

It’s a real-world breakdown of how a cached directory listing in a Cockpit CMS plugin caused false ENOENT errors when reading status.Cockpit Reading JSON File on Docker OverlayFS, and how strace, cache invalidation, and removing existence checks fixed it.

Can Docker OverlayFS cause Cockpit Reading JSON File to appear missing?

Yes. On OverlayFS, an atomic rename creates a new inode. If your app caches readdir results, it can hold a stale view of the directory and think the file is gone even though open would succeed.

Should I check if a file exists before reading it in Node.js?

No. Go straight to fs.readFile and catch the error. Checking with fs.existsSync or readdir first adds a race condition and can give you wrong answers under load.

How do I safely read JSON config files written by another process?

Use atomic write-and-rename on the writer side. On the reader side, call readFile directly, wrap JSON.parse in try/catch, and avoid caching directory listings. Use inotify if you must cache.

What tools helped solve the Cockpit JSON reading issue fastest?

strace -e trace=file to see actual syscalls, inotifywait to watch directory changes, and custom metrics for read attempts vs. ENOENT. Logs alone weren’t enough because the code logged the wrong error.

ALSO READ: Redefining The Qualities Of A Good Man For Today

Leave a Comment

Your email address will not be published. Required fields are marked *