When Your Database Goes Down at 3 AM: A MongoDB War Story
3:17 AM. Phone buzzes. Monitoring alert: “Project UI is down. MongoDB connection failed.”
I’m up, laptop open, VPN connecting. This is production. Customers are affected.
The Initial Investigation
SSH into the server. Check MongoDB status:
systemctl status mongod
● mongod.service - MongoDB Database Server
Active: inactive (dead)
...
Dec 28 03:14:23 server systemd[1]: mongod.service: Failed
MongoDB is dead. Check the logs:
tail -100 /var/log/mongodb/mongod.log
# Nothing. File exists but tail shows nothing?
ls -lh /var/log/mongodb/mongod.log
-rw-r--r-- 1 mongodb mongodb 0 Dec 28 03:14 mongod.log
# File is empty. That's weird.
Try to start MongoDB:
systemctl start mongod
# Fails immediately
journalctl -u mongod -n 50
Dec 28 03:17:45 mongod[12453]: Error opening log file: No space left on device
There it is. No space left on device.
Finding the Problem
Check disk space:
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/rootvg-rootlv 50G 35G 15G 70% /
/dev/mapper/rootvg-var_loglv 2.0G 2.0G 20K 100% /var/log ← Here!
/dev/mapper/rootvg-homelv 20G 12G 8.0G 60% /home
/var/log is 100% full. MongoDB can’t write logs, so it won’t start.
What’s taking up space?
du -sh /var/log/* | sort -rh | head -5
1.8G /var/log/mongodb
150M /var/log/nginx
89M /var/log/syslog
45M /var/log/kern.log
32M /var/log/audit
MongoDB logs: 1.8GB out of 2GB partition. That’s the culprit.
ls -lh /var/log/mongodb/
-rw-r--r-- 1 mongodb mongodb 1.8G Dec 28 03:14 mongod.log.2024-12-27
-rw-r--r-- 1 mongodb mongodb 0 Dec 28 03:14 mongod.log
Yesterday’s log file: 1.8GB. Today’s: 0 bytes (couldn’t be created - no space).
First Attempt: rm (Wrong!)
My first instinct:
rm /var/log/mongodb/mongod.log.2024-12-27
# Check disk space
df -h /var/log
/dev/mapper/rootvg-var_loglv 2.0G 2.0G 20K 100% /var/log
# Still 100% full?!
The file is gone from the directory listing, but the space isn’t freed. Why?
Checked for processes holding deleted files:
lsof | grep deleted | grep mongodb
# Nothing - MongoDB is stopped
Then I remembered: MongoDB was writing to that file when the partition filled. The file was still held open when MongoDB crashed. Even though MongoDB is stopped now, the file handle persisted.
Lesson learned: rm on files that were open when a process crashed doesn’t immediately free space.
The Quick Fix: truncate
Instead of deleting, empty the file:
truncate -s 0 /var/log/mongodb/mongod.log.2024-12-27
df -h /var/log
/dev/mapper/rootvg-var_loglv 2.0G 200M 1.8G 10% /var/log
# Space freed immediately!
Why truncate works:
- Empties file content while keeping the file in place
- File descriptor remains valid (even if process is dead)
- Disk space freed instantly
- Process can resume writing to same file
Start MongoDB:
systemctl start mongod
systemctl status mongod
● mongod.service - MongoDB Database Server
Active: active (running)
# Success!
3:42 AM. MongoDB is back. UI is responsive. Crisis averted. For now.
Root Cause Analysis
Why did the log grow so large? Checked MongoDB configuration:
cat /etc/mongod.conf
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
# No log rotation configured!
No log rotation. MongoDB just kept appending to the same file forever. Eventually it filled the partition.
Combined with:
/var/logpartition: Only 2GB- Production load: ~500MB logs per day
- No monitoring: Nobody noticed until it crashed
Perfect recipe for disaster.
The Permanent Fix
Two options:
Option 1: Set up log rotation (temporary fix)
Option 2: Expand /var/log partition (permanent fix)
I chose both.
Part 1: Expand the Partition
Check volume group:
vgs
VG #PV #LV #SN Attr VSize VFree
rootvg 1 10 0 wz--n- 100.00g 12.00g
# 12GB free in volume group
Extend the logical volume:
# Add 4GB to /var/log (double its size)
lvextend -L +4G /dev/mapper/rootvg-var_loglv
Extending logical volume rootvg/var_loglv to 6.00 GiB
Logical volume rootvg/var_loglv successfully resized.
Grow the filesystem (XFS in this case):
# XFS can be grown while mounted - no downtime!
xfs_growfs /var/log
meta-data=/dev/mapper/rootvg-var_loglv
= sectsz=512 attr=2, projid32bit=1
data = bsize=4096 blocks=524288, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 524288 to 1572864
Verify:
df -h /var/log
/dev/mapper/rootvg-var_loglv 6.0G 200M 5.8G 4% /var/log
# Now 6GB total!
Key insight: xfs_growfs works on mounted filesystems. No downtime needed. MongoDB kept running during the resize.
Note: For ext4 filesystems, use resize2fs instead of xfs_growfs.
Part 2: Set Up Log Rotation
Created /etc/logrotate.d/mongodb:
/var/log/mongodb/*.log {
daily
rotate 7
compress
delaycompress
notifempty
create 0640 mongodb mongodb
sharedscripts
postrotate
/bin/kill -SIGUSR1 `cat /var/run/mongodb/mongod.pid 2>/dev/null` 2> /dev/null || true
endscript
}
This config:
- Rotates logs daily
- Keeps 7 days of logs
- Compresses old logs
- Signals MongoDB to reopen log file after rotation
Test it:
logrotate -f /etc/logrotate.d/mongodb
ls -lh /var/log/mongodb/
-rw-r----- 1 mongodb mongodb 89M Dec 28 04:15 mongod.log
-rw-r----- 1 mongodb mongodb 156M Dec 28 03:14 mongod.log.1
-rw-r----- 1 mongodb mongodb 234M Dec 27 23:59 mongod.log.2.gz
-rw-r----- 1 mongodb mongodb 189M Dec 26 23:59 mongod.log.3.gz
Perfect. Logs rotate daily, old ones compressed.
Part 3: Monitoring
Added disk space monitoring:
# Check every 5 minutes
*/5 * * * * /usr/local/bin/check_disk_space.sh
check_disk_space.sh:
#!/bin/bash
THRESHOLD=80
USAGE=$(df -h /var/log | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "/var/log is ${USAGE}% full" | mail -s "Disk Alert" ops@company.com
fi
Now we get alerts at 80% before it becomes critical.
What I Learned
1. truncate -s 0 vs rm for Active Log Files
Use truncate when:
- Process is still running and writing to file
- Process crashed but file was open
- You need space back immediately
- Can’t restart the process
Use rm when:
- File is not open by any process
- You want the file gone completely
- Normal log cleanup
Check open files:
lsof | grep deleted
If files show up as “deleted” but still using space, the process still has them open.
2. XFS vs ext4 Online Resize
XFS:
xfs_growfs /mount/point # Can only grow (not shrink)
ext4:
resize2fs /dev/mapper/device # Can grow or shrink
Both can be done online (mounted). XFS is simpler for growing filesystems.
3. LVM is Your Friend
Without LVM, expanding a partition requires:
- Backing up data
- Repartitioning
- Restoring data
- Hours of downtime
With LVM:
lvextend -L +4G /dev/mapper/volume
xfs_growfs /mount/point
# Done in seconds, no downtime
4. Always Have Log Rotation
MongoDB doesn’t rotate logs by default. Neither do many other applications. Always set up logrotate:
ls /etc/logrotate.d/
apache2 mongodb nginx postgresql syslog ...
Every service that writes logs should have a logrotate config.
5. Monitor Disk Space Proactively
Don’t wait for failures. Monitor at multiple thresholds:
- 70% - Warning (nice to know)
- 80% - Alert (take action soon)
- 90% - Critical (immediate action required)
- 95% - Emergency (might fail any moment)
6. Understand df vs du
df (disk free) - Shows filesystem usage (includes open deleted files) du (disk usage) - Shows actual file sizes (excludes deleted files)
If df shows more usage than du, you have deleted-but-open files.
7. Plan Partition Sizes Carefully
Our original 2GB /var/log was too small:
# Current daily log generation
MongoDB: 500MB/day
Nginx: 50MB/day
System: 100MB/day
Total: ~650MB/day
# With 7-day retention = 4.5GB needed
# Gave 6GB = comfortable headroom
General rule: (daily log size) × (retention days) × 1.5 for safety margin.
Post-Incident Review
Timeline:
- 03:14 AM: Partition fills, MongoDB crashes
- 03:17 AM: Alert received
- 03:25 AM: Problem identified (disk full)
- 03:35 AM: Quick fix applied (truncate)
- 03:42 AM: MongoDB restored
- 04:30 AM: Permanent fix implemented (LVM extend + logrotate)
- 05:00 AM: Monitoring added
Total downtime: 28 minutes
Impact:
- UI unavailable for 28 minutes
- No data loss (MongoDB clean shutdown)
- ~50 users potentially affected (low traffic time)
Preventive Measures Added:
- ✓ Expanded
/var/logfrom 2GB to 6GB - ✓ Configured log rotation (7-day retention)
- ✓ Added disk space monitoring
- ✓ Documented
truncateprocedure - ✓ Created runbook for disk-full scenarios
The Runbook
Created for next time:
# MongoDB Down - Disk Full
## Symptoms
- MongoDB won't start
- Error: "No space left on device"
- `df -h` shows 100% on /var/log
## Quick Fix (5 min)
1. Identify full partition: `df -h`
2. Find large files: `du -sh /var/log/* | sort -rh`
3. Truncate old logs: `truncate -s 0 /var/log/mongodb/old.log`
4. Restart MongoDB: `systemctl restart mongod`
## Permanent Fix (30 min)
1. Check VG space: `vgs`
2. Extend LV: `lvextend -L +4G /dev/mapper/rootvg-var_loglv`
3. Grow filesystem: `xfs_growfs /var/log` (or `resize2fs` for ext4)
4. Set up logrotate: `/etc/logrotate.d/mongodb`
5. Test rotation: `logrotate -f /etc/logrotate.d/mongodb`
## Verify
- MongoDB running: `systemctl status mongod`
- Logs rotating: Check `/var/log/mongodb/` after 24h
- Disk usage: `df -h /var/log` < 50%
Three Months Later
The fix has held up:
df -h /var/log
/dev/mapper/rootvg-var_loglv 6.0G 2.1G 3.9G 35% /var/log
ls -lh /var/log/mongodb/
-rw-r----- 1 mongodb mongodb 89M Dec 31 08:30 mongod.log
-rw-r----- 1 mongodb mongodb 91M Dec 30 23:59 mongod.log.1
-rw-r----- 1 mongodb mongodb 87M Dec 29 23:59 mongod.log.2.gz
-rw-r----- 1 mongodb mongodb 94M Dec 28 23:59 mongod.log.3.gz
# ... 7 days of logs, perfectly rotated
Disk usage: 35%. Logs rotating cleanly. No alerts. No incidents.
Key Takeaways
Before production:
- ✓ Plan partition sizes based on actual usage
- ✓ Set up log rotation for all services
- ✓ Configure monitoring with alerts
- ✓ Have LVM for flexibility
During incidents:
- ✓
truncate -s 0frees space instantly - ✓
dfvsduhelps diagnose - ✓
lsof | grep deletedfinds held files - ✓ Check process status before
rm
After resolution:
- ✓ Implement permanent fix, not just quick fix
- ✓ Document the incident
- ✓ Create runbooks
- ✓ Add monitoring to prevent recurrence
That 3 AM wake-up call taught me more about Linux filesystems, LVM, and production incident response than any tutorial could. MongoDB crashes are stressful, but they’re also the best learning opportunities.
Now our monitoring catches disk space issues at 80% - long before they become 3 AM emergencies.
Tools used:
df -h- Check filesystem usagedu -sh- Check directory/file sizestruncate -s 0- Empty file without deletinglvextend- Extend logical volumexfs_growfs- Grow XFS filesystem onlinelogrotate- Automated log rotation
Resources: