Diskmon/README.md
2026-04-02 11:59:03 +00:00

442 lines
No EOL
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Overview
This project provides lightweight disk monitoring for a VPS with:
- Daily storage reports
- Rapid disk usage change detection
- Alerts sent to a Matrix room
- Minimal dependencies (Python + SQLite only)
It is designed to be:
- Simple
- Transparent
- Easy to debug
- Low overhead
---
## Architecture
```
systemd timers
Python script (disk_monitor.py)
SQLite (local state/history)
Matrix API (alerts + reports)
```
---
## Components
### 1. Python Script
**Location:**
```
/opt/diskmon/disk_monitor.py
```
**Responsibilities:**
- Collect disk usage stats
- Store historical samples
- Detect rapid changes
- Format messages
- Send messages to Matrix
---
### 2. SQLite Database
**Location:**
```
/var/lib/diskmon/diskmon.sqlite3
```
**Purpose:**
- Store disk usage history
- Track alert cooldowns
---
### 3. Environment Config
**Location:**
```
/etc/diskmon.env
```
**Contents:**
```
MATRIX_HOMESERVER=https://matrix.yourdomain.com
MATRIX_ROOM_ID=!roomid:yourdomain.com
MATRIX_ACCESS_TOKEN=your_token
DISKMON_DB=/var/lib/diskmon/diskmon.sqlite3
DISKMON_MOUNT=/
```
---
### 4. systemd Timers
#### Sample Timer (every 5 min)
```
diskmon-sample.timer
```
#### Report Timer (daily)
```
diskmon-report.timer
```
---
## Data Flow
### Sampling Loop (every 5 minutes)
1. Read disk usage (`shutil.disk_usage`)
2. Insert sample into SQLite
3. Compare against:
- 10-minute-old sample
- 60-minute-old sample
4. Trigger alerts if thresholds exceeded
5. Apply cooldown logic
---
### Daily Report
1. Read current disk usage
2. Format summary
3. Send to Matrix
---
## Database Schema
### samples
|column|type|description|
|---|---|---|
|id|int|primary key|
|ts|int|unix timestamp|
|mount|text|mount path|
|used_bytes|int|used disk space|
|avail_bytes|int|free space|
|total_bytes|int|total capacity|
---
### alerts
|column|type|description|
|---|---|---|
|key|text|alert identifier|
|last_sent_ts|int|last time alert was triggered|
---
## Alert Logic
### Thresholds
|Condition|Trigger|
|---|---|
|Warning|≥ 1 GiB increase in 10 minutes|
|Critical|≥ 10 GiB increase in 60 minutes|
---
### Cooldowns
|Alert Type|Cooldown|
|---|---|
|Warning|30 minutes|
|Critical|60 minutes|
---
### Why cooldowns exist
Prevents:
- Alert spam
- Repeated messages for same event
- Noise during sustained writes
---
## Message Formats
### Daily Report
```
[VPS Storage Report]
Mount: /
Used: 48.2 GiB
Available: 131.7 GiB
Total: 180.0 GiB
Usage: 26.8%
Timestamp: 2026-04-01 09:00:00 EDT
```
---
### Alert
```
[Storage Alert]
Mount: /
Used space increased by 1.4 GiB in 10 minutes
Previous used: 48.2 GiB
Current used: 49.6 GiB
Timestamp: 2026-04-01 09:40:00 EDT
```
---
## Monitoring the System
### Check timers
```
systemctl list-timers | grep diskmon
```
---
### Check logs
#### Sample job
```
journalctl -u diskmon-sample.service -f
```
#### Report job
```
journalctl -u diskmon-report.service -f
```
---
### Run manually
```
systemctl start diskmon-sample.service
systemctl start diskmon-report.service
```
---
### Check service status
```
systemctl status diskmon-sample.service
systemctl status diskmon-report.service
```
---
## Debugging
### 1. Environment variables not found
**Symptom:**
```
KeyError: MATRIX_HOMESERVER
```
**Fix:**
```
set -a
source /etc/diskmon.env
set +a
```
---
### 2. SQLite errors
**Symptom:**
```
sqlite3.OperationalError
```
**Fix:**
- Check SQL syntax
- Delete DB and recreate if needed:
```
rm /var/lib/diskmon/diskmon.sqlite3
```
---
### 3. No Matrix messages
Check:
- correct homeserver URL
- valid access token
- correct room ID
- HTTPS used
---
### 4. Script not running
```
systemctl status diskmon-sample.timer
```
---
## Testing Alerts
### Trigger disk usage spike
```
fallocate -l 2G /tmp/testfile
```
Wait ~510 minutes.
Cleanup:
```
rm /tmp/testfile
```
---
## Maintenance
### View database
```
sqlite3 /var/lib/diskmon/diskmon.sqlite3
```
---
### Clean old data
Handled automatically:
- keeps ~2 days of samples
---
## Extending the System
### Possible improvements
- Monitor multiple mounts
- Add low disk space alerts (e.g. <20GB)
- Send HTML-formatted Matrix messages
- Integrate with Uptime Kuma push monitor
- Add inode monitoring
- Add disk I/O rate tracking
---
## Design Philosophy
This system intentionally avoids:
- Prometheus
- external monitoring stacks
- heavy dependencies
Instead it focuses on:
- clarity
- reliability
- minimalism
- full control over alert logic
---
## Summary
This setup provides:
- Continuous disk monitoring
- Time-window-based change detection
- Daily reporting
- Matrix integration
- Minimal operational overhead
All in ~1 script + systemd.
---