442 lines
No EOL
4.9 KiB
Markdown
442 lines
No EOL
4.9 KiB
Markdown
## Overview
|
||
|
||
This project provides lightweight disk monitoring for a VPS with:
|
||
|
||
- Daily storage reports
|
||
|
||
- Rapid disk usage change detection
|
||
|
||
- Alerts sent to a Matrix room
|
||
|
||
- Minimal dependencies (Python + SQLite only)
|
||
|
||
|
||
It is designed to be:
|
||
|
||
- Simple
|
||
|
||
- Transparent
|
||
|
||
- Easy to debug
|
||
|
||
- Low overhead
|
||
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
```
|
||
systemd timers
|
||
↓
|
||
Python script (disk_monitor.py)
|
||
↓
|
||
SQLite (local state/history)
|
||
↓
|
||
Matrix API (alerts + reports)
|
||
```
|
||
|
||
---
|
||
|
||
## Components
|
||
|
||
### 1. Python Script
|
||
|
||
**Location:**
|
||
|
||
```
|
||
/opt/diskmon/disk_monitor.py
|
||
```
|
||
|
||
**Responsibilities:**
|
||
|
||
- Collect disk usage stats
|
||
|
||
- Store historical samples
|
||
|
||
- Detect rapid changes
|
||
|
||
- Format messages
|
||
|
||
- Send messages to Matrix
|
||
|
||
|
||
---
|
||
|
||
### 2. SQLite Database
|
||
|
||
**Location:**
|
||
|
||
```
|
||
/var/lib/diskmon/diskmon.sqlite3
|
||
```
|
||
|
||
**Purpose:**
|
||
|
||
- Store disk usage history
|
||
|
||
- Track alert cooldowns
|
||
|
||
|
||
---
|
||
|
||
### 3. Environment Config
|
||
|
||
**Location:**
|
||
|
||
```
|
||
/etc/diskmon.env
|
||
```
|
||
|
||
**Contents:**
|
||
|
||
```
|
||
MATRIX_HOMESERVER=https://matrix.yourdomain.com
|
||
MATRIX_ROOM_ID=!roomid:yourdomain.com
|
||
MATRIX_ACCESS_TOKEN=your_token
|
||
DISKMON_DB=/var/lib/diskmon/diskmon.sqlite3
|
||
DISKMON_MOUNT=/
|
||
```
|
||
|
||
---
|
||
|
||
### 4. systemd Timers
|
||
|
||
#### Sample Timer (every 5 min)
|
||
|
||
```
|
||
diskmon-sample.timer
|
||
```
|
||
|
||
#### Report Timer (daily)
|
||
|
||
```
|
||
diskmon-report.timer
|
||
```
|
||
|
||
---
|
||
|
||
## Data Flow
|
||
|
||
### Sampling Loop (every 5 minutes)
|
||
|
||
1. Read disk usage (`shutil.disk_usage`)
|
||
|
||
2. Insert sample into SQLite
|
||
|
||
3. Compare against:
|
||
|
||
- 10-minute-old sample
|
||
|
||
- 60-minute-old sample
|
||
|
||
4. Trigger alerts if thresholds exceeded
|
||
|
||
5. Apply cooldown logic
|
||
|
||
|
||
---
|
||
|
||
### Daily Report
|
||
|
||
1. Read current disk usage
|
||
|
||
2. Format summary
|
||
|
||
3. Send to Matrix
|
||
|
||
|
||
---
|
||
|
||
## Database Schema
|
||
|
||
### samples
|
||
|
||
|column|type|description|
|
||
|---|---|---|
|
||
|id|int|primary key|
|
||
|ts|int|unix timestamp|
|
||
|mount|text|mount path|
|
||
|used_bytes|int|used disk space|
|
||
|avail_bytes|int|free space|
|
||
|total_bytes|int|total capacity|
|
||
|
||
---
|
||
|
||
### alerts
|
||
|
||
|column|type|description|
|
||
|---|---|---|
|
||
|key|text|alert identifier|
|
||
|last_sent_ts|int|last time alert was triggered|
|
||
|
||
---
|
||
|
||
## Alert Logic
|
||
|
||
### Thresholds
|
||
|
||
|Condition|Trigger|
|
||
|---|---|
|
||
|Warning|≥ 1 GiB increase in 10 minutes|
|
||
|Critical|≥ 10 GiB increase in 60 minutes|
|
||
|
||
---
|
||
|
||
### Cooldowns
|
||
|
||
|Alert Type|Cooldown|
|
||
|---|---|
|
||
|Warning|30 minutes|
|
||
|Critical|60 minutes|
|
||
|
||
---
|
||
|
||
### Why cooldowns exist
|
||
|
||
Prevents:
|
||
|
||
- Alert spam
|
||
|
||
- Repeated messages for same event
|
||
|
||
- Noise during sustained writes
|
||
|
||
|
||
---
|
||
|
||
## Message Formats
|
||
|
||
### Daily Report
|
||
|
||
```
|
||
[VPS Storage Report]
|
||
Mount: /
|
||
Used: 48.2 GiB
|
||
Available: 131.7 GiB
|
||
Total: 180.0 GiB
|
||
Usage: 26.8%
|
||
Timestamp: 2026-04-01 09:00:00 EDT
|
||
```
|
||
|
||
---
|
||
|
||
### Alert
|
||
|
||
```
|
||
[Storage Alert]
|
||
Mount: /
|
||
Used space increased by 1.4 GiB in 10 minutes
|
||
Previous used: 48.2 GiB
|
||
Current used: 49.6 GiB
|
||
Timestamp: 2026-04-01 09:40:00 EDT
|
||
```
|
||
|
||
---
|
||
|
||
## Monitoring the System
|
||
|
||
### Check timers
|
||
|
||
```
|
||
systemctl list-timers | grep diskmon
|
||
```
|
||
|
||
---
|
||
|
||
### Check logs
|
||
|
||
#### Sample job
|
||
|
||
```
|
||
journalctl -u diskmon-sample.service -f
|
||
```
|
||
|
||
#### Report job
|
||
|
||
```
|
||
journalctl -u diskmon-report.service -f
|
||
```
|
||
|
||
---
|
||
|
||
### Run manually
|
||
|
||
```
|
||
systemctl start diskmon-sample.service
|
||
systemctl start diskmon-report.service
|
||
```
|
||
|
||
---
|
||
|
||
### Check service status
|
||
|
||
```
|
||
systemctl status diskmon-sample.service
|
||
systemctl status diskmon-report.service
|
||
```
|
||
|
||
---
|
||
|
||
## Debugging
|
||
|
||
### 1. Environment variables not found
|
||
|
||
**Symptom:**
|
||
|
||
```
|
||
KeyError: MATRIX_HOMESERVER
|
||
```
|
||
|
||
**Fix:**
|
||
|
||
```
|
||
set -a
|
||
source /etc/diskmon.env
|
||
set +a
|
||
```
|
||
|
||
---
|
||
|
||
### 2. SQLite errors
|
||
|
||
**Symptom:**
|
||
|
||
```
|
||
sqlite3.OperationalError
|
||
```
|
||
|
||
**Fix:**
|
||
|
||
- Check SQL syntax
|
||
|
||
- Delete DB and recreate if needed:
|
||
|
||
|
||
```
|
||
rm /var/lib/diskmon/diskmon.sqlite3
|
||
```
|
||
|
||
---
|
||
|
||
### 3. No Matrix messages
|
||
|
||
Check:
|
||
|
||
- correct homeserver URL
|
||
|
||
- valid access token
|
||
|
||
- correct room ID
|
||
|
||
- HTTPS used
|
||
|
||
|
||
---
|
||
|
||
### 4. Script not running
|
||
|
||
```
|
||
systemctl status diskmon-sample.timer
|
||
```
|
||
|
||
---
|
||
|
||
## Testing Alerts
|
||
|
||
### Trigger disk usage spike
|
||
|
||
```
|
||
fallocate -l 2G /tmp/testfile
|
||
```
|
||
|
||
Wait ~5–10 minutes.
|
||
|
||
Cleanup:
|
||
|
||
```
|
||
rm /tmp/testfile
|
||
```
|
||
|
||
---
|
||
|
||
## Maintenance
|
||
|
||
### View database
|
||
|
||
```
|
||
sqlite3 /var/lib/diskmon/diskmon.sqlite3
|
||
```
|
||
|
||
---
|
||
|
||
### Clean old data
|
||
|
||
Handled automatically:
|
||
|
||
- keeps ~2 days of samples
|
||
|
||
|
||
---
|
||
|
||
## Extending the System
|
||
|
||
### Possible improvements
|
||
|
||
- Monitor multiple mounts
|
||
|
||
- Add low disk space alerts (e.g. <20GB)
|
||
|
||
- Send HTML-formatted Matrix messages
|
||
|
||
- Integrate with Uptime Kuma push monitor
|
||
|
||
- Add inode monitoring
|
||
|
||
- Add disk I/O rate tracking
|
||
|
||
|
||
---
|
||
|
||
## Design Philosophy
|
||
|
||
This system intentionally avoids:
|
||
|
||
- Prometheus
|
||
|
||
- external monitoring stacks
|
||
|
||
- heavy dependencies
|
||
|
||
|
||
Instead it focuses on:
|
||
|
||
- clarity
|
||
|
||
- reliability
|
||
|
||
- minimalism
|
||
|
||
- full control over alert logic
|
||
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
This setup provides:
|
||
|
||
- Continuous disk monitoring
|
||
|
||
- Time-window-based change detection
|
||
|
||
- Daily reporting
|
||
|
||
- Matrix integration
|
||
|
||
- Minimal operational overhead
|
||
|
||
|
||
All in ~1 script + systemd.
|
||
|
||
--- |