dispatcher — rm worker_src apres extract + fstrim pour eviter thin pool full

Le cache src_*.MP4 sur les workers s empile: 12 fichiers pour 82 GB au pire.
Le thin pool LVM sur le host Proxmox est trop petit (810 GB pour 1144 GB
thick-provisionned) et se remplit a 100% en quelques heures de pipeline
-> I/O errors -> VMs auto-paused -> tout casse.

Fix: delete src_*.MP4 immediatement apres count_frames (les frames sont
deja extraites), puis fstrim en fin de job pour que le thin pool reclaim
les blocks immediatement via DISCARD/UNMAP.
This commit is contained in:
Flag
2026-04-22 15:39:56 +00:00
parent 2599a376af
commit 3eb568f14e

View File

@@ -225,7 +225,12 @@ def do_extract(job: sqlite3.Row, worker: dict) -> str:
_, err, _ = ssh(worker["ssh_alias"], f"cat /tmp/cosma-ffmpeg-{job['id']}.log 2>/dev/null | tail -5 || echo ''") _, err, _ = ssh(worker["ssh_alias"], f"cat /tmp/cosma-ffmpeg-{job['id']}.log 2>/dev/null | tail -5 || echo ''")
raise RuntimeError(f"ffmpeg failed on {v}: {err[:200]}") raise RuntimeError(f"ffmpeg failed on {v}: {err[:200]}")
idx = count_frames(worker, frames_dir) idx = count_frames(worker, frames_dir)
# Free MP4 cache immediately: thin pool on Proxmox host is tight and src_*.MP4
# are 1-11 GB each. Frames are already extracted so worker_src is no longer needed.
ssh(worker["ssh_alias"], f"rm -f {shlex.quote(worker_src)}")
set_status(job["id"], frame_count=idx, progress=min(99, idx * 100 // total_frames_est)) set_status(job["id"], frame_count=idx, progress=min(99, idx * 100 // total_frames_est))
# Trim once per job so LVM thin pool on the host actually reclaims the freed blocks.
ssh(worker["ssh_alias"], "sudo fstrim / 2>/dev/null || fstrim / 2>/dev/null", timeout=60)
return frames_dir return frames_dir