dispatcher rm worker_src apres extract + fstrim pour eviter thin pool full
Le cache src_*.MP4 sur les workers s empile: 12 fichiers pour 82 GB au pire. Le thin pool LVM sur le host Proxmox est trop petit (810 GB pour 1144 GB thick-provisionned) et se remplit a 100% en quelques heures de pipeline -> I/O errors -> VMs auto-paused -> tout casse. Fix: delete src_*.MP4 immediatement apres count_frames (les frames sont deja extraites), puis fstrim en fin de job pour que le thin pool reclaim les blocks immediatement via DISCARD/UNMAP.
This commit is contained in:
@@ -225,7 +225,12 @@ def do_extract(job: sqlite3.Row, worker: dict) -> str:
|
|||||||
_, err, _ = ssh(worker["ssh_alias"], f"cat /tmp/cosma-ffmpeg-{job['id']}.log 2>/dev/null | tail -5 || echo ''")
|
_, err, _ = ssh(worker["ssh_alias"], f"cat /tmp/cosma-ffmpeg-{job['id']}.log 2>/dev/null | tail -5 || echo ''")
|
||||||
raise RuntimeError(f"ffmpeg failed on {v}: {err[:200]}")
|
raise RuntimeError(f"ffmpeg failed on {v}: {err[:200]}")
|
||||||
idx = count_frames(worker, frames_dir)
|
idx = count_frames(worker, frames_dir)
|
||||||
|
# Free MP4 cache immediately: thin pool on Proxmox host is tight and src_*.MP4
|
||||||
|
# are 1-11 GB each. Frames are already extracted so worker_src is no longer needed.
|
||||||
|
ssh(worker["ssh_alias"], f"rm -f {shlex.quote(worker_src)}")
|
||||||
set_status(job["id"], frame_count=idx, progress=min(99, idx * 100 // total_frames_est))
|
set_status(job["id"], frame_count=idx, progress=min(99, idx * 100 // total_frames_est))
|
||||||
|
# Trim once per job so LVM thin pool on the host actually reclaims the freed blocks.
|
||||||
|
ssh(worker["ssh_alias"], "sudo fstrim / 2>/dev/null || fstrim / 2>/dev/null", timeout=60)
|
||||||
return frames_dir
|
return frames_dir
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user