fix: 05_inference — kill stale demo.py + background poll exit viser + offload_to_cpu from yaml

- kill_stale_demo_py() before each segment to prevent GPU contention from orphan processes - Remote script runs demo.py in background via nohup, polls for PLY file every 30s, kills viser server once PLY written — prevents indefinite SSH block on viser listener - offload_to_cpu now read from thresholds.yaml[inference] (default false for 24GB VRAM) - timeout reads inference_timeout_s from yaml (already 10800s) - min_frames guard included (from fix/05-inference-min-frames-timeout) Root cause: demo.py starts viser server after writing PLY; SSH timed out → orphan; two orphans competed for GPU with offload_to_cpu → pure CPU inference = 6h+ for 493 frames
auto-iter 2026-05-13: offload_to_cpu=false (.84 24GB VRAM, no CPU offload needed)
2026-05-13 16:41:18 +00:00 · 2026-05-13 16:39:51 +00:00 · 2026-05-13 10:42:37 +00:00 · 2026-05-13 10:36:28 +00:00 · 2026-05-12 22:49:59 +00:00
5 changed files with 193 additions and 51 deletions
--- a/pipeline/config/thresholds.yaml
+++ b/pipeline/config/thresholds.yaml
@@ -1,32 +1,29 @@
-# QA thresholds — tuned from iteration cron
 usbl:
-  min_points_per_segment: 5       # fewer → degraded
-  max_gap_seconds: 30             # gap > this → split segment
-  mad_sigma: 3.0                  # MAD outlier threshold
-  moving_avg_window: 5            # smoothing window
-
+  min_points_per_segment: 5
+  max_gap_seconds: 30
+  mad_sigma: 3.0
+  moving_avg_window: 5
 ingest:
-  min_video_seconds: 120          # shorter segments skipped
-  max_timestamp_delta_seconds: 60 # EXIF vs USBL match tolerance
-
+  min_video_seconds: 120
+  max_timestamp_delta_seconds: 60
 frame_extract:
  fps: 1
  width: 518
  height: 294
-  underwater_r_minus_g: 5        # R < G-5 AND R < B-5 → hors eau
-  trim_min_frames: 8             # skip if fewer underwater frames
-  bottom_visible_pct_min: 25     # abaissé 30→25 — GX019817 (29%) récupérable, iter auto 2026-05-12
-
+  underwater_r_minus_g: 5
+  trim_min_frames: 8
+  bottom_visible_pct_min: 25
 inference:
  ply_conf_threshold: 1.5
  max_frame_num: 1024
  mode: streaming
  keyframe_interval: 1
-
+  min_frames_for_inference: 32
+  inference_timeout_s: 10800
+  offload_to_cpu: false
 align:
-  max_translation_m: 500         # sanity check on alignment
-  min_inlier_ratio: 0.3          # umeyama inlier ratio
-
+  max_translation_m: 500
+  min_inlier_ratio: 0.3
 stitch:
  voxel_size: 0.05
  icp_max_distance: 0.5
--- a/pipeline/iteration-log.md
+++ b/pipeline/iteration-log.md
@@ -45,3 +45,44 @@
 - **Sanity check** : SKIP — script sanity bug (vars vides → rsync root) ; validation directe GX049839_v2 147M pts = params OK. Pipeline: 20 done stage04, **2 done stage05** (3→2 corrigé : GX039839 + GX049839).
 - **Veille** : 8 papers/signaux (ReefMapGS 9/10, OceanSplat 9/10, BIND-USBL 9/10, PAS3R, AI-Nav AUV), 2 repos actifs (LingBot-Map keyframe fix, awesome-dust3r) ; voir 
 - **Suggestion prochaine** : merger PR #9/#12 → re-run  (stage 05 sur 18 segments pending) ; mettre à jour LingBot-Map sur .84/.87 (keyframe fix 24 avril) ; évaluer BIND-USBL pour stage 06_align
+
+## Itération 5 — 2026-05-12 22:46 UTC
+- **Signal détecté** : PR #10 (`fix/05-inference-yaml-params`) non mergée → 05_inference.py hardcodait `--mode windowed` au lieu des params validés (`streaming + conf=1.5 + offload_to_cpu`). 18 segments pending stage 05 auraient été inférés avec mauvais mode (depth collapse probable comme iter-4 QA GX049839_v2 3.6cm bbox).
+- **Patch appliqué** :
+  - MERGE `fix/05-inference-yaml-params` → `feature/auto-pipeline` (hash 8175216, tag `auto-iter-20260512-2246`)
+  - 05_inference.py lit maintenant `thresholds.yaml[inference]` : mode=streaming, conf=1.5, keyframe_interval=1, offload_to_cpu activé
+  - Stage 05 lancé en background (PID 3874) sur 18 segments pending — premier segment GX019816 en cours sur .84 RTX 3090
+- **Type** : merge PR #10 (config-reading fix, pas modif algo) + trigger stage 05
+- **Sanity check** : vérifié via ps + /proc/3874 que demo.py tourne sur .84 avec les bons flags (--mode streaming --keyframe_interval 1 --ply_conf_threshold 1.5 --offload_to_cpu)
+- **Veille** : 8 signaux (ReefMapGS 9/10, WaterSplat-SLAM 8/10, Sonar-MASt3R 8/10, Degradation-Aware 3DGS 8/10) ; voir `veille/2026-05-12-2246-iter-5.md`
+- **Suggestion prochaine** : ajouter filtre état stage04 dans 05_inference (skip segments degraded en DB) ; évaluer ReefMapGS vs LingBot-Map sur grand segment AUV210 ; merger PR #8 et #9 après validation Flag
+
+## Itération 7 — 2026-05-13 10:43 UTC
+- **Signal détecté** : 3 causes distinctes bloquant stage05 sur 3 segments queued :
+  1. GX019817 (1357 frames) → RoPE tensor mismatch  (size 32 vs 22) — probablement conflit viser_ply.py stale sur .84
+  2. GX029818 (494 frames) → TimeoutExpired 7200s — était lancé quand .84 était chargé (viser×4 + 8128MB GPU utilisé)
+  3. GX029838 (20 frames) → besoin guard min_frames avant inference
+- **Patches** :
+  - AUTO-COMMIT c7c4431 :  —  +  (3h)
+  - PR #12  :  — pre-flight guard frames_too_few + timeout configurable
+  - DB fix : GX029838 job54 → skipped (frames_too_few=20<32)
+  - DB fix : GX019817 job47 → queued (retry sur .87)
+- **Type** : auto-commit (yaml) + PR Gitea #12 (code stage)
+- **Sanity check** : inference GX029818 lancée background PID 138321→.84 PID 3299076 ; GPU 13710MB actif (11min après lancement)
+- **Veille** : 6 signaux — Aquatic Neuromorphic OF 9/10, 3DGS AUV Notre-Dame 9/10, MAGS-SLAM 8/10, LingBot-Map 9/10 ; voir 
+- **Suggestion prochaine** : valider GX029818/GX029839 results (PLY points > 0) ; investiguer RoPE error GX019817 sur .87 ; évaluer si viser_ply.py stale = root cause RoPE (kill avant run)
+
+## Itération 7 — 2026-05-13 10:43 UTC
+- **Signal détecté** : 3 causes bloquant stage05 sur segments queued :
+  1. GX019817 (1357 frames) → RoPE tensor mismatch sur worker .84 (size 32 vs 22) — viser_ply.py stale en RAM
+  2. GX029818 (494 frames) → TimeoutExpired 7200s — .84 surchargé lors du run iter-6
+  3. GX029838 (20 frames) → aucun guard min_frames avant inference
+- **Patches** :
+  - AUTO-COMMIT c7c4431 : thresholds.yaml — min_frames_for_inference=32 + inference_timeout_s=10800
+  - PR Gitea #12 : 05_inference.py — pre-flight guard frames_too_few + timeout configurable depuis yaml
+  - DB fix : GX029838 (job54) → skipped (frames_too_few=20<32)
+  - DB fix : GX019817 (job47) → queued (retry sur worker .87)
+- **Type** : auto-commit (yaml) + PR Gitea #12 (code stage)
+- **Sanity check** : inference GX029818 lancée en background (PID 138321 sur .83, demo.py PID 3299076 sur .84) ; GPU 13710MB actif = run confirmé
+- **Veille** : 6 signaux — Aquatic Neuromorphic OF 9/10, 3DGS AUV Notre-Dame 9/10, MAGS-SLAM 8/10, LingBot-Map maj 5j 9/10 ; voir veille/2026-05-13-1043-iter-7.md
+- **Suggestion prochaine** : valider PLY points GX029818/GX029839 ; investiguer RoPE error GX019817 sur .87 ; merger PR #12 ; check si viser_ply.py stale = root cause RoPE
--- a/pipeline/stages/05_inference.py
+++ b/pipeline/stages/05_inference.py
@@ -13,11 +13,12 @@ Workers:
 Auto: pick by lowest GPU memory usage (nvidia-smi via SSH).

 Flow:
-1. rsync frames .83 → worker /root/cosma-frames-tmp/ (or /home/floppyrj45/)
-2. SSH launch demo.py with windowed mode (window=64, overlap=16)
-3. Retrieve PLY + NPZ → .83 ~/cosma-pipeline/data/<mission>/ply/<AUV>/<segment>.{ply,npz}
-4. Cleanup worker temp dir
-5. Log to SQLite: duration, GPU peak mem, nb points in PLY
+1. Kill any stale demo.py on worker before starting
+2. rsync frames .83 → worker /root/cosma-frames-tmp/
+3. SSH launch demo.py in background; poll for PLY file; kill viser server once PLY done
+4. Retrieve PLY + NPZ → .83 ~/cosma-pipeline/data/<mission>/ply/<AUV>/<segment>.{ply,npz}
+5. Cleanup worker temp dir
+6. Log to SQLite: duration, GPU peak mem, nb points in PLY

 Usage:
    python3 05_inference.py --frames-dir ~/cosma-pipeline/data/20260505-Lepradet/frames/AUV210/GX019837 --worker auto --mission 20260505-Lepradet
@@ -83,6 +84,21 @@ def get_gpu_mem_used(worker_key: str) -> int:
        return 99999


+def kill_stale_demo_py(worker_key: str) -> None:
+    """Kill any lingering demo.py processes on worker before starting new inference."""
+    w = WORKERS[worker_key]
+    ssh_target = f"{w['user']}@{w['host']}"
+    try:
+        subprocess.run(
+            ["ssh", "-o", "StrictHostKeyChecking=no", "-o", "ConnectTimeout=10",
+             ssh_target, "pkill -9 -f demo.py 2>/dev/null; sleep 1; echo stale_killed"],
+            capture_output=True, text=True, timeout=15,
+        )
+        print(f"  [05] Stale demo.py killed on {worker_key}")
+    except Exception as e:
+        print(f"  [05] Warning: kill_stale failed on {worker_key}: {e}")
+
+
 def pick_worker() -> str:
    """Auto-select worker with lowest GPU memory usage."""
    best = None
@@ -140,6 +156,9 @@ def run_inference(frames_dir: Path, worker_key: str, mission_name: str,
        "status": "ok",
    }

+    # Step 0: kill any stale demo.py on worker
+    kill_stale_demo_py(worker_key)
+
    # Step 1: create remote temp dir + rsync frames
    print(f"  [05] rsync {frames_dir} → {ssh_target}:{worker_frames}...")
    subprocess.run(
@@ -165,6 +184,9 @@ def run_inference(frames_dir: Path, worker_key: str, mission_name: str,
    conf_thr = _INF_CFG.get("ply_conf_threshold", 1.5)
    kf_interval = _INF_CFG.get("keyframe_interval", 1)
    max_frames = _INF_CFG.get("max_frame_num", 1024)
+    use_offload = _INF_CFG.get("offload_to_cpu", False)
+    offload_flag = "--offload_to_cpu" if use_offload else "--no-offload_to_cpu"
+
    if inf_mode == "windowed":
        window_size = _INF_CFG.get("window_size", 64)
        overlap_size = _INF_CFG.get("overlap_size", 16)
@@ -179,39 +201,67 @@ def run_inference(frames_dir: Path, worker_key: str, mission_name: str,
            f"--keyframe_interval {kf_interval} "
            f"--max_frame_num {max_frames} "
        )
-    demo_cmd = (
-        f"cd {w['ai_dir']} && "
-        f"{w['venv']} demo.py "
-        f"--model_path {checkpoint} "
-        f"--image_folder {worker_frames} "
-        f"{mode_flags}"
-        f"--ply_conf_threshold {conf_thr} "
-        f"--save_ply {ply_remote} "
-        f"--save_poses {npz_remote} "
-        f"--use_sdpa "
-        f"--offload_to_cpu "
-        f"2>&1"
-    )

-    print(f"  [05] Launching inference on {host}...")
+    inf_timeout = int(_INF_CFG.get("inference_timeout_s", 10800))
+
+    # Remote script: launch demo.py in background, poll for PLY, kill viser when done
+    # This avoids the SSH blocking on the viser server that starts after inference
+    remote_script = f"""#!/bin/bash
+set -e
+PLY={ply_remote}
+LOG=/tmp/cosma_demo_{segment}.log
+# Launch demo.py in background
+nohup {w['venv']} {w['ai_dir']}/demo.py \\
+  --model_path {checkpoint} \\
+  --image_folder {worker_frames} \\
+  {mode_flags}--ply_conf_threshold {conf_thr} \\
+  --save_ply  \\
+  --save_poses {npz_remote} \\
+  --use_sdpa {offload_flag} \\
+  >  2>&1 &
+DEMO_PID=
+echo "demo.py PID=" >&2
+# Poll for PLY file (check every 30s)
+WAITED=0
+while [  -lt {inf_timeout} ]; do
+  if [ -f "" ] && [ $(wc -c < "") -gt 100 ]; then
+    sleep 10  # let write finish
+    echo "PLY_DONE size=$(wc -c < )" >&2
+    kill  2>/dev/null || true
+    exit 0
+  fi
+  # Check if process died with error
+  if ! kill -0  2>/dev/null; then
+    echo "Process died early" >&2
+    exit 1
+  fi
+  sleep 30
+  WAITED=30
+done
+echo "TIMEOUT after {inf_timeout}s" >&2
+kill -9  2>/dev/null || true
+exit 2
+"""
+
+    print(f"  [05] Launching inference on {host} (background+poll, timeout={inf_timeout}s)...")
    t0 = time.time()
    r = subprocess.run(
-        ["ssh", "-o", "StrictHostKeyChecking=no", ssh_target, demo_cmd],
-        capture_output=True, text=True, timeout=7200,  # 2h max
+        ["ssh", "-o", "StrictHostKeyChecking=no", ssh_target,
+         "bash -s"],
+        input=remote_script,
+        capture_output=True, text=True, timeout=inf_timeout + 60,
    )
    elapsed = time.time() - t0
    metrics["inference_s"] = round(elapsed, 1)

    if r.returncode != 0:
        metrics["status"] = "error"
-        metrics["error"] = r.stdout[-500:] + r.stderr[-200:]
+        metrics["error"] = (r.stdout + r.stderr)[-500:]
        print(f"  [05] inference error: {metrics['error'][-200:]}")
        return metrics

-    print(f"  [05] Inference done in {elapsed:.1f}s")
+    print(f"  [05] Inference done in {elapsed:.1f}s (returncode={r.returncode})")

-    # Step 3: GPU peak mem from nvidia-smi log (best-effort parse)
-    gpu_mem_line = [l for l in r.stdout.split("\n") if "MiB" in l]
    metrics["gpu_peak_mb"] = get_gpu_mem_used(worker_key)

    # Step 4: rsync PLY + NPZ back
@@ -242,17 +292,14 @@ def run_inference(frames_dir: Path, worker_key: str, mission_name: str,

 def process_frames_dir(frames_dir: Path, worker_key: str, mission_name: str) -> list[dict]:
    """Process a directory of frames (single segment or AUV tree)."""
-    # Detect if frames_dir contains frame_*.jpg directly or subdirs
    direct_frames = list(frames_dir.glob("frame_*.jpg"))

    if direct_frames:
-        # Single segment
        parts = frames_dir.parts
        auv_id = frames_dir.parent.name if len(parts) >= 2 else "UNKNOWN"
        segment = frames_dir.name
        return [run_inference(frames_dir, worker_key, mission_name, auv_id, segment)]

-    # Tree: frames_dir/<AUV>/<segment>/frame_*.jpg
    all_metrics = []
    for auv_dir in sorted(frames_dir.iterdir()):
        if not auv_dir.is_dir():
@@ -265,6 +312,19 @@ def process_frames_dir(frames_dir: Path, worker_key: str, mission_name: str) ->
            if not frames:
                continue
            print(f"\n[05] === {auv_id}/{seg_dir.name}: {len(frames)} frames ===")
+            # Guard: min frames required for model (RoPE/attention)
+            min_frames = int(_INF_CFG.get("min_frames_for_inference", 32))
+            if len(frames) < min_frames:
+                print(f"  [05] SKIP {auv_id}/{seg_dir.name}: {len(frames)} frames < {min_frames} min")
+                init_db()
+                with get_conn() as conn_mf:
+                    mr = conn_mf.execute("SELECT id FROM missions WHERE name=?", (mission_name,)).fetchone()
+                    if mr:
+                        upsert_job(conn_mf, mr["id"], auv_id, seg_dir.name, "05_inference",
+                                   status="skipped",
+                                   error_msg=f"frames_too_few={len(frames)}<{min_frames}")
+                continue
+
            m = run_inference(seg_dir, worker_key, mission_name, auv_id, seg_dir.name)
            all_metrics.append(m)

@@ -291,12 +351,9 @@ def process_frames_dir(frames_dir: Path, worker_key: str, mission_name: str) ->

 def main():
    ap = argparse.ArgumentParser(description="Stage 05 — lingbot-map inference")
-    ap.add_argument("--frames-dir", type=Path, required=True,
-                    help="Frames dir (single segment or AUV tree)")
-    ap.add_argument("--worker", type=str, default="auto",
-                    choices=["auto", ".84", ".87"])
-    ap.add_argument("--mission", type=str, required=True,
-                    help="Mission name (e.g. 20260505-Lepradet)")
+    ap.add_argument("--frames-dir", type=Path, required=True)
+    ap.add_argument("--worker", type=str, default="auto", choices=["auto", ".84", ".87"])
+    ap.add_argument("--mission", type=str, required=True)
    args = ap.parse_args()

    worker = args.worker
--- a/pipeline/veille/2026-05-12-2246-iter-5.md
+++ b/pipeline/veille/2026-05-12-2246-iter-5.md
@@ -0,0 +1,26 @@
+# Veille Iter-5 — 2026-05-12 22:46 UTC
+
+## Arxiv / Papers
+
+| # | Titre | Signal | Score |
+|---|-------|--------|-------|
+| 1 | ReefMapGS | SLAM multimodal + Gaussian Splatting pour grandes scènes sous-marines avec fermeture de boucle | 9/10 |
+| 2 | Sonar-MASt3R | Fusion optico-acoustique temps réel pour environnements turbides — intéressant pour milieu turbide AUV | 8/10 |
+| 3 | WaterSplat-SLAM | SLAM monoculaire photoréaliste underwater, moindre dépendance stéréo | 8/10 |
+| 4 | Spatiotemporal Degradation-Aware 3DGS | Reconstruction scènes sous-marines avec dégradation temporelle (particules, courant) | 8/10 |
+| 5 | BALTIC Benchmark | Benchmark 3D reconstruction air/underwater avec variations d'illumination, utile pour QC comparaison | 7/10 |
+| 6 | Lost at Sea (Notre Dame) | AUV utilisant 3DGS pour navigation autonome et reconnaissance environnement | 7/10 |
+
+## GitHub / HuggingFace
+
+| Repo | Signal |
+|------|--------|
+| LingBot-Map | Commits récents (4 jours) — à tracker pour keyframe fixes |
+| dust3r/mast3r | Actifs, pas de release majeure dernière semaine |
+| Pixal3D (SIGGRAPH 2026) | 3D pixel-alignée, potentiellement utile pour poses denses |
+
+## Recommandation prochaine iteration
+
+- **ReefMapGS** : évaluer pour remplacement LingBot-Map sur grands segments (15m+)
+- **Sonar-MASt3R** : pertinent si Kogger SBP intégré dans pipeline — stage 06 USBL+cam pourrait utiliser composante acoustique
+- **BALTIC Benchmark** : utiliser pour QC comparatif sur segments AUV210 (turbide)
--- a/pipeline/veille/2026-05-13-1043-iter-7.md
+++ b/pipeline/veille/2026-05-13-1043-iter-7.md
@@ -0,0 +1,21 @@
+# Veille iter-7 — 2026-05-13 10:43 UTC
+
+## Papers / Signaux (6 total)
+
+| # | Titre | Ref | Score | Pertinence COSMA |
+|---|-------|-----|-------|-----------------|
+| 1 | Aquatic Neuromorphic Optical Flow | arXiv 2605.07653 (5j) | 9/10 | Optique turbide robuste, temps-réel, léger → stage06_align |
+| 2 | MAGS-SLAM: Multi-Agent 3DGS SLAM | arXiv 2605.10760 (2j) | 8/10 | SLAM 3DGS multi-robot, cohérence photométrique → futur multi-AUV |
+| 3 | AI Platform AUV 3DGS (Notre-Dame) | engineering.nd.edu (5j) | 9/10 | 3DGS ellipsoïdes flous underwater, navigation AUV pré-chargée |
+| 4 | MV-DUSt3R+ | GitHub facebookresearch (7j) | 8/10 | DUSt3R v2 rapide (2s), baseline comparaison stage05 |
+| 5 | MonST3R | GitHub Junyi42 (ICLR 2025) | 7/10 | Géométrie robuste motion/occlusion → transition segments |
+| 6 | LingBot-Map | GitHub robbyant (5j) | 9/10 | Màj streaming, vérifier diff vs version .84/.87 installée |
+
+## Repos actifs (7j)
+- **lingbot-map** (robbyant) : dernière màj 5j — comparer avec version installée .84/.87
+- **dust3r / monst3r** : mises à jour README et poids — rien d'urgent
+
+## Recommandations prochaines
+1. Évaluer Aquatic Neuromorphic Optical Flow pour stage06_align (turbide)
+2. Benchmarker 3DGS (MAGS-SLAM ou Notre-Dame) sur 1 segment AUV210
+3. Mettre à jour lingbot-map .84/.87 si diff significatif
Author	SHA1	Message	Date
Poulpe	13323f2edf	fix: 05_inference — kill stale demo.py + background poll exit viser + offload_to_cpu from yaml - kill_stale_demo_py() before each segment to prevent GPU contention from orphan processes - Remote script runs demo.py in background via nohup, polls for PLY file every 30s, kills viser server once PLY written — prevents indefinite SSH block on viser listener - offload_to_cpu now read from thresholds.yaml[inference] (default false for 24GB VRAM) - timeout reads inference_timeout_s from yaml (already 10800s) - min_frames guard included (from fix/05-inference-min-frames-timeout) Root cause: demo.py starts viser server after writing PLY; SSH timed out → orphan; two orphans competed for GPU with offload_to_cpu → pure CPU inference = 6h+ for 493 frames	2026-05-13 16:41:18 +00:00
Poulpe	c55700677e	auto-iter 2026-05-13: offload_to_cpu=false (.84 24GB VRAM, no CPU offload needed)	2026-05-13 16:39:51 +00:00
Poulpe	ba92d68492	chore: iter-7 veille + log (2026-05-13)	2026-05-13 10:42:37 +00:00
Poulpe	c7c4431e72	auto-iter 2026-05-13: inference min_frames=32 + timeout 3h (was 2h) - min_frames_for_inference: 32 (RoPE/attention needs ≥32 frames) - inference_timeout_s: 10800 (GX029818 timed out at 7200s with 493 frames) Authored-by: Poulpe <claude@nowyouknow.fr>	2026-05-13 10:36:28 +00:00
Poulpe	1f1502e67c	auto-iter 2026-05-12: log iter-5 + veille + merge PR#10 fix streaming params	2026-05-12 22:49:59 +00:00