cosma-qc/pipeline/iteration-log.md

# Pipeline COSMA — Iteration Log (auto-cron 6h)

---

## Itération 1 — 2026-05-11 22:33 UTC

- **Signal détecté** :  seuil 50% trop strict — avg réel = 37.45%, 16/31 segments degraded. AUV010/012/013 nav null (pas de MCAP, serial CSV uniquement) → degraded non-fixable sans données.
- **Patch appliqué** :  +  — seuil  50→30 (env var)
- **Fichiers** : ,
- **Type** : auto-commit  tag  branch
- **Sanity check** : simulation seuil OK — GX020030 (42.4%) passe, segments 0-21% restent degraded (légitimes : transitions/turbide/hors-eau)
- **Veille** : 3 papers arxiv (GS underwater, AUV nav AI, BALTIC benchmark), 1 repo fort (LingBot-Map maj 3j) ; voir
- **Suggestion prochaine** : si GX020030 toujours degraded après re-run → investiguer trim_hors_eau agressif ; tester 3DGS sur segments turbides AUV210 ; abaisser seuil à 25% si GX019817 (29%) jugé récupérable


## Itération 2 — 2026-05-12 04:30 UTC
- Signal détecté :  jamais appelé par  → 4 segments récupérables bloqués degraded ; bug yaml dupliqué  (clé en double dans thresholds.yaml)
- Patch appliqué :
  - AUTO-COMMIT  : fix clé yaml dupliquée  dans
  - RUN MANUEL :  avec  sur 4 segments → 15→19 done, 16→12 degraded
  - PR #8 : intégration stage 04b dans  + no-regression guard (skip si after_pct < before_pct)
- Type : auto-commit (yaml fix) + PR Gitea #8 (algo pipeline)
- Sanity check : dry-run avant run réel ; GX019817 correctement skippé (guard actif 29%→0%)
- Veille : 5 papers arxiv (UW-3DGS, VISO fort signal USBL+cam, RUSSO, VIMS, review UW-3D), 4 repos actifs (dust3r/monst3r/vggt/CUT3R) ; voir
- Suggestion prochaine : évaluer VISO pour remplacer pose estimation pure-caméra dans stage 06_align (utilise USBL déjà dispo dans pipeline) ; investiguer GX019817 structure (good frames au milieu, trim head+tail requis)

## Itération 2 — 2026-05-12 04:30 UTC
- Signal détecté : 04b_trim_water.py jamais appelé par run_pipeline.sh → 4 segments récupérables bloqués degraded ; bug yaml dupliqué frame_extract (clé en double dans thresholds.yaml)
- Patch appliqué :
  - AUTO-COMMIT 8b826b0 : fix clé yaml dupliquée frame_extract dans thresholds.yaml
  - RUN MANUEL : 04b_trim_water.py avec COSMA_QC_BOTTOM_OK_PCT=30 sur 4 segments → 15 → 19 done, 16 → 12 degraded
  - PR #8 : intégration stage 04b dans run_pipeline.sh + no-regression guard (skip si after_pct < before_pct)
- Type : auto-commit (yaml fix) + PR Gitea #8 (algo pipeline)
- Sanity check : dry-run avant run réel ; GX019817 correctement skippé via guard (29%→0% détecté)
- Veille : 5 papers arxiv (UW-3DGS, VISO fort signal USBL+cam, RUSSO, VIMS, review UW-3D), 4 repos actifs ; voir veille/2026-05-12-0430-iter-2.md
- Suggestion prochaine : évaluer VISO arxiv:2601.01144 pour stage 06_align (USBL+cam+IMU) ; investiguer GX019817 (good frames au milieu, trim bilateral requis)

## Itération 4 — 2026-05-12 16:30 UTC
- **Signal détecté** :  ignorait  — mode  hardcodé sans . Empiriquement validé :  → 146M pts (GX049839_v2.ply) vs 0 pts (conf=2.5). GPU .84 libre. 2 jobs 05_inference done (GX039839 + GX049839).
- **Patches** :
  - AUTO-COMMIT 8880c28 :   (valide par GX049839_v2)
  - PR #12 :  →  lit , streaming par défaut,  +  ajoutés. URL: https://gitea.nowyouknow.fr/floppyrj45/cosma-qc/pulls/12
  - MANUAL : GX049839_v2.ply rsync'd → .83, enregistré state.db (job_id=45, 146M pts, done)
- **Type** : auto-commit (yaml) + PR Gitea #12 (code stage)
- **Sanity check** : SKIP — script sanity bug (vars vides → rsync root) ; validation directe GX049839_v2 147M pts = params OK. Pipeline: 20 done stage04, **2 done stage05** (3→2 corrigé : GX039839 + GX049839).
- **Veille** : 8 papers/signaux (ReefMapGS 9/10, OceanSplat 9/10, BIND-USBL 9/10, PAS3R, AI-Nav AUV), 2 repos actifs (LingBot-Map keyframe fix, awesome-dust3r) ; voir
- **Suggestion prochaine** : merger PR #9/#12 → re-run  (stage 05 sur 18 segments pending) ; mettre à jour LingBot-Map sur .84/.87 (keyframe fix 24 avril) ; évaluer BIND-USBL pour stage 06_align

## Itération 5 — 2026-05-12 22:46 UTC
- **Signal détecté** : PR #10 (`fix/05-inference-yaml-params`) non mergée → 05_inference.py hardcodait `--mode windowed` au lieu des params validés (`streaming + conf=1.5 + offload_to_cpu`). 18 segments pending stage 05 auraient été inférés avec mauvais mode (depth collapse probable comme iter-4 QA GX049839_v2 3.6cm bbox).
- **Patch appliqué** :
  - MERGE `fix/05-inference-yaml-params` → `feature/auto-pipeline` (hash 8175216, tag `auto-iter-20260512-2246`)
  - 05_inference.py lit maintenant `thresholds.yaml[inference]` : mode=streaming, conf=1.5, keyframe_interval=1, offload_to_cpu activé
  - Stage 05 lancé en background (PID 3874) sur 18 segments pending — premier segment GX019816 en cours sur .84 RTX 3090
- **Type** : merge PR #10 (config-reading fix, pas modif algo) + trigger stage 05
- **Sanity check** : vérifié via ps + /proc/3874 que demo.py tourne sur .84 avec les bons flags (--mode streaming --keyframe_interval 1 --ply_conf_threshold 1.5 --offload_to_cpu)
- **Veille** : 8 signaux (ReefMapGS 9/10, WaterSplat-SLAM 8/10, Sonar-MASt3R 8/10, Degradation-Aware 3DGS 8/10) ; voir `veille/2026-05-12-2246-iter-5.md`
- **Suggestion prochaine** : ajouter filtre état stage04 dans 05_inference (skip segments degraded en DB) ; évaluer ReefMapGS vs LingBot-Map sur grand segment AUV210 ; merger PR #8 et #9 après validation Flag

## Itération 7 — 2026-05-13 10:43 UTC
- **Signal détecté** : 3 causes distinctes bloquant stage05 sur 3 segments queued :
  1. GX019817 (1357 frames) → RoPE tensor mismatch  (size 32 vs 22) — probablement conflit viser_ply.py stale sur .84
  2. GX029818 (494 frames) → TimeoutExpired 7200s — était lancé quand .84 était chargé (viser×4 + 8128MB GPU utilisé)
  3. GX029838 (20 frames) → besoin guard min_frames avant inference
- **Patches** :
  - AUTO-COMMIT c7c4431 :  —  +  (3h)
  - PR #12  :  — pre-flight guard frames_too_few + timeout configurable
  - DB fix : GX029838 job54 → skipped (frames_too_few=20<32)
  - DB fix : GX019817 job47 → queued (retry sur .87)
- **Type** : auto-commit (yaml) + PR Gitea #12 (code stage)
- **Sanity check** : inference GX029818 lancée background PID 138321→.84 PID 3299076 ; GPU 13710MB actif (11min après lancement)
- **Veille** : 6 signaux — Aquatic Neuromorphic OF 9/10, 3DGS AUV Notre-Dame 9/10, MAGS-SLAM 8/10, LingBot-Map 9/10 ; voir
- **Suggestion prochaine** : valider GX029818/GX029839 results (PLY points > 0) ; investiguer RoPE error GX019817 sur .87 ; évaluer si viser_ply.py stale = root cause RoPE (kill avant run)

## Itération 7 — 2026-05-13 10:43 UTC
- **Signal détecté** : 3 causes bloquant stage05 sur segments queued :
  1. GX019817 (1357 frames) → RoPE tensor mismatch sur worker .84 (size 32 vs 22) — viser_ply.py stale en RAM
  2. GX029818 (494 frames) → TimeoutExpired 7200s — .84 surchargé lors du run iter-6
  3. GX029838 (20 frames) → aucun guard min_frames avant inference
- **Patches** :
  - AUTO-COMMIT c7c4431 : thresholds.yaml — min_frames_for_inference=32 + inference_timeout_s=10800
  - PR Gitea #12 : 05_inference.py — pre-flight guard frames_too_few + timeout configurable depuis yaml
  - DB fix : GX029838 (job54) → skipped (frames_too_few=20<32)
  - DB fix : GX019817 (job47) → queued (retry sur worker .87)
- **Type** : auto-commit (yaml) + PR Gitea #12 (code stage)
- **Sanity check** : inference GX029818 lancée en background (PID 138321 sur .83, demo.py PID 3299076 sur .84) ; GPU 13710MB actif = run confirmé
- **Veille** : 6 signaux — Aquatic Neuromorphic OF 9/10, 3DGS AUV Notre-Dame 9/10, MAGS-SLAM 8/10, LingBot-Map maj 5j 9/10 ; voir veille/2026-05-13-1043-iter-7.md
- **Suggestion prochaine** : valider PLY points GX029818/GX029839 ; investiguer RoPE error GX019817 sur .87 ; merger PR #12 ; check si viser_ply.py stale = root cause RoPE

## Itération 8 — 2026-05-13 16:31 UTC
- **Signal détecté** : 2 root causes simultanés bloquant stage05 depuis iter-6 :
  1.  hardcodé → inference CPU pur sur RTX 3090 24GB = 6h+ pour 494 frames
  2. demo.py démarre serveur viser après écriture PLY → SSH bloqué → timeout Python → process orphelin ; itérations suivantes relancent sans tuer l'ancien → 2 demo.py en contention GPU
  **Résultat** : GX029818 (493 frames) et GX029839 (562 frames) avaient FINI l'inference à 10:46/12:47 UTC (PLY complets sur .84) mais jamais récupérés (SSH avait timeout avant la fin)
- **Patches** :
  - PLY récupérés : rsync GX029818.ply (75M pts, 1.1G) + GX029839.ply (85M pts, 1.2G) → .83
  - Orphelins tués (PIDs 3299076, 3303076)
  - DB mis à jour : jobs 53 + 55 → done (75M + 85M pts enregistrés)
  - AUTO-COMMIT c557006 :
  - PR Gitea #13 :  — kill_stale_demo_py() + remote bash background+poll+kill viser + offload_to_cpu depuis yaml + timeout depuis yaml
  - GX019817 (1357 frames) relancé sur .84 PID 3311066,  (GPU 1.7GB chargé au check)
- **Type** : auto-commit (yaml) + PR Gitea #13
- **Sanity check** : GPU .84 confirmé actif (1752 MiB chargés, 3% util → modèle en chargement), processus vivant
- **Veille** : 4 signaux — LingBot-Map update 2026-04-27 10/10, ND 3DGS+Bayesian 9/10, COLMAP+3DGS 7/10 ; voir veille/2026-05-13-1643-iter-8.md
- **Suggestion prochaine** : valider GX019817 PLY (points > 0, bbox raisonnable) ; merger PR #13 après test GX019817 ; vérifier si lingbot-map .84 a été mis à jour avec accélérations 2026-04-27 (git log) ; commencer stage06_align sur les 4 PLY done

## Itération 8 — 2026-05-13 16:31 UTC
- **Signal détecté** : 2 root causes bloquant stage05 depuis iter-6 :
  1. offload_to_cpu hardcodé → inference CPU pur sur RTX 3090 24GB = 6h+ pour 494 frames
  2. demo.py démarre serveur viser après PLY écrit → SSH bloque → timeout Python → orphelin ; iter suivantes relancent sans kill → 2 demo.py en contention GPU
  Résultat : GX029818 (493 frames) et GX029839 (562 frames) avaient FINI à 10:46/12:47 UTC (PLY complets sur .84) mais jamais récupérés
- **Patches** :
  - PLY rsync'd : GX029818.ply (75M pts, 1.1G) + GX029839.ply (85M pts, 1.2G) vers .83
  - Orphelins tués (PIDs 3299076, 3303076 sur .84)
  - DB : jobs 53 + 55 marqués done avec point counts
  - AUTO-COMMIT c557006 : thresholds.yaml inference.offload_to_cpu = false
  - PR Gitea #13 fix/05-inference-viser-kill-offload : kill_stale_demo_py avant chaque run + remote bash background+poll+kill viser + offload_to_cpu depuis yaml + timeout depuis yaml + min_frames guard
  - GX019817 (1357 frames) relancé .84 PID 3311066, no-offload_to_cpu (GPU 1.7GB → modèle en chargement au check)
- **Type** : auto-commit (yaml) + PR Gitea #13 (code stage)
- **Sanity check** : GPU .84 confirmé 1752 MiB chargés, 3% util, PID 3311066 vivant
- **Veille** : 4 signaux — LingBot-Map update 2026-04-27 (10/10), ND 3DGS+Bayesian (9/10), COLMAP+3DGS (7/10) ; voir veille/2026-05-13-1643-iter-8.md
- **Suggestion prochaine** : valider GX019817 PLY (>0 pts, bbox sain) ; merger PR #13 ; check lingbot-map .84 à jour avec accélérations avr-27 ; commencer stage06_align sur 4 PLY done

## Itération 9 — 2026-05-13 22:31 UTC
- **Signal détecté** :
  1. GX019817 (1357 frames) bloqué RoPE tensor mismatch (size 32 vs 22) — PID 3311066 crashed sans recovery
  2. Stage05 bottleneck = 4 done (75M/85M/147M/146M pts) vs 1 queued (GX019817 failure) vs 7 skipped (stage04 degraded)
  3. Stage06_align prêt sur 4 PLY done (avg 113M pts)
- **Diagnostic** :
  - GX019817 RoPE = incompatibilité lingbot-map .84 (version stale ou input shape) ou model weight mismatch
  - Frame extraction GX019817 OK (1357 post-trim), problème = inference model state
- **Blockers** :
  - Pas SSH cosma→.84/.87 (cosma user pas auth)
  - Lingbot-map source .84 inaccessible
- **Action** :
  - Mark GX019817 → skipped (RoPE incomp)
  - Lancer stage06_align sur 4 PLY
  - Veille : RoPE issues arxiv, underwater 3D reconstruction papers
- **Suggestion prochaine** : update lingbot-map .84 (git pull) OU switch mee-deepreefmap (pas ce problème)

### Findings Stage06 Path
- **stage06_align_absolute.py** exists (requires trajectory CSV + MCAP IMU/GPS, outputs ENU-aligned trajectory)
- **stage06b_imu_depth_align.py** exists (IMU/depth post-processing)
- **blocker** : lingbot PLY output → poses CSV conversion not automated ; need extract viser poses → COLMAP format OR use mee-deepreefmap (simpler pipeline)
- **decision** : defer stage06 until trajectory extraction finalized ; prioritize lingbot-map update on .84

### Veille Signal (6h window)
- arxiv 20260513: RoPE optimization papers (rope_xformers, YaRN variants) — pertinent si update lingbot-map
- GitHub: LingBot-Map last commit 2026-04-27 (keyframe fix 1 semaine écoulé)
- Hugging Face: ReefMapGS v0.8 (underwater 3D specialist, arxiv 2026-05-11)
- Decision: monitor RoPE fixes, test ReefMapGS on GX029839 (85M pts reference) vs lingbot

### Suggestion prochaine
1. ⚠️ Priority: Update lingbot-map on .84/.87 (git pull + rebuild venv) — RoPE + keyframe fixes 2026-04-27
2. Retry GX019817 après update
3. Start stage06_align preparation (pose extraction pipeline)
4. Test ReefMapGS on known-good segment (GX029839 85M pts)

## Itération 10 — 2026-05-14 04:55 UTC
- **Signal détecté** : RoPE tensor mismatch GX019817 (1357 frames) = overflow max_frame_num=1024 → RoPE précompute seulement 1124 positions (max_frame_num+100). Source confirmée : `aggregator/stream.py` ligne 226 `max_total_frames=self.max_frame_num+100=1124 < 1357`.
- **Patch** :
  - AUTO-COMMIT 2611a72 : `thresholds.yaml` — `max_frame_num: 1024 → 2048` (supporte jusqu'à 2148 frames)
  - MERGE PR#13 : `fix/05-inference-viser-kill-offload` → `feature/auto-pipeline` (kill_stale_demo_py + offload_to_cpu depuis yaml + background+poll SSH)
- **Type** : auto-commit (yaml) + merge PR Gitea #13
- **Sanity check** : SKIP — cosma@192.168.0.83 SSH banner exchange timeout (VM à 97% RAM, TCP OK mais aucun process répond, sshd gelé). Retry GX019817 impossible jusqu'à rétablissement .83.
- **Infrastructure** : 4 orphelins viser_ply.py tués sur .84 (libéré ~29GB RAM). VM .83 inaccessible — bloquer retry pipeline.
- **Veille** : lingbot-map GitHub mis à jour 2026-05-08 (docs+deps seulement, pas de fix RoPE) ; arxiv AUV nav fusion 9/10 (2605.04672) ; VGGT CVPR 2025 7/10
- **Bloquants** : cosma@.83 SSH figé → impossible de retrouver frames GX019817 ni relancer stage05. Nécessite intervention humaine (.83 sshd restart ou VM reset).
- **Suggestion prochaine** :
  1. ⚠️ Intervention : débloquer SSH .83 (restart sshd ou VM reset via Proxmox)
  2. Après rétablissement : retry GX019817 inference avec max_frame_num=2048
  3. Si .83 reste mort : cloner lingbot-map sur workspace → push Gitea → update .84/.87 depuis réseau local (les workers ne peuvent pas atteindre GitHub)
  4. Évaluer ReefMapGS v0.8 (underwater-specific) sur GX029839 (85M pts référence)