{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreige2cayu4k76773y42cxcxfdqlouxxb6itmtlkfgrm3b2oq2s5r5u",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mnrmjcktf4i2"
},
"path": "/t/reinforement-structure-analysis/176541#post_4",
"publishedAt": "2026-06-08T11:00:48.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"Ultralytics YOLO tracking mode",
"OpenCV Optical Flow tutorial",
"SAM 2 GitHub",
"SAM 2 paper",
"COLMAP",
"COLMAP issue: wrong poses due to duplicate/symmetric features",
"COLMAP issue: camera pose error due to similar structure",
"DUSt3R",
"MASt3R",
"VGGT",
"DUSt3R paper",
"VGGT paper",
"Automatic Quality Inspection of Rebar Spacing Using Vision-Based Deep Learning with RGBD Camera",
"PDF",
"Artificial intelligence quality inspection of steel bars installation by integrating Mask R-CNN and stereo vision",
"Video Depth Anything paper"
],
"textContent": "Oh! So you can use multiple images for a single target/face? Then the set of useful models and algorithms changes quite a lot, in a good way:\n\n* * *\n\nThe extra images and your capture options change the recommendation quite a bit.\n\nIf you can take **multiple images or a video of the same target face** , I would not keep the system framed as a strict single-image problem. I would treat each target face as an **inspection set** :\n\n\n one target face\n → one straight-on reference image\n → several slightly shifted images, or a short slow video\n → candidate detection / segmentation per frame\n → cross-frame consistency / motion / geometry checks\n → front-layer selection\n → horizontal-level clustering\n → count + overlay + confidence\n\n\nThat change matters because some methods that are weak or ambiguous in a single RGB image become much more useful once the input becomes a same-target image/video set.\n\n## 1. Why multiple images change the problem\n\nIn a single image, you mostly have:\n\n * appearance,\n * line geometry,\n * local occlusion cues,\n * apparent thickness,\n * monocular depth as a soft cue,\n * learned segmentation/detection.\n\n\n\nThat can help, but the front/rear separation is still underconstrained.\n\nWith multiple images or video of the same target face, you also get:\n\n * temporal consistency,\n * parallax,\n * cross-frame voting,\n * possible optical flow,\n * possible object/mask tracking,\n * possible multi-view geometry,\n * more chances to see around partial occlusions,\n * a better confidence signal.\n\n\n\nSo I would think of the input not as:\n\n\n single image → count\n\n\nbut as:\n\n\n same-target capture set → count\n\n\nThis is a major design change.\n\n## 2. The lowest-friction experiment\n\nIf you already have a single-image pipeline, or if you are already testing Depth Anything / MiDaS / classical CV, I would not throw that work away.\n\nThe smallest useful extension is:\n\n\n 1. Choose one target face.\n 2. Record a slow 5–10 second video while moving slightly left/right,\n or take 5–15 slightly shifted still images.\n 3. Extract frames.\n 4. Run your existing single-image processing on each frame.\n 5. Extract candidate horizontal bars or horizontal bar levels per frame.\n 6. Fuse the results across frames.\n 7. Keep candidates that are stable and plausible across the same-target set.\n 8. Cluster the remaining front-face candidates by vertical position.\n 9. Output count + visual overlay + confidence.\n\n\nIn pseudo-code:\n\n\n frames = extract_frames(video_or_image_set)\n\n per_frame_results = []\n for frame in frames:\n result = single_image_pipeline(frame)\n per_frame_results.append(result)\n\n fused = fuse_same_target_results(per_frame_results)\n count = count_front_horizontal_levels(fused)\n\n\nThis is useful because it does not require a completely new system. It changes the unit of analysis from **one image** to **one capture set**.\n\n## 3. Methods that become more useful with multiple images/video\n\nSome of the methods you already mentioned are weak as single-image final answers, but become more interesting when repeated over a same-target capture set.\n\nMethod | In a single image | With multiple images / video\n---|---|---\nDepth Anything / MiDaS | Useful relative-depth cue, but not reliable enough as final authority | Can be checked for temporal consistency and combined with motion/parallax cues\nClassical CV | Hough lines / edges may over-detect rear bars | Optical flow, feature tracking, line stability, and cross-frame voting become possible\nRebar segmentation / detection | Gives visible rebar candidates | Candidates can be fused and validated across frames\nSAM / SAM-like segmentation | Helpful for masks, but fragile on dense repeated bars | SAM 2-style video mask propagation or interactive correction becomes more useful\nCOLMAP / SfM / modern 3D models | Not applicable | Can be tested as diagnostic geometry cues\nRGB-D / stereo | Not relevant if only RGB | Becomes a strong option if specialized cameras are acceptable\n\nSo I would not discard the original ideas. I would change their role.\n\nFor example:\n\n\n Depth Anything as a one-frame decision maker: risky\n Depth Anything as a repeated soft cue across frames: more useful\n\n\nand:\n\n\n Hough lines on one image: many false positives\n line candidates stable across a same-target video: more meaningful\n\n\n## 4. A practical branch tree\n\nI would choose the pipeline depending on what capture is possible.\n\n\n Can you capture multiple images or video of the same target face?\n\n ├─ No, single image only\n │ └─ Endpoint:\n │ rebar detection/segmentation\n │ + geometric filtering\n │ + optional monocular depth cue\n │ + confidence / human review\n │\n └─ Yes\n ├─ Short video is available\n │ └─ Endpoint:\n │ per-frame candidates\n │ + tracking / optical flow / SAM 2 mask propagation\n │ + temporal consistency\n │ + row clustering\n │\n ├─ Multiple still images are available\n │ └─ Endpoint:\n │ same-target inspection set\n │ + multi-view consistency\n │ + optional SfM / DUSt3R / MASt3R / VGGT diagnostic\n │ + fused row candidates\n │\n └─ Specialized camera is possible\n └─ Endpoint:\n stereo or RGB-D\n + point cloud / plane fitting\n + target-layer extraction\n + spacing/count validation\n\n\nI would start with the lowest-cost branch and only move to heavier hardware or heavier 3D reconstruction if the simpler route fails.\n\n## 5. Suggested priority order\n\nMy practical priority order would be:\n\nPriority | Option | Why\n---|---|---\n1 | Controlled same-target video | No special hardware; adds temporal consistency and parallax\n2 | Multiple same-target still images | Easy to collect; supports cross-view checking\n3 | Rebar-specific detection/segmentation | Gives candidate bars before layer selection\n4 | Optical flow / tracking / temporal voting | Low-cost way to use video\n5 | SAM 2 video propagation | Useful for interactive mask propagation / annotation\n6 | COLMAP / DUSt3R / MASt3R / VGGT | Useful diagnostic geometry, but not guaranteed on repetitive rebar\n7 | Stereo / RGB-D | Stronger geometry if special cameras are acceptable\n8 | Drone | Useful for access/safety/repeatability, but not automatically a better CV solution\n\nI would not start with the drone unless access or safety requires it. A drone changes the camera position and may help collect images from safer or more repeatable viewpoints, but it does not automatically solve front/rear bar separation. A controlled handheld same-target video may be more valuable for algorithm development.\n\n## 6. Path A: single-image fallback\n\nIf only one RGB image is available, I would use the earlier kind of pipeline:\n\n\n image\n → crop/select target face\n → detect or segment rebar candidates\n → keep near-horizontal elongated candidates\n → score front-face likelihood\n → cluster by vertical position\n → count\n\n\nThe front-face score could combine:\n\n\n apparent thickness\n + edge sharpness\n + continuity across width\n + occlusion order\n + regular spacing\n + optional monocular depth\n\n\nBut I would still treat this as the least robust path. The output should probably include a visual overlay and a confidence score, because there will be ambiguous cases.\n\n## 7. Path B: same-target video\n\nIf video is available, I would try this first.\n\n\n video of same target face\n → sample frames\n → run candidate detection/segmentation per frame\n → associate candidates across frames\n → keep temporally stable row candidates\n → use motion/parallax to suppress rear/interior candidates\n → cluster rows\n\n\nThis does not require full 3D reconstruction.\n\nIt can be implemented with relatively ordinary tools:\n\n * per-frame detection/segmentation,\n * optical flow,\n * tracker association,\n * temporal voting,\n * row-level clustering.\n\n\n\nUltralytics YOLO has a tracking mode using trackers such as BoT-SORT and ByteTrack:\n\n * Ultralytics YOLO tracking mode\n\n\n\nHowever, I would be careful with the tracking unit. Tracking each individual thin bar may be fragile. For dense, repeated rebar, I would probably track or stabilize **row candidates** or **regions** , not depend too heavily on perfect per-bar IDs.\n\nOpenCV optical flow can also be useful:\n\n * OpenCV Optical Flow tutorial\n\n\n\nBut again, I would not use optical flow as a magic answer. I would use it as another cue:\n\n\n Does this horizontal candidate move like the front layer?\n Is it stable across nearby frames?\n Does it remain in the same row-level cluster?\n\n\n## 8. Path C: SAM 2 as a video/annotation helper\n\nSAM 2 is relevant here because it is designed for promptable segmentation in both images and videos:\n\n * SAM 2 GitHub\n * SAM 2 paper\n\n\n\nI would not assume SAM 2 will automatically separate all dense rebars correctly. The structure is thin, repetitive, and heavily occluded.\n\nBut SAM 2 may be useful in this workflow:\n\n\n first frame:\n prompt target face / front bars / cage region\n\n video:\n propagate mask or region through frames\n\n post-processing:\n count horizontal levels inside the propagated target face\n\n\nI would especially consider it for:\n\n * annotation acceleration,\n * interactive correction,\n * propagating a manually selected target face through video,\n * building a training dataset faster.\n\n\n\nSo the role is not necessarily:\n\n\n SAM 2 → final count\n\n\nbut rather:\n\n\n SAM 2 → useful masks / annotations / target region propagation\n\n\n## 9. Path D: multiple still images and multi-view geometry\n\nIf multiple still images of the same target face are available, I would treat them as an inspection set.\n\n\n same-target still images\n → run candidate detection per image\n → match/fuse row candidates across images\n → optionally run a geometry diagnostic\n → count stable front-face levels\n\n\nThis opens up tools that are impossible with a single image.\n\nFor classic multi-view reconstruction, COLMAP is a standard SfM/MVS tool. COLMAP can be useful to test whether there is enough camera motion and texture to recover a meaningful geometry signal.\n\nHowever, I would not make COLMAP the first production assumption. Rebar cages are difficult for SfM because they contain:\n\n * repeated patterns,\n * thin lines,\n * many similar intersections,\n * occlusion,\n * background construction clutter.\n\n\n\nRepeating structure can cause wrong correspondences or wrong camera poses in SfM systems; this is a known type of failure, not something specific to this task:\n\n * COLMAP issue: wrong poses due to duplicate/symmetric features\n * COLMAP issue: camera pose error due to similar structure\n\n\n\nSo I would treat SfM as:\n\n\n good diagnostic if it works\n not a guaranteed core pipeline\n\n\n## 10. Modern 3D foundation models may be worth testing\n\nBecause you can collect multiple images, newer 3D models may also be worth testing as diagnostics.\n\nExamples:\n\n * DUSt3R\n * MASt3R\n * VGGT\n\n\n\nDUSt3R is designed for dense 3D reconstruction from arbitrary image collections without known camera calibration or poses:\n\n * DUSt3R paper\n\n\n\nVGGT predicts key 3D scene attributes such as camera parameters, point maps, depth maps, and 3D point tracks from one, a few, or hundreds of views:\n\n * VGGT paper\n\n\n\nThese are not rebar-specific models. I would not assume they solve the task directly. But they may be useful to answer a practical question:\n\n\n Does the same-target image set contain enough geometric signal\n to separate the front layer from the background/rear layer?\n\n\nIf the answer is yes, then geometry can become part of the pipeline. If not, it is better to focus on detection/segmentation and capture control.\n\n## 11. Path E: RGB-D or stereo, if specialized cameras are acceptable\n\nIf specialized cameras are acceptable, I would consider stereo or RGB-D before thinking of a drone as the main algorithmic solution.\n\nThe reason is simple:\n\n\n The hard part is layer separation, not only image access.\n\n\nRGB-D or stereo can directly help with front/rear separation.\n\nThere is relevant work on rebar spacing inspection using RGB-D and point-cloud processing:\n\n * Automatic Quality Inspection of Rebar Spacing Using Vision-Based Deep Learning with RGBD Camera\n * PDF\n\n\n\nThat work is interesting because it uses depth/point-cloud processing to filter background rebar layers before measuring the target layer. That is conceptually close to your front/rear separation problem.\n\nThere is also related work combining instance segmentation and stereo vision for steel-bar installation inspection:\n\n * Artificial intelligence quality inspection of steel bars installation by integrating Mask R-CNN and stereo vision\n\n\n\nSo if special cameras are realistic, I would think in this order:\n\n\n normal video / multi-image capture first\n → if front/rear separation is still unreliable:\n stereo or RGB-D\n → drone only if access/safety/repeatability requires it\n\n\n## 12. How I would use Depth Anything / MiDaS in this new setting\n\nYour original monocular-depth idea becomes more useful once there are multiple frames.\n\nSingle-frame usage:\n\n\n Depth Anything / MiDaS\n → relative depth map\n → maybe front/rear cue\n\n\nThis is weak as a final decision.\n\nSame-target video usage:\n\n\n Depth Anything / MiDaS per frame\n → check whether front-layer candidates remain consistently closer\n → combine with candidate continuity and parallax\n → use depth as a soft vote\n\n\nThis is a better role.\n\nThere are also video-focused depth models such as Video Depth Anything:\n\n * Video Depth Anything paper\n\n\n\nThat does not mean it is automatically necessary, but it supports the general point: video depth consistency is a different problem from single-image depth.\n\nSo I would phrase it like this:\n\n\n Monocular depth is not sufficient as a single-frame authority,\n but it may become useful as a repeated soft cue across a same-target capture set.\n\n\n## 13. How I would use classical CV in this new setting\n\nClassical CV also becomes more useful with video.\n\nSingle-image classical CV:\n\n\n edges / Hough lines / morphology\n → many false positives from rear/interior bars\n\n\nVideo classical CV:\n\n\n line candidates\n + optical flow\n + frame-to-frame consistency\n + row-level voting\n\n\nThis is much more useful.\n\nFor example:\n\n\n 1. Extract near-horizontal candidates in each frame.\n 2. Cluster them into row candidates.\n 3. Track row candidates across frames.\n 4. Keep rows that remain stable and plausible.\n 5. Downweight rows that appear only in a few frames or move inconsistently.\n\n\nThis keeps classical CV in a realistic role: not the whole solution, but a useful stabilizer.\n\n## 14. A possible minimal prototype\n\nIf I were testing this with ordinary camera/video first, I would implement:\n\n\n Input:\n same-target short video or 5–15 same-target images\n\n Step 1:\n manually or automatically crop/select the target face\n\n Step 2:\n run existing per-image processing:\n - rebar candidate detection/segmentation\n - optional depth\n - near-horizontal candidate extraction\n\n Step 3:\n aggregate across frames:\n - group candidates by row\n - check temporal consistency\n - check depth consistency if available\n - check motion/parallax behavior if available\n\n Step 4:\n produce:\n - counted horizontal levels\n - overlay on reference image\n - confidence score\n - low-confidence review flag\n\n\nA very simple scoring idea:\n\n\n row_score =\n number_of_frames_detected\n + horizontal_continuity_score\n + row_spacing_plausibility_score\n + front_depth_consistency_score\n + motion_consistency_score\n\n\nThen count rows above a threshold, and always show the overlay.\n\n## 15. Questions that would decide the branch\n\nThe next useful questions are probably:\n\nQuestion | Why it matters\n---|---\nDo you need only count, or also spacing/compliance? | Count can be simpler; spacing needs scale/calibration\nIs the target always one face of the cage? | Same-target capture set assumes this\nCan the target face be manually cropped/selected? | This greatly reduces difficulty\nCan you capture a short slow video during inspection? | Enables temporal consistency and parallax\nCan you place a known-size marker or use design dimensions? | Helps scale and validation\nAre RGB-D/stereo cameras acceptable in the field, or only for R&D? | Decides whether depth/point cloud routes are realistic\nIs a drone needed for access/safety, or mainly for better vision? | These are different reasons\n\n## 16. My revised recommendation\n\nGiven your new constraints, I would revise the earlier recommendation to:\n\n\n Do not treat this as only a single-image depth problem.\n Treat each target face as a same-target inspection set.\n\n Start with ordinary camera video or multiple still images.\n Run your current per-image model/CV pipeline on frames.\n Fuse the evidence across frames.\n Use temporal consistency, parallax, optional depth, and row clustering\n to select the front horizontal levels.\n\n Only move to stereo/RGB-D if normal same-target capture is not reliable enough.\n Use drones mainly for access/safety/repeatability, not as the core CV solution.\n\n\nThis does not give a guaranteed final answer, but it should make the search space much better constrained. The key shift is:\n\n\n single image:\n semantic + appearance problem\n\n multiple images/video:\n semantic + appearance + temporal + geometric problem\n\n\nThat second formulation gives you many more practical options.",
"title": "Reinforement Structure Analysis"
}