Reinforement Structure Analysis
Oh! So you can use multiple images for a single target/face? Then the set of useful models and algorithms changes quite a lot, in a good way:
The extra images and your capture options change the recommendation quite a bit.
If you can take multiple images or a video of the same target face , I would not keep the system framed as a strict single-image problem. I would treat each target face as an inspection set :
one target face
→ one straight-on reference image
→ several slightly shifted images, or a short slow video
→ candidate detection / segmentation per frame
→ cross-frame consistency / motion / geometry checks
→ front-layer selection
→ horizontal-level clustering
→ count + overlay + confidence
That change matters because some methods that are weak or ambiguous in a single RGB image become much more useful once the input becomes a same-target image/video set.
1. Why multiple images change the problem
In a single image, you mostly have:
- appearance,
- line geometry,
- local occlusion cues,
- apparent thickness,
- monocular depth as a soft cue,
- learned segmentation/detection.
That can help, but the front/rear separation is still underconstrained.
With multiple images or video of the same target face, you also get:
- temporal consistency,
- parallax,
- cross-frame voting,
- possible optical flow,
- possible object/mask tracking,
- possible multi-view geometry,
- more chances to see around partial occlusions,
- a better confidence signal.
So I would think of the input not as:
single image → count
but as:
same-target capture set → count
This is a major design change.
2. The lowest-friction experiment
If you already have a single-image pipeline, or if you are already testing Depth Anything / MiDaS / classical CV, I would not throw that work away.
The smallest useful extension is:
1. Choose one target face.
2. Record a slow 5–10 second video while moving slightly left/right,
or take 5–15 slightly shifted still images.
3. Extract frames.
4. Run your existing single-image processing on each frame.
5. Extract candidate horizontal bars or horizontal bar levels per frame.
6. Fuse the results across frames.
7. Keep candidates that are stable and plausible across the same-target set.
8. Cluster the remaining front-face candidates by vertical position.
9. Output count + visual overlay + confidence.
In pseudo-code:
frames = extract_frames(video_or_image_set)
per_frame_results = []
for frame in frames:
result = single_image_pipeline(frame)
per_frame_results.append(result)
fused = fuse_same_target_results(per_frame_results)
count = count_front_horizontal_levels(fused)
This is useful because it does not require a completely new system. It changes the unit of analysis from one image to one capture set.
3. Methods that become more useful with multiple images/video
Some of the methods you already mentioned are weak as single-image final answers, but become more interesting when repeated over a same-target capture set.
| Method | In a single image | With multiple images / video |
|---|---|---|
| Depth Anything / MiDaS | Useful relative-depth cue, but not reliable enough as final authority | Can be checked for temporal consistency and combined with motion/parallax cues |
| Classical CV | Hough lines / edges may over-detect rear bars | Optical flow, feature tracking, line stability, and cross-frame voting become possible |
| Rebar segmentation / detection | Gives visible rebar candidates | Candidates can be fused and validated across frames |
| SAM / SAM-like segmentation | Helpful for masks, but fragile on dense repeated bars | SAM 2-style video mask propagation or interactive correction becomes more useful |
| COLMAP / SfM / modern 3D models | Not applicable | Can be tested as diagnostic geometry cues |
| RGB-D / stereo | Not relevant if only RGB | Becomes a strong option if specialized cameras are acceptable |
So I would not discard the original ideas. I would change their role.
For example:
Depth Anything as a one-frame decision maker: risky
Depth Anything as a repeated soft cue across frames: more useful
and:
Hough lines on one image: many false positives
line candidates stable across a same-target video: more meaningful
4. A practical branch tree
I would choose the pipeline depending on what capture is possible.
Can you capture multiple images or video of the same target face?
├─ No, single image only
│ └─ Endpoint:
│ rebar detection/segmentation
│ + geometric filtering
│ + optional monocular depth cue
│ + confidence / human review
│
└─ Yes
├─ Short video is available
│ └─ Endpoint:
│ per-frame candidates
│ + tracking / optical flow / SAM 2 mask propagation
│ + temporal consistency
│ + row clustering
│
├─ Multiple still images are available
│ └─ Endpoint:
│ same-target inspection set
│ + multi-view consistency
│ + optional SfM / DUSt3R / MASt3R / VGGT diagnostic
│ + fused row candidates
│
└─ Specialized camera is possible
└─ Endpoint:
stereo or RGB-D
+ point cloud / plane fitting
+ target-layer extraction
+ spacing/count validation
I would start with the lowest-cost branch and only move to heavier hardware or heavier 3D reconstruction if the simpler route fails.
5. Suggested priority order
My practical priority order would be:
| Priority | Option | Why |
|---|---|---|
| 1 | Controlled same-target video | No special hardware; adds temporal consistency and parallax |
| 2 | Multiple same-target still images | Easy to collect; supports cross-view checking |
| 3 | Rebar-specific detection/segmentation | Gives candidate bars before layer selection |
| 4 | Optical flow / tracking / temporal voting | Low-cost way to use video |
| 5 | SAM 2 video propagation | Useful for interactive mask propagation / annotation |
| 6 | COLMAP / DUSt3R / MASt3R / VGGT | Useful diagnostic geometry, but not guaranteed on repetitive rebar |
| 7 | Stereo / RGB-D | Stronger geometry if special cameras are acceptable |
| 8 | Drone | Useful for access/safety/repeatability, but not automatically a better CV solution |
I would not start with the drone unless access or safety requires it. A drone changes the camera position and may help collect images from safer or more repeatable viewpoints, but it does not automatically solve front/rear bar separation. A controlled handheld same-target video may be more valuable for algorithm development.
6. Path A: single-image fallback
If only one RGB image is available, I would use the earlier kind of pipeline:
image
→ crop/select target face
→ detect or segment rebar candidates
→ keep near-horizontal elongated candidates
→ score front-face likelihood
→ cluster by vertical position
→ count
The front-face score could combine:
apparent thickness
+ edge sharpness
+ continuity across width
+ occlusion order
+ regular spacing
+ optional monocular depth
But I would still treat this as the least robust path. The output should probably include a visual overlay and a confidence score, because there will be ambiguous cases.
7. Path B: same-target video
If video is available, I would try this first.
video of same target face
→ sample frames
→ run candidate detection/segmentation per frame
→ associate candidates across frames
→ keep temporally stable row candidates
→ use motion/parallax to suppress rear/interior candidates
→ cluster rows
This does not require full 3D reconstruction.
It can be implemented with relatively ordinary tools:
- per-frame detection/segmentation,
- optical flow,
- tracker association,
- temporal voting,
- row-level clustering.
Ultralytics YOLO has a tracking mode using trackers such as BoT-SORT and ByteTrack:
- Ultralytics YOLO tracking mode
However, I would be careful with the tracking unit. Tracking each individual thin bar may be fragile. For dense, repeated rebar, I would probably track or stabilize row candidates or regions , not depend too heavily on perfect per-bar IDs.
OpenCV optical flow can also be useful:
- OpenCV Optical Flow tutorial
But again, I would not use optical flow as a magic answer. I would use it as another cue:
Does this horizontal candidate move like the front layer?
Is it stable across nearby frames?
Does it remain in the same row-level cluster?
8. Path C: SAM 2 as a video/annotation helper
SAM 2 is relevant here because it is designed for promptable segmentation in both images and videos:
- SAM 2 GitHub
- SAM 2 paper
I would not assume SAM 2 will automatically separate all dense rebars correctly. The structure is thin, repetitive, and heavily occluded.
But SAM 2 may be useful in this workflow:
first frame:
prompt target face / front bars / cage region
video:
propagate mask or region through frames
post-processing:
count horizontal levels inside the propagated target face
I would especially consider it for:
- annotation acceleration,
- interactive correction,
- propagating a manually selected target face through video,
- building a training dataset faster.
So the role is not necessarily:
SAM 2 → final count
but rather:
SAM 2 → useful masks / annotations / target region propagation
9. Path D: multiple still images and multi-view geometry
If multiple still images of the same target face are available, I would treat them as an inspection set.
same-target still images
→ run candidate detection per image
→ match/fuse row candidates across images
→ optionally run a geometry diagnostic
→ count stable front-face levels
This opens up tools that are impossible with a single image.
For classic multi-view reconstruction, COLMAP is a standard SfM/MVS tool. COLMAP can be useful to test whether there is enough camera motion and texture to recover a meaningful geometry signal.
However, I would not make COLMAP the first production assumption. Rebar cages are difficult for SfM because they contain:
- repeated patterns,
- thin lines,
- many similar intersections,
- occlusion,
- background construction clutter.
Repeating structure can cause wrong correspondences or wrong camera poses in SfM systems; this is a known type of failure, not something specific to this task:
- COLMAP issue: wrong poses due to duplicate/symmetric features
- COLMAP issue: camera pose error due to similar structure
So I would treat SfM as:
good diagnostic if it works
not a guaranteed core pipeline
10. Modern 3D foundation models may be worth testing
Because you can collect multiple images, newer 3D models may also be worth testing as diagnostics.
Examples:
- DUSt3R
- MASt3R
- VGGT
DUSt3R is designed for dense 3D reconstruction from arbitrary image collections without known camera calibration or poses:
- DUSt3R paper
VGGT predicts key 3D scene attributes such as camera parameters, point maps, depth maps, and 3D point tracks from one, a few, or hundreds of views:
- VGGT paper
These are not rebar-specific models. I would not assume they solve the task directly. But they may be useful to answer a practical question:
Does the same-target image set contain enough geometric signal
to separate the front layer from the background/rear layer?
If the answer is yes, then geometry can become part of the pipeline. If not, it is better to focus on detection/segmentation and capture control.
11. Path E: RGB-D or stereo, if specialized cameras are acceptable
If specialized cameras are acceptable, I would consider stereo or RGB-D before thinking of a drone as the main algorithmic solution.
The reason is simple:
The hard part is layer separation, not only image access.
RGB-D or stereo can directly help with front/rear separation.
There is relevant work on rebar spacing inspection using RGB-D and point-cloud processing:
- Automatic Quality Inspection of Rebar Spacing Using Vision-Based Deep Learning with RGBD Camera
That work is interesting because it uses depth/point-cloud processing to filter background rebar layers before measuring the target layer. That is conceptually close to your front/rear separation problem.
There is also related work combining instance segmentation and stereo vision for steel-bar installation inspection:
- Artificial intelligence quality inspection of steel bars installation by integrating Mask R-CNN and stereo vision
So if special cameras are realistic, I would think in this order:
normal video / multi-image capture first
→ if front/rear separation is still unreliable:
stereo or RGB-D
→ drone only if access/safety/repeatability requires it
12. How I would use Depth Anything / MiDaS in this new setting
Your original monocular-depth idea becomes more useful once there are multiple frames.
Single-frame usage:
Depth Anything / MiDaS
→ relative depth map
→ maybe front/rear cue
This is weak as a final decision.
Same-target video usage:
Depth Anything / MiDaS per frame
→ check whether front-layer candidates remain consistently closer
→ combine with candidate continuity and parallax
→ use depth as a soft vote
This is a better role.
There are also video-focused depth models such as Video Depth Anything:
- Video Depth Anything paper
That does not mean it is automatically necessary, but it supports the general point: video depth consistency is a different problem from single-image depth.
So I would phrase it like this:
Monocular depth is not sufficient as a single-frame authority,
but it may become useful as a repeated soft cue across a same-target capture set.
13. How I would use classical CV in this new setting
Classical CV also becomes more useful with video.
Single-image classical CV:
edges / Hough lines / morphology
→ many false positives from rear/interior bars
Video classical CV:
line candidates
+ optical flow
+ frame-to-frame consistency
+ row-level voting
This is much more useful.
For example:
1. Extract near-horizontal candidates in each frame.
2. Cluster them into row candidates.
3. Track row candidates across frames.
4. Keep rows that remain stable and plausible.
5. Downweight rows that appear only in a few frames or move inconsistently.
This keeps classical CV in a realistic role: not the whole solution, but a useful stabilizer.
14. A possible minimal prototype
If I were testing this with ordinary camera/video first, I would implement:
Input:
same-target short video or 5–15 same-target images
Step 1:
manually or automatically crop/select the target face
Step 2:
run existing per-image processing:
- rebar candidate detection/segmentation
- optional depth
- near-horizontal candidate extraction
Step 3:
aggregate across frames:
- group candidates by row
- check temporal consistency
- check depth consistency if available
- check motion/parallax behavior if available
Step 4:
produce:
- counted horizontal levels
- overlay on reference image
- confidence score
- low-confidence review flag
A very simple scoring idea:
row_score =
number_of_frames_detected
+ horizontal_continuity_score
+ row_spacing_plausibility_score
+ front_depth_consistency_score
+ motion_consistency_score
Then count rows above a threshold, and always show the overlay.
15. Questions that would decide the branch
The next useful questions are probably:
| Question | Why it matters |
|---|---|
| Do you need only count, or also spacing/compliance? | Count can be simpler; spacing needs scale/calibration |
| Is the target always one face of the cage? | Same-target capture set assumes this |
| Can the target face be manually cropped/selected? | This greatly reduces difficulty |
| Can you capture a short slow video during inspection? | Enables temporal consistency and parallax |
| Can you place a known-size marker or use design dimensions? | Helps scale and validation |
| Are RGB-D/stereo cameras acceptable in the field, or only for R&D? | Decides whether depth/point cloud routes are realistic |
| Is a drone needed for access/safety, or mainly for better vision? | These are different reasons |
16. My revised recommendation
Given your new constraints, I would revise the earlier recommendation to:
Do not treat this as only a single-image depth problem.
Treat each target face as a same-target inspection set.
Start with ordinary camera video or multiple still images.
Run your current per-image model/CV pipeline on frames.
Fuse the evidence across frames.
Use temporal consistency, parallax, optional depth, and row clustering
to select the front horizontal levels.
Only move to stereo/RGB-D if normal same-target capture is not reliable enough.
Use drones mainly for access/safety/repeatability, not as the core CV solution.
This does not give a guaranteed final answer, but it should make the search space much better constrained. The key shift is:
single image:
semantic + appearance problem
multiple images/video:
semantic + appearance + temporal + geometric problem
That second formulation gives you many more practical options.
Discussion in the ATmosphere