External Publication

Reinforement Structure Analysis

Hugging Face Forums [Unofficial] June 8, 2026

Oh! So you can use multiple images for a single target/face? Then the set of useful models and algorithms changes quite a lot, in a good way:

The extra images and your capture options change the recommendation quite a bit.

If you can take multiple images or a video of the same target face , I would not keep the system framed as a strict single-image problem. I would treat each target face as an inspection set :

one target face
→ one straight-on reference image
→ several slightly shifted images, or a short slow video
→ candidate detection / segmentation per frame
→ cross-frame consistency / motion / geometry checks
→ front-layer selection
→ horizontal-level clustering
→ count + overlay + confidence

That change matters because some methods that are weak or ambiguous in a single RGB image become much more useful once the input becomes a same-target image/video set.

1. Why multiple images change the problem

In a single image, you mostly have:

appearance,
line geometry,
local occlusion cues,
apparent thickness,
monocular depth as a soft cue,
learned segmentation/detection.

That can help, but the front/rear separation is still underconstrained.

With multiple images or video of the same target face, you also get:

temporal consistency,
parallax,
cross-frame voting,
possible optical flow,
possible object/mask tracking,
possible multi-view geometry,
more chances to see around partial occlusions,
a better confidence signal.

So I would think of the input not as:

single image → count

but as:

same-target capture set → count

This is a major design change.

2. The lowest-friction experiment

If you already have a single-image pipeline, or if you are already testing Depth Anything / MiDaS / classical CV, I would not throw that work away.

The smallest useful extension is:

1. Choose one target face.
2. Record a slow 5–10 second video while moving slightly left/right,
   or take 5–15 slightly shifted still images.
3. Extract frames.
4. Run your existing single-image processing on each frame.
5. Extract candidate horizontal bars or horizontal bar levels per frame.
6. Fuse the results across frames.
7. Keep candidates that are stable and plausible across the same-target set.
8. Cluster the remaining front-face candidates by vertical position.
9. Output count + visual overlay + confidence.

In pseudo-code:

frames = extract_frames(video_or_image_set)

per_frame_results = []
for frame in frames:
    result = single_image_pipeline(frame)
    per_frame_results.append(result)

fused = fuse_same_target_results(per_frame_results)
count = count_front_horizontal_levels(fused)

This is useful because it does not require a completely new system. It changes the unit of analysis from one image to one capture set.

3. Methods that become more useful with multiple images/video

Some of the methods you already mentioned are weak as single-image final answers, but become more interesting when repeated over a same-target capture set.

Method	In a single image	With multiple images / video
Depth Anything / MiDaS	Useful relative-depth cue, but not reliable enough as final authority	Can be checked for temporal consistency and combined with motion/parallax cues
Classical CV	Hough lines / edges may over-detect rear bars	Optical flow, feature tracking, line stability, and cross-frame voting become possible
Rebar segmentation / detection	Gives visible rebar candidates	Candidates can be fused and validated across frames
SAM / SAM-like segmentation	Helpful for masks, but fragile on dense repeated bars	SAM 2-style video mask propagation or interactive correction becomes more useful
COLMAP / SfM / modern 3D models	Not applicable	Can be tested as diagnostic geometry cues
RGB-D / stereo	Not relevant if only RGB	Becomes a strong option if specialized cameras are acceptable

So I would not discard the original ideas. I would change their role.

For example:

Depth Anything as a one-frame decision maker: risky
Depth Anything as a repeated soft cue across frames: more useful

and:

Hough lines on one image: many false positives
line candidates stable across a same-target video: more meaningful

4. A practical branch tree

I would choose the pipeline depending on what capture is possible.

Can you capture multiple images or video of the same target face?

├─ No, single image only
│  └─ Endpoint:
│     rebar detection/segmentation
│     + geometric filtering
│     + optional monocular depth cue
│     + confidence / human review
│
└─ Yes
   ├─ Short video is available
   │  └─ Endpoint:
   │     per-frame candidates
   │     + tracking / optical flow / SAM 2 mask propagation
   │     + temporal consistency
   │     + row clustering
   │
   ├─ Multiple still images are available
   │  └─ Endpoint:
   │     same-target inspection set
   │     + multi-view consistency
   │     + optional SfM / DUSt3R / MASt3R / VGGT diagnostic
   │     + fused row candidates
   │
   └─ Specialized camera is possible
      └─ Endpoint:
         stereo or RGB-D
         + point cloud / plane fitting
         + target-layer extraction
         + spacing/count validation

I would start with the lowest-cost branch and only move to heavier hardware or heavier 3D reconstruction if the simpler route fails.

5. Suggested priority order

My practical priority order would be:

Priority	Option	Why
1	Controlled same-target video	No special hardware; adds temporal consistency and parallax
2	Multiple same-target still images	Easy to collect; supports cross-view checking
3	Rebar-specific detection/segmentation	Gives candidate bars before layer selection
4	Optical flow / tracking / temporal voting	Low-cost way to use video
5	SAM 2 video propagation	Useful for interactive mask propagation / annotation
6	COLMAP / DUSt3R / MASt3R / VGGT	Useful diagnostic geometry, but not guaranteed on repetitive rebar
7	Stereo / RGB-D	Stronger geometry if special cameras are acceptable
8	Drone	Useful for access/safety/repeatability, but not automatically a better CV solution

I would not start with the drone unless access or safety requires it. A drone changes the camera position and may help collect images from safer or more repeatable viewpoints, but it does not automatically solve front/rear bar separation. A controlled handheld same-target video may be more valuable for algorithm development.

6. Path A: single-image fallback

If only one RGB image is available, I would use the earlier kind of pipeline:

image
→ crop/select target face
→ detect or segment rebar candidates
→ keep near-horizontal elongated candidates
→ score front-face likelihood
→ cluster by vertical position
→ count

The front-face score could combine:

apparent thickness
+ edge sharpness
+ continuity across width
+ occlusion order
+ regular spacing
+ optional monocular depth

But I would still treat this as the least robust path. The output should probably include a visual overlay and a confidence score, because there will be ambiguous cases.

7. Path B: same-target video

If video is available, I would try this first.

video of same target face
→ sample frames
→ run candidate detection/segmentation per frame
→ associate candidates across frames
→ keep temporally stable row candidates
→ use motion/parallax to suppress rear/interior candidates
→ cluster rows

This does not require full 3D reconstruction.

It can be implemented with relatively ordinary tools:

per-frame detection/segmentation,
optical flow,
tracker association,
temporal voting,
row-level clustering.

Ultralytics YOLO has a tracking mode using trackers such as BoT-SORT and ByteTrack:

Ultralytics YOLO tracking mode

However, I would be careful with the tracking unit. Tracking each individual thin bar may be fragile. For dense, repeated rebar, I would probably track or stabilize row candidates or regions , not depend too heavily on perfect per-bar IDs.

OpenCV optical flow can also be useful:

OpenCV Optical Flow tutorial

But again, I would not use optical flow as a magic answer. I would use it as another cue:

Does this horizontal candidate move like the front layer?
Is it stable across nearby frames?
Does it remain in the same row-level cluster?

8. Path C: SAM 2 as a video/annotation helper

SAM 2 is relevant here because it is designed for promptable segmentation in both images and videos:

SAM 2 GitHub
SAM 2 paper

I would not assume SAM 2 will automatically separate all dense rebars correctly. The structure is thin, repetitive, and heavily occluded.

But SAM 2 may be useful in this workflow:

first frame:
  prompt target face / front bars / cage region

video:
  propagate mask or region through frames

post-processing:
  count horizontal levels inside the propagated target face

I would especially consider it for:

annotation acceleration,
interactive correction,
propagating a manually selected target face through video,
building a training dataset faster.

So the role is not necessarily:

SAM 2 → final count

but rather:

SAM 2 → useful masks / annotations / target region propagation

9. Path D: multiple still images and multi-view geometry

If multiple still images of the same target face are available, I would treat them as an inspection set.

same-target still images
→ run candidate detection per image
→ match/fuse row candidates across images
→ optionally run a geometry diagnostic
→ count stable front-face levels

This opens up tools that are impossible with a single image.

For classic multi-view reconstruction, COLMAP is a standard SfM/MVS tool. COLMAP can be useful to test whether there is enough camera motion and texture to recover a meaningful geometry signal.

However, I would not make COLMAP the first production assumption. Rebar cages are difficult for SfM because they contain:

repeated patterns,
thin lines,
many similar intersections,
occlusion,
background construction clutter.

Repeating structure can cause wrong correspondences or wrong camera poses in SfM systems; this is a known type of failure, not something specific to this task:

COLMAP issue: wrong poses due to duplicate/symmetric features
COLMAP issue: camera pose error due to similar structure

So I would treat SfM as:

good diagnostic if it works
not a guaranteed core pipeline

10. Modern 3D foundation models may be worth testing

Because you can collect multiple images, newer 3D models may also be worth testing as diagnostics.

Examples:

DUSt3R
MASt3R
VGGT

DUSt3R is designed for dense 3D reconstruction from arbitrary image collections without known camera calibration or poses:

DUSt3R paper

VGGT predicts key 3D scene attributes such as camera parameters, point maps, depth maps, and 3D point tracks from one, a few, or hundreds of views:

VGGT paper

These are not rebar-specific models. I would not assume they solve the task directly. But they may be useful to answer a practical question:

Does the same-target image set contain enough geometric signal
to separate the front layer from the background/rear layer?

If the answer is yes, then geometry can become part of the pipeline. If not, it is better to focus on detection/segmentation and capture control.

11. Path E: RGB-D or stereo, if specialized cameras are acceptable

If specialized cameras are acceptable, I would consider stereo or RGB-D before thinking of a drone as the main algorithmic solution.

The reason is simple:

The hard part is layer separation, not only image access.

RGB-D or stereo can directly help with front/rear separation.

There is relevant work on rebar spacing inspection using RGB-D and point-cloud processing:

Automatic Quality Inspection of Rebar Spacing Using Vision-Based Deep Learning with RGBD Camera
PDF

That work is interesting because it uses depth/point-cloud processing to filter background rebar layers before measuring the target layer. That is conceptually close to your front/rear separation problem.

There is also related work combining instance segmentation and stereo vision for steel-bar installation inspection:

Artificial intelligence quality inspection of steel bars installation by integrating Mask R-CNN and stereo vision

So if special cameras are realistic, I would think in this order:

normal video / multi-image capture first
→ if front/rear separation is still unreliable:
   stereo or RGB-D
→ drone only if access/safety/repeatability requires it

12. How I would use Depth Anything / MiDaS in this new setting

Your original monocular-depth idea becomes more useful once there are multiple frames.

Single-frame usage:

Depth Anything / MiDaS
→ relative depth map
→ maybe front/rear cue

This is weak as a final decision.

Same-target video usage:

Depth Anything / MiDaS per frame
→ check whether front-layer candidates remain consistently closer
→ combine with candidate continuity and parallax
→ use depth as a soft vote

This is a better role.

There are also video-focused depth models such as Video Depth Anything:

Video Depth Anything paper

That does not mean it is automatically necessary, but it supports the general point: video depth consistency is a different problem from single-image depth.

So I would phrase it like this:

Monocular depth is not sufficient as a single-frame authority,
but it may become useful as a repeated soft cue across a same-target capture set.

13. How I would use classical CV in this new setting

Classical CV also becomes more useful with video.

Single-image classical CV:

edges / Hough lines / morphology
→ many false positives from rear/interior bars

Video classical CV:

line candidates
+ optical flow
+ frame-to-frame consistency
+ row-level voting

This is much more useful.

For example:

1. Extract near-horizontal candidates in each frame.
2. Cluster them into row candidates.
3. Track row candidates across frames.
4. Keep rows that remain stable and plausible.
5. Downweight rows that appear only in a few frames or move inconsistently.

This keeps classical CV in a realistic role: not the whole solution, but a useful stabilizer.

14. A possible minimal prototype

If I were testing this with ordinary camera/video first, I would implement:

Input:
  same-target short video or 5–15 same-target images

Step 1:
  manually or automatically crop/select the target face

Step 2:
  run existing per-image processing:
    - rebar candidate detection/segmentation
    - optional depth
    - near-horizontal candidate extraction

Step 3:
  aggregate across frames:
    - group candidates by row
    - check temporal consistency
    - check depth consistency if available
    - check motion/parallax behavior if available

Step 4:
  produce:
    - counted horizontal levels
    - overlay on reference image
    - confidence score
    - low-confidence review flag

A very simple scoring idea:

row_score =
    number_of_frames_detected
  + horizontal_continuity_score
  + row_spacing_plausibility_score
  + front_depth_consistency_score
  + motion_consistency_score

Then count rows above a threshold, and always show the overlay.

15. Questions that would decide the branch

The next useful questions are probably:

Question	Why it matters
Do you need only count, or also spacing/compliance?	Count can be simpler; spacing needs scale/calibration
Is the target always one face of the cage?	Same-target capture set assumes this
Can the target face be manually cropped/selected?	This greatly reduces difficulty
Can you capture a short slow video during inspection?	Enables temporal consistency and parallax
Can you place a known-size marker or use design dimensions?	Helps scale and validation
Are RGB-D/stereo cameras acceptable in the field, or only for R&D?	Decides whether depth/point cloud routes are realistic
Is a drone needed for access/safety, or mainly for better vision?	These are different reasons

16. My revised recommendation

Given your new constraints, I would revise the earlier recommendation to:

Do not treat this as only a single-image depth problem.
Treat each target face as a same-target inspection set.

Start with ordinary camera video or multiple still images.
Run your current per-image model/CV pipeline on frames.
Fuse the evidence across frames.
Use temporal consistency, parallax, optional depth, and row clustering
to select the front horizontal levels.

Only move to stereo/RGB-D if normal same-target capture is not reliable enough.
Use drones mainly for access/safety/repeatability, not as the core CV solution.

This does not give a guaranteed final answer, but it should make the search space much better constrained. The key shift is:

single image:
  semantic + appearance problem

multiple images/video:
  semantic + appearance + temporal + geometric problem

That second formulation gives you many more practical options.