External Publication

Real-time exercise form analysis with MediaPipe , looking for advice

Hugging Face Forums [Unofficial] May 2, 2026

For now, I’ve gathered some existing resources that might be useful:

Your project idea is feasible, but only if you keep the scope tight.

You are not really building one system; you are building several systems stacked together:

camera video
→ pose estimation
→ landmark cleanup
→ feature extraction
→ exercise recognition
→ repetition segmentation
→ form feedback
→ real-time UI

For a one-month course project with three people and no previous full CV-pipeline experience, I would not try to build a general-purpose “AI personal trainer” for many exercises. I would build a narrow, explainable, real-time prototype that works well for 2 exercises.

My recommended version of FormAI :

Phone/webcam camera
→ MediaPipe Pose Landmarker
→ landmark quality checks
→ normalized joint-angle features
→ manual exercise mode or lightweight exercise classifier
→ state-machine repetition counting
→ exercise-specific form rules
→ real-time overlay + after-rep feedback

The most important design choice:

Use ML for exercise recognition or phase recognition, but use rule-based / phase-aware logic for form feedback first.

That will be easier to build, easier to debug, easier to demo, and easier to explain in your report.

1. Why MediaPipe is a good backbone

MediaPipe Pose Landmarker is a good fit because it detects body landmarks in images/video and outputs both image-coordinate landmarks and 3D world-coordinate landmarks. It is designed for tasks like posture analysis and movement categorization.

Useful official links:

MediaPipe Pose Landmarker overview
MediaPipe Pose Landmarker Python guide
MediaPipe Pose Landmarker Web guide
MediaPipe Pose Landmarker Android guide
Legacy MediaPipe Pose docs
MediaPipe GitHub repo

For a student project, MediaPipe should be treated as the pose-estimation backend , not as the full exercise coach. MediaPipe gives you landmarks. Your project still has to decide:

which landmarks are reliable,
which angles matter,
where a rep starts and ends,
whether the current exercise phase is valid,
what feedback should be shown,
whether the camera view is acceptable.

So the core of your project is not “running MediaPipe.” The core is everything you build on top of MediaPipe.

2. The main warning: do not start with a big correct/incorrect classifier

Your original plan is:

MediaPipe keypoints
→ joint angles
→ classifier for exercise type
→ classifier for correct/incorrect form

The exercise-type classifier is reasonable.

The correct/incorrect classifier is risky.

Why? Because “incorrect form” is not one thing.

A squat can be wrong because of:

not enough depth,
excessive torso lean,
knees caving inward,
left/right asymmetry,
heels lifting,
bad camera angle,
missing landmarks.

A bicep curl can be wrong because of:

partial range of motion,
not extending fully,
elbow drifting,
shoulder swinging,
torso momentum,
occluded wrist/elbow.

A binary classifier may output:

incorrect

But the user needs something like:

"Your elbow is drifting forward. Keep your upper arm more stable."

So instead of training a vague correct/incorrect classifier first, define specific feedback rules.

Better structure:

exercise classifier:
    squat / bicep_curl / push_up / unknown

form analyzers:
    squat_depth
    squat_torso_lean
    squat_knee_symmetry
    curl_range_of_motion
    curl_elbow_drift
    curl_shoulder_swing

This is more explainable and easier to evaluate.

3. Best scope for one month

I would build only:

1. Squat
2. Bicep curl

Optional third exercise:

3. Push-up

But only add push-up if squat and curl already work.

Why squat + bicep curl?

Exercise	Why it is good	What you can detect	Main risk
Squat	Common, visual, uses lower-body landmarks	depth, torso lean, knee symmetry	camera view matters
Bicep curl	Simpler upper-body motion	range of motion, elbow drift, shoulder swing	wrist/elbow occlusion
Push-up	Good demo exercise	depth, hip sag, elbow angle	floor/prone pose is harder

A robust 2-exercise system is much better than a weak 8-exercise system.

Do not try to support every exercise. For a course project, “we do two exercises well and discuss how to extend it” is a strong result.

4. Use Fit3D carefully

Fit3D is a strong dataset choice. The dataset includes exercise videos, multiple camera views, 3D skeletons, meshes, exercise labels, and repetition information. The Fit3D homepage describes the broader AIFit system and dataset context.

Important Fit3D advice

Train and test your deployed pipeline using MediaPipe-extracted landmarks from Fit3D RGB videos , not only Fit3D’s clean 3D ground-truth skeletons.

Use:

Fit3D RGB video
→ MediaPipe Pose
→ MediaPipe landmarks
→ joint-angle features
→ classifier/rules

Do not only use:

Fit3D ground-truth skeleton
→ classifier

Why?

Because your live app will receive noisy MediaPipe predictions, not perfect motion-capture skeletons. MediaPipe can have jitter, missing landmarks, occlusion errors, left/right confusion, and depth instability. Your training/evaluation features should resemble your real deployment features.

5. Good final architecture

A clean architecture:

1. Camera input
2. MediaPipe Pose Landmarker
3. Landmark quality checker
4. Pose normalization
5. Feature extraction
6. Temporal smoothing
7. Exercise module
8. Rep phase detector
9. Form feedback rules
10. UI overlay + logging

Each part should be testable independently.

Example directory structure:

formai/
  app/
    webcam_demo.py
    overlay.py
  pose/
    mediapipe_runner.py
    landmark_utils.py
    quality_checks.py
  features/
    angles.py
    normalization.py
    window_features.py
  exercises/
    squat.py
    bicep_curl.py
    pushup.py
  ml/
    train_exercise_classifier.py
    evaluate.py
  data/
    process_fit3d.py
  configs/
    exercises.yaml

6. Pose normalization

Do not feed raw pixel coordinates directly into your classifier.

Raw coordinates depend on:

camera distance,
user height,
frame resolution,
where the person stands,
crop/zoom,
phone orientation.

Normalize landmarks.

Basic normalization:

hip_center = midpoint(left_hip, right_hip)
shoulder_center = midpoint(left_shoulder, right_shoulder)
scale = distance(hip_center, shoulder_center)
normalized_landmark = (landmark - hip_center) / scale

This makes the pose representation less sensitive to body size and camera distance.

7. Joint-angle features

Joint angles are a very good starting point because they are:

interpretable,
fast,
simple,
easy to debug,
easy to explain in a report.

For a joint angle:

A — B — C

The angle is at point B.

Example:

hip → knee → ankle = knee angle
shoulder → elbow → wrist = elbow angle
shoulder → hip → knee = hip angle

Squat features

Feature	Landmarks	Why useful
Knee angle	hip-knee-ankle	squat depth and phase
Hip angle	shoulder-hip-knee	hip hinge / lower-body pattern
Torso angle	shoulder center to hip center	forward lean
Hip vertical movement	hip center y	depth
Left/right knee difference	left knee angle vs right knee angle	asymmetry
Knee tracking	knee vs ankle/hip x-position	knee cave, front view only

Bicep curl features

Feature	Landmarks	Why useful
Elbow angle	shoulder-elbow-wrist	rep phase and range
Elbow drift	elbow relative to shoulder/torso	upper-arm stability
Shoulder motion	shoulder/upper-arm movement	swinging/cheating
Wrist path	wrist relative to elbow	curl motion
Left/right elbow difference	both elbows	symmetry if two-arm curl

8. Use temporal windows, not single frames

Exercise is motion. A single frame is often ambiguous.

A squat, lunge, and deadlift can look similar in one frame. A curl midpoint can look like many other arm motions. Use time.

A useful paper here is Real-Time Fitness Exercise Classification and Counting Using a Bidirectional LSTM, which uses temporal pose features over frame sequences. You do not need to implement BiLSTM first, but the principle is important:

use sequences/windows, not isolated frames

Simple temporal-window features:

window = last 30 frames

for each angle:
    mean
    min
    max
    range
    standard deviation
    velocity

For squat:

knee_angle_min
knee_angle_max
knee_angle_range
hip_y_range
torso_angle_max
left_right_knee_difference_mean

For curl:

elbow_angle_min
elbow_angle_max
elbow_angle_range
elbow_velocity_mean
elbow_position_drift
shoulder_angle_range

This gives you motion information without needing a deep temporal model.

9. Repetition counting: use a state machine

Do not count reps like this:

if knee_angle < 100:
    count += 1

That will overcount.

Use a state machine.

Squat state machine

standing
→ descending
→ bottom
→ ascending
→ standing

Simple version:

if state == "standing" and knee_angle < down_threshold:
    state = "bottom"

if state == "bottom" and knee_angle > up_threshold:
    count += 1
    state = "standing"

Better version:

require threshold for several frames
require valid landmarks
require minimum time between reps
ignore low-quality frames
smooth angles before decisions

Bicep curl state machine

extended
→ curling_up
→ top
→ lowering
→ extended

Example thresholds:

extended: elbow_angle > 150°
top: elbow_angle < 60°
rep: extended → top → extended

Do not treat those numbers as universal. Use them as starting points and tune them from your videos.

10. Form feedback should be rep-level, not frame-level

Frame-level feedback is noisy.

Bad:

Frame 101: torso bad
Frame 102: torso okay
Frame 103: torso bad
Frame 104: torso okay

Better:

Rep 3:
- depth: too shallow
- torso lean: acceptable
- symmetry: acceptable

Feedback:
"Rep 3: try to squat slightly deeper."

Recommended feedback behavior:

During the rep:
    show light live cues

After the rep:
    show one main correction

Use cooldowns:

Do not repeat the same feedback every frame.
Only show a warning if the issue persists for N frames or appears in a significant part of the rep.

11. Suggested rules

Squat rules

Use side view first.

Required landmarks:

shoulders
hips
knees
ankles

Main rep signal:

average knee angle
or hip height relative to knee

Rules:

Rule	Signal	Feedback
Not deep enough	min knee angle or hip/knee height	“Try to squat deeper.”
Torso leaning too far	torso angle during descent/bottom	“Keep your chest more upright.”
Left/right asymmetry	difference between left and right knee angles	“Try to move both legs evenly.”
Knee cave	knee position relative to ankle/hip line	“Avoid letting your knees collapse inward.”

Important: knee-cave detection is mainly a front-view problem. Squat depth and torso lean are mainly side-view problems. Do not claim all squat errors can be detected from any camera angle.

Bicep curl rules

Required landmarks:

shoulder
elbow
wrist
hip/torso reference

Main rep signal:

elbow angle

Rules:

Rule	Signal	Feedback
Incomplete curl	min elbow angle too large	“Curl higher.”
Incomplete extension	max elbow angle too small	“Extend your arm more at the bottom.”
Elbow drift	elbow moves relative to shoulder/torso	“Keep your elbow stable.”
Shoulder swing	shoulder/upper arm moves too much	“Avoid swinging your shoulder.”

12. Camera-view constraints are not optional

A phone camera cannot reliably detect every form issue from every angle.

Issue	Best camera view
Squat depth	side view
Squat torso lean	side view
Knees caving inward	front view
Bicep curl elbow drift	side or front upper-body view
Bicep curl shoulder swing	side view
Push-up hip sag	side view
Push-up elbow flare	front/diagonal view

Your app should guide the user:

"For squat depth and torso analysis, place the camera to your side."
"For knee-cave analysis, use a front view."
"For bicep curls, keep shoulder, elbow, and wrist visible."

This makes the system more honest and more reliable.

13. Landmark quality checks

Before giving form feedback, check that the pose is usable.

Quality checks:

one person detected
required landmarks visible
full body inside frame
landmarks inside image bounds
limb lengths reasonable
angle changes not physically impossible
landmarks stable for several frames
camera view suitable for selected exercise

If the input is bad, do not say “bad form.” Say:

"Move farther from the camera."
"Make sure your full body is visible."
"Improve lighting."
"Use side view for squat analysis."
"Only one person should be in frame."

Useful MediaPipe issue links for real-world pitfalls:

Pose landmark jitter issue
Landmark visibility/presence discussion
Occlusion / hallucinated landmarks issue
Pose accuracy issues with non-standing / rotated poses
MediaPipe Web synchronous detect/detectForVideo performance issue

Takeaway: a robust app needs input-quality warnings , not only form warnings.

14. Evaluation: avoid fake accuracy

Do not randomly split frames.

Bad:

frame 1 from video A → train
frame 2 from video A → test
frame 3 from video A → train
frame 4 from video A → test

This leaks information because neighboring frames are nearly identical.

Better:

recording A → train
recording B → test

Best:

subject A/B/C → train
subject D → test

Read:

scikit-learn common pitfalls: data leakage
scikit-learn GroupShuffleSplit
scikit-learn getting started

Use GroupShuffleSplit or GroupKFold with:

group = subject_id

or:

group = recording_id

Report multiple metrics:

Component	Metric
Exercise classifier	accuracy, macro F1, confusion matrix
Rep counter	absolute count error
Form rules	manual agreement on selected clips
Runtime	FPS, average latency
Robustness	failure cases by lighting/camera/occlusion

Do not only report:

accuracy = 98%

Report:

exercise classifier macro F1
rep-counting error
FPS
failure cases

That will make your report much more credible.

15. Similar projects worth studying

Official / high-value guides

MediaPipe pose classification and repetition counting guide Very relevant. Shows pose classification and repetition counting with push-ups/squats using k-NN.
ML Kit pose classification guide Useful mobile-oriented explanation of pose classification and rep counting.
Build an AI Fitness Trainer Using MediaPipe for Squat Analysis Practical squat-focused MediaPipe example with feedback logic.

Research references

AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training Closest research-grade version of your idea: 3D pose, rep segmentation, trainer/trainee comparison, interpretable feedback.
AIFit PDF
Real-Time Fitness Exercise Classification and Counting Using BiLSTM Good for the idea that temporal windows matter.
BlazePose paper Background on the pose-estimation model family behind MediaPipe Pose.
BlazePose GHUM Holistic Useful for 3D landmark / on-device pose-estimation background.

GitHub projects

ExercisePoseCorrection Push-ups, squats, and bicep curls with real-time form feedback.
AI Push-Up Trainer Good example of using a state machine before counting reps.
Deadlift posture-correction system Useful example of stage-based feedback: setup, lifting, lockout.
Pose Estimation for Fitness Exercise Analysis Uses MediaPipe + scikit-learn for exercise phase classification, rep counting, and quality assessment.
Exercise-Correction Useful for stage-dependent exercise error logic.
Workout-Trainer Good example of exercise-specific metrics such as elbow flexion, shoulder drift, squat depth, and chest angle.

Use GitHub projects as engineering references, not as proof that the problem is solved. Many hobby/student projects have weak evaluation.

16. Recommended tech stack

Fastest prototype

Python
OpenCV
MediaPipe
NumPy
Pandas
scikit-learn
Matplotlib
Joblib

Use this for:

video reading
pose extraction
angle calculation
CSV generation
model training
evaluation plots
debugging

Optional web demo

JavaScript / TypeScript
@mediapipe/tasks-vision
HTML Canvas
Web Worker

Relevant links:

MediaPipe Web Pose Landmarker guide
MediaPipe samples web repo
Pose Landmarker worker sample

The web version is good if you want a phone-browser demo, but be careful with performance. Pose detection in the browser can block the main thread unless you throttle it or run it in a worker.

Optional Android demo

Kotlin or Java
MediaPipe Tasks Android
CameraX
Canvas overlay

Only do Android if someone on your team already knows Android.

17. Suggested one-month plan

Week 1: Pose pipeline

Goal:

camera/video
→ MediaPipe
→ landmarks
→ joint angles

Deliverables:

webcam or video input,
pose overlay,
angle calculation,
CSV export,
basic landmark quality checks.

Do not start with a complex model yet.

Week 2: One exercise end-to-end

Goal:

squat works from camera to feedback

Deliverables:

squat rep counter,
squat state machine,
2–3 squat feedback rules,
angle smoothing,
after-rep feedback.

At the end of Week 2, you should already have a demo.

Week 3: Second exercise + classifier

Goal:

bicep curl support + exercise classifier baseline

Deliverables:

curl rep counter,
curl feedback rules,
Fit3D subset processing,
exercise classifier baseline,
confusion matrix.

Start with:

Random Forest
SVM
k-NN
Logistic Regression

Do not start with LSTM unless the simple pipeline is already working.

Week 4: Integration and polish

Goal:

stable final demo + honest evaluation

Deliverables:

clean UI,
final demo video,
FPS measurement,
evaluation metrics,
failure cases,
report,
presentation.

18. Team division

With three teammates, divide the work like this.

Teammate 1: Real-time pipeline

Responsibilities:

camera input
MediaPipe setup
landmark drawing
FPS/latency measurement
UI overlay

Deliverables:

live skeleton demo
real-time angle display
recorded demo video

Teammate 2: Dataset and ML

Responsibilities:

Fit3D subset
landmark extraction
feature CSV
exercise classifier
train/test split
evaluation

Deliverables:

features.csv
trained classifier
classification report
confusion matrix

Teammate 3: Rep counting and feedback

Responsibilities:

angle logic
state machines
squat rules
curl rules
feedback messages
failure-case documentation

Deliverables:

rep counter
form analyzer
feedback engine
rule documentation

This gives everyone a clear subsystem.

19. What your final report should say

Avoid saying only:

We used MediaPipe and trained a classifier.

Say something like:

We built a modular real-time exercise-form analysis pipeline. MediaPipe Pose Landmarker was used to extract body landmarks from camera/video input. We normalized landmarks, computed interpretable joint-angle features, smoothed temporal signals, counted repetitions with exercise-specific state machines, and generated corrective feedback from phase-aware rules. We evaluated exercise classification with a subject/video-level split and measured rep-counting accuracy, runtime FPS, and common failure cases.

Suggested report structure:

1. Introduction
   - problem
   - motivation
   - goal

2. Background
   - human pose estimation
   - MediaPipe Pose
   - exercise classification
   - form feedback

3. Dataset
   - Fit3D overview
   - selected exercises
   - preprocessing
   - train/test split

4. Method
   - pose extraction
   - landmark normalization
   - joint-angle features
   - temporal smoothing
   - rep state machine
   - form-feedback rules
   - exercise classifier

5. Implementation
   - real-time pipeline
   - UI
   - performance considerations

6. Experiments
   - exercise classification results
   - rep-counting results
   - runtime FPS
   - failure cases

7. Discussion
   - limitations
   - camera-view constraints
   - dataset limitations
   - future work

8. Conclusion

20. Pitfalls to avoid

Pitfall 1: Too many exercises

Bad:

We support 12 exercises.

Better:

We support 2 exercises robustly and explainably.

Pitfall 2: Binary correct/incorrect form

Bad:

The model says correct or incorrect.

Better:

The system detects specific issues:
- shallow squat
- excessive torso lean
- elbow drift
- partial curl

Pitfall 3: Frame-level random split

Bad:

random train_test_split over all frames

Better:

split by subject_id or recording_id

Pitfall 4: No camera setup

Bad:

Analyze from any camera angle.

Better:

Use side view for squat depth and torso lean.
Use front/side upper-body view for bicep curls.

Pitfall 5: No smoothing

Bad:

one-frame warning

Better:

warning only after the condition persists across several frames or across a rep

Pitfall 6: Overusing z-depth

Bad:

precise 3D biomechanics from one phone camera

Better:

2D angle features with constrained camera view; optional 3D/world features for experiments

Pitfall 7: Overclaiming safety

Avoid:

prevents injuries
guarantees safe form
replaces a trainer

Say:

provides basic real-time feedback on visible form deviations

21. My final recommended FormAI MVP

Input:
- phone/webcam video

Pose:
- MediaPipe Pose Landmarker

Exercises:
- squat
- bicep curl

Features:
- normalized landmarks
- joint angles
- temporal window statistics

Exercise recognition:
- manual exercise selection for demo
- optional Random Forest/k-NN/SVM classifier for experiment

Rep counting:
- state-machine based

Feedback:
- squat depth
- squat torso lean
- squat asymmetry
- curl range of motion
- curl elbow drift
- curl shoulder swing

Evaluation:
- Fit3D subset
- subject/video-level split
- confusion matrix
- rep-counting error
- FPS
- failure cases

Short summary

The project is feasible if you narrow the scope.
Use MediaPipe for pose landmarks, not for the whole coaching logic.
Use Fit3D for exercise videos, rep intervals, and offline experiments.
Train on MediaPipe-extracted landmarks from Fit3D videos, not only clean ground-truth skeletons.
Use ML for exercise recognition or phase recognition.
Use rules for form feedback first.
Use state machines for rep counting.
Use temporal smoothing and rep-level feedback.
Split train/test by subject or recording, not by frame.
Build squat + bicep curl well before adding anything else.
Make the final claim modest: FormAI gives basic real-time feedback on visible form deviations; it does not replace a trainer or guarantee injury prevention.

1. Why MediaPipe is a good backbone

2. The main warning: do not start with a big correct/incorrect classifier

3. Best scope for one month

Why squat + bicep curl?

4. Use Fit3D carefully

Important Fit3D advice

5. Good final architecture

6. Pose normalization

7. Joint-angle features

Squat features

Bicep curl features

8. Use temporal windows, not single frames

9. Repetition counting: use a state machine

Squat state machine

Bicep curl state machine

10. Form feedback should be rep-level, not frame-level

11. Suggested rules

Squat rules

Bicep curl rules

12. Camera-view constraints are not optional

13. Landmark quality checks

14. Evaluation: avoid fake accuracy

15. Similar projects worth studying

Official / high-value guides

Research references

GitHub projects

16. Recommended tech stack

Fastest prototype

Optional web demo

Optional Android demo

17. Suggested one-month plan

Week 1: Pose pipeline

Week 2: One exercise end-to-end

Week 3: Second exercise + classifier

Week 4: Integration and polish

18. Team division

Teammate 1: Real-time pipeline

Teammate 2: Dataset and ML

Teammate 3: Rep counting and feedback

19. What your final report should say

20. Pitfalls to avoid

Pitfall 1: Too many exercises

Pitfall 2: Binary correct/incorrect form

Pitfall 3: Frame-level random split

Pitfall 4: No camera setup

Pitfall 5: No smoothing

Pitfall 6: Overusing z-depth

Pitfall 7: Overclaiming safety

21. My final recommended FormAI MVP

Short summary

Discussion in the ATmosphere