External Publication
Visit Post

Real-time exercise form analysis with MediaPipe , looking for advice

Hugging Face Forums [Unofficial] May 2, 2026
Source

For now, I’ve gathered some existing resources that might be useful:


Your project idea is feasible, but only if you keep the scope tight.

You are not really building one system; you are building several systems stacked together:

camera video
→ pose estimation
→ landmark cleanup
→ feature extraction
→ exercise recognition
→ repetition segmentation
→ form feedback
→ real-time UI

For a one-month course project with three people and no previous full CV-pipeline experience, I would not try to build a general-purpose “AI personal trainer” for many exercises. I would build a narrow, explainable, real-time prototype that works well for 2 exercises.

My recommended version of FormAI :

Phone/webcam camera
→ MediaPipe Pose Landmarker
→ landmark quality checks
→ normalized joint-angle features
→ manual exercise mode or lightweight exercise classifier
→ state-machine repetition counting
→ exercise-specific form rules
→ real-time overlay + after-rep feedback

The most important design choice:

Use ML for exercise recognition or phase recognition, but use rule-based / phase-aware logic for form feedback first.

That will be easier to build, easier to debug, easier to demo, and easier to explain in your report.


1. Why MediaPipe is a good backbone

MediaPipe Pose Landmarker is a good fit because it detects body landmarks in images/video and outputs both image-coordinate landmarks and 3D world-coordinate landmarks. It is designed for tasks like posture analysis and movement categorization.

Useful official links:

  • MediaPipe Pose Landmarker overview
  • MediaPipe Pose Landmarker Python guide
  • MediaPipe Pose Landmarker Web guide
  • MediaPipe Pose Landmarker Android guide
  • Legacy MediaPipe Pose docs
  • MediaPipe GitHub repo

For a student project, MediaPipe should be treated as the pose-estimation backend , not as the full exercise coach. MediaPipe gives you landmarks. Your project still has to decide:

  • which landmarks are reliable,
  • which angles matter,
  • where a rep starts and ends,
  • whether the current exercise phase is valid,
  • what feedback should be shown,
  • whether the camera view is acceptable.

So the core of your project is not “running MediaPipe.” The core is everything you build on top of MediaPipe.


2. The main warning: do not start with a big correct/incorrect classifier

Your original plan is:

MediaPipe keypoints
→ joint angles
→ classifier for exercise type
→ classifier for correct/incorrect form

The exercise-type classifier is reasonable.

The correct/incorrect classifier is risky.

Why? Because “incorrect form” is not one thing.

A squat can be wrong because of:

  • not enough depth,
  • excessive torso lean,
  • knees caving inward,
  • left/right asymmetry,
  • heels lifting,
  • bad camera angle,
  • missing landmarks.

A bicep curl can be wrong because of:

  • partial range of motion,
  • not extending fully,
  • elbow drifting,
  • shoulder swinging,
  • torso momentum,
  • occluded wrist/elbow.

A binary classifier may output:

incorrect

But the user needs something like:

"Your elbow is drifting forward. Keep your upper arm more stable."

So instead of training a vague correct/incorrect classifier first, define specific feedback rules.

Better structure:

exercise classifier:
    squat / bicep_curl / push_up / unknown

form analyzers:
    squat_depth
    squat_torso_lean
    squat_knee_symmetry
    curl_range_of_motion
    curl_elbow_drift
    curl_shoulder_swing

This is more explainable and easier to evaluate.


3. Best scope for one month

I would build only:

1. Squat
2. Bicep curl

Optional third exercise:

3. Push-up

But only add push-up if squat and curl already work.

Why squat + bicep curl?

Exercise Why it is good What you can detect Main risk
Squat Common, visual, uses lower-body landmarks depth, torso lean, knee symmetry camera view matters
Bicep curl Simpler upper-body motion range of motion, elbow drift, shoulder swing wrist/elbow occlusion
Push-up Good demo exercise depth, hip sag, elbow angle floor/prone pose is harder

A robust 2-exercise system is much better than a weak 8-exercise system.

Do not try to support every exercise. For a course project, “we do two exercises well and discuss how to extend it” is a strong result.


4. Use Fit3D carefully

Fit3D is a strong dataset choice. The dataset includes exercise videos, multiple camera views, 3D skeletons, meshes, exercise labels, and repetition information. The Fit3D homepage describes the broader AIFit system and dataset context.

Also read:

  • Fit3D dataset page
  • Fit3D homepage
  • Fit3D code page
  • Fit3D license/legal page
  • AIFit CVPR paper
  • AIFit PDF

Fit3D is useful for:

exercise labels
repetition intervals
multi-view exercise videos
3D skeleton reference data
offline experiments

But do not assume it directly gives you all the form-error labels you want, such as:

squat_not_deep_enough
squat_knees_caving
curl_elbow_drift
pushup_hip_sag

You may need to define those errors yourself with rules.

Important Fit3D advice

Train and test your deployed pipeline using MediaPipe-extracted landmarks from Fit3D RGB videos , not only Fit3D’s clean 3D ground-truth skeletons.

Use:

Fit3D RGB video
→ MediaPipe Pose
→ MediaPipe landmarks
→ joint-angle features
→ classifier/rules

Do not only use:

Fit3D ground-truth skeleton
→ classifier

Why?

Because your live app will receive noisy MediaPipe predictions, not perfect motion-capture skeletons. MediaPipe can have jitter, missing landmarks, occlusion errors, left/right confusion, and depth instability. Your training/evaluation features should resemble your real deployment features.


5. Good final architecture

A clean architecture:

1. Camera input
2. MediaPipe Pose Landmarker
3. Landmark quality checker
4. Pose normalization
5. Feature extraction
6. Temporal smoothing
7. Exercise module
8. Rep phase detector
9. Form feedback rules
10. UI overlay + logging

Each part should be testable independently.

Example directory structure:

formai/
  app/
    webcam_demo.py
    overlay.py
  pose/
    mediapipe_runner.py
    landmark_utils.py
    quality_checks.py
  features/
    angles.py
    normalization.py
    window_features.py
  exercises/
    squat.py
    bicep_curl.py
    pushup.py
  ml/
    train_exercise_classifier.py
    evaluate.py
  data/
    process_fit3d.py
  configs/
    exercises.yaml

6. Pose normalization

Do not feed raw pixel coordinates directly into your classifier.

Raw coordinates depend on:

  • camera distance,
  • user height,
  • frame resolution,
  • where the person stands,
  • crop/zoom,
  • phone orientation.

Normalize landmarks.

Basic normalization:

hip_center = midpoint(left_hip, right_hip)
shoulder_center = midpoint(left_shoulder, right_shoulder)
scale = distance(hip_center, shoulder_center)
normalized_landmark = (landmark - hip_center) / scale

This makes the pose representation less sensitive to body size and camera distance.


7. Joint-angle features

Joint angles are a very good starting point because they are:

  • interpretable,
  • fast,
  • simple,
  • easy to debug,
  • easy to explain in a report.

For a joint angle:

A — B — C

The angle is at point B.

Example:

hip → knee → ankle = knee angle
shoulder → elbow → wrist = elbow angle
shoulder → hip → knee = hip angle

Squat features

Feature Landmarks Why useful
Knee angle hip-knee-ankle squat depth and phase
Hip angle shoulder-hip-knee hip hinge / lower-body pattern
Torso angle shoulder center to hip center forward lean
Hip vertical movement hip center y depth
Left/right knee difference left knee angle vs right knee angle asymmetry
Knee tracking knee vs ankle/hip x-position knee cave, front view only

Bicep curl features

Feature Landmarks Why useful
Elbow angle shoulder-elbow-wrist rep phase and range
Elbow drift elbow relative to shoulder/torso upper-arm stability
Shoulder motion shoulder/upper-arm movement swinging/cheating
Wrist path wrist relative to elbow curl motion
Left/right elbow difference both elbows symmetry if two-arm curl

8. Use temporal windows, not single frames

Exercise is motion. A single frame is often ambiguous.

A squat, lunge, and deadlift can look similar in one frame. A curl midpoint can look like many other arm motions. Use time.

A useful paper here is Real-Time Fitness Exercise Classification and Counting Using a Bidirectional LSTM, which uses temporal pose features over frame sequences. You do not need to implement BiLSTM first, but the principle is important:

use sequences/windows, not isolated frames

Simple temporal-window features:

window = last 30 frames

for each angle:
    mean
    min
    max
    range
    standard deviation
    velocity

For squat:

knee_angle_min
knee_angle_max
knee_angle_range
hip_y_range
torso_angle_max
left_right_knee_difference_mean

For curl:

elbow_angle_min
elbow_angle_max
elbow_angle_range
elbow_velocity_mean
elbow_position_drift
shoulder_angle_range

This gives you motion information without needing a deep temporal model.


9. Repetition counting: use a state machine

Do not count reps like this:

if knee_angle < 100:
    count += 1

That will overcount.

Use a state machine.

Squat state machine

standing
→ descending
→ bottom
→ ascending
→ standing

Simple version:

if state == "standing" and knee_angle < down_threshold:
    state = "bottom"

if state == "bottom" and knee_angle > up_threshold:
    count += 1
    state = "standing"

Better version:

require threshold for several frames
require valid landmarks
require minimum time between reps
ignore low-quality frames
smooth angles before decisions

Bicep curl state machine

extended
→ curling_up
→ top
→ lowering
→ extended

Example thresholds:

extended: elbow_angle > 150°
top: elbow_angle < 60°
rep: extended → top → extended

Do not treat those numbers as universal. Use them as starting points and tune them from your videos.


10. Form feedback should be rep-level, not frame-level

Frame-level feedback is noisy.

Bad:

Frame 101: torso bad
Frame 102: torso okay
Frame 103: torso bad
Frame 104: torso okay

Better:

Rep 3:
- depth: too shallow
- torso lean: acceptable
- symmetry: acceptable

Feedback:
"Rep 3: try to squat slightly deeper."

Recommended feedback behavior:

During the rep:
    show light live cues

After the rep:
    show one main correction

Use cooldowns:

Do not repeat the same feedback every frame.
Only show a warning if the issue persists for N frames or appears in a significant part of the rep.

11. Suggested rules

Squat rules

Use side view first.

Required landmarks:

shoulders
hips
knees
ankles

Main rep signal:

average knee angle
or hip height relative to knee

Rules:

Rule Signal Feedback
Not deep enough min knee angle or hip/knee height “Try to squat deeper.”
Torso leaning too far torso angle during descent/bottom “Keep your chest more upright.”
Left/right asymmetry difference between left and right knee angles “Try to move both legs evenly.”
Knee cave knee position relative to ankle/hip line “Avoid letting your knees collapse inward.”

Important: knee-cave detection is mainly a front-view problem. Squat depth and torso lean are mainly side-view problems. Do not claim all squat errors can be detected from any camera angle.

Bicep curl rules

Required landmarks:

shoulder
elbow
wrist
hip/torso reference

Main rep signal:

elbow angle

Rules:

Rule Signal Feedback
Incomplete curl min elbow angle too large “Curl higher.”
Incomplete extension max elbow angle too small “Extend your arm more at the bottom.”
Elbow drift elbow moves relative to shoulder/torso “Keep your elbow stable.”
Shoulder swing shoulder/upper arm moves too much “Avoid swinging your shoulder.”

12. Camera-view constraints are not optional

A phone camera cannot reliably detect every form issue from every angle.

Issue Best camera view
Squat depth side view
Squat torso lean side view
Knees caving inward front view
Bicep curl elbow drift side or front upper-body view
Bicep curl shoulder swing side view
Push-up hip sag side view
Push-up elbow flare front/diagonal view

Your app should guide the user:

"For squat depth and torso analysis, place the camera to your side."
"For knee-cave analysis, use a front view."
"For bicep curls, keep shoulder, elbow, and wrist visible."

This makes the system more honest and more reliable.


13. Landmark quality checks

Before giving form feedback, check that the pose is usable.

Quality checks:

one person detected
required landmarks visible
full body inside frame
landmarks inside image bounds
limb lengths reasonable
angle changes not physically impossible
landmarks stable for several frames
camera view suitable for selected exercise

If the input is bad, do not say “bad form.” Say:

"Move farther from the camera."
"Make sure your full body is visible."
"Improve lighting."
"Use side view for squat analysis."
"Only one person should be in frame."

Useful MediaPipe issue links for real-world pitfalls:

  • Pose landmark jitter issue
  • Landmark visibility/presence discussion
  • Occlusion / hallucinated landmarks issue
  • Pose accuracy issues with non-standing / rotated poses
  • MediaPipe Web synchronous detect/detectForVideo performance issue

Takeaway: a robust app needs input-quality warnings , not only form warnings.


14. Evaluation: avoid fake accuracy

Do not randomly split frames.

Bad:

frame 1 from video A → train
frame 2 from video A → test
frame 3 from video A → train
frame 4 from video A → test

This leaks information because neighboring frames are nearly identical.

Better:

recording A → train
recording B → test

Best:

subject A/B/C → train
subject D → test

Read:

  • scikit-learn common pitfalls: data leakage
  • scikit-learn GroupShuffleSplit
  • scikit-learn getting started

Use GroupShuffleSplit or GroupKFold with:

group = subject_id

or:

group = recording_id

Report multiple metrics:

Component Metric
Exercise classifier accuracy, macro F1, confusion matrix
Rep counter absolute count error
Form rules manual agreement on selected clips
Runtime FPS, average latency
Robustness failure cases by lighting/camera/occlusion

Do not only report:

accuracy = 98%

Report:

exercise classifier macro F1
rep-counting error
FPS
failure cases

That will make your report much more credible.


15. Similar projects worth studying

Official / high-value guides

  • MediaPipe pose classification and repetition counting guide Very relevant. Shows pose classification and repetition counting with push-ups/squats using k-NN.

  • ML Kit pose classification guide Useful mobile-oriented explanation of pose classification and rep counting.

  • Build an AI Fitness Trainer Using MediaPipe for Squat Analysis Practical squat-focused MediaPipe example with feedback logic.

Research references

  • AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training Closest research-grade version of your idea: 3D pose, rep segmentation, trainer/trainee comparison, interpretable feedback.

  • AIFit PDF

  • Real-Time Fitness Exercise Classification and Counting Using BiLSTM Good for the idea that temporal windows matter.

  • BlazePose paper Background on the pose-estimation model family behind MediaPipe Pose.

  • BlazePose GHUM Holistic Useful for 3D landmark / on-device pose-estimation background.

GitHub projects

  • ExercisePoseCorrection Push-ups, squats, and bicep curls with real-time form feedback.

  • AI Push-Up Trainer Good example of using a state machine before counting reps.

  • Deadlift posture-correction system Useful example of stage-based feedback: setup, lifting, lockout.

  • Pose Estimation for Fitness Exercise Analysis Uses MediaPipe + scikit-learn for exercise phase classification, rep counting, and quality assessment.

  • Exercise-Correction Useful for stage-dependent exercise error logic.

  • Workout-Trainer Good example of exercise-specific metrics such as elbow flexion, shoulder drift, squat depth, and chest angle.

Use GitHub projects as engineering references, not as proof that the problem is solved. Many hobby/student projects have weak evaluation.


16. Recommended tech stack

Fastest prototype

Python
OpenCV
MediaPipe
NumPy
Pandas
scikit-learn
Matplotlib
Joblib

Use this for:

video reading
pose extraction
angle calculation
CSV generation
model training
evaluation plots
debugging

Optional web demo

JavaScript / TypeScript
@mediapipe/tasks-vision
HTML Canvas
Web Worker

Relevant links:

  • MediaPipe Web Pose Landmarker guide
  • MediaPipe samples web repo
  • Pose Landmarker worker sample

The web version is good if you want a phone-browser demo, but be careful with performance. Pose detection in the browser can block the main thread unless you throttle it or run it in a worker.

Optional Android demo

Kotlin or Java
MediaPipe Tasks Android
CameraX
Canvas overlay

Only do Android if someone on your team already knows Android.


17. Suggested one-month plan

Week 1: Pose pipeline

Goal:

camera/video
→ MediaPipe
→ landmarks
→ joint angles

Deliverables:

  • webcam or video input,
  • pose overlay,
  • angle calculation,
  • CSV export,
  • basic landmark quality checks.

Do not start with a complex model yet.

Week 2: One exercise end-to-end

Goal:

squat works from camera to feedback

Deliverables:

  • squat rep counter,
  • squat state machine,
  • 2–3 squat feedback rules,
  • angle smoothing,
  • after-rep feedback.

At the end of Week 2, you should already have a demo.

Week 3: Second exercise + classifier

Goal:

bicep curl support + exercise classifier baseline

Deliverables:

  • curl rep counter,
  • curl feedback rules,
  • Fit3D subset processing,
  • exercise classifier baseline,
  • confusion matrix.

Start with:

Random Forest
SVM
k-NN
Logistic Regression

Do not start with LSTM unless the simple pipeline is already working.

Week 4: Integration and polish

Goal:

stable final demo + honest evaluation

Deliverables:

  • clean UI,
  • final demo video,
  • FPS measurement,
  • evaluation metrics,
  • failure cases,
  • report,
  • presentation.

18. Team division

With three teammates, divide the work like this.

Teammate 1: Real-time pipeline

Responsibilities:

camera input
MediaPipe setup
landmark drawing
FPS/latency measurement
UI overlay

Deliverables:

live skeleton demo
real-time angle display
recorded demo video

Teammate 2: Dataset and ML

Responsibilities:

Fit3D subset
landmark extraction
feature CSV
exercise classifier
train/test split
evaluation

Deliverables:

features.csv
trained classifier
classification report
confusion matrix

Teammate 3: Rep counting and feedback

Responsibilities:

angle logic
state machines
squat rules
curl rules
feedback messages
failure-case documentation

Deliverables:

rep counter
form analyzer
feedback engine
rule documentation

This gives everyone a clear subsystem.


19. What your final report should say

Avoid saying only:

We used MediaPipe and trained a classifier.

Say something like:

We built a modular real-time exercise-form analysis pipeline. MediaPipe Pose Landmarker was used to extract body landmarks from camera/video input. We normalized landmarks, computed interpretable joint-angle features, smoothed temporal signals, counted repetitions with exercise-specific state machines, and generated corrective feedback from phase-aware rules. We evaluated exercise classification with a subject/video-level split and measured rep-counting accuracy, runtime FPS, and common failure cases.

Suggested report structure:

1. Introduction
   - problem
   - motivation
   - goal

2. Background
   - human pose estimation
   - MediaPipe Pose
   - exercise classification
   - form feedback

3. Dataset
   - Fit3D overview
   - selected exercises
   - preprocessing
   - train/test split

4. Method
   - pose extraction
   - landmark normalization
   - joint-angle features
   - temporal smoothing
   - rep state machine
   - form-feedback rules
   - exercise classifier

5. Implementation
   - real-time pipeline
   - UI
   - performance considerations

6. Experiments
   - exercise classification results
   - rep-counting results
   - runtime FPS
   - failure cases

7. Discussion
   - limitations
   - camera-view constraints
   - dataset limitations
   - future work

8. Conclusion

20. Pitfalls to avoid

Pitfall 1: Too many exercises

Bad:

We support 12 exercises.

Better:

We support 2 exercises robustly and explainably.

Pitfall 2: Binary correct/incorrect form

Bad:

The model says correct or incorrect.

Better:

The system detects specific issues:
- shallow squat
- excessive torso lean
- elbow drift
- partial curl

Pitfall 3: Frame-level random split

Bad:

random train_test_split over all frames

Better:

split by subject_id or recording_id

Pitfall 4: No camera setup

Bad:

Analyze from any camera angle.

Better:

Use side view for squat depth and torso lean.
Use front/side upper-body view for bicep curls.

Pitfall 5: No smoothing

Bad:

one-frame warning

Better:

warning only after the condition persists across several frames or across a rep

Pitfall 6: Overusing z-depth

Bad:

precise 3D biomechanics from one phone camera

Better:

2D angle features with constrained camera view; optional 3D/world features for experiments

Pitfall 7: Overclaiming safety

Avoid:

prevents injuries
guarantees safe form
replaces a trainer

Say:

provides basic real-time feedback on visible form deviations

21. My final recommended FormAI MVP

Input:
- phone/webcam video

Pose:
- MediaPipe Pose Landmarker

Exercises:
- squat
- bicep curl

Features:
- normalized landmarks
- joint angles
- temporal window statistics

Exercise recognition:
- manual exercise selection for demo
- optional Random Forest/k-NN/SVM classifier for experiment

Rep counting:
- state-machine based

Feedback:
- squat depth
- squat torso lean
- squat asymmetry
- curl range of motion
- curl elbow drift
- curl shoulder swing

Evaluation:
- Fit3D subset
- subject/video-level split
- confusion matrix
- rep-counting error
- FPS
- failure cases

Short summary

  • The project is feasible if you narrow the scope.
  • Use MediaPipe for pose landmarks, not for the whole coaching logic.
  • Use Fit3D for exercise videos, rep intervals, and offline experiments.
  • Train on MediaPipe-extracted landmarks from Fit3D videos, not only clean ground-truth skeletons.
  • Use ML for exercise recognition or phase recognition.
  • Use rules for form feedback first.
  • Use state machines for rep counting.
  • Use temporal smoothing and rep-level feedback.
  • Split train/test by subject or recording, not by frame.
  • Build squat + bicep curl well before adding anything else.
  • Make the final claim modest: FormAI gives basic real-time feedback on visible form deviations; it does not replace a trainer or guarantee injury prevention.

Discussion in the ATmosphere

Loading comments...