Real-time exercise form analysis with MediaPipe , looking for advice
For now, I’ve gathered some existing resources that might be useful:
Your project idea is feasible, but only if you keep the scope tight.
You are not really building one system; you are building several systems stacked together:
camera video
→ pose estimation
→ landmark cleanup
→ feature extraction
→ exercise recognition
→ repetition segmentation
→ form feedback
→ real-time UI
For a one-month course project with three people and no previous full CV-pipeline experience, I would not try to build a general-purpose “AI personal trainer” for many exercises. I would build a narrow, explainable, real-time prototype that works well for 2 exercises.
My recommended version of FormAI :
Phone/webcam camera
→ MediaPipe Pose Landmarker
→ landmark quality checks
→ normalized joint-angle features
→ manual exercise mode or lightweight exercise classifier
→ state-machine repetition counting
→ exercise-specific form rules
→ real-time overlay + after-rep feedback
The most important design choice:
Use ML for exercise recognition or phase recognition, but use rule-based / phase-aware logic for form feedback first.
That will be easier to build, easier to debug, easier to demo, and easier to explain in your report.
1. Why MediaPipe is a good backbone
MediaPipe Pose Landmarker is a good fit because it detects body landmarks in images/video and outputs both image-coordinate landmarks and 3D world-coordinate landmarks. It is designed for tasks like posture analysis and movement categorization.
Useful official links:
- MediaPipe Pose Landmarker overview
- MediaPipe Pose Landmarker Python guide
- MediaPipe Pose Landmarker Web guide
- MediaPipe Pose Landmarker Android guide
- Legacy MediaPipe Pose docs
- MediaPipe GitHub repo
For a student project, MediaPipe should be treated as the pose-estimation backend , not as the full exercise coach. MediaPipe gives you landmarks. Your project still has to decide:
- which landmarks are reliable,
- which angles matter,
- where a rep starts and ends,
- whether the current exercise phase is valid,
- what feedback should be shown,
- whether the camera view is acceptable.
So the core of your project is not “running MediaPipe.” The core is everything you build on top of MediaPipe.
2. The main warning: do not start with a big correct/incorrect classifier
Your original plan is:
MediaPipe keypoints
→ joint angles
→ classifier for exercise type
→ classifier for correct/incorrect form
The exercise-type classifier is reasonable.
The correct/incorrect classifier is risky.
Why? Because “incorrect form” is not one thing.
A squat can be wrong because of:
- not enough depth,
- excessive torso lean,
- knees caving inward,
- left/right asymmetry,
- heels lifting,
- bad camera angle,
- missing landmarks.
A bicep curl can be wrong because of:
- partial range of motion,
- not extending fully,
- elbow drifting,
- shoulder swinging,
- torso momentum,
- occluded wrist/elbow.
A binary classifier may output:
incorrect
But the user needs something like:
"Your elbow is drifting forward. Keep your upper arm more stable."
So instead of training a vague correct/incorrect classifier first, define specific feedback rules.
Better structure:
exercise classifier:
squat / bicep_curl / push_up / unknown
form analyzers:
squat_depth
squat_torso_lean
squat_knee_symmetry
curl_range_of_motion
curl_elbow_drift
curl_shoulder_swing
This is more explainable and easier to evaluate.
3. Best scope for one month
I would build only:
1. Squat
2. Bicep curl
Optional third exercise:
3. Push-up
But only add push-up if squat and curl already work.
Why squat + bicep curl?
| Exercise | Why it is good | What you can detect | Main risk |
|---|---|---|---|
| Squat | Common, visual, uses lower-body landmarks | depth, torso lean, knee symmetry | camera view matters |
| Bicep curl | Simpler upper-body motion | range of motion, elbow drift, shoulder swing | wrist/elbow occlusion |
| Push-up | Good demo exercise | depth, hip sag, elbow angle | floor/prone pose is harder |
A robust 2-exercise system is much better than a weak 8-exercise system.
Do not try to support every exercise. For a course project, “we do two exercises well and discuss how to extend it” is a strong result.
4. Use Fit3D carefully
Fit3D is a strong dataset choice. The dataset includes exercise videos, multiple camera views, 3D skeletons, meshes, exercise labels, and repetition information. The Fit3D homepage describes the broader AIFit system and dataset context.
Also read:
- Fit3D dataset page
- Fit3D homepage
- Fit3D code page
- Fit3D license/legal page
- AIFit CVPR paper
- AIFit PDF
Fit3D is useful for:
exercise labels
repetition intervals
multi-view exercise videos
3D skeleton reference data
offline experiments
But do not assume it directly gives you all the form-error labels you want, such as:
squat_not_deep_enough
squat_knees_caving
curl_elbow_drift
pushup_hip_sag
You may need to define those errors yourself with rules.
Important Fit3D advice
Train and test your deployed pipeline using MediaPipe-extracted landmarks from Fit3D RGB videos , not only Fit3D’s clean 3D ground-truth skeletons.
Use:
Fit3D RGB video
→ MediaPipe Pose
→ MediaPipe landmarks
→ joint-angle features
→ classifier/rules
Do not only use:
Fit3D ground-truth skeleton
→ classifier
Why?
Because your live app will receive noisy MediaPipe predictions, not perfect motion-capture skeletons. MediaPipe can have jitter, missing landmarks, occlusion errors, left/right confusion, and depth instability. Your training/evaluation features should resemble your real deployment features.
5. Good final architecture
A clean architecture:
1. Camera input
2. MediaPipe Pose Landmarker
3. Landmark quality checker
4. Pose normalization
5. Feature extraction
6. Temporal smoothing
7. Exercise module
8. Rep phase detector
9. Form feedback rules
10. UI overlay + logging
Each part should be testable independently.
Example directory structure:
formai/
app/
webcam_demo.py
overlay.py
pose/
mediapipe_runner.py
landmark_utils.py
quality_checks.py
features/
angles.py
normalization.py
window_features.py
exercises/
squat.py
bicep_curl.py
pushup.py
ml/
train_exercise_classifier.py
evaluate.py
data/
process_fit3d.py
configs/
exercises.yaml
6. Pose normalization
Do not feed raw pixel coordinates directly into your classifier.
Raw coordinates depend on:
- camera distance,
- user height,
- frame resolution,
- where the person stands,
- crop/zoom,
- phone orientation.
Normalize landmarks.
Basic normalization:
hip_center = midpoint(left_hip, right_hip)
shoulder_center = midpoint(left_shoulder, right_shoulder)
scale = distance(hip_center, shoulder_center)
normalized_landmark = (landmark - hip_center) / scale
This makes the pose representation less sensitive to body size and camera distance.
7. Joint-angle features
Joint angles are a very good starting point because they are:
- interpretable,
- fast,
- simple,
- easy to debug,
- easy to explain in a report.
For a joint angle:
A — B — C
The angle is at point B.
Example:
hip → knee → ankle = knee angle
shoulder → elbow → wrist = elbow angle
shoulder → hip → knee = hip angle
Squat features
| Feature | Landmarks | Why useful |
|---|---|---|
| Knee angle | hip-knee-ankle | squat depth and phase |
| Hip angle | shoulder-hip-knee | hip hinge / lower-body pattern |
| Torso angle | shoulder center to hip center | forward lean |
| Hip vertical movement | hip center y | depth |
| Left/right knee difference | left knee angle vs right knee angle | asymmetry |
| Knee tracking | knee vs ankle/hip x-position | knee cave, front view only |
Bicep curl features
| Feature | Landmarks | Why useful |
|---|---|---|
| Elbow angle | shoulder-elbow-wrist | rep phase and range |
| Elbow drift | elbow relative to shoulder/torso | upper-arm stability |
| Shoulder motion | shoulder/upper-arm movement | swinging/cheating |
| Wrist path | wrist relative to elbow | curl motion |
| Left/right elbow difference | both elbows | symmetry if two-arm curl |
8. Use temporal windows, not single frames
Exercise is motion. A single frame is often ambiguous.
A squat, lunge, and deadlift can look similar in one frame. A curl midpoint can look like many other arm motions. Use time.
A useful paper here is Real-Time Fitness Exercise Classification and Counting Using a Bidirectional LSTM, which uses temporal pose features over frame sequences. You do not need to implement BiLSTM first, but the principle is important:
use sequences/windows, not isolated frames
Simple temporal-window features:
window = last 30 frames
for each angle:
mean
min
max
range
standard deviation
velocity
For squat:
knee_angle_min
knee_angle_max
knee_angle_range
hip_y_range
torso_angle_max
left_right_knee_difference_mean
For curl:
elbow_angle_min
elbow_angle_max
elbow_angle_range
elbow_velocity_mean
elbow_position_drift
shoulder_angle_range
This gives you motion information without needing a deep temporal model.
9. Repetition counting: use a state machine
Do not count reps like this:
if knee_angle < 100:
count += 1
That will overcount.
Use a state machine.
Squat state machine
standing
→ descending
→ bottom
→ ascending
→ standing
Simple version:
if state == "standing" and knee_angle < down_threshold:
state = "bottom"
if state == "bottom" and knee_angle > up_threshold:
count += 1
state = "standing"
Better version:
require threshold for several frames
require valid landmarks
require minimum time between reps
ignore low-quality frames
smooth angles before decisions
Bicep curl state machine
extended
→ curling_up
→ top
→ lowering
→ extended
Example thresholds:
extended: elbow_angle > 150°
top: elbow_angle < 60°
rep: extended → top → extended
Do not treat those numbers as universal. Use them as starting points and tune them from your videos.
10. Form feedback should be rep-level, not frame-level
Frame-level feedback is noisy.
Bad:
Frame 101: torso bad
Frame 102: torso okay
Frame 103: torso bad
Frame 104: torso okay
Better:
Rep 3:
- depth: too shallow
- torso lean: acceptable
- symmetry: acceptable
Feedback:
"Rep 3: try to squat slightly deeper."
Recommended feedback behavior:
During the rep:
show light live cues
After the rep:
show one main correction
Use cooldowns:
Do not repeat the same feedback every frame.
Only show a warning if the issue persists for N frames or appears in a significant part of the rep.
11. Suggested rules
Squat rules
Use side view first.
Required landmarks:
shoulders
hips
knees
ankles
Main rep signal:
average knee angle
or hip height relative to knee
Rules:
| Rule | Signal | Feedback |
|---|---|---|
| Not deep enough | min knee angle or hip/knee height | “Try to squat deeper.” |
| Torso leaning too far | torso angle during descent/bottom | “Keep your chest more upright.” |
| Left/right asymmetry | difference between left and right knee angles | “Try to move both legs evenly.” |
| Knee cave | knee position relative to ankle/hip line | “Avoid letting your knees collapse inward.” |
Important: knee-cave detection is mainly a front-view problem. Squat depth and torso lean are mainly side-view problems. Do not claim all squat errors can be detected from any camera angle.
Bicep curl rules
Required landmarks:
shoulder
elbow
wrist
hip/torso reference
Main rep signal:
elbow angle
Rules:
| Rule | Signal | Feedback |
|---|---|---|
| Incomplete curl | min elbow angle too large | “Curl higher.” |
| Incomplete extension | max elbow angle too small | “Extend your arm more at the bottom.” |
| Elbow drift | elbow moves relative to shoulder/torso | “Keep your elbow stable.” |
| Shoulder swing | shoulder/upper arm moves too much | “Avoid swinging your shoulder.” |
12. Camera-view constraints are not optional
A phone camera cannot reliably detect every form issue from every angle.
| Issue | Best camera view |
|---|---|
| Squat depth | side view |
| Squat torso lean | side view |
| Knees caving inward | front view |
| Bicep curl elbow drift | side or front upper-body view |
| Bicep curl shoulder swing | side view |
| Push-up hip sag | side view |
| Push-up elbow flare | front/diagonal view |
Your app should guide the user:
"For squat depth and torso analysis, place the camera to your side."
"For knee-cave analysis, use a front view."
"For bicep curls, keep shoulder, elbow, and wrist visible."
This makes the system more honest and more reliable.
13. Landmark quality checks
Before giving form feedback, check that the pose is usable.
Quality checks:
one person detected
required landmarks visible
full body inside frame
landmarks inside image bounds
limb lengths reasonable
angle changes not physically impossible
landmarks stable for several frames
camera view suitable for selected exercise
If the input is bad, do not say “bad form.” Say:
"Move farther from the camera."
"Make sure your full body is visible."
"Improve lighting."
"Use side view for squat analysis."
"Only one person should be in frame."
Useful MediaPipe issue links for real-world pitfalls:
- Pose landmark jitter issue
- Landmark visibility/presence discussion
- Occlusion / hallucinated landmarks issue
- Pose accuracy issues with non-standing / rotated poses
- MediaPipe Web synchronous detect/detectForVideo performance issue
Takeaway: a robust app needs input-quality warnings , not only form warnings.
14. Evaluation: avoid fake accuracy
Do not randomly split frames.
Bad:
frame 1 from video A → train
frame 2 from video A → test
frame 3 from video A → train
frame 4 from video A → test
This leaks information because neighboring frames are nearly identical.
Better:
recording A → train
recording B → test
Best:
subject A/B/C → train
subject D → test
Read:
- scikit-learn common pitfalls: data leakage
- scikit-learn GroupShuffleSplit
- scikit-learn getting started
Use GroupShuffleSplit or GroupKFold with:
group = subject_id
or:
group = recording_id
Report multiple metrics:
| Component | Metric |
|---|---|
| Exercise classifier | accuracy, macro F1, confusion matrix |
| Rep counter | absolute count error |
| Form rules | manual agreement on selected clips |
| Runtime | FPS, average latency |
| Robustness | failure cases by lighting/camera/occlusion |
Do not only report:
accuracy = 98%
Report:
exercise classifier macro F1
rep-counting error
FPS
failure cases
That will make your report much more credible.
15. Similar projects worth studying
Official / high-value guides
MediaPipe pose classification and repetition counting guide Very relevant. Shows pose classification and repetition counting with push-ups/squats using k-NN.
ML Kit pose classification guide Useful mobile-oriented explanation of pose classification and rep counting.
Build an AI Fitness Trainer Using MediaPipe for Squat Analysis Practical squat-focused MediaPipe example with feedback logic.
Research references
AIFit: Automatic 3D Human-Interpretable Feedback Models for Fitness Training Closest research-grade version of your idea: 3D pose, rep segmentation, trainer/trainee comparison, interpretable feedback.
AIFit PDF
Real-Time Fitness Exercise Classification and Counting Using BiLSTM Good for the idea that temporal windows matter.
BlazePose paper Background on the pose-estimation model family behind MediaPipe Pose.
BlazePose GHUM Holistic Useful for 3D landmark / on-device pose-estimation background.
GitHub projects
ExercisePoseCorrection Push-ups, squats, and bicep curls with real-time form feedback.
AI Push-Up Trainer Good example of using a state machine before counting reps.
Deadlift posture-correction system Useful example of stage-based feedback: setup, lifting, lockout.
Pose Estimation for Fitness Exercise Analysis Uses MediaPipe + scikit-learn for exercise phase classification, rep counting, and quality assessment.
Exercise-Correction Useful for stage-dependent exercise error logic.
Workout-Trainer Good example of exercise-specific metrics such as elbow flexion, shoulder drift, squat depth, and chest angle.
Use GitHub projects as engineering references, not as proof that the problem is solved. Many hobby/student projects have weak evaluation.
16. Recommended tech stack
Fastest prototype
Python
OpenCV
MediaPipe
NumPy
Pandas
scikit-learn
Matplotlib
Joblib
Use this for:
video reading
pose extraction
angle calculation
CSV generation
model training
evaluation plots
debugging
Optional web demo
JavaScript / TypeScript
@mediapipe/tasks-vision
HTML Canvas
Web Worker
Relevant links:
- MediaPipe Web Pose Landmarker guide
- MediaPipe samples web repo
- Pose Landmarker worker sample
The web version is good if you want a phone-browser demo, but be careful with performance. Pose detection in the browser can block the main thread unless you throttle it or run it in a worker.
Optional Android demo
Kotlin or Java
MediaPipe Tasks Android
CameraX
Canvas overlay
Only do Android if someone on your team already knows Android.
17. Suggested one-month plan
Week 1: Pose pipeline
Goal:
camera/video
→ MediaPipe
→ landmarks
→ joint angles
Deliverables:
- webcam or video input,
- pose overlay,
- angle calculation,
- CSV export,
- basic landmark quality checks.
Do not start with a complex model yet.
Week 2: One exercise end-to-end
Goal:
squat works from camera to feedback
Deliverables:
- squat rep counter,
- squat state machine,
- 2–3 squat feedback rules,
- angle smoothing,
- after-rep feedback.
At the end of Week 2, you should already have a demo.
Week 3: Second exercise + classifier
Goal:
bicep curl support + exercise classifier baseline
Deliverables:
- curl rep counter,
- curl feedback rules,
- Fit3D subset processing,
- exercise classifier baseline,
- confusion matrix.
Start with:
Random Forest
SVM
k-NN
Logistic Regression
Do not start with LSTM unless the simple pipeline is already working.
Week 4: Integration and polish
Goal:
stable final demo + honest evaluation
Deliverables:
- clean UI,
- final demo video,
- FPS measurement,
- evaluation metrics,
- failure cases,
- report,
- presentation.
18. Team division
With three teammates, divide the work like this.
Teammate 1: Real-time pipeline
Responsibilities:
camera input
MediaPipe setup
landmark drawing
FPS/latency measurement
UI overlay
Deliverables:
live skeleton demo
real-time angle display
recorded demo video
Teammate 2: Dataset and ML
Responsibilities:
Fit3D subset
landmark extraction
feature CSV
exercise classifier
train/test split
evaluation
Deliverables:
features.csv
trained classifier
classification report
confusion matrix
Teammate 3: Rep counting and feedback
Responsibilities:
angle logic
state machines
squat rules
curl rules
feedback messages
failure-case documentation
Deliverables:
rep counter
form analyzer
feedback engine
rule documentation
This gives everyone a clear subsystem.
19. What your final report should say
Avoid saying only:
We used MediaPipe and trained a classifier.
Say something like:
We built a modular real-time exercise-form analysis pipeline. MediaPipe Pose Landmarker was used to extract body landmarks from camera/video input. We normalized landmarks, computed interpretable joint-angle features, smoothed temporal signals, counted repetitions with exercise-specific state machines, and generated corrective feedback from phase-aware rules. We evaluated exercise classification with a subject/video-level split and measured rep-counting accuracy, runtime FPS, and common failure cases.
Suggested report structure:
1. Introduction
- problem
- motivation
- goal
2. Background
- human pose estimation
- MediaPipe Pose
- exercise classification
- form feedback
3. Dataset
- Fit3D overview
- selected exercises
- preprocessing
- train/test split
4. Method
- pose extraction
- landmark normalization
- joint-angle features
- temporal smoothing
- rep state machine
- form-feedback rules
- exercise classifier
5. Implementation
- real-time pipeline
- UI
- performance considerations
6. Experiments
- exercise classification results
- rep-counting results
- runtime FPS
- failure cases
7. Discussion
- limitations
- camera-view constraints
- dataset limitations
- future work
8. Conclusion
20. Pitfalls to avoid
Pitfall 1: Too many exercises
Bad:
We support 12 exercises.
Better:
We support 2 exercises robustly and explainably.
Pitfall 2: Binary correct/incorrect form
Bad:
The model says correct or incorrect.
Better:
The system detects specific issues:
- shallow squat
- excessive torso lean
- elbow drift
- partial curl
Pitfall 3: Frame-level random split
Bad:
random train_test_split over all frames
Better:
split by subject_id or recording_id
Pitfall 4: No camera setup
Bad:
Analyze from any camera angle.
Better:
Use side view for squat depth and torso lean.
Use front/side upper-body view for bicep curls.
Pitfall 5: No smoothing
Bad:
one-frame warning
Better:
warning only after the condition persists across several frames or across a rep
Pitfall 6: Overusing z-depth
Bad:
precise 3D biomechanics from one phone camera
Better:
2D angle features with constrained camera view; optional 3D/world features for experiments
Pitfall 7: Overclaiming safety
Avoid:
prevents injuries
guarantees safe form
replaces a trainer
Say:
provides basic real-time feedback on visible form deviations
21. My final recommended FormAI MVP
Input:
- phone/webcam video
Pose:
- MediaPipe Pose Landmarker
Exercises:
- squat
- bicep curl
Features:
- normalized landmarks
- joint angles
- temporal window statistics
Exercise recognition:
- manual exercise selection for demo
- optional Random Forest/k-NN/SVM classifier for experiment
Rep counting:
- state-machine based
Feedback:
- squat depth
- squat torso lean
- squat asymmetry
- curl range of motion
- curl elbow drift
- curl shoulder swing
Evaluation:
- Fit3D subset
- subject/video-level split
- confusion matrix
- rep-counting error
- FPS
- failure cases
Short summary
- The project is feasible if you narrow the scope.
- Use MediaPipe for pose landmarks, not for the whole coaching logic.
- Use Fit3D for exercise videos, rep intervals, and offline experiments.
- Train on MediaPipe-extracted landmarks from Fit3D videos, not only clean ground-truth skeletons.
- Use ML for exercise recognition or phase recognition.
- Use rules for form feedback first.
- Use state machines for rep counting.
- Use temporal smoothing and rep-level feedback.
- Split train/test by subject or recording, not by frame.
- Build squat + bicep curl well before adding anything else.
- Make the final claim modest: FormAI gives basic real-time feedback on visible form deviations; it does not replace a trainer or guarantee injury prevention.
Discussion in the ATmosphere