Is AI Really Necessary in Fitness App Development Today?
Same here. I use a cheap consumer wearable almost every day, mostly for sleep duration, rough activity trends, and a quick look at basic vitals. It is genuinely useful for that. But I almost never use the “AI coach” or “AI insights” part, because most of the time I do not need a philosophical interpretation of how many hours I slept. I just want to know: did I sleep enough, did I move enough, and is anything obviously different from my usual baseline?
I discussed this with ChatGPT for a while, and my current conclusion is something like this:
I do not think the interesting question is simply:
“Is AI necessary in fitness apps?”
A better question may be:
“What would an honest AI system for fitness look like, given the current limits of consumer wearables, sports science, behavior change science, and everyday life?”
And once I phrase it that way, I become much less excited about the usual “AI personal trainer” pitch, but more interested in a much more boring kind of AI.
Maybe the useful version is not an “AI fitness coach.”
Maybe it is an honest fitness data assistant.
1. Consumer wearable data is useful, but it is not clinical-grade truth
Cheap wearables are useful. I do not want to dismiss them.
A smartwatch or ring can be good enough for rough sleep duration, step count, resting heart rate trends, activity reminders, and long-term self-monitoring. That alone can be valuable.
But we should not confuse this with data collected under medical or research conditions.
Consumer devices are usually working with:
- cheap, low-power sensors
- imperfect skin contact
- motion artifacts
- missing data
- battery gaps
- non-wear periods
- proprietary algorithms
- device firmware/app updates
- uncertain ground truth
- user behavior that is not protocol-controlled
There is a large literature on this already. For example, Fuller et al. 2020 reviewed commercial wearables for step count, heart rate, and energy expenditure. The rough picture is: step count and heart rate can be useful in many settings, but accuracy varies, and energy expenditure is much weaker.
A newer broad review, “Keeping Pace with Wearables”, makes the same general point at a larger scale: consumer wearables are promising, but accuracy depends on the metric, the device, the population, and the context.
Sleep is another good example. A 2024 systematic review comparing Fitbit Charge 4, Garmin Vivosmart 4, and WHOOP against polysomnography found that these devices can be useful, but sleep-stage estimation still has real limits: Schyvens et al. 2024.
So I think the first rule for fitness AI should be:
Do not treat consumer wearable data as if it were a clinical measurement.
For casual self-monitoring, it is often useful. For strong personalized coaching, it is much weaker.
2. “Garbage in, garbage out” is a serious problem here
The usual AI hype assumes that if we feed enough wearable data into an AI model, the model will become a good coach.
But in fitness, the input data may not contain the thing we actually want to know.
A wrist wearable can measure or estimate things like:
- steps
- heart rate
- heart rate variability
- sleep duration
- sleep stages
- activity type
- GPS pace/distance
- calories burned
- maybe SpO2
- maybe skin temperature
But for serious fitness decisions, we often want something closer to:
- actual muscle tension
- tendon load
- joint stress
- local tissue fatigue
- mechanical load per muscle group
- movement quality
- pain risk
- recovery capacity
- technique degradation
- whether the current stimulus is appropriate for this person
Those are much harder to measure.
A cheap wrist device cannot really tell how much load your quadriceps tendon took during squats, whether your lower back was compensating, whether a muscle group was close to failure, or whether a movement pattern increased injury risk.
There are research tools for parts of this: EMG, NIRS, force plates, motion capture, instrumented treadmills, bar velocity trackers, force sensors, lab-grade metabolic carts, and so on. But that is not the same as a cheap everyday wearable.
So the issue is not just that the AI is not smart enough. The issue is also physical:
The sensor layer is often too far away from the real fitness variable we care about.
That is why I think the “AI coach” claim often jumps too far ahead of the measurement stack.
3. Fitness is harder than some other wearable use cases
Wearables can be more useful in cases where the measured signal is closer to the target.
For example, trend monitoring of SpO2 in some respiratory contexts may be useful as a warning signal or as part of a care plan, though of course not as a replacement for medical judgment. In that kind of case, even if the absolute value is imperfect, a trend or drop may still be meaningful enough to investigate.
But fitness is different.
Fitness advice usually requires a chain like this:
sensor signal
-> inferred metric
-> physiological interpretation
-> training interpretation
-> user-specific context
-> safe recommendation
-> behavior change
-> long-term adaptation
Every arrow in that chain is hard.
For a cheap wearable, the system may know that my heart rate increased. But why?
- exercise intensity?
- heat?
- poor sleep?
- caffeine?
- stress?
- dehydration?
- illness?
- anxiety?
- sensor artifact?
- loose watch strap?
- different workout type?
- unusually high fatigue?
And even if the system correctly detects “more strain,” what should it recommend?
- rest?
- walk?
- lift lighter?
- do mobility work?
- sleep earlier?
- eat more?
- do nothing?
- ask a clinician?
- ignore the signal because the data quality is poor?
That is not simply an AI problem. It is a measurement problem, a physiology problem, a behavior problem, and a daily-life-context problem.
4. The missing data problem is not random
Another problem is that consumer wearable data is not 24/7/365 reliable.
People remove devices when charging, showering, sleeping, exercising, traveling, feeling sick, irritated by the strap, or simply tired of tracking. And those missing periods may be exactly the important periods.
Missing data is not just an empty hole. It can be informative.
Maybe the user removed the device because:
- they slept badly
- they were sick
- they were stressed
- they were traveling
- the device was uncomfortable
- they did a sport where the watch was annoying
- they did not want to look at the data
- they forgot to charge it because life was chaotic
There is literature on this too. See, for example, reviews on person-generated wearable data quality and practical work on real-world wearable data problems such as non-wear, missing data, and artifacts.
An honest fitness AI should not just smooth over missing data and continue speaking confidently.
It should sometimes say:
“The data is too incomplete today, so I will not make a training recommendation.”
That is boring. But it is honest.
5. The useful AI may be a pipeline, not one magic LLM
I think a serious version of this should not be:
wearable data -> LLM -> advice
That is too dangerous and too vague.
A more honest architecture would be something like:
wearable data
-> data quality check
-> non-wear / artifact / missingness detection
-> metric-specific reliability score
-> personal baseline comparison
-> context check
-> risk/scope check
-> intervention policy
-> natural language explanation
-> user correction / feedback
This is why I like the idea of a “fitness AI pipeline” more than a single “AI coach model.”
The DACIA framework for digital biomarkers is relevant here. It breaks the path from wearable sensor data to useful action into stages: Data, Aggregation, Contextualization, Interpretation, and Actions.
That seems much closer to what a serious fitness AI needs.
The LLM should probably be near the end of the pipeline, not at the beginning. Its job should be to explain checked information in a clear, cautious, human-readable way — not to magically infer the user’s body state from noisy sensor data.
6. The AI should explain its reasoning in small, readable pieces
One daily habit of honesty is giving a simple reason that the other person can understand.
For example, not this:
“You are under-recovered. Take a rest day.”
But this:
“Your sleep duration was shorter than your usual baseline, and your resting heart rate is slightly higher than usual. However, the device cannot measure actual muscle or joint load. If you also feel tired, a lighter day or rest day is reasonable.”
That is much better.
A good output format might be:
Conclusion:
Light exercise or rest may be reasonable today.
Why:
Sleep was shorter than your usual baseline, and resting heart rate is slightly elevated.
Limits:
This device does not measure actual muscle load, joint stress, pain, or subjective fatigue.
Options:
Rest, take a short walk, or do the planned workout at lower intensity.
This gives the user something to inspect.
The user can say:
“Actually, I feel fine today.”
or:
“The heart rate was probably high because I had coffee.”
or:
“Yes, I am tired, I will take it easy.”
That is important. The AI should not override the user’s bodily experience. It should help the user reason.
This also connects to general guidance around transparency and explainability in health AI. The WHO guidance on ethics and governance of AI for health emphasizes transparency, explainability, human autonomy, safety, and accountability. The FDA / Health Canada / MHRA principles on transparency for machine-learning-enabled medical devices also emphasize communicating information that affects risk, outcomes, context of use, and user understanding.
Fitness apps are not always medical devices, but the same spirit matters when an app starts giving health-related advice.
7. Maybe the real value is not “do more,” but “worry less”
Many fitness apps implicitly push users to do more:
- walk more
- train more
- close the rings
- improve the score
- optimize sleep
- optimize recovery
- optimize calories
- optimize everything
But humans cannot optimize everything. Daily life is already full.
A useful AI might do the opposite:
- ignore this metric
- do not worry about sleep stages
- calories burned is too noisy to use as a precise target
- your weekly activity trend is enough
- today’s data quality is poor, so do not over-interpret it
- you already trained enough this week
- rest is a valid option
- if you are tired, do less
- do not add another habit right now
This is less glamorous than an AI coach, but probably more useful.
There is also a known concern around over-tracking. The term orthosomnia was proposed for cases where people become overly concerned with sleep tracker data and “perfect” sleep scores. Even if that is a specific sleep-tracking concept, the broader warning applies to fitness data too:
More measurement does not automatically mean better self-regulation.
Sometimes the best AI feature is to reduce the number of things the user has to think about.
8. Fitness advice is behavior-change design, not just recommendation
Even if the AI gives scientifically reasonable advice, it may still be useless if the user cannot act on it.
“Go for a 30-minute run” may be good advice in the abstract. But maybe the user is:
- on a train
- at work
- exhausted
- wearing the wrong clothes
- caring for family
- in pain
- sleep deprived
- in bad weather
- socially unable to exercise in that context
- mentally overloaded
This is where behavior-change frameworks matter.
The COM-B model says behavior depends on Capability, Opportunity, and Motivation. Fitness apps often over-focus on Motivation, but Opportunity may be the real bottleneck: time, place, equipment, social context, and energy.
The Behavior Change Technique Taxonomy is also relevant because it reminds us that “advice” is not one thing. Goal-setting, self-monitoring, feedback, prompts, action planning, social support, and habit formation are different intervention components.
So a good fitness AI should not merely ask:
“What is the optimal workout?”
It should ask:
“Is this intervention actually possible for this person today?”
That is a much harder and more honest question.
9. JITAI is close to the real problem, but hard
The idea of Just-in-Time Adaptive Interventions is very relevant: give the right kind and amount of support, at the right time, based on the person’s changing state and context.
That sounds exactly like what people want from AI fitness coaching.
But it is hard.
A system has to know:
- when to intervene
- when not to intervene
- what signal is reliable
- what signal is not reliable
- what the user can actually do now
- whether the intervention will help
- whether the intervention will annoy the user
- whether the intervention might increase anxiety
- whether the system should recommend action, rest, or silence
So I think consumer fitness AI should start with something weaker than a full JITAI:
detect obvious deviations, give low-risk suggestions, and stay silent when uncertain.
That may sound modest, but it is already useful.
10. I would define “honest fitness AI” like this
An honest fitness AI should:
Check data quality first. If the device was not worn properly, if the data is missing, or if the metric is unreliable, weaken or suppress the advice.
Prefer personal trends over universal claims. “Compared with your usual baseline” is often safer than “your sleep is bad” or “your recovery is low.”
Separate observation, inference, and advice. “Sleep was shorter” is different from “you are under-recovered,” which is different from “you should rest.”
Explain the basis briefly. Every recommendation should have a small “why” and a small “limits” section.
Never pretend to measure what it cannot measure. A wrist wearable usually does not measure actual muscle load, tendon stress, joint load, or injury risk directly.
Use low-risk suggestions by default. Rest, short walks, reducing intensity, checking subjective fatigue, or simply logging data are safer than strong workout prescriptions.
Respect user context and agency. The user may know something the device does not: pain, mood, stress, schedule, caffeine, illness, family obligations, or simply “today is not the day.”
Know when to shut up. Silence is a feature. Not every data point needs interpretation.
Be transparent about scientific limits. The system should not exceed the current state of measurement science, exercise science, or behavior-change evidence.
Escalate instead of pretending. If the issue sounds medical — chest pain, fainting, severe breathlessness, unusual symptoms — the app should not “coach.” It should advise seeking appropriate medical help.
11. My preferred product would be boring
Honestly, the first good version may not be a daily AI coach.
It might be a weekly review assistant.
Something like:
This week:
- Sleep duration was about 35 minutes shorter than your usual weekly average.
- Activity was roughly stable.
- Workout frequency dropped from 3 sessions to 2.
- Wednesday sleep data is incomplete, so sleep-stage analysis is not used.
Main thing to watch:
- Sleep duration, not sleep stages.
Low-risk options:
- Keep workouts the same, but avoid increasing intensity this week.
- Add one short walk if you have time.
- If tired, do nothing extra.
That is not sexy.
But it is probably more useful than:
“Your AI coach has generated an optimized personalized training plan.”
Because the weekly review assistant is not pretending to know more than it knows.
12. So my answer to the original question is: AI can help, but only if it becomes more modest
I do not think AI is “necessary” for every fitness app.
A simple, reliable app that tracks sleep duration, steps, workouts, and trends may be more valuable than an overconfident AI coach.
But I also do not think AI is useless.
AI could be useful if it is designed as:
an uncertainty-aware, evidence-limited, low-risk, user-respecting explanation layer on top of wearable data.
Not a magical trainer.
Not a doctor.
Not a personality that nags you.
More like:
“Here is what changed. Here is the weak evidence. Here is what I cannot know. Here are safe options. You decide.”
That is the kind of fitness AI I would actually trust.
Useful references / starting points
- Consumer wearable validity and reliability: Fuller et al. 2020, JMIR mHealth
- Broad consumer wearable accuracy review: Doherty et al. 2024, “Keeping Pace with Wearables”
- Sleep tracker validation against PSG: Schyvens et al. 2024
- Sleep-tracking anxiety / orthosomnia: Baron et al. 2017
- Wearable data quality / person-generated data: Cho et al. 2021
- Real-world wearable data issues: Van Der Donckt et al. 2024
- Digital biomarker pipeline thinking: DACIA framework, npj Digital Medicine
- Just-in-Time Adaptive Interventions: Nahum-Shani et al. 2017
- COM-B / Behavior Change Wheel: Michie et al. 2011
- Behavior Change Technique Taxonomy: Michie et al. 2013
- Prediction model reporting with AI/ML: TRIPOD+AI, BMJ 2024
- WHO health AI ethics/governance: WHO 2021 guidance
- FDA / Health Canada / MHRA transparency principles for ML-enabled medical devices: FDA transparency principles
- NIST AI risk management: NIST AI RMF 1.0
So my short version is:
AI in fitness is not automatically bad, but the honest version should probably be much more boring than the marketing version.
The useful near-term product is not “AI understands your body.”
It is:
“AI helps you avoid over-interpreting weak data.”
Discussion in the ATmosphere