{
"$type": "site.standard.document",
"bskyPostRef": {
"cid": "bafyreiacdihnavu7ybmmr6l7ws6k3p6xxig4a3wxrofdr4fhdum6sgflli",
"uri": "at://did:plc:pgryn3ephfd2xgft23qokfzt/app.bsky.feed.post/3mkwsgwamids2"
},
"path": "/t/advice-needed-for-building-interviewai-a-real-time-ai-interview-feedback-project/175712#post_3",
"publishedAt": "2026-05-03T07:48:24.000Z",
"site": "https://discuss.huggingface.co",
"tags": [
"SHRM report",
"EPIC note",
"Wired coverage",
"MediaPipe Face Landmarker for Web",
"MediaPipe Pose Landmarker for Web",
"MediaPipe Hand Landmarker for Web",
"OpenCV",
"OpenFace",
"MMPose",
"RTMPose",
"openai/whisper-large-v3-turbo",
"CohereLabs/cohere-transcribe-03-2026",
"nvidia/canary-qwen-2.5b",
"nvidia/parakeet-tdt-0.6b-v3",
"Open ASR Leaderboard",
"hf-audio/asr-leaderboard-longform",
"qualcomm/MediaPipe-Face-Detection",
"qualcomm/Facial-Landmark-Detection",
"qualcomm/MediaPipe-Pose-Estimation",
"qualcomm/MediaPipe-Hand-Detection",
"qualcomm/RTMPose-Body2d",
"trpakov/vit-face-expression",
"mo-thecreator/vit-Facial-Expression-Recognition",
"Tanneru/Facial-Emotion-Detection-FER-RAFDB-AffectNet-BEIT-Large",
"Qwen/Qwen3.6-35B-A3B",
"Qwen3 collection",
"Qwen3.5 collection",
"meta-llama models",
"Qwen/Qwen3-Embedding-0.6B",
"Qwen/Qwen3-Embedding-8B",
"Qwen/Qwen3-Reranker-8B",
"MTEB leaderboard",
"meta-llama/Llama-Prompt-Guard-2-86M",
"protectai/deberta-v3-base-prompt-injection-v2",
"meta-llama/Llama-Guard-4-12B",
"MDN getUserMedia",
"AI-Mock-Interviewer/Train_data",
"K-areem/AI-Interview-Questions",
"ali-alkhars/interviews",
"edinburghcstr/ami",
"facebook/omnilingual-asr-corpus",
"nilc-nlp/CORAA-MUPE-ASR",
"ETHZurich/biwi_kinect_head_pose",
"vigil1917/GazeGene",
"fhswf/wholebody-pose-estimation-fingerspelling",
"deanngkl/ferplus-7cls",
"abhilash88/fer2013-enhanced",
"laion/emonet-face-hq",
"ak0255/Synthesis_SER",
"GDGiangi/SEIRDB",
"Hugging Face dataset loading scripts",
"Dataset scripts no longer supported discussion",
"egekaraca/ai-interview-coach",
"KentTDang/AI-Interview-Coach",
"yaotingchun/VoxLab",
"SergioSediq/interview-coach",
"Aditi-T27/InterviewAnalyser",
"Mohamed-samy2/Video-Interview-Analysis",
"SatyamPote/Ai-Video-Interviewer",
"AI mock interview Space",
"NYC Local Law 144 AEDT page",
"Illinois Artificial Intelligence Video Interview Act",
"EEOC AI and algorithmic fairness initiative",
"EEOC AI and ADA resources",
"NIST AI Risk Management Framework",
"Hugging Face Datasets",
"Hugging Face Evaluate",
"Hugging Face Spaces",
"Hugging Face model cards",
"ONNX Runtime Web"
],
"textContent": "(Taking above advice into account) this looks really hard…\n\n* * *\n\n# Advice for building **InterviewAI** : real-time AI interview feedback with webcam, speech, and careful feedback\n\nI would build **InterviewAI** as an **evidence-based interview practice coach** , not as a facial-emotion or “confidence” detector.\n\nThe safest and strongest version is:\n\n\n webcam + microphone\n → visible/speech measurements\n → reliability checks\n → transcript/rubric analysis\n → evidence-grounded coaching feedback\n\n\nThe risky version is:\n\n\n face/body/voice\n → emotion/confidence/honesty/personality/hireability score\n\n\nThat second version can look impressive in a demo, but it is hard to validate and easy to overclaim. A better product reports **observable behavior** :\n\n * speaking pace\n * long pauses\n * filler words\n * answer length\n * whether the answer addressed the question\n * STAR structure: situation, task, action, result\n * face visible percentage\n * screen-facing / looking-away estimate\n * head movement stability\n * shoulder/posture visibility\n * hand-to-face episodes\n * camera/audio quality\n\n\n\nIt should avoid unsupported psychological judgments such as:\n\n * “you lacked confidence”\n * “you looked nervous”\n * “you seemed dishonest”\n * “your personality is weak”\n * “you are not hireable”\n\n\n\nA good feedback sentence is:\n\n> “During answer 2, you had four pauses longer than 2.5 seconds and did not include a clear result. Try answering again with: situation, action, result, and one measurable outcome.”\n\nNot:\n\n> “You lacked confidence.”\n\nThis matters technically and ethically. HireVue, a major video-interview vendor, discontinued facial analysis in screening assessments after concerns about AI use in employment decisions: SHRM report, EPIC note, Wired coverage. If InterviewAI stays a **user-owned practice tool** , risk is much lower than if it becomes an employer-facing candidate-ranking system.\n\n* * *\n\n## 1. Recommended architecture\n\nUse a modular architecture. Do not build one giant “interview quality model” first.\n\n\n Client app\n ├── webcam capture\n │ ├── face detection / face landmarks\n │ ├── head pose or screen-facing estimate\n │ ├── pose landmarks\n │ ├── hand landmarks\n │ └── frame-quality checks\n │\n ├── microphone capture\n │ ├── voice activity detection\n │ ├── speech-to-text\n │ ├── pause detection\n │ ├── speaking pace\n │ └── filler-word detection\n │\n ├── interview engine\n │ ├── question generation\n │ ├── role-specific question bank\n │ ├── follow-up questions\n │ └── answer segmentation\n │\n ├── feature aggregator\n │ ├── per-frame visual features\n │ ├── per-answer speech features\n │ ├── transcript features\n │ ├── rolling averages\n │ ├── confidence/reliability flags\n │ └── event timeline\n │\n ├── feedback engine\n │ ├── deterministic metric summaries\n │ ├── rubric-based answer analysis\n │ ├── LLM-generated coaching\n │ ├── safety/claim checker\n │ └── practice recommendations\n │\n └── report UI\n ├── per-answer feedback\n ├── timeline\n ├── evidence for each suggestion\n ├── reliability caveats\n └── progress over sessions\n\n\nThe most important separation is:\n\n\n measurement layer ≠ interpretation layer\n\n\nThe measurement layer should output facts:\n\n\n {\n \"answer_id\": \"answer_002\",\n \"duration_seconds\": 74,\n \"wpm\": 142,\n \"long_pauses\": 4,\n \"filler_words\": 11,\n \"face_visible_ratio\": 0.93,\n \"screen_facing_ratio\": 0.68,\n \"hand_to_face_events\": 5,\n \"posture_feedback_valid\": false,\n \"posture_invalid_reason\": \"shoulders not visible enough\"\n }\n\n\nThe interpretation layer turns those facts into coaching:\n\n> “Your action was clear, but the result was missing. You also had four long pauses. Try answering again using situation, action, result, and one measurable outcome.”\n\n* * *\n\n## 2. Build in stages\n\n### Stage 1 — transcript-only MVP\n\nStart here before webcam analysis.\n\n\n question\n → user speaks answer\n → ASR transcript\n → WPM / pauses / fillers\n → rubric feedback\n → report\n\n\nFeatures:\n\n * generate interview questions\n * record audio\n * transcribe answer\n * compute speaking pace\n * detect long pauses\n * count filler words\n * check answer structure\n * generate feedback\n\n\n\nExample output:\n\n\n Answer length: 74 seconds\n Speaking pace: 142 WPM\n Long pauses: 4\n Filler words: 11\n STAR structure:\n situation: present\n task: unclear\n action: present\n result: missing\n\n Feedback:\n Your action was clear, but the result was missing. Add one sentence explaining what changed because of your action.\n\n\nThis is already useful and much easier to evaluate than facial emotion recognition.\n\n### Stage 2 — basic webcam observables\n\nAdd:\n\n * face visible percentage\n * face centeredness\n * screen-facing estimate\n * looking-away episodes\n * head movement variance\n\n\n\nUse these as **observations** , not psychological claims.\n\n### Stage 3 — pose and hands\n\nAdd:\n\n * shoulder visibility\n * upper-body stability\n * posture validity flag\n * hand-to-face episodes\n * large movement spikes\n\n\n\nGood feedback:\n\n> “Your hand moved near your face five times during this answer.”\n\nBad feedback:\n\n> “You were anxious.”\n\n### Stage 4 — job-aware coaching\n\nAdd:\n\n * resume parsing\n * job-description matching\n * role-specific rubrics\n * retrieval of coaching examples\n * answer-to-rubric comparison\n\n\n\nUse embeddings/rerankers for this.\n\n### Stage 5 — evaluation and safety\n\nAdd:\n\n * manually labeled benchmark sessions\n * detector precision/recall/F1\n * feedback faithfulness checks\n * unsupported-claim detector\n * accessibility modes\n * privacy controls\n\n\n\n* * *\n\n## 3. Computer vision direction\n\nFor the first version, use **landmark tracking** , not a custom visual model.\n\nRecommended first tools:\n\n * MediaPipe Face Landmarker for Web\n * MediaPipe Pose Landmarker for Web\n * MediaPipe Hand Landmarker for Web\n * OpenCV for image/video utilities\n * OpenFace for offline/research facial behavior analysis\n * MMPose / RTMPose later if MediaPipe is not enough\n\n\n\nMediaPipe is a good MVP choice because it is real-time, browser/mobile friendly, and gives the kind of coordinates you need: face, body, and hand landmarks.\n\nUse OpenFace mainly for research/offline validation. It supports facial landmarks, head pose, facial action units, and eye gaze. Use it to compare signals, not to claim internal emotion.\n\n* * *\n\n## 4. Hugging Face models/libraries that can help\n\nUse Hugging Face mainly for **ASR, LLMs, datasets, evaluation, embeddings, demos, and optional experiments**. For real-time camera landmarks, MediaPipe is usually the better first tool.\n\n### Speech-to-text\n\nGood candidates to test:\n\n * openai/whisper-large-v3-turbo\n * CohereLabs/cohere-transcribe-03-2026\n * nvidia/canary-qwen-2.5b\n * nvidia/parakeet-tdt-0.6b-v3\n\n\n\nASR matters because transcript quality affects everything else: filler words, answer structure, relevance, and feedback quality.\n\nEvaluate ASR on your own mock-interview audio. Generic WER is not enough. Measure:\n\n * word error rate\n * filler-word recall\n * timestamp quality\n * pause boundary accuracy\n * speed/latency\n * accent robustness\n * microphone robustness\n\n\n\nUseful benchmark:\n\n * Open ASR Leaderboard\n * hf-audio/asr-leaderboard-longform\n\n\n\n### Face/body/hand models on Hugging Face\n\nUseful HF-hosted MediaPipe-style models:\n\n * qualcomm/MediaPipe-Face-Detection\n * qualcomm/Facial-Landmark-Detection\n * qualcomm/MediaPipe-Pose-Estimation\n * qualcomm/MediaPipe-Hand-Detection\n * qualcomm/RTMPose-Body2d\n\n\n\nUse these for:\n\n * face visibility\n * face framing\n * head movement\n * shoulder visibility\n * upper-body stability\n * hand-to-face events\n\n\n\nDo not use them to infer confidence or nervousness.\n\n### Facial-expression models\n\nTreat these as optional experiments only:\n\n * trpakov/vit-face-expression\n * mo-thecreator/vit-Facial-Expression-Recognition\n * Tanneru/Facial-Emotion-Detection-FER-RAFDB-AffectNet-BEIT-Large\n\n\n\nThese classify expression labels like angry, happy, sad, surprise, neutral, etc. That is not the same as detecting interview confidence, nervousness, honesty, or hireability.\n\nSafe label:\n\n> “experimental facial-expression classifier output”\n\nUnsafe label:\n\n> “candidate confidence score”\n\n### Feedback LLMs\n\nThe LLM should receive structured evidence and write coaching feedback. It should not invent visual claims.\n\nCandidates to test:\n\n * Qwen/Qwen3.6-35B-A3B\n * Qwen3 collection\n * Qwen3.5 collection\n * meta-llama models\n\n\n\nExample LLM input:\n\n\n {\n \"question\": \"Tell me about a time you handled conflict.\",\n \"transcript\": \"I had a disagreement with a teammate...\",\n \"speech_metrics\": {\n \"wpm\": 138,\n \"long_pauses\": 3,\n \"filler_words\": 9\n },\n \"vision_metrics\": {\n \"face_visible_ratio\": 0.94,\n \"screen_facing_ratio\": 0.71,\n \"hand_to_face_events\": 4,\n \"posture_feedback_valid\": false\n },\n \"rubric\": {\n \"situation\": true,\n \"task\": true,\n \"action\": true,\n \"result\": false\n },\n \"instruction\": \"Use only the evidence above. Do not infer confidence, nervousness, honesty, personality, or hireability.\"\n }\n\n\n### Embeddings and rerankers\n\nUseful if you want job/resume-aware coaching:\n\n * Qwen/Qwen3-Embedding-0.6B\n * Qwen/Qwen3-Embedding-8B\n * Qwen/Qwen3-Reranker-8B\n * MTEB leaderboard\n\n\n\nUse embeddings for:\n\n\n resume → retrieve relevant past projects\n job description → retrieve required competencies\n question → retrieve rubric\n answer → retrieve coaching examples\n\n\nDo not use embedding similarity alone as an interview-quality score.\n\n### Safety / prompt-injection guard\n\nIf users upload resumes, job descriptions, or company pages, protect the feedback LLM from prompt injection.\n\nOptions:\n\n * meta-llama/Llama-Prompt-Guard-2-86M\n * protectai/deberta-v3-base-prompt-injection-v2\n * meta-llama/Llama-Guard-4-12B\n\n\n\nAlso add a custom validator that blocks unsupported feedback such as:\n\n\n you looked nervous\n you lacked confidence\n you seemed dishonest\n you are not hireable\n your personality is weak\n\n\n* * *\n\n## 5. Should you train your own model?\n\nDo not train your own model first.\n\nStart with:\n\n 1. pretrained ASR\n 2. pretrained landmarks\n 3. rule-based metrics\n 4. LLM feedback from structured evidence\n 5. manual evaluation\n\n\n\nFine-tune only after you have:\n\n * a narrow observable target\n * labeled data\n * a baseline\n * a clear failure case\n * evaluation metrics\n\n\n\nGood fine-tuning targets:\n\nTarget | Good label\n---|---\nASR adaptation | transcript text\nfiller detection | token-level filler labels\nlong pauses | timestamped pause segments\nlooking away | timestamped looking-away segments\nhand-to-face | timestamped hand-near-face events\nanswer quality | rubric scores by human reviewers\nfeedback style | human coach feedback examples\n\nBad fine-tuning targets:\n\nTarget | Problem\n---|---\nconfidence | vague internal state\nnervousness | not directly observable\nhonesty | not valid from video/audio\npersonality | ethically and scientifically risky\nhireability | high-stakes and bias-prone\n\n* * *\n\n## 6. Real-time webcam inference best practices\n\n### Run models slower than camera FPS\n\nSuggested rates:\n\nComponent | Rate\n---|---\nwebcam preview | 30 FPS\nface landmarks | 10–15 FPS\npose landmarks | 5–10 FPS\nhand landmarks | 5–10 FPS\nexpression classifier, if any | 1–3 FPS\nfull LLM feedback | after each answer\n\nDo not generate feedback every frame.\n\n### Smooth signals\n\nLandmarks jitter. Use:\n\n * moving average\n * exponential moving average\n * median filter\n * hysteresis thresholds\n * minimum event duration\n\n\n\nBad:\n\n\n frame 1834 = looking away\n\n\nBetter:\n\n\n looking-away event:\n start: 00:41.2\n end: 00:46.8\n duration: 5.6 seconds\n confidence: medium\n\n\n### Use calibration\n\nAt session start:\n\n\n Please sit naturally and look at the screen for 5 seconds.\n\n\nCapture baseline:\n\n * head yaw\n * head pitch\n * face center\n * shoulder position\n * distance from camera\n * lighting quality\n\n\n\nThen compare future movement to that user’s baseline.\n\n### Use reliability gates\n\nDo not report metrics when evidence is weak.\n\nCondition | Action\n---|---\nface visible < 60% | do not report screen-facing estimate\nshoulders not visible | do not report posture\nhand landmarks unstable | do not report hand-to-face events\naudio quality poor | warn transcript metrics may be unreliable\nmultiple faces visible | pause analysis or warn\nlow lighting | ask user to improve lighting\n\nA good caveat:\n\n> “I could not reliably evaluate posture because your shoulders were not visible enough.”\n\n### Store features, not raw video\n\nPrefer:\n\n\n {\n \"timestamp_ms\": 12800,\n \"face_visible\": true,\n \"head_yaw\": -0.12,\n \"head_pitch\": 0.04,\n \"screen_facing_estimate\": true,\n \"left_hand_near_face\": false,\n \"right_hand_near_face\": true\n }\n\n\nDo not store raw video unless the user explicitly opts in.\n\nBrowser webcam access uses `getUserMedia()`; see MDN getUserMedia.\n\n* * *\n\n## 7. Evaluation plan\n\nDo not evaluate by asking:\n\n> “Does the generated feedback sound good?”\n\nA polished LLM can produce convincing but false feedback.\n\nEvaluate by asking:\n\n> “Did the system correctly detect the observable things it claims to detect, and did the feedback stay faithful to those measurements?”\n\n### Build a labeled benchmark\n\nCreate:\n\n\n 30–100 mock interview sessions\n 5–10 minutes each\n 10+ users\n different webcams\n different microphones\n different lighting\n different accents\n camera-on and camera-off cases\n\n\nManually label:\n\nLabel | Type\n---|---\nlong pauses | timestamped start/end\nfiller words | transcript tokens\nlooking away | timestamped segments\nface visible | per frame or per second\nhand near face | timestamped segments\nshoulders visible | segment-level\nposture feedback valid | yes/no\nSTAR structure | present/missing\nanswer relevance | rubric score\nunsupported feedback claim | yes/no\n\n### Metrics\n\nComponent | Metric\n---|---\nASR | WER, filler recall, timestamp error\npause detector | precision, recall, F1, boundary error\nlooking-away detector | event F1, false positives/min\nhand-to-face detector | event precision/recall\nposture validity | accuracy of “can judge / cannot judge”\nanswer rubric | agreement with human reviewer\nfeedback | faithfulness, helpfulness, unsupported-claim rate\noverall | user usefulness rating + objective detector scores\n\nExample report:\n\n\n InterviewAI v0.2 evaluation\n\n Dataset:\n 48 mock interview sessions\n 14 participants\n 7 webcam/microphone setups\n 326 answer segments\n\n ASR:\n WER: 8.9%\n filler-word recall: 0.81\n average timestamp error: 420 ms\n\n Long pauses:\n precision: 0.91\n recall: 0.85\n F1: 0.88\n\n Looking-away estimate:\n precision: 0.76\n recall: 0.69\n F1: 0.72\n false positives: 0.7/min\n\n Hand-to-face:\n precision: 0.82\n recall: 0.64\n F1: 0.72\n\n Feedback:\n evidence faithfulness: 96%\n unsupported psychological claims: 0%\n human coach agreement: 0.74\n\n\n* * *\n\n## 8. Datasets and benchmarks\n\n### Interview-question datasets\n\nUseful for question generation and technical QA examples:\n\n * AI-Mock-Interviewer/Train_data\n * K-areem/AI-Interview-Questions\n * ali-alkhars/interviews\n\n\n\nUse them for:\n\n * question generation\n * role-specific question bank\n * technical interview examples\n * simple instruction-tuning experiments\n\n\n\nDo not treat them as reliable answer-quality labels.\n\n### ASR datasets\n\nUseful for transcription evaluation:\n\n * hf-audio/asr-leaderboard-longform\n * edinburghcstr/ami\n * facebook/omnilingual-asr-corpus\n * nilc-nlp/CORAA-MUPE-ASR\n\n\n\n### Face/head/gaze/pose datasets\n\nUseful for component experiments:\n\n * ETHZurich/biwi_kinect_head_pose\n * vigil1917/GazeGene\n * fhswf/wholebody-pose-estimation-fingerspelling\n\n\n\n### FER/SER datasets\n\nUse only for optional research:\n\n * deanngkl/ferplus-7cls\n * abhilash88/fer2013-enhanced\n * laion/emonet-face-hq\n * ak0255/Synthesis_SER\n * GDGiangi/SEIRDB\n\n\n\nDo not use FER/SER datasets as proof that you can detect interview confidence or nervousness.\n\n### Prefer no-script datasets\n\nPrefer Hugging Face datasets stored as Parquet, JSON, CSV, image folders, or audio files. Avoid datasets that require custom Python dataset builder scripts or `trust_remote_code=True` when possible.\n\nRelevant docs:\n\n * Hugging Face dataset loading scripts\n * Dataset scripts no longer supported discussion\n\n\n\n* * *\n\n## 9. Similar projects to study\n\nUse these for architecture and UX ideas, not as proof of validity.\n\n * egekaraca/ai-interview-coach\n * KentTDang/AI-Interview-Coach\n * yaotingchun/VoxLab\n * SergioSediq/interview-coach\n * Aditi-T27/InterviewAnalyser\n * Mohamed-samy2/Video-Interview-Analysis\n * SatyamPote/Ai-Video-Interviewer\n * AI mock interview Space\n\n\n\nProjects that say “confidence analyzer” or “candidate scoring” are useful cautionary examples. Their architecture may be interesting, but the framing is risky.\n\nBetter names:\n\n\n Interview practice coach\n Interview delivery analyzer\n Observable behavior feedback system\n Mock interview feedback assistant\n\n\nRiskier names:\n\n\n confidence detector\n nervousness analyzer\n honesty detector\n hireability evaluator\n personality detector\n\n\n* * *\n\n## 10. Legal/safety/product-positioning warnings\n\nIf InterviewAI is a **candidate-owned practice tool** , the risk is much lower.\n\nIf it becomes an **employer-facing automated screening tool** , the risk increases sharply.\n\nRelevant references:\n\n * NYC Local Law 144 AEDT page\n * Illinois Artificial Intelligence Video Interview Act\n * EEOC AI and algorithmic fairness initiative\n * EEOC AI and ADA resources\n * NIST AI Risk Management Framework\n\n\n\nRecommended disclaimer:\n\n\n InterviewAI analyzes observable practice signals such as transcript quality, speaking pace, pauses, filler words, face visibility, screen-facing estimate, and hand/pose landmarks.\n\n InterviewAI does not infer honesty, personality, mental health, true confidence, emotional state, or hireability.\n\n\n* * *\n\n## 11. Common mistakes to avoid\n\n### Mistake 1 — leading with emotion recognition\n\nFacial expression recognition sounds impressive, but it is not the strongest core feature. It is hard to validate and easy to overclaim.\n\nUse it as:\n\n\n optional experimental expression classifier\n\n\nNot as:\n\n\n confidence detector\n\n\n### Mistake 2 — using one overall score too early\n\nAvoid:\n\n\n Interview score: 74/100\n Confidence: 62/100\n Professionalism: 80/100\n\n\nPrefer:\n\n\n Speech:\n 146 WPM\n 3 long pauses\n 9 filler words\n\n Answer structure:\n situation: present\n action: present\n result: missing\n\n Camera:\n face visible: 94%\n screen-facing estimate: 71%\n hand-to-face events: 5\n\n\n### Mistake 3 — no reliability gates\n\nIf the evidence is weak, say so. Do not fake posture or gaze feedback.\n\n### Mistake 4 — letting the LLM invent observations\n\nDo not prompt:\n\n\n Analyze this candidate's confidence.\n\n\nPrompt:\n\n\n Use only the transcript and measured metrics. Do not infer mental state, honesty, personality, or hireability.\n\n\n### Mistake 5 — no labeled evaluation set\n\nMost demos fail here. Build a small labeled mock-interview benchmark and report objective metrics.\n\n### Mistake 6 — ignoring accessibility\n\nEye contact, speech rhythm, posture, facial movement, and gesture patterns vary across people. Include:\n\n * camera-off mode\n * transcript-only mode\n * no penalty for gaze differences\n * manual self-review\n * user-controlled goals\n * clear caveats\n\n\n\n* * *\n\n## 12. Recommended stack\n\n### Web MVP\n\nLayer | Recommendation\n---|---\nfrontend | Next.js / React\nwebcam/mic | `getUserMedia`, MediaRecorder\nreal-time CV | MediaPipe Tasks Web\nbackend | FastAPI or Node\nASR | Whisper / Cohere Transcribe / Canary / Parakeet\nfeedback LLM | Qwen / Llama / API model\nembeddings | Qwen3 Embedding\nreranking | Qwen3 Reranker\nstorage | PostgreSQL + object storage\nevaluation | Python, scikit-learn, Hugging Face Evaluate\ndemo | Hugging Face Spaces or custom web deployment\n\nUseful docs:\n\n * Hugging Face Datasets\n * Hugging Face Evaluate\n * Hugging Face Spaces\n * Hugging Face model cards\n * ONNX Runtime Web\n\n\n\n### Local/privacy-oriented version\n\nLayer | Recommendation\n---|---\nCV | MediaPipe/OpenCV locally\nASR | local Whisper/faster-whisper-style runtime\nLLM | local Qwen/Llama if hardware supports\nstorage | local SQLite\nreports | local HTML/PDF export\n\n* * *\n\n## 13. Strong README positioning\n\nUse this:\n\n\n InterviewAI is an AI interview-practice coach that combines transcript analysis, speech timing, and webcam landmark tracking to give users evidence-based feedback.\n\n It measures observable practice signals such as speaking pace, long pauses, filler words, answer structure, face visibility, screen-facing estimate, head movement, upper-body stability, and hand-to-face movement.\n\n It does not infer honesty, personality, mental health, true confidence, nervousness, or hireability.\n\n\nAvoid this:\n\n\n InterviewAI uses emotion recognition to detect whether candidates are confident, nervous, honest, and hireable.\n\n\n* * *\n\n## 14. Direct answers to the seven questions\n\n### 1. Recommended architecture?\n\nUse:\n\n\n webcam + microphone\n → landmarks + transcript\n → observable metrics\n → per-answer aggregation\n → rubric scoring\n → LLM feedback from evidence\n → report with caveats\n\n\nKeep measurement separate from interpretation.\n\n### 2. Which HF models/libraries help?\n\nUse Hugging Face for:\n\n * ASR: Whisper, Cohere Transcribe, Canary, Parakeet\n * LLM feedback: Qwen/Llama-style instruction models\n * embeddings/rerankers: Qwen3 Embedding/Reranker\n * datasets: ASR, interview questions, pose/gaze experiments\n * evaluation: Hugging Face Evaluate\n * demos: Spaces\n * safety: Prompt Guard / Llama Guard-style models\n\n\n\nUse MediaPipe/OpenFace/MMPose for camera landmarks.\n\n### 3. Train, fine-tune, or pretrained?\n\nUse pretrained models first. Fine-tune only after you have labeled data and a clear failing baseline. Do not train a model to predict “confidence” or “nervousness” first.\n\n### 4. Best practices for real-time webcam inference?\n\n * lower FPS inference\n * smoothing\n * calibration\n * async processing\n * feature storage instead of raw video\n * reliability gates\n * uncertainty reporting\n * feedback after each answer, not every frame\n\n\n\n### 5. How to evaluate reliability?\n\nCreate a labeled mock-interview benchmark and measure:\n\n * long-pause F1\n * filler recall\n * looking-away event F1\n * hand-to-face precision/recall\n * ASR WER\n * rubric agreement\n * feedback faithfulness\n * unsupported-claim rate\n\n\n\n### 6. Datasets/models/pipelines?\n\nUse:\n\n * `AI-Mock-Interviewer/Train_data`\n * `K-areem/AI-Interview-Questions`\n * `hf-audio/asr-leaderboard-longform`\n * MediaPipe / OpenFace / MMPose\n * Whisper / Cohere / Canary / Parakeet\n * Qwen / Llama for feedback\n * Qwen3 Embedding/Reranker for retrieval\n * your own `interviewai-eval-v1` for final validation\n\n\n\n### 7. Common mistakes?\n\nAvoid:\n\n * emotion overclaiming\n * confidence/honesty/hireability scoring\n * one overall score too early\n * no reliability gates\n * no labeled evaluation\n * raw-video storage by default\n * LLM-invented observations\n * ignoring accessibility\n * building an employer screening tool before addressing legal/fairness requirements\n\n\n\n* * *\n\n## Final recommendation\n\nBuild the boring measurable system first:\n\n\n observable signals\n + transcript analysis\n + reliability gates\n + evidence-grounded LLM feedback\n + manually labeled evaluation\n\n\nDo not start with:\n\n\n facial emotion recognition\n + confidence score\n + personality score\n + hireability score\n\n\nIf the basic signals are noisy, a bigger model will mostly give you a more expensive noisy system. If the basic signals are reliable, the feedback layer can become genuinely useful.",
"title": "Advice needed for building InterviewAI: a real-time AI interview feedback project"
}