Raw Record Source

{
  "$type": "site.standard.document",
  "content": "---\ntitle: \"COMP4020: assessment\"\ndescription: \"What does a student hand in for an agentic coding course---and how do I make\n  that artefact worth more than the grade itself?\"\ntags: [comp4020, teaching]\n---\n\n:::tip\n\nThis post is part of a series I'm writing as I develop\n[COMP4020: Agentic Coding Studio](/blog/2025/12/19/comp4020-rapid-prototyping-for-the-web/).\nSee [all posts in the series](/blog/tag/comp4020/).\n\n:::\n\nThe assessment challenges in an agentic coding course aren't the same as the\nones\n[causing much hand-wringing in other areas of higher education](https://nymag.com/intelligencer/article/openai-chatgpt-ai-cheating-education-college-students-school.html). The obvious\nframing---\"how do we stop students cheating with AI?\"---misses what COMP4020 is\nactually for. The course isn't trying to produce students who can code without\nAI; it's trying to produce students who can build _good_ software _with_ AI,\nwhich is a different skill. So the question I'm actually trying to answer is:\nwhat artefacts can/should we assess to determine attainment of the course\nlearning outcomes when producing code itself is cheap?\n\nI want the assessed artefact to do three things. It should be _authentic_ to\nwhat the student actually built---tightly coupled to the work itself, not some\nabstracted written reflection or traditional exam that drifts away from the\nthing the course is about. It should require _human_ care and effort to do well;\na one-shottable assessment fails to measure the thing the course is actually\nabout, which is judgement: when to let the agent run, when to intervene or throw\nthe whole thing away and start again, and how best to scaffold all these things\nin a co-operative human-agent system. A submission anyone could generate\nby pasting the prompt into Claude and hitting enter once hasn't told us anything\nuseful. And---this one's the most fun to think about---it should have\n_usefulness beyond the course grade_. Ideally the student would want to keep it,\nshare it, put it on their CV, post it to\n[Hacker News](https://news.ycombinator.com/). The grade becomes a side-effect of\nmaking something genuinely useful to somebody.\n\nConcretely, the course will have three assignments of increasing scope (static\nsite, CRUD app, real-time app), and for each one students will hand in a\nbundle of four things: the git repo itself (source, Docker-based build and\ndeploy workflow, `CLAUDE.md` and other agent files, `README.md`); the full\nClaude Code session logs (the `.jsonl` files from `~/.claude/projects/...`), a\ncomplete auditable record of the agentic workflow that produced the thing; a\nshort, slick, launch-style product video of the kind you'd put at the top of a\nlanding page to show what the app does; and a longer \"behind-the-scenes\"\nworkflow video---more reflective, \"here's how I actually built this, including\nthe bits that went wrong\", the sort of video that would generate genuine\ndiscussion on Hacker News or r/programming about agentic coding best\npractices.\n\n:::info\n\nThe four COMP4020 learning outcomes. Students will be able to:\n\n1. design, build and test full-stack web applications using a rapid-prototyping\n   process\n2. describe the components of a Large Language Model interface for code\n   generation\n3. design and evaluate different LLM agent workflows for software development\n4. apply principles from the scholarly literature to work-in-progress and\n   finalised software prototypes\n\n:::\n\nThe two videos do different work. The product video exists to demonstrate that\nthe thing _works_---that the student can ship something polished. The BTS video\nis where most of the real assessment happens, and where the \"useful beyond the\ngrade\" criterion kicks in hardest; if these videos are any good, they're a\ncontribution to the wider conversation about how to build software with agents.\nThe four items also map onto the course's four learning outcomes unevenly: the\nproduct video and the working repo cover LO1 (design, build, test full-stack web\napps); the BTS video backed by the JSONL logs covers LO3 (design and evaluate\nLLM agent workflows) and LO4 (apply scholarly principles to the prototypes); LO2\n(describe the components of an LLM interface for code generation) is the gap I\nhaven't yet closed, and probably sits in a separate written component or threads\nthrough the BTS commentary (\"I used a subagent here because...\").\n\nThe JSONL logs are the sneaky part. They're a complete record of every prompt,\nevery tool use, every back-and-forth, which makes them dense and hard to read\nend-to-end; nobody's going to wade through hundreds of megabytes for every\nsubmission. That's fine, because the logs exist as _evidence_. They make every\nclaim in the BTS video auditable, and they make one-shotting visibly obvious (a\nlog containing a single prompt and a single response is telling us something).\nThey also open up some interesting possibilities on top of that. An\n[LLM-as-judge](/blog/2026/03/30/comp4020-whats-the-theory-here/) could scan them\nfor specific workflow patterns a student claims to have used. Aggregated across\nthe cohort, with consent and anonymisation, they become a genuine research\ncorpus. And requiring them forces students to practise good secrets hygiene from\nday one---no credentials in prompts, no API keys pasted into chat---which I'd\nwant them doing regardless of how I assessed them. I'll train the students in\nthis, and give them tooling to help.\n\nThere's plenty still open. The rubric is the biggest one: how do you actually\n_grade_ a BTS video? What\ndistinguishes a good one from a glib one? I have intuitions---genuine engagement\nwith trade-offs, specific references to what went wrong, connections to the\nscholarly material, evidence of iteration rather than single-shot\nprompting---but I don't have a scheme I'd trust yet. Related: what stops the BTS\nvideo itself being [LLM-generated slop](https://simonwillison.net/2024/May/8/slop/)? The JSONL logs are part of the answer\n(it's hard to reflect plausibly on workflow choices the logs show you didn't\nmake), though they're probably not the whole answer. Voice-and-face-on-camera\nhelps, though I'm wary of mandating that and creating accessibility problems.\n\nLO2 is unresolved, as I've noted. I suspect it won't resolve properly until I've\ndrafted the lecture material and can see what needs assessing in isolation from\nthe prototypes.\n\nAnd there's the broadest question of all. There's a\n[whole discourse right now about whether AI-assisted work is meaningfully assessable _at all_](https://postplagiarism.com/2024/08/21/intro/),\nabout integrity, about what a degree even signals when agents can produce code\nthat looks like what a student would produce. I keep going back and forth on whether\nto engage with that explicitly in the course materials, or whether to let the\npositive proposal do the arguing. For now I'm leaning toward the latter---ask me\nagain next month.\n\nAs with\n[everything else in this series](/blog/2026/03/31/comp4020-the-story-so-far/),\nif you're teaching something similar and have made different choices, I'd\ngenuinely like to hear about them.\n",
  "createdAt": "2026-05-13T23:14:36.405Z",
  "description": "What does a student hand in for an agentic coding course---and how do I make that artefact worth more than the grade itself?",
  "path": "/blog/2026/04/15/comp4020-assessment",
  "publishedAt": "2026-04-15T00:00:00.000Z",
  "site": "at://did:plc:tevykrhi4kibtsipzci76d76/site.standard.publication/self",
  "tags": [
    "comp4020",
    "teaching"
  ],
  "textContent": "What does a student hand in for an agentic coding course---and how do I make that artefact worth more than the grade itself?",
  "title": "COMP4020: assessment"
}