My Tools in Data Science course uses LLMs for assessments. We use LLMs to
- Suggest project ideas (I pick), e.g. https://chatgpt.com/share/6741d870-73f4-800c-a741-af127d20eec7
- Draft the project brief (we edit), e.g. https://docs.google.com/document/d/1VgtVtypnVyPWiXied5q0_CcAt3zufOdFwIhvDDCmPXk/edit
- Propose scoring rubrics (we tweak), e.g. https://chatgpt.com/share/68b8eef6-60ec-800c-8b10-cfff1a571590
- Score code against the rubric (we test), e.g. https://github.com/sanand0/tds-evals/blob/5cfabf09c21c2884623e0774eae9a01db212c76a/llm-browser-agent/process_submissions.py
- Analyze the results (we refine), e.g. https://chatgpt.com/share/68b8f962-16a4-800c-84ff-fb9e3f0c779a
This changed our assessments process. It’s easier and better.
Earlier, TAs took 2 weeks to evaluate 500 code submissions. In the example above, it took 2 hours. Quality held up: LLMs match my judgement as closely as TAs do but run fast and at scale.
LLM-graded reviews aren’t just a cost hack. They’re a scale and quality lever.
- We create new assessments fast. The example took ~2 hours to ideate.
- We run, analyze and iterate just as fast. This full loop now takes ~2 hours.
I no longer have an excuse to teach outdated content.
Prompts & code: https://github.com/sanand0/tds-evals/tree/main/llm-browser-agent
