Canvas Tutor β€” Module Improvement Roadmap v1

Status: ACTIONABLE Β· supersedes Phase 1.5 framing Β· drives day-to-day work Date: 2026-04-29 Companion: CANVAS_TUTOR_ARCHITECTURE_v1.md (the spine)

This document lists exactly what needs to be built to take each of the 9 architecture modules from its current quality to its next level. Manual testing only β€” no automated harnesses. We work granularly: pick one module, do the work, you test it manually on the live URL, score it, move to the next.


At-a-glance β€” current state (post M1-M9 push, 2026-04-29)

# Module Exists Quality (was β†’ now) What landed in this push
M1 Input Handler βœ… functional (6/6) 🟠 β†’ 🟑 solid blank input mode + 10 question cards (auto-fill on click)
M2A Cached lessons βœ… functional 🟠 β†’ 🟠⁺ developing+ (re-cook running) scripts/patch_lessons.py shipped runtime helpers into 29 cached HTMLs; lesson re-cook with new prompt running on EC2
M2B Live ingest+gen βœ… functional πŸ”΄ β†’ 🟠 developing 9-phase pedagogical arc in SYSTEM_PROMPT, beat-count guidance by complexity, source-faithfulness for url/pdf, M9 footprint contract documented, second-pass critique-revise rubric (rewrites if any dim < 4)
M3 First-Response βœ… functional 🟠 β†’ 🟑 solid input-aware hello categories: hello_url, hello_pdf, hello_blank, hello_question (16 utterances, 15 synthesized)
M4 Streaming Engine ❌ not started n/a (deferred) β€”
M5 Skill Executor 🟑 partial (focus cues live) πŸ”΄ β†’ 🟠 developing parse_focus_markers() + scheduleFocusCues() word-level highlight via char-position approximation against audio.currentTime
M6 Interruption FSM 🟑 partial πŸ”΄ β†’ 🟠 developing state transitions formalized
M7 Response Router βœ… functional (5/5) 🟠 β†’ 🟑 solid 5-way classifier: INLINE / TANGENT / PARK / REFUSE / CLARIFY with confidence + rationale; uniform synth path
M8 Activity Indicator βœ… functional πŸ”΄ β†’ 🟠 developing visual progress modal, 4 sub-states (classifying / preparing / synthesizing / finalizing), heuristic timing
M9 Board State + Layout 🟑 partial (Phase A+B) πŸ”΄ β†’ 🟠 developing Phase A: 6Γ—4 footprint contract in author prompt + runtime. Phase B: window.boardState (add/remove/getState/hasRoomFor/findReferencedRecently) + permanence levels. Phase C (active reconciliation) deferred.

Status: all 9 modules at 🟠⁺ or better. Lesson re-cook (force --parallel 3, 20 topics) running in tmux on devbox to surface M2B's new prompt across the demo corpus. Manual testing on physolympiad.com/CanvasA/ β€” pick a card, hit a hello variant, ask a question, watch focus cues + activity modal.


Per-module work plans

M1 Β· Input Handler

Current: 4 of 6 modes work. Target: all 6, with input-aware errors.

Work item Effort
Add blank input mode (student arrives without a topic β€” tutor asks "what would you like to learn?") ~2 hr
Add question_card mode + curate ~10 starter questions ~half day
Image-PDF OCR (Tesseract or Claude Vision for scanned PDFs) ~1 day
Better validation + helpful error messages on bad URLs / parse failures ~1 hr

Manual test: try each input mode, intentionally feed bad input (broken URL, image-only PDF), verify graceful handling.

Move-on score: all 6 modes work + bad input doesn't crash.


M2A Β· Cached lessons

Current: 19/20 cards live, demoware-quality. Target: golden quality on the 20.

Work item Effort
Fix Maxwell's equations card (1/20 still topic-only) ~30 min
Hand-edit 5 of the 20 cards with you (or a physicist) β€” narration polish, scene refinement, layout fixes ~5 hr active, ~1 wk elapsed
Add a manual-review checklist per lesson (correctness, pedagogy, layout, sync) ~1 hr
Re-cook with M2B's improved prompt once that lands (cascades automatically) overlap

Manual test: play 5 polished lessons; do they teach genuinely? Compare to YouTube-physics-tutor benchmark.

Move-on score: 5+ cards rated "I'd actually share this with a student" by you.


M2B Β· Live ingest+gen

Current: single-shot Claude, demoware. Target: structured pedagogical arc + critique-revise.

Work item Effort
Rewrite system prompt to enforce pedagogical arc: hook β†’ prior knowledge β†’ core insight β†’ derivation β†’ worked example β†’ trap β†’ consolidation ~half day
Add critique-and-revise pass: second Claude call scores the first output against rubric, rewrites weak parts ~half day
Variable lesson length: prompt asks for beat count proportional to topic complexity (5 for trivial, 20 for Maxwell) ~half day
Source-faithfulness check: when source is provided, verify lesson reflects source's framing ~1 hr

Manual test: regenerate 3 lessons with new prompt; compare to baseline. Does the new lesson have a clearer arc? Is the worked example really worked?

Move-on score: 3 of 3 regenerated lessons score "noticeably better than baseline" subjectively.


M3 Β· First-Response Generator

Current: generic hello. Target: input-aware hello + first visual placeholder.

Work item Effort
Author 5 hello variants per input mode (URL-paste, PDF-drop, blank, question card) β€” total ~25 new utterances ~half day prompt + synth
Pre-synthesize via existing F018 pipeline ~10 min
First scene placeholder (ghost diagram or skeleton title appears alongside hello audio) ~half day

Manual test: paste 5 different URL types, drop a PDF, click cards β€” does the hello acknowledge what you sent?

Move-on score: hello phrasing makes you think "yes, the tutor noticed what I sent."


M4 Β· Streaming Engine

Current: doesn't exist. Target: functional β€” first beat plays before full lesson is generated.

Work item Effort
Refactor generate_lesson.py to stream Claude output beat-by-beat (Anthropic streaming API) ~1 day
ElevenLabs per-beat synth as each beat is parsed (parallel, not after) ~1 day
Frontend listens for beat_ready events on the status endpoint, plays as soon as available ~1 day
Lookahead buffer: pre-fetch beat N+1's audio while N plays ~half day
Backpressure: if production lags playback, fall back to extended bridging ~half day

Manual test: paste URL, time first audible spoken-content moment. Should be ~10s instead of ~60s.

Move-on score: first lesson content plays within 15s of clicking; full 12-beat lesson plays without dead air.


M5 Β· Skill Executor

Current: 5/10 skills, all implicit dispatch. Target: formal registry + word-level highlight + 3 new skills.

Work item Effort
Create prompts/skills_registry.json (the 10 skills, typed inputs/outputs) ~1 hr
Refactor lesson author prompt to emit explicit skill calls (instead of implicit scene types) ~half day
Refactor runtime to dispatch by skill name (matches registry) ~half day
F024 word-level timing: ElevenLabs with-timestamps endpoint integration ~1 day
F025 inline focus markers: author prompt emits [focus:elem-id]…[/] blocks ~half day
F026 cue engine: runtime fires F011 highlighter underline on focus markers' word timings ~1 day
New skill: look_up(query) β€” knowledge lookup with citation ~1 day
New skill: derive_step(symbolic) β€” sympy-verified algebraic step ~1-2 days
New skill: ask_back(question) β€” pause + wait for student ~1 day
New skill: play_animation(spec) β€” parameter-driven motion ~2 days

Manual test: word-level β€” replay any lesson; does the highlight underline track the spoken word? New skills β€” generate a lesson on a topic that benefits from each skill.

Move-on score: word-level sync feels natural on 3 sample lessons; at least 2 of 4 new skills functional.


M6 Β· Interruption State Machine

Current: implicit. Target: named states with per-state activity.

Work item Effort
Define InterruptionState enum (pausing β†’ classifying β†’ preparing β†’ delivering β†’ transitioning_back β†’ playing) ~1 hr
Refactor /api/ask to emit state events as the request progresses ~half day
Frontend subscribes to state events; updates M8 indicator in real time ~half day
Cancellation: student can cancel a question mid-prepare (e.g., "actually never mind") ~half day
Resume re-orientation: brief "Right, where were we…" voice cue when transitioning back ~1 hr

Manual test: ask a question; watch the indicator step through named states. Hit Esc mid-prepare; does it cancel cleanly?

Move-on score: you can tell at a glance which sub-state the tutor is in.


M7 Β· Response Router

Current: 2-way classifier. Target: 5-way + confidence + rationale.

Work item Effort
Upgrade Claude classifier prompt to emit 5-way decision: INLINE / TANGENT / PARK / REFUSE / CLARIFY ~half day
Implement PARK mode: question saved to a "parking lot" UI element, accessible at end of lesson ~half day
Implement REFUSE mode: polite refusal phrasing for impossible/ambiguous/off-domain questions ~1 hr
Implement CLARIFY mode: tutor asks targeted follow-up before answering ~half day
Log classifier confidence + rationale per call (debug aid) ~1 hr

Manual test: ask 10 diverse questions across different intent types. Does the router pick correctly?

Move-on score: β‰₯8 of 10 routes feel right; PARK and CLARIFY UX is clear.


M8 Β· Activity Indicator

Current: voice surface only during activation. Target: visual modal + interruption coverage.

Work item Effort
Visual progress modal (small card, top-right of board, shows current sub-state) ~half day
Wire to M6's state events (lights up sub-state lines as tutor progresses) ~1 hr
Author 10–15 new bridging utterances tied to interruption sub-states (synth via F018 pipeline) ~half day
Pulsing chalk-cursor animation for "tutor is doing something" ~1 hr

Manual test: ask a question, watch the modal animate sub-states + voice utterances play during the wait.

Move-on score: zero "Tutor is thinking…" moments where you don't know what's happening.


M9 Β· Board State + Layout

Current: 6Γ—4 grid in CSS, author-driven placement, reactive overlap fixer. Target: structured contract + state recorder.

Work item Effort
Phase A (~half day): Add footprint + position_hint + permanence fields to scene element schema. Update lesson author prompt to emit them. Runtime ignores initially. ~half day
Phase B (~1 day): Build board state recorder. Track every element added with permanence + timestamp. Provide get_state(), has_room_for(footprint), find_referenced_recently() queries. ~1 day
Phase B': M7 starts using has_room_for() to inform same-board-vs-new-board routing ~1 hr
Phase C (~1-2 days): Active reconciliation. Beat arrives, system checks fit, erases ephemerals if needed, signals new-board if no room. Replaces F009 reactive auto-fixer. ~1-2 days
Visual debug overlay: dev mode shows current board state + footprints ~1 hr

Manual test: generate 3 long lessons. Are layouts crowded or clean? Does the new-board decision happen at the right moment?

Move-on score: layout never feels accidental.


Suggested execution order

Ordered by user-visible impact / dependency:

  1. M2B prompt + critique-revise (lesson quality) β†’ ~1 day β†’ addresses #1 user complaint
  2. M5 word-level highlight (F024+F025+F026) β†’ ~2 days β†’ addresses #2 user complaint (voice talking in vacuum)
  3. M8 visual progress modal + interruption bridging β†’ ~1 day β†’ addresses "Tutor is thinking…" complaint
  4. M6 state machine formalization β†’ ~half day β†’ enables M8 to work cleanly
  5. M7 5-way router (PARK + CLARIFY) β†’ ~1 day β†’ fewer mis-routed questions
  6. M9 Phase A β€” contract introduction β†’ ~half day β†’ no behavior change, sets up future
  7. M3 input-aware hellos + first scene placeholder β†’ ~half day β†’ polish
  8. M9 Phase B β€” state recorder β†’ ~1 day β†’ enables M7 to be smarter
  9. M2A β€” fix Maxwell + hand-edit 5 goldens β†’ ~5 days elapsed β†’ highest-leverage content polish
  10. M1 β€” blank + question_card modes β†’ ~1 day β†’ broader entry surface
  11. M5 new skill: look_up β†’ ~1 day β†’ adds factual citation
  12. M5 new skill: ask_back β†’ ~1 day β†’ enables Phase 4 active learning
  13. M9 Phase C β€” active reconciliation β†’ ~2 days β†’ big quality jump
  14. M4 β€” Streaming Engine β†’ ~3-5 days β†’ biggest architectural lift; addresses 30s wait
  15. M5 new skills: derive_step, play_animation β†’ ~3-4 days β†’ extends tutor's repertoire
  16. M1 β€” image-PDF OCR β†’ ~1 day β†’ broader input
  17. M9 β€” mobile reflow β†’ ~3 days β†’ required for Phase 10

Total to take all 9 modules from current β†’ 🟠⁺: ~14 working days at AI speed.


What happens to the rest of the roadmap?

The existing 14-phase roadmap in ROADMAP_CANVAS_A_v3.md is the long-term vision. This module roadmap is the immediate execution plan. They map cleanly:

14-phase Roadmap Module work that delivers it
Phase 0 β€” Foundation Eval harness lives in M2 (gate), observability in M6/M8 (state logging), auth/rate limits orthogonal infra
Phase 1 β€” Activation engine v1 Done. F016–F034 covered M1, M2A, M2B, M3, M5 narrate, M9 grid baseline
Phase 1.5 β€” Voice-visual sync + lesson quality REPLACED by this doc. F024–F032 = M5 highlight, M2B prompt, M2A goldens, etc.
Phase 2 β€” Board craft v1 = M5 (more skills) + M9 (full reconciliation) + M3 (first scene placeholder)
Phase 3 β€” Interjection v1 = M6 (FSM) + M7 (5-way router) + M8 (full activity coverage)
Phase 4 β€” Engagement v1 = M5 new skills (ask_back, evaluate_work v2, pose_problem v2) + M2B prompt for engagement patterns
Phase 5 β€” Pedagogy bedrock = M2 eval harness + M5 look_up, derive_step + misconception library content
Phase 6 β€” Activation v2 (rich inputs) = M1 expansion (already partial via F021/F022) + M2B streaming (M4)
Phase 7 β€” Tool-using agent (sandboxed) = M5 skill maturation + new "tutorial-led entry" mode
Phase 8 β€” Adaptive intelligence + memory New module M10 (Knowledge State + Memory) β€” not in v1.2 architecture, will be added
Phase 9 β€” Animation + simulation = M5 (play_animation skill) + M9 (animation-aware layout)
Phase 10 β€” Mobile + accessibility = M9 reflow + cross-cutting CSS work
Phase 11 β€” Pre-warm corpus scale = M2A scale-up (~150 hand-curated β†’ 1500 long-tail)
Phase 12 β€” Production pilot Orthogonal β€” infra, deployment, RCT design
Phase 13 β€” Subject coverage + scale = M1 (math/modern physics) + M2A scale + new module M11 (Teacher/Parent dashboards)
Phase 14 β€” Classroom mode Parked. Multi-student is a new concern crossing M3 + M6 + M7 + M8

Bottom line: the 14-phase roadmap doesn't go away; it just gets executed through module-quality units. When all 9 modules hit 🟠⁺ (β‰ˆ 2 weeks), Phase 1 + 1.5 are done. Phase 2 starts as M5 + M9 work to reach 🟑 solid. Phase 3 starts as M6 + M7 + M8 work to reach 🟑 solid. And so on.

This doc gets updated weekly. As work completes, items move from open to done. When all of M2B's items are done and you score β‰₯ 🟠 quality, M2B's row in the at-a-glance table flips to 🟠.


How we use this doc

  1. Pick top of execution order. (Currently: M2B prompt + critique-revise.)
  2. Do the work. Build, deploy.
  3. You test. On the live URL. Score subjectively against the move-on criteria.
  4. If pass: mark item done. Update at-a-glance table. Move to next item.
  5. If fail: iterate within the module until it passes, OR explicitly flag and move on if blocking.
  6. Dashboard reflects state β€” the unified Studio dashboard's Roadmap tab pulls from this.

This is a working document. Edit it as we learn. Add items as new gaps appear. Remove items if they prove unnecessary.


Open questions before kickoff

  1. Confirm execution order. Default: M2B β†’ M5 highlight β†’ M8 β†’ M6 β†’ M7 β†’ M9-A β†’ M3 β†’ … Override?
  2. Hand-edit cadence for M2A goldens. Are you available for ~1 hr per lesson Γ— 5 lessons over the next week?
  3. Streaming priority. M4 currently sits at #14 (~3-5 days). Pull earlier or leave there?
  4. Cancellation UX. Pressing Esc mid-prepare should cancel β€” currently doesn't. Is that worth making a Phase A item in M6?

Drafted 2026-04-29 in response to: "create that roadmap and work on it granularly." Manual testing only. Module-by-module cadence. The 14-phase roadmap is the destination; this is the route.


Appendix Β· The destination β€” what comes after the 9-module push

Tier A Β· Polish (~3–4 weeks): Take all 9 modules from 🟠 developing β†’ 🟑 solid. Delivers Phases 2 (Board craft), 3 (Interjection), 4 (Engagement) of the 14-phase roadmap.

Tier B Β· New capabilities (~2–3 weeks): - Phase 5 β€” Pedagogy bedrock: sympy unit validator, source citations everywhere, misconception library v0 with 30 tagged misconceptions, Mayer-principles audit - Phase 7 β€” Tool-using agent (sandboxed): tutorial-led entry mode with live planner-executor agent + lookup_fact + derive_step skills

Tier C Β· New modules + content (~3–4 weeks): - Phase 8 β€” Adaptive intelligence + memory (new M10 module): per-user mastery model, FERPA-aware cross-session memory, spaced repetition, diagnostic pre-assessment - Phase 11 β€” Pre-warm corpus scale: 150 hand-curated Tier 1 + 500 auto-gated Tier 2 + 850 on-demand Tier 3 lessons across physolympiad + SuperStem + Wikipedia

Tier D Β· Reach + Deploy (~6–8 weeks): - Phase 10 β€” Mobile + accessibility: M9 reflow for phones/tablets, touch mic, captions, screen-reader, WCAG 2.2 AA - Phase 12 β€” Production pilot: physolympiad.com integration, S3 audio, real auth + rate limits, 50–150 student RCT vs. Khan Academy - Phase 13 β€” Subject coverage + scale: math primitives, modern physics, new M11 (teacher dashboard) + M12 (parent dashboard), library scale-up to 1500 topics

Tier E Β· Future (~2–3 weeks, on-demand): - Phase 14 β€” Classroom mode (parked): multi-voice detection, group UX, per-student aggregation, teacher orchestration. Activates when β‰₯3 schools ask.

Total elapsed to production pilot: ~2 weeks (current module push) + ~14–19 weeks (Tiers A–D) = ~16–21 weeks total β‰ˆ 4–5 months.