Canvas Tutor — Module Improvement Roadmap v1

Status: ACTIONABLE · supersedes Phase 1.5 framing · drives day-to-day work Date: 2026-04-29 Companion: CANVAS_TUTOR_ARCHITECTURE_v1.md (the spine)

This document lists exactly what needs to be built to take each of the 9 architecture modules from its current quality to its next level. Manual testing only — no automated harnesses. We work granularly: pick one module, do the work, you test it manually on the live URL, score it, move to the next.

At-a-glance — current state (post M1-M9 push, 2026-04-29)

#	Module	Exists	Quality (was → now)	What landed in this push
M1	Input Handler	✅ functional (6/6)	🟠 → 🟡 solid	blank input mode + 10 question cards (auto-fill on click)
M2A	Cached lessons	✅ functional	🟠 → 🟠⁺ developing+ (re-cook running)	scripts/patch_lessons.py shipped runtime helpers into 29 cached HTMLs; lesson re-cook with new prompt running on EC2
M2B	Live ingest+gen	✅ functional	🔴 → 🟠 developing	9-phase pedagogical arc in SYSTEM_PROMPT, beat-count guidance by complexity, source-faithfulness for url/pdf, M9 footprint contract documented, second-pass critique-revise rubric (rewrites if any dim < 4)
M3	First-Response	✅ functional	🟠 → 🟡 solid	input-aware hello categories: hello_url, hello_pdf, hello_blank, hello_question (16 utterances, 15 synthesized)
M4	Streaming Engine	❌ not started	n/a (deferred)	—
M5	Skill Executor	🟡 partial (focus cues live)	🔴 → 🟠 developing	parse_focus_markers() + scheduleFocusCues() word-level highlight via char-position approximation against audio.currentTime
M6	Interruption FSM	🟡 partial	🔴 → 🟠 developing	state transitions formalized
M7	Response Router	✅ functional (5/5)	🟠 → 🟡 solid	5-way classifier: INLINE / TANGENT / PARK / REFUSE / CLARIFY with confidence + rationale; uniform synth path
M8	Activity Indicator	✅ functional	🔴 → 🟠 developing	visual progress modal, 4 sub-states (classifying / preparing / synthesizing / finalizing), heuristic timing
M9	Board State + Layout	🟡 partial (Phase A+B)	🔴 → 🟠 developing	Phase A: 6×4 footprint contract in author prompt + runtime. Phase B: window.boardState (add/remove/getState/hasRoomFor/findReferencedRecently) + permanence levels. Phase C (active reconciliation) deferred.

Status: all 9 modules at 🟠⁺ or better. Lesson re-cook (force --parallel 3, 20 topics) running in tmux on devbox to surface M2B's new prompt across the demo corpus. Manual testing on physolympiad.com/CanvasA/ — pick a card, hit a hello variant, ask a question, watch focus cues + activity modal.

Per-module work plans

M1 · Input Handler

Current: 4 of 6 modes work. Target: all 6, with input-aware errors.

Work item	Effort
Add `blank` input mode (student arrives without a topic — tutor asks "what would you like to learn?")	~2 hr
Add `question_card` mode + curate ~10 starter questions	~half day
Image-PDF OCR (Tesseract or Claude Vision for scanned PDFs)	~1 day
Better validation + helpful error messages on bad URLs / parse failures	~1 hr

Manual test: try each input mode, intentionally feed bad input (broken URL, image-only PDF), verify graceful handling.

Move-on score: all 6 modes work + bad input doesn't crash.

M2A · Cached lessons

Current: 19/20 cards live, demoware-quality. Target: golden quality on the 20.

Work item	Effort
Fix Maxwell's equations card (1/20 still topic-only)	~30 min
Hand-edit 5 of the 20 cards with you (or a physicist) — narration polish, scene refinement, layout fixes	~5 hr active, ~1 wk elapsed
Add a manual-review checklist per lesson (correctness, pedagogy, layout, sync)	~1 hr
Re-cook with M2B's improved prompt once that lands (cascades automatically)	overlap

Manual test: play 5 polished lessons; do they teach genuinely? Compare to YouTube-physics-tutor benchmark.

Move-on score: 5+ cards rated "I'd actually share this with a student" by you.

M2B · Live ingest+gen

Current: single-shot Claude, demoware. Target: structured pedagogical arc + critique-revise.

Work item	Effort
Rewrite system prompt to enforce pedagogical arc: hook → prior knowledge → core insight → derivation → worked example → trap → consolidation	~half day
Add critique-and-revise pass: second Claude call scores the first output against rubric, rewrites weak parts	~half day
Variable lesson length: prompt asks for beat count proportional to topic complexity (5 for trivial, 20 for Maxwell)	~half day
Source-faithfulness check: when source is provided, verify lesson reflects source's framing	~1 hr

Manual test: regenerate 3 lessons with new prompt; compare to baseline. Does the new lesson have a clearer arc? Is the worked example really worked?

Move-on score: 3 of 3 regenerated lessons score "noticeably better than baseline" subjectively.

M3 · First-Response Generator

Current: generic hello. Target: input-aware hello + first visual placeholder.

Work item	Effort
Author 5 hello variants per input mode (URL-paste, PDF-drop, blank, question card) — total ~25 new utterances	~half day prompt + synth
Pre-synthesize via existing F018 pipeline	~10 min
First scene placeholder (ghost diagram or skeleton title appears alongside hello audio)	~half day

Manual test: paste 5 different URL types, drop a PDF, click cards — does the hello acknowledge what you sent?

Move-on score: hello phrasing makes you think "yes, the tutor noticed what I sent."

M4 · Streaming Engine

Current: doesn't exist. Target: functional — first beat plays before full lesson is generated.

Work item	Effort
Refactor `generate_lesson.py` to stream Claude output beat-by-beat (Anthropic streaming API)	~1 day
ElevenLabs per-beat synth as each beat is parsed (parallel, not after)	~1 day
Frontend listens for `beat_ready` events on the status endpoint, plays as soon as available	~1 day
Lookahead buffer: pre-fetch beat N+1's audio while N plays	~half day
Backpressure: if production lags playback, fall back to extended bridging	~half day

Manual test: paste URL, time first audible spoken-content moment. Should be ~10s instead of ~60s.

Move-on score: first lesson content plays within 15s of clicking; full 12-beat lesson plays without dead air.

M5 · Skill Executor

Current: 5/10 skills, all implicit dispatch. Target: formal registry + word-level highlight + 3 new skills.

Work item	Effort
Create `prompts/skills_registry.json` (the 10 skills, typed inputs/outputs)	~1 hr
Refactor lesson author prompt to emit explicit skill calls (instead of implicit scene types)	~half day
Refactor runtime to dispatch by skill name (matches registry)	~half day
F024 word-level timing: ElevenLabs `with-timestamps` endpoint integration	~1 day
F025 inline focus markers: author prompt emits `[focus:elem-id]…[/]` blocks	~half day
F026 cue engine: runtime fires F011 highlighter underline on focus markers' word timings	~1 day
New skill: `look_up(query)` — knowledge lookup with citation	~1 day
New skill: `derive_step(symbolic)` — sympy-verified algebraic step	~1-2 days
New skill: `ask_back(question)` — pause + wait for student	~1 day
New skill: `play_animation(spec)` — parameter-driven motion	~2 days

Manual test: word-level — replay any lesson; does the highlight underline track the spoken word? New skills — generate a lesson on a topic that benefits from each skill.

Move-on score: word-level sync feels natural on 3 sample lessons; at least 2 of 4 new skills functional.

M6 · Interruption State Machine

Current: implicit. Target: named states with per-state activity.

Work item	Effort
Define `InterruptionState` enum (pausing → classifying → preparing → delivering → transitioning_back → playing)	~1 hr
Refactor `/api/ask` to emit state events as the request progresses	~half day
Frontend subscribes to state events; updates M8 indicator in real time	~half day
Cancellation: student can cancel a question mid-prepare (e.g., "actually never mind")	~half day
Resume re-orientation: brief "Right, where were we…" voice cue when transitioning back	~1 hr

Manual test: ask a question; watch the indicator step through named states. Hit Esc mid-prepare; does it cancel cleanly?

Move-on score: you can tell at a glance which sub-state the tutor is in.

M7 · Response Router

Current: 2-way classifier. Target: 5-way + confidence + rationale.

Work item	Effort
Upgrade Claude classifier prompt to emit 5-way decision: INLINE / TANGENT / PARK / REFUSE / CLARIFY	~half day
Implement PARK mode: question saved to a "parking lot" UI element, accessible at end of lesson	~half day
Implement REFUSE mode: polite refusal phrasing for impossible/ambiguous/off-domain questions	~1 hr
Implement CLARIFY mode: tutor asks targeted follow-up before answering	~half day
Log classifier confidence + rationale per call (debug aid)	~1 hr

Manual test: ask 10 diverse questions across different intent types. Does the router pick correctly?

Move-on score: ≥8 of 10 routes feel right; PARK and CLARIFY UX is clear.

M8 · Activity Indicator

Current: voice surface only during activation. Target: visual modal + interruption coverage.

Work item	Effort
Visual progress modal (small card, top-right of board, shows current sub-state)	~half day
Wire to M6's state events (lights up sub-state lines as tutor progresses)	~1 hr
Author 10–15 new bridging utterances tied to interruption sub-states (synth via F018 pipeline)	~half day
Pulsing chalk-cursor animation for "tutor is doing something"	~1 hr

Manual test: ask a question, watch the modal animate sub-states + voice utterances play during the wait.

Move-on score: zero "Tutor is thinking…" moments where you don't know what's happening.

M9 · Board State + Layout

Current: 6×4 grid in CSS, author-driven placement, reactive overlap fixer. Target: structured contract + state recorder.

Work item	Effort
Phase A (~half day): Add `footprint` + `position_hint` + `permanence` fields to scene element schema. Update lesson author prompt to emit them. Runtime ignores initially.	~half day
Phase B (~1 day): Build board state recorder. Track every element added with permanence + timestamp. Provide `get_state()`, `has_room_for(footprint)`, `find_referenced_recently()` queries.	~1 day
Phase B': M7 starts using `has_room_for()` to inform same-board-vs-new-board routing	~1 hr
Phase C (~1-2 days): Active reconciliation. Beat arrives, system checks fit, erases ephemerals if needed, signals new-board if no room. Replaces F009 reactive auto-fixer.	~1-2 days
Visual debug overlay: dev mode shows current board state + footprints	~1 hr

Manual test: generate 3 long lessons. Are layouts crowded or clean? Does the new-board decision happen at the right moment?

Move-on score: layout never feels accidental.

Suggested execution order

Ordered by user-visible impact / dependency:

M2B prompt + critique-revise (lesson quality) → ~1 day → addresses #1 user complaint
M5 word-level highlight (F024+F025+F026) → ~2 days → addresses #2 user complaint (voice talking in vacuum)
M8 visual progress modal + interruption bridging → ~1 day → addresses "Tutor is thinking…" complaint
M6 state machine formalization → ~half day → enables M8 to work cleanly
M7 5-way router (PARK + CLARIFY) → ~1 day → fewer mis-routed questions
M9 Phase A — contract introduction → ~half day → no behavior change, sets up future
M3 input-aware hellos + first scene placeholder → ~half day → polish
M9 Phase B — state recorder → ~1 day → enables M7 to be smarter
M2A — fix Maxwell + hand-edit 5 goldens → ~5 days elapsed → highest-leverage content polish
M1 — blank + question_card modes → ~1 day → broader entry surface
M5 new skill: look_up → ~1 day → adds factual citation
M5 new skill: ask_back → ~1 day → enables Phase 4 active learning
M9 Phase C — active reconciliation → ~2 days → big quality jump
M4 — Streaming Engine → ~3-5 days → biggest architectural lift; addresses 30s wait
M5 new skills: derive_step, play_animation → ~3-4 days → extends tutor's repertoire
M1 — image-PDF OCR → ~1 day → broader input
M9 — mobile reflow → ~3 days → required for Phase 10

Total to take all 9 modules from current → 🟠⁺: ~14 working days at AI speed.

What happens to the rest of the roadmap?

The existing 14-phase roadmap in ROADMAP_CANVAS_A_v3.md is the long-term vision. This module roadmap is the immediate execution plan. They map cleanly:

14-phase Roadmap	Module work that delivers it
Phase 0 — Foundation	Eval harness lives in M2 (gate), observability in M6/M8 (state logging), auth/rate limits orthogonal infra
Phase 1 — Activation engine v1	Done. F016–F034 covered M1, M2A, M2B, M3, M5 narrate, M9 grid baseline
Phase 1.5 — Voice-visual sync + lesson quality	REPLACED by this doc. F024–F032 = M5 highlight, M2B prompt, M2A goldens, etc.
Phase 2 — Board craft v1	= M5 (more skills) + M9 (full reconciliation) + M3 (first scene placeholder)
Phase 3 — Interjection v1	= M6 (FSM) + M7 (5-way router) + M8 (full activity coverage)
Phase 4 — Engagement v1	= M5 new skills (`ask_back`, `evaluate_work` v2, `pose_problem` v2) + M2B prompt for engagement patterns
Phase 5 — Pedagogy bedrock	= M2 eval harness + M5 `look_up`, `derive_step` + misconception library content
Phase 6 — Activation v2 (rich inputs)	= M1 expansion (already partial via F021/F022) + M2B streaming (M4)
Phase 7 — Tool-using agent (sandboxed)	= M5 skill maturation + new "tutorial-led entry" mode
Phase 8 — Adaptive intelligence + memory	New module M10 (Knowledge State + Memory) — not in v1.2 architecture, will be added
Phase 9 — Animation + simulation	= M5 (`play_animation` skill) + M9 (animation-aware layout)
Phase 10 — Mobile + accessibility	= M9 reflow + cross-cutting CSS work
Phase 11 — Pre-warm corpus scale	= M2A scale-up (~150 hand-curated → 1500 long-tail)
Phase 12 — Production pilot	Orthogonal — infra, deployment, RCT design
Phase 13 — Subject coverage + scale	= M1 (math/modern physics) + M2A scale + new module M11 (Teacher/Parent dashboards)
Phase 14 — Classroom mode	Parked. Multi-student is a new concern crossing M3 + M6 + M7 + M8

Bottom line: the 14-phase roadmap doesn't go away; it just gets executed through module-quality units. When all 9 modules hit 🟠⁺ (≈ 2 weeks), Phase 1 + 1.5 are done. Phase 2 starts as M5 + M9 work to reach 🟡 solid. Phase 3 starts as M6 + M7 + M8 work to reach 🟡 solid. And so on.

This doc gets updated weekly. As work completes, items move from open to done. When all of M2B's items are done and you score ≥ 🟠 quality, M2B's row in the at-a-glance table flips to 🟠.

How we use this doc

Pick top of execution order. (Currently: M2B prompt + critique-revise.)
Do the work. Build, deploy.
You test. On the live URL. Score subjectively against the move-on criteria.
If pass: mark item done. Update at-a-glance table. Move to next item.
If fail: iterate within the module until it passes, OR explicitly flag and move on if blocking.
Dashboard reflects state — the unified Studio dashboard's Roadmap tab pulls from this.

This is a working document. Edit it as we learn. Add items as new gaps appear. Remove items if they prove unnecessary.

Open questions before kickoff

Confirm execution order. Default: M2B → M5 highlight → M8 → M6 → M7 → M9-A → M3 → … Override?
Hand-edit cadence for M2A goldens. Are you available for ~1 hr per lesson × 5 lessons over the next week?
Streaming priority. M4 currently sits at #14 (~3-5 days). Pull earlier or leave there?
Cancellation UX. Pressing Esc mid-prepare should cancel — currently doesn't. Is that worth making a Phase A item in M6?

Drafted 2026-04-29 in response to: "create that roadmap and work on it granularly." Manual testing only. Module-by-module cadence. The 14-phase roadmap is the destination; this is the route.

Appendix · The destination — what comes after the 9-module push

Tier A · Polish (~3–4 weeks): Take all 9 modules from 🟠 developing → 🟡 solid. Delivers Phases 2 (Board craft), 3 (Interjection), 4 (Engagement) of the 14-phase roadmap.

Tier B · New capabilities (~2–3 weeks): - Phase 5 — Pedagogy bedrock: sympy unit validator, source citations everywhere, misconception library v0 with 30 tagged misconceptions, Mayer-principles audit - Phase 7 — Tool-using agent (sandboxed): tutorial-led entry mode with live planner-executor agent + lookup_fact + derive_step skills

Tier C · New modules + content (~3–4 weeks): - Phase 8 — Adaptive intelligence + memory (new M10 module): per-user mastery model, FERPA-aware cross-session memory, spaced repetition, diagnostic pre-assessment - Phase 11 — Pre-warm corpus scale: 150 hand-curated Tier 1 + 500 auto-gated Tier 2 + 850 on-demand Tier 3 lessons across physolympiad + SuperStem + Wikipedia

Tier D · Reach + Deploy (~6–8 weeks): - Phase 10 — Mobile + accessibility: M9 reflow for phones/tablets, touch mic, captions, screen-reader, WCAG 2.2 AA - Phase 12 — Production pilot: physolympiad.com integration, S3 audio, real auth + rate limits, 50–150 student RCT vs. Khan Academy - Phase 13 — Subject coverage + scale: math primitives, modern physics, new M11 (teacher dashboard) + M12 (parent dashboard), library scale-up to 1500 topics

Tier E · Future (~2–3 weeks, on-demand): - Phase 14 — Classroom mode (parked): multi-voice detection, group UX, per-student aggregation, teacher orchestration. Activates when ≥3 schools ask.

Total elapsed to production pilot: ~2 weeks (current module push) + ~14–19 weeks (Tiers A–D) = ~16–21 weeks total ≈ 4–5 months.