Status: ACTIONABLE Β· supersedes Phase 1.5 framing Β· drives day-to-day work
Date: 2026-04-29
Companion: CANVAS_TUTOR_ARCHITECTURE_v1.md (the spine)
This document lists exactly what needs to be built to take each of the 9 architecture modules from its current quality to its next level. Manual testing only β no automated harnesses. We work granularly: pick one module, do the work, you test it manually on the live URL, score it, move to the next.
| # | Module | Exists | Quality (was β now) | What landed in this push |
|---|---|---|---|---|
| M1 | Input Handler | β functional (6/6) | π β π‘ solid | blank input mode + 10 question cards (auto-fill on click) |
| M2A | Cached lessons | β functional | π β π βΊ developing+ (re-cook running) | scripts/patch_lessons.py shipped runtime helpers into 29 cached HTMLs; lesson re-cook with new prompt running on EC2 |
| M2B | Live ingest+gen | β functional | π΄ β π developing | 9-phase pedagogical arc in SYSTEM_PROMPT, beat-count guidance by complexity, source-faithfulness for url/pdf, M9 footprint contract documented, second-pass critique-revise rubric (rewrites if any dim < 4) |
| M3 | First-Response | β functional | π β π‘ solid | input-aware hello categories: hello_url, hello_pdf, hello_blank, hello_question (16 utterances, 15 synthesized) |
| M4 | Streaming Engine | β not started | n/a (deferred) | β |
| M5 | Skill Executor | π‘ partial (focus cues live) | π΄ β π developing | parse_focus_markers() + scheduleFocusCues() word-level highlight via char-position approximation against audio.currentTime |
| M6 | Interruption FSM | π‘ partial | π΄ β π developing | state transitions formalized |
| M7 | Response Router | β functional (5/5) | π β π‘ solid | 5-way classifier: INLINE / TANGENT / PARK / REFUSE / CLARIFY with confidence + rationale; uniform synth path |
| M8 | Activity Indicator | β functional | π΄ β π developing | visual progress modal, 4 sub-states (classifying / preparing / synthesizing / finalizing), heuristic timing |
| M9 | Board State + Layout | π‘ partial (Phase A+B) | π΄ β π developing | Phase A: 6Γ4 footprint contract in author prompt + runtime. Phase B: window.boardState (add/remove/getState/hasRoomFor/findReferencedRecently) + permanence levels. Phase C (active reconciliation) deferred. |
Status: all 9 modules at π βΊ or better. Lesson re-cook (force --parallel 3, 20 topics) running in tmux on devbox to surface M2B's new prompt across the demo corpus. Manual testing on physolympiad.com/CanvasA/ β pick a card, hit a hello variant, ask a question, watch focus cues + activity modal.
Current: 4 of 6 modes work. Target: all 6, with input-aware errors.
| Work item | Effort |
|---|---|
Add blank input mode (student arrives without a topic β tutor asks "what would you like to learn?") |
~2 hr |
Add question_card mode + curate ~10 starter questions |
~half day |
| Image-PDF OCR (Tesseract or Claude Vision for scanned PDFs) | ~1 day |
| Better validation + helpful error messages on bad URLs / parse failures | ~1 hr |
Manual test: try each input mode, intentionally feed bad input (broken URL, image-only PDF), verify graceful handling.
Move-on score: all 6 modes work + bad input doesn't crash.
Current: 19/20 cards live, demoware-quality. Target: golden quality on the 20.
| Work item | Effort |
|---|---|
| Fix Maxwell's equations card (1/20 still topic-only) | ~30 min |
| Hand-edit 5 of the 20 cards with you (or a physicist) β narration polish, scene refinement, layout fixes | ~5 hr active, ~1 wk elapsed |
| Add a manual-review checklist per lesson (correctness, pedagogy, layout, sync) | ~1 hr |
| Re-cook with M2B's improved prompt once that lands (cascades automatically) | overlap |
Manual test: play 5 polished lessons; do they teach genuinely? Compare to YouTube-physics-tutor benchmark.
Move-on score: 5+ cards rated "I'd actually share this with a student" by you.
Current: single-shot Claude, demoware. Target: structured pedagogical arc + critique-revise.
| Work item | Effort |
|---|---|
| Rewrite system prompt to enforce pedagogical arc: hook β prior knowledge β core insight β derivation β worked example β trap β consolidation | ~half day |
| Add critique-and-revise pass: second Claude call scores the first output against rubric, rewrites weak parts | ~half day |
| Variable lesson length: prompt asks for beat count proportional to topic complexity (5 for trivial, 20 for Maxwell) | ~half day |
| Source-faithfulness check: when source is provided, verify lesson reflects source's framing | ~1 hr |
Manual test: regenerate 3 lessons with new prompt; compare to baseline. Does the new lesson have a clearer arc? Is the worked example really worked?
Move-on score: 3 of 3 regenerated lessons score "noticeably better than baseline" subjectively.
Current: generic hello. Target: input-aware hello + first visual placeholder.
| Work item | Effort |
|---|---|
| Author 5 hello variants per input mode (URL-paste, PDF-drop, blank, question card) β total ~25 new utterances | ~half day prompt + synth |
| Pre-synthesize via existing F018 pipeline | ~10 min |
| First scene placeholder (ghost diagram or skeleton title appears alongside hello audio) | ~half day |
Manual test: paste 5 different URL types, drop a PDF, click cards β does the hello acknowledge what you sent?
Move-on score: hello phrasing makes you think "yes, the tutor noticed what I sent."
Current: doesn't exist. Target: functional β first beat plays before full lesson is generated.
| Work item | Effort |
|---|---|
Refactor generate_lesson.py to stream Claude output beat-by-beat (Anthropic streaming API) |
~1 day |
| ElevenLabs per-beat synth as each beat is parsed (parallel, not after) | ~1 day |
Frontend listens for beat_ready events on the status endpoint, plays as soon as available |
~1 day |
| Lookahead buffer: pre-fetch beat N+1's audio while N plays | ~half day |
| Backpressure: if production lags playback, fall back to extended bridging | ~half day |
Manual test: paste URL, time first audible spoken-content moment. Should be ~10s instead of ~60s.
Move-on score: first lesson content plays within 15s of clicking; full 12-beat lesson plays without dead air.
Current: 5/10 skills, all implicit dispatch. Target: formal registry + word-level highlight + 3 new skills.
| Work item | Effort |
|---|---|
Create prompts/skills_registry.json (the 10 skills, typed inputs/outputs) |
~1 hr |
| Refactor lesson author prompt to emit explicit skill calls (instead of implicit scene types) | ~half day |
| Refactor runtime to dispatch by skill name (matches registry) | ~half day |
F024 word-level timing: ElevenLabs with-timestamps endpoint integration |
~1 day |
F025 inline focus markers: author prompt emits [focus:elem-id]β¦[/] blocks |
~half day |
| F026 cue engine: runtime fires F011 highlighter underline on focus markers' word timings | ~1 day |
New skill: look_up(query) β knowledge lookup with citation |
~1 day |
New skill: derive_step(symbolic) β sympy-verified algebraic step |
~1-2 days |
New skill: ask_back(question) β pause + wait for student |
~1 day |
New skill: play_animation(spec) β parameter-driven motion |
~2 days |
Manual test: word-level β replay any lesson; does the highlight underline track the spoken word? New skills β generate a lesson on a topic that benefits from each skill.
Move-on score: word-level sync feels natural on 3 sample lessons; at least 2 of 4 new skills functional.
Current: implicit. Target: named states with per-state activity.
| Work item | Effort |
|---|---|
Define InterruptionState enum (pausing β classifying β preparing β delivering β transitioning_back β playing) |
~1 hr |
Refactor /api/ask to emit state events as the request progresses |
~half day |
| Frontend subscribes to state events; updates M8 indicator in real time | ~half day |
| Cancellation: student can cancel a question mid-prepare (e.g., "actually never mind") | ~half day |
| Resume re-orientation: brief "Right, where were weβ¦" voice cue when transitioning back | ~1 hr |
Manual test: ask a question; watch the indicator step through named states. Hit Esc mid-prepare; does it cancel cleanly?
Move-on score: you can tell at a glance which sub-state the tutor is in.
Current: 2-way classifier. Target: 5-way + confidence + rationale.
| Work item | Effort |
|---|---|
| Upgrade Claude classifier prompt to emit 5-way decision: INLINE / TANGENT / PARK / REFUSE / CLARIFY | ~half day |
| Implement PARK mode: question saved to a "parking lot" UI element, accessible at end of lesson | ~half day |
| Implement REFUSE mode: polite refusal phrasing for impossible/ambiguous/off-domain questions | ~1 hr |
| Implement CLARIFY mode: tutor asks targeted follow-up before answering | ~half day |
| Log classifier confidence + rationale per call (debug aid) | ~1 hr |
Manual test: ask 10 diverse questions across different intent types. Does the router pick correctly?
Move-on score: β₯8 of 10 routes feel right; PARK and CLARIFY UX is clear.
Current: voice surface only during activation. Target: visual modal + interruption coverage.
| Work item | Effort |
|---|---|
| Visual progress modal (small card, top-right of board, shows current sub-state) | ~half day |
| Wire to M6's state events (lights up sub-state lines as tutor progresses) | ~1 hr |
| Author 10β15 new bridging utterances tied to interruption sub-states (synth via F018 pipeline) | ~half day |
| Pulsing chalk-cursor animation for "tutor is doing something" | ~1 hr |
Manual test: ask a question, watch the modal animate sub-states + voice utterances play during the wait.
Move-on score: zero "Tutor is thinkingβ¦" moments where you don't know what's happening.
Current: 6Γ4 grid in CSS, author-driven placement, reactive overlap fixer. Target: structured contract + state recorder.
| Work item | Effort |
|---|---|
Phase A (~half day): Add footprint + position_hint + permanence fields to scene element schema. Update lesson author prompt to emit them. Runtime ignores initially. |
~half day |
Phase B (~1 day): Build board state recorder. Track every element added with permanence + timestamp. Provide get_state(), has_room_for(footprint), find_referenced_recently() queries. |
~1 day |
Phase B': M7 starts using has_room_for() to inform same-board-vs-new-board routing |
~1 hr |
| Phase C (~1-2 days): Active reconciliation. Beat arrives, system checks fit, erases ephemerals if needed, signals new-board if no room. Replaces F009 reactive auto-fixer. | ~1-2 days |
| Visual debug overlay: dev mode shows current board state + footprints | ~1 hr |
Manual test: generate 3 long lessons. Are layouts crowded or clean? Does the new-board decision happen at the right moment?
Move-on score: layout never feels accidental.
Ordered by user-visible impact / dependency:
blank + question_card modes β ~1 day β broader entry surfacelook_up β ~1 day β adds factual citationask_back β ~1 day β enables Phase 4 active learningderive_step, play_animation β ~3-4 days β extends tutor's repertoireTotal to take all 9 modules from current β π βΊ: ~14 working days at AI speed.
The existing 14-phase roadmap in ROADMAP_CANVAS_A_v3.md is the long-term vision. This module roadmap is the immediate execution plan. They map cleanly:
| 14-phase Roadmap | Module work that delivers it |
|---|---|
| Phase 0 β Foundation | Eval harness lives in M2 (gate), observability in M6/M8 (state logging), auth/rate limits orthogonal infra |
| Phase 1 β Activation engine v1 | Done. F016βF034 covered M1, M2A, M2B, M3, M5 narrate, M9 grid baseline |
| Phase 1.5 β Voice-visual sync + lesson quality | REPLACED by this doc. F024βF032 = M5 highlight, M2B prompt, M2A goldens, etc. |
| Phase 2 β Board craft v1 | = M5 (more skills) + M9 (full reconciliation) + M3 (first scene placeholder) |
| Phase 3 β Interjection v1 | = M6 (FSM) + M7 (5-way router) + M8 (full activity coverage) |
| Phase 4 β Engagement v1 | = M5 new skills (ask_back, evaluate_work v2, pose_problem v2) + M2B prompt for engagement patterns |
| Phase 5 β Pedagogy bedrock | = M2 eval harness + M5 look_up, derive_step + misconception library content |
| Phase 6 β Activation v2 (rich inputs) | = M1 expansion (already partial via F021/F022) + M2B streaming (M4) |
| Phase 7 β Tool-using agent (sandboxed) | = M5 skill maturation + new "tutorial-led entry" mode |
| Phase 8 β Adaptive intelligence + memory | New module M10 (Knowledge State + Memory) β not in v1.2 architecture, will be added |
| Phase 9 β Animation + simulation | = M5 (play_animation skill) + M9 (animation-aware layout) |
| Phase 10 β Mobile + accessibility | = M9 reflow + cross-cutting CSS work |
| Phase 11 β Pre-warm corpus scale | = M2A scale-up (~150 hand-curated β 1500 long-tail) |
| Phase 12 β Production pilot | Orthogonal β infra, deployment, RCT design |
| Phase 13 β Subject coverage + scale | = M1 (math/modern physics) + M2A scale + new module M11 (Teacher/Parent dashboards) |
| Phase 14 β Classroom mode | Parked. Multi-student is a new concern crossing M3 + M6 + M7 + M8 |
Bottom line: the 14-phase roadmap doesn't go away; it just gets executed through module-quality units. When all 9 modules hit π βΊ (β 2 weeks), Phase 1 + 1.5 are done. Phase 2 starts as M5 + M9 work to reach π‘ solid. Phase 3 starts as M6 + M7 + M8 work to reach π‘ solid. And so on.
This doc gets updated weekly. As work completes, items move from open to done. When all of M2B's items are done and you score β₯ π quality, M2B's row in the at-a-glance table flips to π .
This is a working document. Edit it as we learn. Add items as new gaps appear. Remove items if they prove unnecessary.
Drafted 2026-04-29 in response to: "create that roadmap and work on it granularly." Manual testing only. Module-by-module cadence. The 14-phase roadmap is the destination; this is the route.
Tier A Β· Polish (~3β4 weeks): Take all 9 modules from π developing β π‘ solid. Delivers Phases 2 (Board craft), 3 (Interjection), 4 (Engagement) of the 14-phase roadmap.
Tier B Β· New capabilities (~2β3 weeks):
- Phase 5 β Pedagogy bedrock: sympy unit validator, source citations everywhere, misconception library v0 with 30 tagged misconceptions, Mayer-principles audit
- Phase 7 β Tool-using agent (sandboxed): tutorial-led entry mode with live planner-executor agent + lookup_fact + derive_step skills
Tier C Β· New modules + content (~3β4 weeks): - Phase 8 β Adaptive intelligence + memory (new M10 module): per-user mastery model, FERPA-aware cross-session memory, spaced repetition, diagnostic pre-assessment - Phase 11 β Pre-warm corpus scale: 150 hand-curated Tier 1 + 500 auto-gated Tier 2 + 850 on-demand Tier 3 lessons across physolympiad + SuperStem + Wikipedia
Tier D Β· Reach + Deploy (~6β8 weeks): - Phase 10 β Mobile + accessibility: M9 reflow for phones/tablets, touch mic, captions, screen-reader, WCAG 2.2 AA - Phase 12 β Production pilot: physolympiad.com integration, S3 audio, real auth + rate limits, 50β150 student RCT vs. Khan Academy - Phase 13 β Subject coverage + scale: math primitives, modern physics, new M11 (teacher dashboard) + M12 (parent dashboard), library scale-up to 1500 topics
Tier E Β· Future (~2β3 weeks, on-demand): - Phase 14 β Classroom mode (parked): multi-voice detection, group UX, per-student aggregation, teacher orchestration. Activates when β₯3 schools ask.
Total elapsed to production pilot: ~2 weeks (current module push) + ~14β19 weeks (Tiers AβD) = ~16β21 weeks total β 4β5 months.