Project dossier
NeuroAssess
Clinical Parkinson's screening portal with transformer-based text inference and decision-support reports.
What it solves
Overview
NeuroAssess is a clinical machine learning screening platform that analyzes patient text and structured clinical signals to estimate Parkinson's Disease risk as decision support, not diagnosis. Interview focus: cover the PPMI cohort target classes, leak-free patient-level splits, traditional and transformer model suite, multimodal stacking, focal loss, model artifact loading, RAG report generation, medical document indexing, dual reports, digital-twin forecasting, and why outputs remain clinical decision support.
Target audience
System design
Architecture
The platform separates model inference, clinical preprocessing, report generation, and the browser portal. The Flask layer loads the trained model and exposes prediction routes while the frontend collects input and explains results. The source repo includes a Flask web app, PPMI feature mapping, LightGBM/XGBoost/SVM training, PubMedBERT/BioGPT/Clinical-T5 training, a multimodal ensemble, TF-IDF document indexing, dual report generation, optional digital-twin progression support, and runtime flags that defer heavy model/PDF work for faster startup.
Architecture diagram
Clinical input layer
Collects patient text, symptom notes, and structured fields through a cautious healthcare-focused UI.
Inference API layer
Loads the trained model, validates incoming fields, and returns prediction output with caveats.
ML pipeline layer
Cleans clinical text, tokenizes input, and runs transformer inference for binary screening.
Report layer
Turns model output into explainable screening language for clinicians and project interviews.
Training orchestration layer
Scripts coordinate traditional model trials, transformer trials, focal-loss training, checkpoint selection, resume support, and RTX A4000 preflight checks.
Knowledge retrieval layer
Medical PDFs and text references are indexed so generated reports can include guideline-aware context instead of only raw class predictions.
Digital twin layer
A forecasting view can estimate progression and treatment scenarios with a fast heuristic path and an optional PPMI-backed bridge.
Implementation surface
Tech stack
Training scripts, preprocessing, inference, and report generation.
Transformer model definition, training, and inference.
Tokenization and transformer architecture support.
Prediction API for the clinical screening workflow.
Clinical record cleaning and feature preparation.
Evaluation metrics, train-test splits, and preprocessing utilities.
Traditional ML baselines for structured PPMI clinical features and comparison against transformer models.
Medical language model family used for clinical text-oriented transformer experiments.
Serialization and loading of traditional model, preprocessor, and ensemble artifacts.
Lightweight retrieval over medical reference documents for report generation.
Operational flow
How it works
The portal accepts clinical text, normalizes it, tokenizes it for a transformer model, returns a Parkinson's screening score, and frames the output as decision support.
Collect clinical context
The user enters clinical observations, symptom descriptions, and optional structured fields.
Clean and normalize
The backend standardizes text, removes unusable fields, handles missing values, and prepares model features.
Tokenize input
Clinical text is converted into token IDs and attention masks that the transformer can process.
Run inference
The model produces a binary screening prediction and confidence score from the processed input.
Explain the result
The portal presents risk, confidence, limitations, and suggested follow-up language without claiming diagnosis.
Map questionnaire fields to PPMI features
The web layer normalizes user inputs such as age, sex, BMI, tremor, rigidity, bradykinesia, postural instability, sleep, mood, and cognitive scores into model feature names.
This is an interview-critical boundary because invalid or missing clinical fields can silently distort model predictions.
Load model artifacts lazily
Startup can skip heavy initialization, then load models and document indexes on first prediction request when needed.
Lazy loading makes smoke tests and static frontend hosting faster while preserving the full local ML workflow.
Retrieve medical context for reports
The report workflow retrieves relevant disease information, guideline text, and feature interpretations before writing clinician-readable output.
The prediction is only one part of the system; the report must explain why a class matters and what follow-up language is safe.
Generate optional digital twin scenarios
The twin dashboard can produce progression or treatment scenario views using fast heuristics by default and a PPMI-backed bridge when enabled.
This separates demo responsiveness from heavier research workflows.
Sequence diagram
Concept depth
Key concepts
Transformers process all positions in parallel and learn which tokens should attend to each other. This makes them effective for long text where important clues may be far apart.
In NeuroAssess: NeuroAssess uses transformer inference to capture clinical wording patterns that simpler bag-of-words features can miss.
Confidence
Implementation evidence
Code highlights
Inference route
The API validates text input, builds tensors, and returns a cautious screening result.
The route rejects empty clinical text before the model path.
The response includes a medical disclaimer because screening output is not diagnosis.
Clinical metric framing
Model evaluation reports sensitivity and specificity so interview answers stay healthcare-aware.
Medical ML should not be defended with accuracy alone.
Sensitivity and specificity expose the false-negative and false-positive trade-off.
Patient-level split guard
The training pipeline should split by patient identifier before expanding records into model rows.
The split happens at patient level, not row level.
The assertion makes leakage visible during development.
Safe clinical field normalization
Clinical form input is coerced into the feature schema while preserving missing-value behavior.
Missing clinical fields are surfaced instead of silently converted to zeros.
The model boundary is the normalized PPMI feature schema.
Contracts
API design
Base URL: http://localhost:5000
/predictRuns Parkinson's screening inference for clinical text.
{ "clinicalText": "Patient reports tremor and gait instability." }{ "riskScore": 0.7134, "screening": "review", "disclaimer": "Decision support only; not a diagnosis." }/reportGenerates a clinician-readable report for a prediction result.
/api/predictNormalizes patient data, loads model artifacts if needed, and returns cohort probabilities plus decision-support text.
{ "age": 63, "SEX": "male", "sym_tremor": 2, "sym_rigid": 1, "moca": 24 }{ "predictedClass": "PRODROMAL", "confidence": 0.67, "disclaimer": "Decision support only." }/api/reports/dualGenerates patient-facing and clinician-facing report variants from the same prediction and retrieved references.
/api/documents/uploadAccepts PDF or text medical references and indexes them for report retrieval experiments.
/api/twin/projectReturns digital-twin progression or treatment scenario output for a normalized patient profile.
State model
Database design
Data relationship diagram
model_artifact
Serialized model, tokenizer settings, and preprocessing configuration.
prediction_log
Optional local record of screening requests for development and audit experiments.
medical_docs
Reference documents used by the portal's explanatory report workflow.
ppmi_patient_features
Curated patient-level clinical features mapped from PPMI records before training and inference.
model_registry
Saved model artifacts, preprocessing artifacts, checkpoint metadata, and validation metrics.
document_index
Medical reference documents indexed for TF-IDF retrieval and report context.
twin_scenario
Optional progression or treatment scenario outputs generated for digital-twin views.
Architecture decisions
Trade-offs
Model family
Transformer classifier over LSTM or bag-of-words model
Clinical notes can contain long-range context. Attention gives stronger handling of distant cues than recurrent or shallow text features.
API framework
Flask over FastAPI
The inference service is request-response oriented and small. Flask is sufficient and keeps the clinical ML path straightforward.
Product framing
Decision support over Diagnostic claim
A model prediction should support review, not replace clinical judgment or overstate medical validity.
Validation split
Patient-level split over Random row split
Repeated PPMI visits can leak patient identity across train and test. Patient-level splitting gives a more honest estimate of generalization.
Training objective
Class-weighted focal loss over Plain cross-entropy
The cohort labels are imbalanced and clinically important minority classes should not be ignored by a model that optimizes only easy examples.
Report generation
RAG-enhanced explanatory reports over Returning only class probabilities
A clinical-support tool must explain risk factors, caveats, and follow-up considerations in language a clinician can review.
Frontend deployment
Static Vite frontend on Vercel with external Flask API over Bundling local ML inference into Vercel
Model loading and PDF indexing are too heavy for a static frontend deployment, so the hosted UI should call a separate backend.
Lessons learned
Challenges and solutions
Problem
Class imbalance can make a model look accurate while missing positive cases.
Solution: Evaluate with sensitivity, specificity, confusion matrices, and threshold discussion.
Lesson: Healthcare ML needs metrics aligned to clinical risk, not just a single score.
Problem
Clinical text can include missing, inconsistent, or noisy fields.
Solution: Normalize text, validate required inputs, and make missing data behavior explicit in preprocessing.
Lesson: Data quality handling is part of the model, not a side concern.
Problem
Clinical models can look strong if patient visits leak across train and test splits.
Solution: Split by patient ID, assert disjoint patients, and report validation metrics from held-out patients only.
Lesson: For medical ML, evaluation design is part of the product's credibility.
Problem
Transformer training can be interrupted on long GPU runs.
Solution: Add A4000 preflight checks, resumable training scripts, checkpoint selection by validation F1, and resume commands.
Lesson: ML systems need operational training workflows, not just model code.
Problem
PDF extraction and model initialization slow down basic web smoke tests.
Solution: Defer PDF full-text extraction and allow skip-init startup while keeping full local initialization available through flags.
Lesson: Heavy ML systems benefit from runtime modes that separate UI checks from full inference readiness.
Runbook
Requirements and future work
Requirements
- Python 3.x runtime with Flask.
- PyTorch and Transformers packages for model inference.
- Trained model and tokenizer artifacts available locally.
- Clinical dataset used for training contains approximately 42,000 patient records according to the PRD.
- PPMI curated CSV files must be present before training or evaluation.
- sacremoses is required for BioGPT tokenization.
- CUDA-enabled PyTorch is recommended for transformer training, with A4000 preflight scripts available.
- PD_EXTRACT_PDF_TEXT enables full PDF extraction when RAG experiments need it.
- PD_TWIN_BRIDGE_ENABLED enables the optional PPMI-backed digital-twin bridge.
Future improvements
- Add calibrated confidence intervals and threshold selection UI.
- Track model card metadata and dataset limitations inside the portal.
- Add clinician feedback loops for post-review outcome capture.
- Add an explicit model card page describing dataset version, cohort distribution, leakage controls, and known limitations.
- Add probability calibration and threshold sliders for sensitivity/specificity tradeoff exploration.
- Persist anonymized prediction audit records with consent-aware retention controls.
- Add external validation on a dataset outside PPMI before making stronger clinical claims.
Active recall
Interview Q&A
Why call this decision support instead of diagnosis?
Why are sensitivity and specificity important here?
What would you harden before clinical deployment?
Why is patient-level splitting mandatory for PPMI data?
What are HC, PD, SWEDD, and PRODROMAL in this project?
Why include traditional ML if transformer models exist?
What is the role of RAG in NeuroAssess?
How would you explain focal loss in this clinical setting?
What should be checked before deploying this as a clinical tool?
Why does the frontend deploy separately from the Flask ML backend?
What does model calibration add beyond accuracy?