Sarah Silva — ML Portfolio | Harvard Data Science

Data Science
Portfolio
by Sarah Silva

A rigorous end-to-end machine learning study completed at Harvard, applying 10+ algorithms to real-world datasets — from cancer diagnosis to movie recommendations.

12+ Datasets

10+ Algorithms

97.4% Best Accuracy

01 — Overview

Key Metrics

Performance highlights across all models and datasets

97.4% Best Model (KNN)

96.5% Best Ensemble

88.3% Titanic RF

98.8% SVD Variance

λ=135 Optimal Reg.

🎗️ BRCA — Cancer Classification

Breast cancer biopsy · 569 samples · 30 features

4 Models

Logistic Regression

93.9%

LOESS / GAM

93.0%

Random Forest

94.8%

KNN (k=9) ★

97.4%

Ensemble

96.5%

Key insight: KNN at k=9 outperformed all models including the ensemble, achieving 97.4% accuracy after proper feature scaling with sweep(). The most important predictor was area_worst.

🚢 Titanic Survival Prediction

891 passengers · Sex, class, age, fare

6 Models

Sex only

78.8%

Class only

68.2%

GLM (4 predictors)

82.1%

Decision Tree

84.9%

KNN (k=15)

73.2%

Random Forest ★

88.3%

Key insight: Random Forest (mtry=2, ntree=100) was the best model with 88.3%. Females in 1st and 2nd class had the highest survival rates (>90%). The decision tree revealed Sex → Age → Fare → Class as the key decision splits.

🔢 MNIST 27 — Ensemble Comparison

Digit recognition (2s vs 7s) · 7 models

Ensemble

GLM77.5%

LDA77.5%

Naive Bayes81.5%

KNN ★83.5%

LOESS83.5%

QDA81.5%

Random Forest82.0%

Key insight: KNN and LOESS tied at 83.5%, beating the 7-model ensemble (82.5%). When only models with CV accuracy ≥ 80% were included (gamLoess, QDA), the filtered ensemble maintained 82.5% — confirming that weaker models don't always hurt.

📐 KNN Accuracy vs. k — Heights Dataset

Sex prediction from height · F1 score optimization

k = 1→101

Key insight: The optimal k=40 maximized F1 score at 0.619 for sex prediction from height. Lower k values overfit (high variance), while higher k values underfit. The curve peaks around k=40 before gradually declining.

🏫 Schools Regularization — RMSE vs λ

1000 simulated schools · Finding optimal penalty

λ = 10→250

Key insight: λ=135 minimizes RMSE when estimating school quality. Without regularization, small schools dominate top rankings due to high variance — not because they are actually better. Regularization "shrinks" noisy estimates toward the overall mean.

🧬 BRCA PCA — Variance Explained

30 principal components · Cancer features

PC1 = 44.3%

PC1

44.3%

PC2

19.0%

PC3

9.4%

PC4

6.6%

PC5

5.5%

PC6

4.0%

PC7 (→90%)

3.1%

Key insight: PC1 alone explains 44.3% of variance and clearly separates benign from malignant tumors — the only PC with non-overlapping IQRs between groups. Just 7 components are needed to explain 90% of total variance.

02 — Visualizations

Plots from R

Real outputs generated during the analysis — click to expand and read the explanation

Titanic Decision Tree

Classification tree for survival prediction — Sex, Age, SibSp, Fare and Pclass splits

rpart · cp=0.02

MovieLens — Genre Effect

Average rating ± SE by genre combination. Comedy has the lowest average (3.27 stars)

movielens · error bars

MovieLens — Time Effect

Weekly average rating over time (1995–2017) with LOESS smoother — mild downward trend

movielens · LOESS

Rating Frequency vs. Average

More frequently rated films have higher average ratings — popularity bias confirmed

movielens · scatter

Ratings by Release Year

Boxplot of ratings count per film by release year — 1995 has the highest median

movielens · boxplot

SVD — V Matrix Heatmap

The right singular vectors reveal structure in the 24-subject grade matrix (Math, Science, Arts)

SVD · my_image()

Correlation Matrix — Student Grades

High within-subject correlations, moderate between Math/Science, weaker with Arts

SVD · cor(y)

Student Grades Heatmap

100 students × 24 subjects. Top students (red) cluster at top, with 3 subject groupings visible

SVD · my_image(y)

BRCA — PCA Boxplots (B vs M)

PC1 is the only component with non-overlapping IQRs between Benign and Malignant tumors

BRCA · PCA · boxplot

BRCA — PC1 vs PC2 Scatter

Malignant (teal) clearly separated from Benign (pink) along PC1 axis — strong signal

BRCA · PCA · scatter

03 — Key Findings

What the Data Revealed

The most important insights from applying ML theory to real datasets

97.4%

KNN Dominates BRCA

After proper Z-score scaling with sweep(), KNN at k=9 outperformed all individual models and even the 4-model ensemble on cancer diagnosis.

λ=135

Regularization Fixes Rankings

The "small school effect" — small schools dominating top rankings due to noise — is corrected entirely by regularization. Optimal λ=135 minimizes RMSE.

98.8%

3 SVD Components Suffice

Just 3 singular vectors capture 98.8% of variance in 100×24 grade data. Component 1 = ability, 2 = Arts vs STEM, 3 = Math vs Science.

7.4×

Bayes Theorem in Practice

A positive test (85% sensitivity, 90% specificity) for a disease with 2% prevalence increases the probability of disease by 7.4×, from 2% to 14.8%.

75%

Data Leakage Demonstrated

Selecting 108 "significant" predictors from random data before cross-validation inflated accuracy to 75% — a textbook case of data leakage invalidating results.

PC1

One Component Separates Cancer

PC1 alone (44.3% of variance) separates malignant from benign tumors with no IQR overlap. All 9 remaining components show overlapping distributions.

04 — Algorithms

Models Implemented

All methods applied, tuned, and evaluated with proper cross-validation

📈

Logistic Regression

BRCA: 93.9%

Linear boundary classification. Used across Titanic, BRCA and MNIST datasets.

🌲

Random Forest

Titanic: 88.3%

Bagged decision trees. Best for Titanic. Key variable: area_worst in BRCA.

🔵

K-Nearest Neighbors

BRCA: 97.4%

Distance-based. Best overall model. Requires scaling — very sensitive to feature normalization.

〰️

LOESS / GAM

BRCA: 93.0%

Local polynomial regression. Non-parametric smoother for continuous and classification tasks.

🌳

Decision Trees

Titanic: 84.9%

Interpretable splits via rpart. Tuned with complexity parameter cp.

📊

LDA / QDA

MNIST: 77.5%

Linear/quadratic discriminant analysis. Fast parametric classifiers assuming Gaussian distributions.

🧮

PCA / SVD

98.8% variance

Dimensionality reduction. 3 components suffice for student grades, PC1 separates cancer types.

🗳️

Ensemble (Vote)

BRCA: 96.5%

Majority vote across 4–7 models. Reduces variance but can be outperformed by a single strong model.

Data Science Portfolio by Sarah Silva

Key Metrics

Plots from R

What the Data Revealed

Models Implemented

Data Science
Portfolio
by Sarah Silva