About

I’m a data scientist at Princeton University, formerly in Computer Science and currently in Ecology and Evolutionary Biology. I’m also affiliated with the Center for Statistics and Machine Learning.

My interests and expertise are in software engineering, statistics, machine learning, and analyzing large datasets carefully by frequently using sanity checks such as data visualizations and software tests.

As a departmental data scientist, I primarily do collaborative research by rotating among labs (~4 month terms to maximize diversity) and queueing projects via networking within the department. I also

  • teach workshops
  • mentor individual students or postdocs
  • serve as consultant
  • develop computational workflows that enable or accelerate several projects

Highlights

Here, I highlight my technical contributions to selected collaborations in which I take the lead on a modular component. These contributions involve applying my skills mentioned above to various data modalities including 3D movies of neural activity, spatial transcriptomics, and whole-genome sequencing (WGS).

  • GASTON: deep neural network to segment domains and study continuous variation in gene expression from spatial transcriptomics data
    • refactored code into python package to run at scale and optimize neural network architecture
    • feature-ized histology images (H&E stained) to facilitate tissue segmentation
    • analyzed several colorectal cancer datasets to characterize metabolic gradients and the tumor microenvironment
GASTON

(A) H&E stain of a 10x Genomics Visium colorectal tumor sample. (B) Spatial domains learned by GASTON. Domains 1 and 2 are labeled as tumor and tumor-adjacent stroma, respectively, based on the histology image in (A). (C) Spatial gradients learned by GASTON show directions of maximum gene expression changes in tumor and tumor-adjacent stromal domains.

  • 3D movie analysis: image segmentation to decode neural activity in the mosquito antennal lobe (AL)
    • created python workflows to measure activity of the entire AL
    • discovered technical batch effects and used the experimental design to correct them via custom statistical models
    • segmented individual glomeruli (clusters of nerve endings) within the AL via nonnegative matrix factorization (NMF)
Neural activity visualization

2D max projection of raw data from the AL of a mosquito that recieved an odor puff. These data are noisy even after median filtering.

Neural activity visualization

Custom visualization of NMF segmentation shows spatial positions and activity traces of glomeruli.

  • HATCHet2: copy-number calling (amplifications/deletions of DNA) for tumor WGS data
    • wrote several modules to phase genotypes into haplotypes using the 1000GP reference panel
    • Docker-ized software for cloud computing on GCP and AWS
    • analyzed large datasets in collaboration with the Genomic Data Analysis Network
    • gave virtual invited talk at University of Edinburgh’s biomedical AI seminar
HATCHET

From tumor WGS data, HATCHet2 extracts 2 features that are correlated with copy number: the fractional copy number (rescaled from read depth) and the mirrored B-allele frequency (BAF). For normal diploids, these values should be 2 and 0.5, respectively, which we observe for some genomic regions in this sample (orange dots). However, most tumors exhibit stiking variation in copy-number along the genome.

  • snpArcher: workflow to automate variant calling in nonmodel organisms
    • authored original snakemake code and supplementary algorithms to massively parallelize two variant callers
  • Tuskless African elephants: poaching drives evolution of tusklessness, a female-specific trait encoded by a male-lethal mutation
    • contributed all genomic analyses as co-first author
    • led to over 411 news stories from 301 media outlets
    • invited guest on the Nice Genes! podcast

Background

For the past 6 years, I have worked as a staff scientist in academia.

Two of those years I spent at Harvard Informatics, creating workflows for analyzing genomic datasets, researching with faculty, and teaching introductory workshops on bioinformatics.

I then moved to Princeton University where I spent 3 years as a biomedical data scientist as part of the DataX initiative. At Princeton, I worked in the Computer Science department doing research on cancer genomics and teaching workshops on software engineering and machine learning. Through these experiences, I learned to think about data in more general ways beyond my specialization in genomics.

I now have a broad interest in algorithms, statistics, and machine learning. As a departmental data scientist in Ecology and Evolutionary Biology, I’m interested in applying these skills to any kind of data that could be useful to address interesting questions.