Joyce Yu

Goals and Aspirations:

"Looking to explore our world in the most efficient ways possible..."

Hi there, my name is Joyce! I have a strong interest in building and integrating machine learning tools into my life's work, whether that be in the tech industry, biopharmaceutical space, or other business sectors; I want to explore it all! Particularly, I want to learn more about our world through the data that describes it; and where I can help make a positive impact with the skills I've learned along the way. If you would like to connect, please don't hesitate to reach me at joyceyu579@berkeley.edu .

Projects:

Genomic Data Simulation at Exact Sciences

Role: NGS Application Developer
Team: P.O.B.O.T. (Precision Oncology Bioinformatics Operations Team)

Languages: Python, Bash
Infrastructure & DevOps: AWS (EC2, S3, CLI), HPC clusters, CI/CD, GitLab, Pixi
Workflow & HPC Scheduling: Nextflow (DSL2/Groovy), SLURM job scripts
Bioinformatic tools: BioPython, numpy, re, pandas, matplotlib, argparse, samtools, wgsim, IGV, BAMSurgeon, seqtk, bcftools, pysam
Project Management: Jira, Confluence, Agile and Waterfall methodologies

As the NGS Application Developer on the POBOT team at Exact Sciences, I developed a suite of CLI tools using Python, Bash, and Nextflow, leveraging cloud computing services, high-performance computing (HPC) clusters, and existing bioinformatics tools and Python libraries to build software for genomic data simulation. These tools facilitated the development of variant calling algorithms, diagnostic studies, and the maintenance of bioinformatics pipelines in the OncoExTra product.

Additionally, I implemented CI/CD processes with GitLab and contributed to the development of automated testing frameworks, ensuring the reliability and accuracy of bioinformatics analyses. I also presented project results to stakeholders of the BBDS (Biological and Bioinformatics Data Science) pillar at the San Diego office location, demonstrating technical solutions and their impact on precision oncology workflows.

Beyond development, I actively engaged in journal clubs featuring live demonstrations of emerging AI tools (e.g., Perplexity, MCPs, Biomni, etc.), gaining exposure to cutting-edge technologies and best practices in bioinformatics. My work supported the advancement of novel algorithms, validation studies, and precision oncology solutions enabling personalized treatment strategies for cancer patients.

Machine Learning for Precision Oncology:
Predicting Personalized Anti-Cancer Treatments for 200,000+ Breast Cancer Patients

Languages: Python
Libraries: Keras, Scipy, PyTorch, RDKit, NumPy, Pandas, matplotlib/seaborn

In this project, my team and I built a machine learning pipeline to match breast cancer patients with personalized anti-cancer therapeutics and predicted their potential treatment responses.

Our dataset comprised 204,026 breast cancer patients and incorporated 86 clinical, experimental, and drug-related features aggregated from multiple heterogeneous sources. The raw data required extensive cleaning, normalization, and integration to ensure quality and comparability across studies. This process included resolving missing values, unifying feature definitions, and harmonizing measurement scales to form a single, consistent analytical dataset.

In addition, We utilized machine learning and advanced data processing techniques to analyze complex, multi-source datasets consisting 86 dimensions. This included dimensionality reduction methods (PCA, UMAP), feature attraction of drug characterization metrics (IC50 calculations, PK/PD measures), and a suite of machine learning algorithms (e.g. Gaussian Naive Bayes, Nearest Neighbors, feed-forward neural networks, binary classifiers, and k-fold cross-validation) to build predictive models for personalized anti-cancer therapeutics and patient treatment response.

Natural Language Processing for Chatbot Analysis:
Predicting Winners and Prompt Difficulty Across 20 LLMs

Languages: Python
Libraries: Sklearn, Pandas, NumPy, matplotlib/seaborn

This project applied natural language processing (NLP) and machine learning techniques to evaluate and interpret the performance of 28 distinct chatbots and large language models (LLMs) across a diverse set of questions and responses.

My team's analysis began with exploratory data analysis (EDA) to investigate embedded response vectors, extracting meaningful patterns and relationships within the data. To enhance our feature space and interpretability, we employed one-hot encoding for categorical variables, K-means clustering, calculated ELO ratings, Principal Component Analysis (PCA), and statistical feature extraction. For performance assessment and robustness, we incorporated different types of regression testing (linear, logisitic, lasso, and ridge), confusion matrices, k-fold cross-validation, and various visualization tools (histograms, kernel density estimation (KDE) plots, box-and-whisker plots, and waterfall diagrams).

Our results demonstrated that embedded NLP features combined with statistical models can effectively predict competitive chatbot outcomes and quantify question difficulty, offering a data-driven framework for evaluating the capabilities of conversational AI systems.

Molecular substructure search via Graph based modeling and algorithms

Languages: Python, C++, BASH
Libraries: NetworkX, matplotlib, itertools, Eigen/Eigen, complex, utility, iostream, cmath, vector and other C++ container libraries

Inspired by the RDKit library, I used python to develop my own library and molecular fingerprinting algorithm that enables users to perform substructure searches via the terminal and BASH commands for my final project in the CHEM274A course at UC Berkeley. In addition, this program is capable of generating visuals to obtain graph representation of user-chosen molecules. It also assists the user in identifying other molecular properties such as different functional groups, aromatic structures, and features using graph traversal algorithms.

Included in the repository are additional files used to draft the main program:
- A python file where I imeplement the depth-first-search algorithm to detect unique cycles in a ring, and to gain a better understanding of how libraries like RDKit and NetworkX uses DFS/BFS to detect cycles in aromatic compounds.
- A python parser file (generously provided by UC Berkeley CHEM274A course instructors) used to parse .sdf files.
- A C++ program (.cpp, .hpp, and .tpp files) used to compare graph representations, traversals, and computations of molecules using the Eigen library in C++ vs. NetworkX library in Python

Systemic evaluation and run time analysis of molecular simulations

Languages: Python, C++, Git
Libraries: NumPy, math, random, fstream, array, vector, utility, chrono

In collaboration with 2 other engineers, we've modeled lennard-jones fluidic systems, simulated particles in motion via Monte Carlo methods, and refactored scripts using Python Standard Library into NumPy and C++ for our final project in CHEM280 at UC Berkeley. To evaluate the space and run time complexity of our programs, and to assess the performance of our refactored scripts, BASH commands were used to time and compile our programs that can be used to compute the energy of particles in motion, and to make accurate predictions of larger systems through molecular dynamic simulations.

BANKING SYSTEM APPLICATION

Languages: Python
Libraries: Python standard library

In collaboration with 3 other engineers, we've simulated a banking system capable of creating user accounts, withdrawals, deposits, money transfers, payments, and 2% cashbacks. The application also keeps a historical record of transactions and prevents users from creating invalid or dangerous executions (i.e. overdrafts, duplicate withdrawals, etc.). This application has gone through a series of tests to ensure its usability by business owners who may want to implement a purchasing or bank-like system.

Interactive visualization tool for product workflows and data summaries

Languages: Python, HTML/CSS
Libraries: Plotly, NumPy, Pandas, webcolors, matplotlib

This repository contains an interactive data visualization tool that I built using Python and HTML. It was made specifically for employees and customers of Curia Global's Antibody Discovery team starting in August 2023. The purpose is to summarize data and visualize a common workflow used by pharmaceutical companies for antibody research and development. The interactive visualization tool is capable of summarizing ~6 months worth of wet-lab data into one flow diagram and captures more than 5 different types of data (passage of time, genetic strains of animals, sequencing information of top drug candidates, screening results, drug characterization data, and more.)

During this intitative of introducing new tools into the workplace, I worked wih the research team and Curia's IT/cybersecurity department to ensure compliancy with all company and client data; and taught bash/python basics to advance the skills of older colleagues/scientists.

Most importantly, the interactive visualization tool can be used to track the origins of the company's products and serves as a deliverable to clients.

Coursework and Mini-projects:

TECHNICAL COURSEWORK

University of California, Berkeley

CHEM277B: Machine Learning Algorithms (Neural nets, ANN's, CNN's, RNN's, LSTM's and GNN's)

Image processing using CNN's and an encoder-decoder for anomaly detection

CHEM274B: Software engineering fundamentals for Molecular Sciences

Ligand-based virtual screening using machine learning and algorithms

DATA200S: Principles and Techniques of Data Science

Housing price prediction

Spam or Ham: Classifying emails as spam or not spam.

CHEM274A: Programming Languages for Molecular Sciences: Python and C++
PBHLTH142: Probability and Statistics in Biology and Public Health using R programming
MATH10A/B: Methods of Mathematics - Calculus, Statistics, and Combinatorics

College of San Mateo

CIS501: Data Structures and Algorithms
CIS364: From Data Warehousing to Big Data (SQL Databases)
CIS133: NoSQL Databases
CIS124: Foundations of Data science (DATA8 equivalent at UC Berkeley)
CIS117: Python programming
MATH270: Linear Algebra (MATH56 equivalent at UC Berkeley)

OTHER COURSEWORK

University of California, Berkeley

INDENG185: Entrepeneurship and Technology for Alternative Meats (Challenge lab)
UGBA196: Special Topics in Business Administration
INTEGBIC32: Bioinspired Design (Intersections of Engineering, Biology, Business, and Architecture)
COLWRIT10A: Introduction to Public Speaking

Hobbies:

Crabbing
Volunteering (Most recently at the Mission Hospice center in San Mateo)
Running / Hiking / Paddle-boarding / Kayaking
Reading (Currently reading "The Genesis Machine" by Amy Webb)
And more! -->