ABOUT ME:
Joyce Yu
Master of Science
Data Science and Software Engineering
University of California, Berkeley
joyceyu579@berkeley.edu
Goals and Aspirations:
"Looking to explore our world in the most efficient ways possible..."
Hi there, my name is Joyce! I am a current graduate student at the University of California, Berkeley studying data science and software engineering. I have a strong interest in building and integrating machine learning tools into my life's work, whether that be in the tech industry, biopharmaceutical space, or other business sectors; I want to explore it all! Particularly, I want to learn more about our world through the data that describes it; and where I can help make a positive impact with the skills I've learned along the way. If you would like to connect, please don't hesitate to reach me at joyceyu579@berkeley.edu .
Projects:
Graduate Student Internship at Exact Sciences - Summer 2025
Role: NGS Application Developer
Team: P.O.B.O.T. (Precision Oncology Bioinformatics Operations Team)
Languages: Python, Bash
Infrastructure & DevOps: AWS (EC2, S3, CLI), HPC clusters, CI/CD, GitLab, Pixi
Workflow & HPC Scheduling: Nextflow (DSL2/Groovy), SLURM job scripts
Bioinformatic tools: IGV, BAMSurgeon, seqtk, bcftools, samtools, wgsim, pysam, BioPython, numpy, re, pandas, matplotlib
Project Management: Jira, Confluence, Agile and Waterfall methodologies
On-going. Completion by August 29.
As the NGS Application Developer on the POBOT team, I am developing a suite of CLI tools used for genomic data simulation to help facilitate the development of variant calling algorithms, diagnostic studies, and maintanence of bioinformatic pipelines in the OncoExTra product.
My role involves designing and implementing robust, scalable workflows using Nextflow and Groovy, integrating bioinformatics tools and libraries, and optimizing workflow performance on high-performance computing (HPC) clusters, ensuring efficient resource utilization and job scheduling using SLURM.
Additionally, I am implementing CI/CD processes for continuous integration and deployment of the software tools I am building, and contribute to the development of automated testing frameworks to ensure the reliability and accuracy of bioinformatics analyses.
My work supports the development of precision oncology solutions, enabling personalized treatment strategies for cancer patients.
Machine Learning for Precision Oncology:
Predicting Personalized Anti-Cancer Treatments for 200,000+ Breast Cancer Patients
Languages: Python
Libraries: Keras, Scipy, PyTorch, RDKit, NumPy, Pandas, matplotlib/seaborn
In this project, my team and I built a machine learning pipeline to match breast cancer patients with
personalized anti-cancer therapeutics and predicted their potential treatment responses.
Our dataset comprised 204,026 breast cancer patients and incorporated 86 clinical, experimental, and drug-related features aggregated
from multiple heterogeneous sources. The raw data required extensive cleaning, normalization, and integration to ensure quality and comparability across studies.
This process included resolving missing values, unifying feature definitions, and harmonizing measurement scales to form a single, consistent analytical dataset.
In addition, We utilized machine learning and advanced data processing techniques to analyze complex, multi-source datasets consisting 86 dimensions.
This included dimensionality reduction methods (PCA, UMAP), feature attraction of drug
characterization metrics (IC50 calculations, PK/PD measures), and a suite of machine learning algorithms —
including Gaussian Naive Bayes, Nearest Neighbors, feed-forward neural networks, binary classifiers, and k-fold cross-validation —
to build predictive models for personalized anti-cancer therapeutics and patient treatment response.
Natural Language Processing for Chatbot Analysis:
Predicting Winners and Prompt Difficulty Across 20 LLMs
Languages: Python
Libraries: Sklearn, Pandas, NumPy, matplotlib/seaborn
This project applied natural language processing (NLP) and machine learning techniques to evaluate and interpret the performance of
28 distinct chatbots and large language models (LLMs) across a diverse set of questions and responses.
My team's analysis began with exploratory data analysis (EDA) to investigate embedded response vectors, extracting meaningful patterns and relationships within the data.
To enhance our feature space and interpretability, we employed one-hot encoding for categorical variables, K-means clustering, calculated ELO ratings,
Principal Component Analysis (PCA), and statistical feature extraction. For performance assessment and robustness, we incorporated different types of regression testing (linear, logisitic, lasso, and ridge),
confusion matrices, k-fold cross-validation, and various visualization tools (histograms, kernel density estimation (KDE) plots, box-and-whisker plots, and waterfall diagrams).
Our results demonstrated that embedded NLP features combined with statistical models can effectively predict competitive chatbot outcomes and quantify question difficulty, offering a data-driven framework for evaluating the capabilities of conversational AI systems.
Molecular substructure search via Graph based modeling and algorithms
Languages: Python, C++, BASH
Libraries: NetworkX, matplotlib, itertools, Eigen/Eigen, complex, utility, iostream, cmath, vector and other C++ container libraries
Inspired by the RDKit library, I used python to develop my own library and molecular fingerprinting algorithm that enables users to perform substructure searches
via the terminal and BASH commands for my final project in the CHEM274A course at UC Berkeley. In addition, this program is capable of generating visuals to obtain graph representation of user-chosen molecules.
It also assists the user in identifying other molecular properties such as different functional groups, aromatic structures, and features using graph traversal algorithms.
Included in the repository are additional files used to draft the main program:
- A python file where I imeplement the depth-first-search algorithm to detect unique cycles in a ring, and to gain a better understanding of how libraries like RDKit and NetworkX uses DFS/BFS to detect cycles in aromatic compounds.
- A python parser file (generously provided by UC Berkeley CHEM274A course instructors) used to parse .sdf files.
- A C++ program (.cpp, .hpp, and .tpp files) used to compare graph representations, traversals, and computations of molecules using the Eigen library in C++ vs. NetworkX library in Python
Systemic evaluation and run time analysis of molecular simulations
Languages: Python, C++, Git
Libraries: NumPy, math, random, fstream, array, vector, utility, chrono
In collaboration with 2 other engineers, we've modeled lennard-jones fluidic systems, simulated particles in motion via Monte Carlo methods, and refactored scripts using Python Standard Library into NumPy and C++ for our final project in CHEM280 at UC Berkeley. To evaluate the space and run time complexity of our programs, and to assess the performance of our refactored scripts, BASH commands were used to time and compile our programs that can be used to compute the energy of particles in motion, and to make accurate predictions of larger systems through molecular dynamic simulations.
BANKING SYSTEM APPLICATION
Languages: Python
Libraries: Python standard library
In collaboration with 3 other engineers, we've simulated a banking system capable of creating user accounts, withdrawals, deposits, money transfers, payments, and 2% cashbacks. The application also keeps a historical record of transactions and prevents users from creating invalid or dangerous executions (i.e. overdrafts, duplicate withdrawals, etc.). This application has gone through a series of tests to ensure its usability by business owners who may want to implement a purchasing or bank-like system.
Interactive visualization tool for product workflows and data summaries
Languages: Python, HTML/CSS
Libraries: Plotly, NumPy, Pandas, webcolors, matplotlib
This repository contains an interactive data visualization tool that I built using Python and HTML. It was made specifically for employees and customers of Curia Global's Antibody Discovery team starting in August 2023.
The purpose is to summarize data and visualize a common workflow used by pharmaceutical companies for antibody research and development.
The interactive visualization tool is capable of summarizing ~6 months worth of wet-lab data into one flow diagram and captures more than 5 different types of data (passage of time, genetic strains of animals, sequencing information of top drug candidates, screening results, drug characterization data, and more.)
During this intitative of introducing new tools into the workplace,
I worked wih the research team and Curia's IT/cybersecurity department to ensure compliancy with all company and client data;
and taught bash/python basics to advance the skills of older colleagues/scientists.
Most importantly, the interactive visualization tool can be used to track the origins of the company's products and serves as a deliverable to clients.
Coursework and Mini-projects:
TECHNICAL COURSEWORK
University of California, Berkeley
CHEM277B: Machine Learning Algorithms (Neural nets, ANN's, CNN's, RNN's, LSTM's and GNN's)
Image processing using CNN's and an encoder-decoder for anomaly detection
CHEM274B: Software engineering fundamentals for Molecular Sciences
Ligand-based virtual screening using machine learning and algorithms
DATA200S: Principles and Techniques of Data Science
Housing price prediction
Spam or Ham: Classifying emails as spam or not spam.
CHEM274A: Programming Languages for Molecular Sciences: Python and C++
PBHLTH142: Probability and Statistics in Biology and Public Health using R programming
MATH10A/B: Methods of Mathematics - Calculus, Statistics, and Combinatorics
College of San Mateo
CIS501: Data Structures and Algorithms
CIS364: From Data Warehousing to Big Data (SQL Databases)
CIS133: NoSQL Databases
CIS124: Foundations of Data science (DATA8 equivalent at UC Berkeley)
CIS117: Python programming
MATH270: Linear Algebra (MATH56 equivalent at UC Berkeley)
OTHER COURSEWORK
University of California, Berkeley
INDENG185: Entrepeneurship and Technology for Alternative Meats (Challenge lab)
UGBA196: Special Topics in Business Administration
INTEGBIC32: Bioinspired Design (Intersections of Engineering, Biology, Business, and Architecture)
COLWRIT10A: Introduction to Public Speaking
Hobbies:
Crabbing
Volunteering (Most recently at the Mission Hospice center in San Mateo)
Running / Hiking / Paddle-boarding / Kayaking
Reading (Currently reading "The Genesis Machine" by Amy Webb)
And more! -->