Jonathan Liu

Hello! I am an A.B. Computer Science student at Princeton University graduating in 2028. I am currently a researcher in the Prof. Zhuang Liu Group and the Princeton Computer Vision Lab (advised by Prof. Jia Deng). I have had the pleasure of working with Sachin Konan, Supriyo Chakraborty, and Abhishek Joshi.

My research focuses on Machine Learning, Natural Language Processing, and Reinforcement Learning. I am particularly interested in LLM fine-tuning, mechanistic interpretability, and procedural generation for simulation.

Outside of research, I play classical cello in the Princeton University Orchestra, enjoy running (I completed the Jersey City Marathon in 2025!), and like experimenting with RL environments (PufferLib; NetHack env).

Thanks for visiting!

jonathanliu [at] princeton [dot] edu

Updates

Date
[05/2026]	Developed thread-safe NetHack 3.6.7 + merged environment into PufferLib at 500k+ train SPS
[05/2026]	Joined the Prime Intellect RL Residency to work on NetHack.
[05/2026]	Joined Abridge AI as a ML Research Scientist Intern.
[12/2025]	First author paper presented at the 2025 NeurIPS workshop GenAI4Health.
[08/2025]	Started as a Research Assistant in Prof. Zhuang Liu's Group.
[06/2025]	Began a Machine Learning Research Internship at BBN Technologies.
[08/2024]	Co-authored Infinigen-Sim, accepted to CoRL LSRL.

(* indicates equal contribution)

PaperLens: How Predictable Is Paper Acceptance?
Sachin G Konan*, Jonathan Liu, Zhuang Liu.
Under Review.

Peer review remains a primary signal of research quality in science, yet its outcome is widely considered noisy and unpredictable. How much of the acceptance decision is recoverable from the paper alone? We find that much of it is. We construct a balanced ICLR benchmark with equal accepts and rejects and fine-tune a 7B vision-language model, PaperLens, that reaches 70.9% accuracy on ICLR 2025, suggesting that papers carry substantial signal about their eventual acceptance. More importantly, the predicted acceptance probability can be trusted: when the model is $X$% confident, its accuracy is close to $X$ %. We find that processing paper pages as images with a vision-language model outperforms tokenizing the text with a language model, suggesting that visual presentation carries acceptance signal beyond written content. Frontier general-purpose models, by contrast, accept nearly every paper when prompted to review. By integrating PaperLens' calibrated predictions into these models, we produce reviews grounded in accurate acceptance probabilities, closing the loop between generating useful feedback and predicting the review outcome.

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements
Jonathan Liu*, Kia Ghods.
Generative AI in Genomics (Gen²) Workshop at ICLR 2026.

We present a parameter-efficient Diffusion Transformer (DiT) for generating 200bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net's best validation loss in 13 epochs (60× fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38× improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.

Demo: Statistically Significant Results on Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Jonathan Liu*, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi, Mark Dredze.
GenAI4Health 2025 Poster.

Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to LLM-as-a-judge treatment category detectors. As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs. We find that LLM annotators exhibit low agreement scores (average Cohen's Kappa=0.118), and only specific (answering, evaluation) LLM pairs yield statistically significant differences across writing styles, genders, and races. We recommend that studies using LLM evaluation use multiple LLMs as evaluators in order to avoid arriving at statistically significant but non-generalizable results, particularly in the absence of ground-truth data. We also suggest publishing inter-LLM agreement metrics for transparency. Our code and dataset are available here: https://github.com/BBN-E/medic-neurips-2025-demo.

Infinigen-Sim: Procedural Generation of Articulated Simulation Assets
Abhishek Joshi*, Beining Han, Jack Nugent, Yiming Zuo, Jonathan Liu, Hongyu Wen, Stamatis Alexandropoulos, Tao Sun, Alexander Raistrick, Gaowen Liu, Yi Shao, Jia Deng.
LSRW Poster (CoRL LSRL).

We introduce Infinigen-Sim, a toolkit for generating realistic, procedurally generated articulated assets for robotics simulation. We include procedural generators for 12 common articulated object categories along with high-level utilities for use creating custom articulated assets in Blender. We also provide an export pipeline to integrate the resulting assets along with their physical properties into common robotics simulators. Experiments show that assets sampled from these generators are useful for movable object segmentation, training generalizable reinforcement learning policies, and sim-to-real transfer of imitation learning policies.

Experience

Prime Intellect		RL Residency(Summer 2026)
Abridge AI		Incoming Research Scientist (PhD Role) (Summer 2026)
Prof. Zhuang Liu Group		Research Assistant (Aug 2025 - Present)
BBN Technologies		Machine Learning Research Intern (Jun 2025 - Aug 2025)
Princeton Computer Vision Lab		Research Assistant (Aug 2024 - Present)

Projects

Sight Support (Assistive tech app)		Spring 2020 - Winter 2025
RL for the Andrews-Curtis Conjecture		Spring 2025 - Present
Discovering Transformer Circuits with Edge Pruning		Spring 2025

Relevant Coursework

Princeton University

COS 484: Natural Language Processing		Spring 2025
COS 485: Neural Networks: Theory and Application		Spring 2025
COS 597R: Advanced Topics in Computer Science (Probabilistic Topics in RL)		Fall 2025
COS 585: Information Theory and Applications		Fall 2025
COS 568: Systems and Machine Learning		Spring 2026
COS 417: Operating Systems		Spring 2026
COS 598B: Advanced Topics in Computer Science (Formal methods)		Spring 2026
ECE 476: Parallel Computing: Principles, Systems		Spring 2026

Jonathan Liu

jonathanliu [at] princeton[dot] edu

Updates

Papers

Experience

Projects

Relevant Coursework

jonathanliu [at] princeton
[dot] edu