Shahriar Golchin

📍 San Francisco, CA

I am an AI Researcher at Labelbox, working on model evaluation and AI safety.

I earned my PhD in Computer Science – Artificial Intelligence from the University of Arizona, advised by Prof. Mihai Surdeanu. My research interests center on Large Language Models (LLMs), focusing on designing datasets/tasks that adversarially stress-test alignment. I am particularly interested in surfacing systematic misalignment and reasoning failure modes across frontier AI models.

My PhD dissertation is the first to systematically identify data contamination (data leakage) in LLMs, scenarios where training data overlaps with evaluation data. I developed several methods to detect and estimate contamination in fully black-box LLMs. My PhD research received media coverage and earned several awards, including the Outstanding Graduate Scholarship and the Galileo Circle Scholarship.

Previously, I was a research intern at Google Cloud AI Research, Walmart Global Tech, and Harvard Medical School.

Agentic Query Reformulation for Contextualized Hyper-Personalized Product Search
Raghav Gaggar*, Sean Rosario*, Daniel Varivoda*, Shahriar Golchin*, Jayant Sachdev, Ali Lafzi, Siddharth Singh, Jason Cho, Yog Domlur, Swati Kirti, Chittaranjan Tripathy
(* indicates co-first authors)
SIGIR 2026 | Paper| Abstract ▾ Traditional e-commerce search often suffers from high abandonment rates due to an intent gap where users provide broad or vague queries. We propose a novel framework using Agentic AI to bridge this gap through dynamic query reformulation. By leveraging the ReAct agentic framework, our system performs contextual reasoning by analyzing historical purchase data to transform generic queries into hyper-personalized search queries. Unlike personalization methods that require architectural overhauls, our solution operates on top of existing search systems, ensuring seamless integration with any search system without costly development or downtime. To ensure low-latency performance, we utilize offline pre-computation of top queries and real-time cosine similarity matching. Results from our experiments demonstrate that this agentic approach can significantly improve search ranking leading to improved customer engagement.

Intent Laundering: AI Safety Datasets Are Not What They Seem
Shahriar Golchin, Marc Wetter
arXiv 2026 | Paper| Blog Post| Media Coverage| Medium Post| Abstract ▾ We systematically evaluate the quality of widely used adversarial safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three defining properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results show that current adversarial safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7/4. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90.00% to 100.00%, under fully black-box access. Overall, our findings expose a significant disconnect between how existing datasets evaluate model safety and how real-world adversaries behave.

Towards Compute-Optimal Many-Shot In-Context Learning
Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
COLM 2025 | Paper| Poster| Abstract ▾ Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

Memorization in In-Context Learning
Shahriar Golchin, Mihai Surdeanu, Steven Bethard, Eduardo Blanco, Ellen Riloff
arXiv 2025 | Paper| Abstract ▾ In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind this performance improvement remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance on downstream tasks across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers memorization as a new factor impacting ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?

Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models
Shahriar Golchin, Mihai Surdeanu
TACL/ACL 2025 | Paper| Poster| Video| Media Coverage| Abstract ▾ We propose the Data Contamination Quiz (DCQ), a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it. Specifically, we frame data contamination detection as a series of multiple-choice questions, devising a quiz format wherein three perturbed versions of each instance, subsampled from a specific dataset partition, are created. These changes only include word-level perturbations. The generated perturbations, along with the original dataset instance, form the options in the DCQ, with an extra option accommodating the selection of none of the provided options. Given that the only distinguishing signal among the options is the exact wording with respect to the original dataset instance, an LLM, when tasked with identifying the original dataset instance, gravitates towards selecting the original one if it has been exposed to it. While accounting for positional biases in LLMs, the quiz performance reveals the contamination level for the tested model with the dataset partition to which the quiz pertains. Applied to various datasets and LLMs, under controlled and uncontrolled contamination, our findings, while fully lacking access to training data and model parameters, suggest that DCQ achieves state-of-the-art results and uncovers greater contamination levels through memorization compared to existing methods. Also, it proficiently bypasses more safety filters, especially those set to avoid generating copyrighted content.

Grading Massive Open Online Courses Using Large Language Models
Shahriar Golchin, Nikhil Garuda, Christopher Impey, Matthew Wenger
COLING 2025 | Paper| Abstract ▾ Massive open online courses (MOOCs) offer free education globally. Despite this democratization of learning, the massive enrollment in these courses makes it impractical for an instructor to assess every student's writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, we explore the feasibility of using large language models (LLMs) to replace peer grading in MOOCs. To this end, we adapt the zero-shot chain-of-thought (ZCoT) prompting technique to automate the feedback process once the LLM assigns a score to an assignment. Specifically, to instruct LLMs for grading, we use three distinct prompts based on ZCoT: (1) ZCoT with instructor-provided correct answers, (2) ZCoT with both instructor-provided correct answers and rubrics, and (3) ZCoT with instructor-provided correct answers and LLM-generated rubrics. We tested these prompts in 18 different scenarios using two LLMs, GPT-4 and GPT-3.5, across three MOOCs: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. Our results show that ZCoT, when augmented with instructor-provided correct answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. Finally, our findings indicate a promising potential for automated grading systems in MOOCs, especially in subjects with well-defined rubrics, to improve the learning experience for millions of online learners worldwide.

Time Travel in LLMs: Tracing Data Contamination in Large Language Models
Shahriar Golchin, Mihai Surdeanu
ICLR 2024 — Spotlight 🌟 (notable top 5%) | Paper| Poster| Video| Media Coverage| Abstract ▾ Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

Do not Mask Randomly: Effective Domain-Adaptive Pretraining by Masking In-Domain Keywords
Shahriar Golchin, Mihai Surdeanu, Nazgol Tavabi, Ata Kiapour
ACL 2023 RepL4NLP | Paper| Poster| Abstract ▾ We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

A Natural Language Processing Pipeline to Study Disparities in Cannabis Use and Documentation Among Children and Young Adults: A Survey of 21 Years of Electronic Health Records
Nazgol Tavabi, Marium Raza, Mallika Singh, Shahriar Golchin, Harsev Singh, Grant Hogue, Ata Kiapour
Nature Digital Medicine | Paper| Abstract ▾ The legalizations of medical and recreational cannabis have generated a great deal of interest in studying the health impacts of cannabis products. Despite increases in cannabis use, its documentation during clinical visits is not yet mainstream. This lack of information hampers efforts to study cannabis's effects on health outcomes. A clear and in-depth understanding of current trends in cannabis use documentation is necessary to develop proper guidelines to screen and document cannabis use. Here we have developed and used a natural language processing pipeline to evaluate the trends and disparities in cannabis documentation. The pipeline includes a screening step to identify clinical notes with cannabis use documentation which is then fed into a BERT-based classifier to confirm positive use. This pipeline is applied to more than 23 million notes from a large cohort of 370,087 patients seen in a high-volume multi-site pediatric and young adult clinic over a period of 21 years. Our findings show a very low but growing rate of cannabis use documentation (<2%) in electronic health records with significant demographic and socioeconomic disparities in both documentation and positive use, which requires further attention.

Building Large-Scale Registries from Unstructured Clinical Notes Using a Low-Resource Natural Language Processing Pipeline
Nazgol Tavabi, James Pruneski, Shahriar Golchin, Mallika Singh, Ryan Sanborn, Benton Heyworth, Amir Kimia, Ata Kiapour
Artificial Intelligence in Medicine | Paper| Abstract ▾ Building clinical registries is an important step in improving the quality and safety of patient care. With the growing size of medical records, manual abstraction becomes more and more infeasible and impractical. On the other hand, Natural Language Processing Techniques have shown promising results in extracting valuable information from unstructured clinical notes. However, the structure and nature of clinical notes are very different from regular text that state-of-the-art NLP models are trained and tested on and they have their own set of challenges. In this study, we propose SE-K, an efficient and interpretable classification approach for extracting information from clinical notes, and show that it outperforms current state-of-the-art models in text classification. We use this approach to generate a 20-year comprehensive registry of anterior cruciate ligament reconstruction operations, one of the most common orthopedics operations among children and young adults. This registry can help us better understand the outcomes of this surgery and identify potential areas for improvement which can ultimately lead to better treatment outcomes.

Blog Posts

Do AI Models Want to Be Watched? Measuring Monitorability Disposition in Large Reasoning Models
Shahriar Golchin
Labelbox Research Blog, Jun 2026 | Link

The AI Safety Illusion: Why Current Safety Datasets Fool Us on Model Safety
Shahriar Golchin
Labelbox Research Blog, Feb 2026 | Link

Reflections on NeurIPS 2025: Advancing Evaluation and Continual Learning in AI
Shahriar Golchin, Smit Modi, Stepan Tytarenko, Almas Abdibayev, Marc Wetter
Labelbox Research Blog, Dec 2025 | Link

Area Chair: COLM 2026

Reviewer: EMNLP 2026, ICML 2026, AAAI 2026, NeurIPS {2025, 2024}, COLM 2025, ICLR 2025, ACL {2024, 2023}