I am an incoming CS Ph.D. student at National University of Singapore (NUS). Previously, I obtained my Master’s degree in Computer Science and Bachelor’s degree in Inforamtion Engineering from Shanghai Jiao Tong University (SJTU), fortunately advised by Prof. Weinan Zhang.

Recently, my research works mainly focus on:

  • Evaluation and analysis of LLMs.
  • LLM with retrieval augmentation.
  • LLM reasoning, planning, and rule-following.
  • Interactive LLM agents.

Feel free to reach out if you are interested in my research or want to collaborate / chat with me.

πŸ”₯ News

  • 2025.05: Β πŸŽ‰πŸŽ‰ Two papers RuleArena and AntiLeak-Bench are accepted by ACL 2025 (Main).
  • 2025.04: Β πŸŽ‰πŸŽ‰ I will present our work RuleArena at ICLR 2025 Workshops (Reason&Plan, SCI-FM). Hope to see you there.
  • 2025.04: Β πŸŽ‰πŸŽ‰ I will join National University of Singapore (NUS) for my Ph.D. journey, starting Aug. 2025.
  • 2024.12: Β πŸŽ‰πŸŽ‰ We release AntiLeak-Bench, an automated anti-leakage LLM benchmarking framework.
  • 2024.12: Β πŸŽ‰πŸŽ‰ We release RuleArena, an LLM rule-guided reasoning benchmark.
  • 2024.11: Β πŸŽ‰πŸŽ‰ I will attend SoCal NLP 2024. Hope to see you there.
  • 2024.07: Β πŸŽ‰πŸŽ‰ I will serve as a volunteer host for SIGIR 2024 AgentIR Workshop. Hope to see you there.
  • 2024.07: Β πŸŽ‰πŸŽ‰ I will present our work TRAD at SIGIR 2024 in Washington D.C.
  • 2024.03: Β πŸŽ‰πŸŽ‰ One paper TRAD is accepted by SIGIR 2024.
  • 2024.03: Β πŸŽ‰πŸŽ‰ We release a retrieval-augmented LLM agent framework TRAD.

πŸ“ Publications

[ACL 2025 (Main)] RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang

  • We create the RuleArena benchmark for LLM rule-guided reasoning, with a comprehensive set of 95 policies/rules drawn from real-world scenarios and 816 datapoints of different difficulty levels for LLM evaluation. We further propose a suite of evaluation metrics for both rule selection and rule application, providing fine-grained insights into LLM performances.
  • Existing state-of-the-art LLMs, such as GPT-4o and Claude-3.5 Sonnet, mostly fail on complex rule-guided reasoning tasks from RuleArena. By examining common failure cases and identifying difficult rule types, we uncover several systematic issues that limit LLMs’ rule-guided reasoning capabilities.

[ACL 2025 (Main)] AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge

Xiaobao Wu, Liangming Pan, Yuxi Xie, Ruiwen Zhou, Shuai Zhao, Yubo Ma, Mingzhe Du, Rui Mao, Shuai Zhao, Anh Tuan Luu, William Yang Wang

  • We create the AntiLeak-Bench framework, which collects samples with explicitly new knowledge absent from LLMs’ training sets, and thus ensures strictly contamination-free evaluation. This building workflow is also automated for maintainance and update without human labor.
  • Experiments show that pre-cutoff samples come with data contamination, which inflates LLMs’ performance, while contamination-free post-cutoff samples collected by our AntiLeak-Bench are more challenging and can more accurately assess LLMs.

[SIGIR 2024] TRAD: Enhancing LLM Agents with Step-Wise Thought Retrieval and Aligned Decision

Ruiwen Zhou, Yingxuan Yang, Muning Wen, Ying Wen, Wenhao Wang, Chunling Xi, Guoqiang Xu, Yong Yu, Weinan Zhang

  • We propose a novel retrieval-augmented LLM agent TRAD. It achieves step-wise relevant demonstration selection via thought retrieval (TR), and applies an aligned decision (AD) module to augment LLM with temporal neighboring steps and their corresponding order information of retrieved demonstrations during action prediction.
  • TRAD significantly outperforms existing SoTA LLM agents such as Synapse and ReAct on common LLM agent benchmarks like ALFWorld (householding) and Mind2Web (web navigation).

[arXiv preprint] Is Risk-Sensitive Reinforcement Learning Properly Resolved?

Ruiwen Zhou, Minghuan Liu, Kan Ren, Xufang Luo, Weinan Zhang, Dongsheng Li

  • We reveal the improper optimization in existing distributional risk-sensitive RL (RSRL) algorithms, showing that maximizing distortion risk measure at every step in a dynamic programming style leads to divergence.
  • To address this issue, we propose a novel RSRL algorithm Trajectory Q-Learning (TQL). By incorporating history return distribution, TQL achieves theoretical optimality and outperforms existing methods in RSRL tasks.

[NeurIPS 2022] Learning Enhanced Representations for Tabular Data via Neighborhood Propagation

Kounianhua Du, Weinan Zhang, Ruiwen Zhou, Yangkun Wang, Xilong Zhao, Jiarui Jin, Quan Gan, Zheng Zhang, David Wipf

  • We propose a GNN-based tabular prediction model PET. It considers both row-wise (across sample) and column-wise (across feature) feature interaction by transforming each target sample with retrieved relevant sample into a hypergraph and conducting message passing on it.
  • On a wide range of tabular prediction tasks like CTR prediction and top-n recommendation, PET surpasses SoTA methods by a large margin.

πŸŽ– Honors and Awards

  • 2024.11 Huatai Securities Scholarship (~Top 10% out of 179).
  • 2024.11 First-Class Excellence Scholarship (Top 30% out of 179).
  • 2022.11 First-Class Excellence Scholarship (Top 30% out of 179).
  • 2021.12 B-Class Excellence Scholarship (Top 10% out of 144).
  • 2021.12 Zhiyuan Honors Scholarship (Top 5% students in Zhiyuan Honors Program).
  • 2021.04 Outstanding Winner of MCM/ICM 2021 (~Top 0.15% among the world).
  • 2020.12 A-Class Excellence Scholarship (Top 1 out of 144).
  • 2020.12 National Scholarship (Top 2 out of 144).
  • 2020.12 Zhiyuan Honors Scholarship (Top 5% students in Zhiyuan Honors Program).
  • 2019.11 B-Class Excellence Scholarship (Top 10% out of 144).
  • 2019.11 Zhiyuan Honors Scholarship (Top 5% students in Zhiyuan Honors Program).

πŸ“– Educations

  • 2022.09 - 2025.03, M.Eng. in Computer Science and Technology, SJTU.
  • 2018.09 - 2022.06, B.Eng. in Information Engineering, SJTU.

πŸ’» Internships

  • 2025.04 - Present, Shanghai AI Lab, Advised by: Jie Fu.
  • 2024.07 - 2024.12, UCSB NLP Group, Advised by: Prof. William Yang Wang.
  • 2022.02 - 2023.02, Amazon Web Service, Advised by: Quan Gan.
  • 2021.08 - 2022.01, Microsoft Research Asia, Advised by: Kan Ren.

πŸ‘€ Miscellaneous

In my spare time, I love:

  • Stroll: I often go for a walk to beautiful sites nearby and recover my energy.
  • Music: I listen to pop. songs, musicals, symphonies, etc. I also play the piano and sing.
  • Sports: I watch NBA, F1, etc. games. I am a fan of James Harden and Lewis Hamilton.