Gostujoče predavanje: Dr. Basant Agarwal

Large Language Models (LLMs) such as ChatGPT, Llama, and Qwen have demonstrated remarkable capabilities across a wide range of natural language processing tasks. To ensure safe deployment, these models are aligned using techniques such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). However, recent studies have shown that carefully crafted adversarial prompts can bypass these safety mechanisms, a phenomenon known as jailbreaking. This presentation proposes a new technique for jailbreaking LLMs, which uses an Evolutionary Algorithm (EA) to optimize a universal adversarial suffix to be attached to user queries such that when attached to a user’s input query, the suffix produces an output from the model whose safety alignment will be disturbed. This presentation provides an overview of LLM jailbreaking.

O predavatelju:

Dr. Basant Agarwal is currently serving as an Associate Professor in the Department of Computer Science and Engineering at the Central University of Rajasthan. Dr. Agarwal earned his M.Tech. and Ph.D. in Computer Science and Engineering from Malaviya National Institute of Technology (MNIT) Jaipur, India. He was a Postdoctoral Research Fellow at the Norwegian University of Science and Technology (NTNU), Norway, under ERCIM Fellowship. He also served as a Research Scientist at Temasek Laboratories, National University of Singapore (NUS), Singapore. Dr. Agarwal has successfully several sponsored research projects funded nationally and internationally including an international bilateral project, i.e. Indo–Slovenia during 2022-25. Dr. Agarwal’s research interests broadly include Artificial Intelligence, Machine Learning, Natural language processing, and developing Large language models for domain specific problems

Preglej dogodke