AIML430_structured_survey/main.tex

\documentclass{article}[11pt]
\usepackage[a4paper,margin=1in]{geometry}

\usepackage[utf8]{inputenc}

\usepackage[style=authoryear,backend=biber]{biblatex}
\addbibresource{references.bib}
\usepackage{graphicx}
\usepackage{mdframed}

\DeclareFieldFormat{title}{#1}
\DeclareCiteCommand{\fullcite}
    {\usebibmacro{prenote}}
    {\printfield{title} (\cite{\thefield{entrykey}})}
    {\multicitedelim}
    {\usebibmacro{postnote}}

\newcommand{\paperdetails}[5]{
\subsection*{Paper: \fullcite{#1}}
#2
\begin{itemize}
    \item \textbf{Main Contributions/Findings:}
    \begin{itemize}
        #3
    \end{itemize}
    \item \textbf{Interesting Ideas/Results:}
    \begin{itemize}
        #4
    \end{itemize}
    \item \textbf{Significant Limitations:}
    \begin{itemize}
        #5
    \end{itemize}
\end{itemize}
}

\begin{document}
\noindent Student ID: 300680096 \\
\noindent \textit{You should choose 3-5 papers in your approved application area. Your selected papers should be coherent --- not too similar, but still related enough that they provide a good overview of the main applications and techniques in your area. The provided word targets are guidelines --- it is important that you are sufficiently detailed but also demonstrating an ability to write concisely.\\ \\
Please delete the text in italics for your final submission. Submit in PDF format by the deadline.
}

\section*{Introduction (target 150--250 words)}
\begin{mdframed}[] %remove this argument
\textit{(Introduce the application area you are going to cover. Make sure that you have constrained your focus to a narrow area.)}
As LLMs are being deployed in production environments the need for LLM safety is increasingly pressing. When these models' capabilities increase, so does their potential for harm \cite{shevlaneModelEvaluationExtreme2023,kangExploitingProgrammaticBehavior2023}. Within the LLM safety space there are two sorts of problems. The first (harmfulness) is short term in that it should refuse to assist in harmful activities. For example teaching someone how to build a bioweapon or hack a computer. The second (alignment) is longer term and that the underlying goals of the LLM align with true human intention.  For example avoiding misaligned behavior like increasing engagement at expense of users' wellbeing. The research in this area is broadly either developing novel evaluation techniques or applying current evaluation techniques to new models.

In this review I will survey three representative papers from this area. The first paper \cite{bondarenkoDemonstratingSpecificationGaming2025} uses a semi-novel evaluation technique to show a LLM cheating at chess. The second paper \cite{marksAuditingLanguageModels2025} runs a blind experiment to study the effectiveness of current safety auditing techniques in discovering hidden objectives in LLMs. The last paper \cite{yuanSEvalAutomatedComprehensive2025} proposes a new LLM-based automated test generation and evaluation framework to evaluate model harmfulness.

This survey will focus on the concept of pre-deployment tests (both harmfulness and alignment) for LLMs, which involve conducting various evaluations/tests/benchmarks prior to deployment, to ensure some level of safety. There are other important aspects of LLM safety such as continuous monitoring, continuous learning from human feedback as well as other more theoretical considerations.

\end{mdframed}

\section*{Key Points in the Papers (target 600-800 words)}
\begin{mdframed}[] %remove this argument
\textit{(For each paper:
\begin{itemize}
    \item[-] List the main contributions/findings of the paper (typically 1 to 3 per paper, a few sentences per contribution/finding).
    \item[-] List any ideas or results in the paper that you thought were particularly interesting and explain why.
    \item[-] List any significant limitations you identified in the work.)
\end{itemize}}

\paperdetails{bondarenkoDemonstratingSpecificationGaming2025}{
    Palisade Research conducts experiments with different prompts and different LLMs to see how and when they will cheat at chess.
}{
    \item [--] Reasoning based LLMs like o1-preview and Deepseek R1 will often try to hack the game to win. It can be observed that when the model thinks it is in a hopeless situation it will get creative and sometimes cheat to win.
    \item [--] Base Language models require more suggestive prompting before they will hack the game. This supports the notion that it is the reasoning capabilities of the model that allows it to give itself suggestive ideas which base LLMs lack.
}{
    \item [--] The LLM will accept the stated goal of "win" and interpret it as "win at all costs". The model is not concerned with the spirit of the game. One could imagine someone with common sense interpreting the request as more along the lines of "win by following the rules, not asking for help, don't remove opponent etc".
    \item [--] Small differences in the prompt could lead to large differences in the models hacking attempts. For example claude 3.5 Sonnet doesn't hack at all for two prompts but for the third hacks about 3/4 of the time. My intuition suggests that humans are not so vulnerable to propmt changes.
}{
    \item [--] The LLM harness used for the experiments is overly complex. The results are harder to generalise as each experiment is conditional on a lot of framework. It is plausible that removing the complex harness would still lead to similar results.
    \item [--] All of the runs are judged by a LLM that disagree with each other about 25\% of the time. I believe this rate of disagreement is significantly higher than would be found when using human judges and could make the experiment data too unreliable.
    \item [--] The goal of the experiment is to test an AI agents chance of going against the "spirit" of the task (specification gaming). As it is only tested on a single specific environment it is hard to generalize across to the models themselves.
}

\paperdetails{marksAuditingLanguageModels2025}{
    Anthropic run a blind experiment of training a LLM with a hidden objective and giving it to 4 differnet red teams to try and discover if it had any hidden objectives in a mock pre-deployment audit.
}{
    \item [--] Current malignment auditing techniques can discover hidden objectives in LLMs. When working in the context of an AI developer who has access to the training data and weights, they can identify the hidden objective and most of the biases that the model has. More so this can all be done in a reasonable amount of time ($<$ 1 week). However the successful teams needed access to training data and weights to find the hidden objective.
}{
    \item [--] Using sparse auto encoders which are typically used for interprability are shown to be quite effective specifically at providing guides for what bad data to look for. It seems very reasonable that as LLMs get smarter having a better understanding of how and what it is thinking will help us judge whether it is thinking the right thing.
    \item [--] From the perspective of a non expert I would have assumed that searching the training data to be a crux. I am interested that they didn't have a single team that had access to weights but not training data (like how many open weight models are provided). I would think that this would better test the effectiveness of white-box techniques.
}{
    \item [--] The game was played for a single round with only 4 teams. This is an extremely small sample size and only tests for a single type of hidden objective on a single model. It would be hard to conclude with too much significance if the auditing methods are indeed scalable and effective.
    \item [--] The hidden objective was injected firstly through synthetic data and secondly by post training. This resulted in all successful teams discovering the hidden objective by finding it in the training data directly. It seems much more problematic (and much harder) to discover hidden objectives when they are nuanced and/or emergent properties of the training data/process.
}

\paperdetails{yuanSEvalAutomatedComprehensive2025}{
    Yuan et al, develop a automated testing framework that uses LLM to generate tests and evaluate LLM responses to quantify a models safety in terms of whether it generates unsafe responses.
}{
    \item [--] Develop a safety evaluation framework, which comprises two models that are fine-tuned version of Qwen-14B-Chat. The generator can build base "risky" prompts as well attack prompts that augment the base prompt with jail break techniques to trick the LLM into answering. An evaluator model determines if the response is safe and gives a reason for its decision.
    \item [--] Conduct a safety evaluation on 21 LLMs across using both English and Chinese. They find that closed source models are generally safer and different languages can have drastically different effects on the safety score.
}{
    \item [--] I think that the idea of having a LLM try to test the safety of other LLMs is very interesting. I think that the harmful space is so complex we need great scale of safety tests to sufficiently explore the space. Using AI test generator and evaluator can facilitate this scale.
    \item [--] The attack prompt adaptive results were interesting. In the case where only one of ten attack prompts has to pass to count as a fail the tested models fail almost all the time. I think that this really highlights the helpfulness vs harmfulness trade off. In this case it seems all the models tested could be reliably jailbroken by building a simple small LLM wrapper that generates a variety of attack prompts to get the model to answer.
}{
    \item [--] A problem with this framework is that it has minimal human oversight. As the scale increases (which is the goal of automation) the potential for drift in the prompt generation and response evaluation increases. Fortunately this system seems developed in a way that would allow for continual updating of the generator and evaluator model including moving to new more powerful base models.
}

\end{mdframed}

\section*{Discussion (target 300-400 words)}
\begin{mdframed}[] %remove this argument
\textit{(You should:
\begin{itemize}
    \item[(a)] Compare and contrast the papers.
    \item[(b)] Identify where these papers could head –-- e.g.\ new research questions or applications to real tasks.
\end{itemize}
An excellent discussion will cohesively link the papers together and provide critical insight into the application area as a whole and where it is/could be going in the future.)
}
We can see in \cite{bondarenkoDemonstratingSpecificationGaming2025} that current SOTA models can suggest and carry out inappropriate actions. This shows that not only does the model misinterpret (malignment) the request ("win at all costs" vs "win by playing normally") it will also respond with how to hack the game (harmful response). Current methods to test for malignment are quite limited and rely on mostly manual human methods \cite{marksAuditingLanguageModels2025} while harmfulness is becoming increasingly automated \cite{yuanSEvalAutomatedComprehensive2025}. In \cite{marksAuditingLanguageModels2025} we can fortunately see that a modern audit can identify a maligned goal like "reward models love chocolate in everything" rather than "be helpful and make the world happy". Figuring out if the model is harmless can devolve into a whack-a-mole situation where patching one problem causes regressions elsewhere. By having large scale high quality test sets one can mitigate this issue and have some higher level of confidence that the model won't respond helpfully to harmful requests. Automated test generation and evaluation is a good avenue to scale up harmfulness testing which is what \cite{yuanSEvalAutomatedComprehensive2025} explores.

We use these current human based methods as well as benchmarks to give some confidence that the model is not harmful. Given the current paradigm of transformer based LLMs there is no way to prove that a model is safe, instead we can provide some certainty based on it successfully passing an ever growing amount of tests/benchmarks. Therefore to increase certainty we need to have larger scales of testing. The greatest scale can only be achieved through the use of AI. Using AI does introduce new risk factors as seen in \cite{wangLargeLanguageModels2023} but I digress. I think that the space of malignment is just as complex so requires the same scale and therefore automation as harmfulness testing. Specifically I think that having dynamically generated safety and alignment tests could help not only increase statistical significance of the results but also reduce the problem of test time behavior \cite{needhamLargeLanguageModels2025} because the environments will be more organic and less predictable. Furthermore as models get more powerful they will be better at passing these safety benchmarks and alignment benchmarks so creativity and scale will be needed to effectively test these models.

To develop this dynamic, AI generated testing system there are three important steps. Firstly is a tests generator that generate not just harmfulness tests but also longer running alignment scenarios like chess game or advising a CEO on tax reduction techniques. Secondly an evaluator that can both mark responses as safe and unsafe (like in \cite{yuanSEvalAutomatedComprehensive2025}) but also carry out longer running tests (like conducting chess game against engine to test for cheating \cite{bondarenkoDemonstratingSpecificationGaming2025} or tests for self-preservations \cite{meinkeFrontierModelsAre2025}). Lastly would be some systematic human research to compare and see if this automated approach gives a signal comparable to current benchmarks and is reliable (i.e consistent results after multiple runs). I see that a sufficiently sophisticated and successful system would increase the effectiveness of pre-deployment tests and provide a scalable solution to the increasingly powerful models.
\end{mdframed}


\section*{References}
\begin{mdframed}[] %remove this argument
\noindent\printbibliography[heading=none]
\end{mdframed}
\end{document}