Final remove italics

Final draft
2025-08-29 12:40:25 +12:00 · 2025-08-29 12:34:45 +12:00
3 changed files with 101 additions and 41 deletions
--- a/main.tex
+++ b/main.tex
@@ -36,76 +36,66 @@

 \begin{document}
 \noindent Student ID: 300680096 \\
-\noindent \textit{You should choose 3-5 papers in your approved application area. Your selected papers should be coherent --- not too similar, but still related enough that they provide a good overview of the main applications and techniques in your area. The provided word targets are guidelines --- it is important that you are sufficiently detailed but also demonstrating an ability to write concisely.\\ \\
-Please delete the text in italics for your final submission. Submit in PDF format by the deadline.
-}

 \section*{Introduction (target 150--250 words)}
 \begin{mdframed}[] %remove this argument
-\textit{(Introduce the application area you are going to cover. Make sure that you have constrained your focus to a narrow area.)}
+As LLMs are being deployed in production environments the need for LLM safety is increasingly pressing. When these models' capabilities increase, so does their potential for harm \cite{shevlaneModelEvaluationExtreme2023,kangExploitingProgrammaticBehavior2023}. Within the LLM safety space there are two sorts of problems. The first (harmfulness) is short term in that it should refuse to assist in harmful activities. For example teaching someone how to build a bioweapon or hack a computer. The second (alignment) is longer term and that the underlying goals of the LLM align with true human intention.  For example avoiding misaligned behavior like increasing engagement at expense of users' wellbeing. The research in this area is broadly either developing novel evaluation techniques or applying current evaluation techniques to new models.
+
+In this review I will survey three representative papers from this area. The first paper \cite{bondarenkoDemonstratingSpecificationGaming2025} uses a semi-novel evaluation technique to show a LLM cheating at chess. The second paper \cite{marksAuditingLanguageModels2025} runs a blind experiment to study the effectiveness of current safety auditing techniques in discovering hidden objectives in LLMs. The last paper \cite{yuanSEvalAutomatedComprehensive2025} proposes a new LLM-based automated test generation and evaluation framework to evaluate model harmfulness.
+
+This survey will focus on the concept of pre-deployment tests (both harmfulness and alignment) for LLMs, which involve conducting various evaluations/tests/benchmarks prior to deployment, to ensure some level of safety. There are other important aspects of LLM safety such as continuous monitoring, continuous learning from human feedback as well as other more theoretical considerations.
+
 \end{mdframed}

 \section*{Key Points in the Papers (target 600-800 words)}
 \begin{mdframed}[] %remove this argument
-\textit{(For each paper:
-\begin{itemize}
-    \item[-] List the main contributions/findings of the paper (typically 1 to 3 per paper, a few sentences per contribution/finding).
-    \item[-] List any ideas or results in the paper that you thought were particularly interesting and explain why.
-    \item[-] List any significant limitations you identified in the work.)
-\end{itemize}}

 \paperdetails{bondarenkoDemonstratingSpecificationGaming2025}{
-    Bondarenko et al, conducts experiments with differnet prompts and differnet LLMs to see how and when they will cheat to beat a chess engine at chess.
+    Palisade Research conducts experiments with different prompts and different LLMs to see how and when they will cheat at chess.
 }{
-    \item [--] Reasoning based LLM like o1-preview and Deepseek R1 will often try to hack the game to win. It can be observed that when the model thinks it is in a hopeless situation it will get creative and sometimes cheat to win.
-    \item [--] Base Language models require more suggestive prompting before they will hack the game. This suggests that when a model reasons with itself it can gives itself some suggestive prompting.
+    \item [--] Reasoning based LLMs like o1-preview and Deepseek R1 will often try to hack the game to win. It can be observed that when the model thinks it is in a hopeless situation it will get creative and sometimes cheat to win.
+    \item [--] Base Language models require more suggestive prompting before they will hack the game. This supports the notion that it is the reasoning capabilities of the model that allows it to give itself suggestive ideas which base LLMs lack.
 }{
-    \item [--] The LLM will accept the stated goal of "win" and interprete it readily as win at all costs. The models don't seem to be concerned with the spirit of the game. One could imagine someone with common sense would interprete the request as more along the lines of "win by following the rules, not asking for help, don't remove opponent etc".
-    \item [--] Small differences in the prompt could lead to large differences in the models hacking attempts. For example claude 3.5 Sonnet doesn't hack at all for two prompts but for the third hacks about 3/4 of the time.
+    \item [--] The LLM will accept the stated goal of "win" and interpret it as "win at all costs". The model is not concerned with the spirit of the game. One could imagine someone with common sense interpreting the request as more along the lines of "win by following the rules, not asking for help, don't remove opponent etc".
+    \item [--] Small differences in the prompt could lead to large differences in the models hacking attempts. For example claude 3.5 Sonnet doesn't hack at all for two prompts but for the third hacks about 3/4 of the time. My intuition suggests that humans are not so vulnerable to propmt changes.
 }{
-    \item [--] The LLM harness used for the experiments is overly complex. This reduces the interpretability and makes the results more opaque.
-    \item [--] All of the runs are judged by a LLM that disagree with each other about 25\% of the time. I believe this rate of disagreement is significantly higher than would be found when using human judges.
-    \item [--] The goal of the experiment is to test an AI agents chance of going against the "spirit" of the task. As it is only tested on a single specific environment it is hard to generalize across to the models themselves.
+    \item [--] The LLM harness used for the experiments is overly complex. The results are harder to generalise as each experiment is conditional on a lot of framework. It is plausible that removing the complex harness would still lead to similar results.
+    \item [--] All of the runs are judged by a LLM that disagree with each other about 25\% of the time. I believe this rate of disagreement is significantly higher than would be found when using human judges and could make the experiment data too unreliable.
+    \item [--] The goal of the experiment is to test an AI agents chance of going against the "spirit" of the task (specification gaming). As it is only tested on a single specific environment it is hard to generalize across to the models themselves.
 }

 \paperdetails{marksAuditingLanguageModels2025}{
-    Anthropic run a blind experiment of training a LLM with a known hidden objectives and giving it to 4 differnet red teams to try and discover if it had any hidden objectives in a mock pre-deployment test.
+    Anthropic run a blind experiment of training a LLM with a hidden objective and giving it to 4 differnet red teams to try and discover if it had any hidden objectives in a mock pre-deployment audit.
 }{
-    \item [--] Current malignment auditing techniques can discover hidden objectives in LLMs. When working in the context of an AI developer who has access to the training data and weights they can in a reasonble amount of time (< 1 week) identify the hidden objective and most of the biases that it generates.
-    \item [--] Without access to the training data and weights (API only access) there was no successful identifying of the known hidden objective. This was used as a anecdotal proof of non-trvialilty of hidden objective toughness, however can also be used to observe current external auditing for hidden objectrive is weak.
+    \item [--] Current malignment auditing techniques can discover hidden objectives in LLMs. When working in the context of an AI developer who has access to the training data and weights, they can identify the hidden objective and most of the biases that the model has. More so this can all be done in a reasonable amount of time ($<$ 1 week). However the successful teams needed access to training data and weights to find the hidden objective.
 }{
-    \item [--] An idea that I found interesting were the authors almost dismissal of SAEs as a viable long term strategy. They seemed quite keen to show that there are other simpler methods that work.
-    \item [--] From a somewhat outsider perspective I would have assumed that searching the training data to be a bit of a crux and not "truly" auditing. I am interested that they didn't have a single team that had access to weights but not training data (like how many open source models are provided).
+    \item [--] Using sparse auto encoders which are typically used for interprability are shown to be quite effective specifically at providing guides for what bad data to look for. It seems very reasonable that as LLMs get smarter having a better understanding of how and what it is thinking will help us judge whether it is thinking the right thing.
+    \item [--] From the perspective of a non expert I would have assumed that searching the training data to be a crux. I am interested that they didn't have a single team that had access to weights but not training data (like how many open weight models are provided). I would think that this would better test the effectiveness of white-box techniques.
 }{
-    \item [--] The game was only played for a single round with only 4 teams. This is an extremely small sample size and only tests for a single type of hidden objective on a single model. It would be hard to conclude with any statistical significance if the auditing methods are scalable and effective.
-    \item [--] The hidden objective was injected firstly through synthetic data with clear biases. This resulted in all successful teams discovering the hidden objective by finding it in the training data directly. It seems much more problematic (and much harder) to discover hidden objectives when they are simply emergent properties of such a large corpus.
+    \item [--] The game was played for a single round with only 4 teams. This is an extremely small sample size and only tests for a single type of hidden objective on a single model. It would be hard to conclude with too much significance if the auditing methods are indeed scalable and effective.
+    \item [--] The hidden objective was injected firstly through synthetic data and secondly by post training. This resulted in all successful teams discovering the hidden objective by finding it in the training data directly. It seems much more problematic (and much harder) to discover hidden objectives when they are nuanced and/or emergent properties of the training data/process.
 }

 \paperdetails{yuanSEvalAutomatedComprehensive2025}{
-    Yuan et al, develop a automated testing framework that uses LLM to generate tests and evaluate LLM responses to quantify a models safetiness in terms of whether it generates unsafe responses.
+    Yuan et al, develop a automated testing framework that uses LLM to generate tests and evaluate LLM responses to quantify a models safety in terms of whether it generates unsafe responses.
 }{
-    \item [--] Develop a safety evalation framework, which comprises two models that are fine-tuned version of Qwen-14B-Chat. The generator can build base risky prompts as well attack prompts that augment the base prompt with jail break techniques to trick the LLM into answering. An evaluator model determines if the repsonse is safe and gives a reason for its decision.
-    \item [--] Conduct a safety evaluation on 21 LLM across using both English and Chinese. They find that closed source models are generally safer and different languages can have drastically different effects on the safety score.
+    \item [--] Develop a safety evaluation framework, which comprises two models that are fine-tuned version of Qwen-14B-Chat. The generator can build base "risky" prompts as well attack prompts that augment the base prompt with jail break techniques to trick the LLM into answering. An evaluator model determines if the response is safe and gives a reason for its decision.
+    \item [--] Conduct a safety evaluation on 21 LLMs across using both English and Chinese. They find that closed source models are generally safer and different languages can have drastically different effects on the safety score.
 }{
-    \item [--] I think that the idea of having a LLM try to test the safety of other LLMs is very interesting. This paper explores a novel approach of having LLM generate tests to be used across various models. As it is automated it can be done on a scale not possible before. \footnote{It would also be interesting to instead of making the test prompt generics instead have a game like setup where the generator model keeps trying to generate a prompt that gets the models to give an unsafe answer to a question (as assessed by the evaluator).}
-    \item [--] The attack prompt adaptive results were interesting. In the case where only one of ten attack prompts has to pass to count as a pass the tested models fail almost all the time. I think that this really highlights the helpfulness vs harmfullness trade off. In this case it seems all the models tested could be reliably jailbroken by building a simply small LLM wrapper that generates a variety of attack prompts to get the model to answer.
+    \item [--] I think that the idea of having a LLM try to test the safety of other LLMs is very interesting. I think that the harmful space is so complex we need great scale of safety tests to sufficiently explore the space. Using AI test generator and evaluator can facilitate this scale.
+    \item [--] The attack prompt adaptive results were interesting. In the case where only one of ten attack prompts has to pass to count as a fail the tested models fail almost all the time. I think that this really highlights the helpfulness vs harmfulness trade off. In this case it seems all the models tested could be reliably jailbroken by building a simple small LLM wrapper that generates a variety of attack prompts to get the model to answer.
 }{
-    \item [--] A problem with this framework is that it has minimal human oversight. As the scale increases (which is the goal of automation) the potential for drift in the prompt generation and response evaluation increases. Fortunately this system seems developed in a way that would allow for continual updating of the generator and evaluator model including moving the new base models.
+    \item [--] A problem with this framework is that it has minimal human oversight. As the scale increases (which is the goal of automation) the potential for drift in the prompt generation and response evaluation increases. Fortunately this system seems developed in a way that would allow for continual updating of the generator and evaluator model including moving to new more powerful base models.
 }
-
 \end{mdframed}

 \section*{Discussion (target 300-400 words)}
 \begin{mdframed}[] %remove this argument
-\textit{(You should:
-\begin{itemize}
-    \item[(a)] Compare and contrast the papers.
-    \item[(b)] Identify where these papers could head –-- e.g.\ new research questions or applications to real tasks.
-\end{itemize}
-An excellent discussion will cohesively link the papers together and provide critical insight into the application area as a whole and where it is/could be going in the future.)
-}
-LLM safety is an area that is becoming increasingly pressing. As these models capabilities increase, so to does their potential harmfullness. Within the LLM safety space there are two sorts of problems. The first (harmlessness) is short term in that when a model is asked how to build a nuclear bomb it should refuse to do so. The second (alignment) is longer term and that the underlying goals of the LLM align with what we as humans want it to do. We can see in \cite{bondarenkoDemonstratingSpecificationGaming2025} that current SOTA models can suggest and carry out inappropriate actions. This shows that not only does the model misinterpret (malignment) the request ("win at all costs" vs "win by playing normally") it will also respond with how to hack the game (harmful response). Current methods for testing for malignment are quite limited and rely on mostly manual human methods while harmfullness is becoming increasingly automated. In \cite{marksAuditingLanguageModels2025} we can fortunately see that a modern audit can identify a maligned goal like "maxismise the RM score" rather than "be helpful and make the world happy" (or more likely a well thought out base goal). Figuring out if the model is harmless can devolve into a whack-a-more situation where patching one problems causes regressions elsewhere. By having large scale high quality test sets one can mitigate this issue and have some higher level of confidence that the model won't repsond helpfully to harmful request. Automated test generation and evaluation is a good avenue to scale up harmfullness testing which is what \cite{yuanSEvalAutomatedComprehensive2025} explores. A further step is to make a more autoamte game like testing for harmfullness as well as alignment (by building virtual worlds to play out in). Expland about how this could work.
+We can see in \cite{bondarenkoDemonstratingSpecificationGaming2025} that current SOTA models can suggest and carry out inappropriate actions. This shows that not only does the model misinterpret (malignment) the request ("win at all costs" vs "win by playing normally") it will also respond with how to hack the game (harmful response). Current methods to test for malignment are quite limited and rely on mostly manual human methods \cite{marksAuditingLanguageModels2025} while harmfulness is becoming increasingly automated \cite{yuanSEvalAutomatedComprehensive2025}. In \cite{marksAuditingLanguageModels2025} we can fortunately see that a modern audit can identify a maligned goal like "reward models love chocolate in everything" rather than "be helpful and make the world happy". Figuring out if the model is harmless can devolve into a whack-a-mole situation where patching one problem causes regressions elsewhere. By having large scale high quality test sets one can mitigate this issue and have some higher level of confidence that the model won't respond helpfully to harmful requests. Automated test generation and evaluation is a good avenue to scale up harmfulness testing which is what \cite{yuanSEvalAutomatedComprehensive2025} explores.

+We use these current human based methods as well as benchmarks to give some confidence that the model is not harmful. Given the current paradigm of transformer based LLMs there is no way to prove that a model is safe, instead we can provide some certainty based on it successfully passing an ever growing amount of tests/benchmarks. Therefore to increase certainty we need to have larger scales of testing. The greatest scale can only be achieved through the use of AI. Using AI does introduce new risk factors as seen in \cite{wangLargeLanguageModels2023} but I digress. I think that the space of malignment is just as complex so requires the same scale and therefore automation as harmfulness testing. Specifically I think that having dynamically generated safety and alignment tests could help not only increase statistical significance of the results but also reduce the problem of test time behavior \cite{needhamLargeLanguageModels2025} because the environments will be more organic and less predictable. Furthermore as models get more powerful they will be better at passing these safety benchmarks and alignment benchmarks so creativity and scale will be needed to effectively test these models.
+
+To develop this dynamic, AI generated testing system there are three important steps. Firstly is a tests generator that generate not just harmfulness tests but also longer running alignment scenarios like chess game or advising a CEO on tax reduction techniques. Secondly an evaluator that can both mark responses as safe and unsafe (like in \cite{yuanSEvalAutomatedComprehensive2025}) but also carry out longer running tests (like conducting chess game against engine to test for cheating \cite{bondarenkoDemonstratingSpecificationGaming2025} or tests for self-preservations \cite{meinkeFrontierModelsAre2025}). Lastly would be some systematic human research to compare and see if this automated approach gives a signal comparable to current benchmarks and is reliable (i.e consistent results after multiple runs). I see that a sufficiently sophisticated and successful system would increase the effectiveness of pre-deployment tests and provide a scalable solution to the increasingly powerful models.
 \end{mdframed}


--- a/output/main.pdf
+++ b/output/main.pdf
--- a/references.bib
+++ b/references.bib
@@ -85,6 +85,24 @@
  file = {/home/james/snap/zotero-snap/common/Zotero/storage/HAEKTKFQ/Järviniemi and Hubinger - 2024 - Uncovering Deceptive Tendencies in Language Models A Simulated Company AI Assistant.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/C632CXHL/2405.html}
 }

+@misc{kangExploitingProgrammaticBehavior2023,
+  title = {Exploiting {{Programmatic Behavior}} of {{LLMs}}: {{Dual-Use Through Standard Security Attacks}}},
+  shorttitle = {Exploiting {{Programmatic Behavior}} of {{LLMs}}},
+  author = {Kang, Daniel and Li, Xuechen and Stoica, Ion and Guestrin, Carlos and Zaharia, Matei and Hashimoto, Tatsunori},
+  year = {2023},
+  month = feb,
+  number = {arXiv:2302.05733},
+  eprint = {2302.05733},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2302.05733},
+  urldate = {2025-08-28},
+  abstract = {Recent advances in instruction-following large language models (LLMs) have led to dramatic improvements in a range of NLP tasks. Unfortunately, we find that the same improved capabilities amplify the dual-use risks for malicious purposes of these models. Dual-use is difficult to prevent as instruction-following capabilities now enable standard attacks from computer security. The capabilities of these instruction-following LLMs provide strong economic incentives for dual-use by malicious actors. In particular, we show that instruction-following LLMs can produce targeted malicious content, including hate speech and scams, bypassing in-the-wild defenses implemented by LLM API vendors. Our analysis shows that this content can be generated economically and at cost likely lower than with human effort alone. Together, our findings suggest that LLMs will increasingly attract more sophisticated adversaries and attacks, and addressing these attacks may require new approaches to mitigations.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Cryptography and Security,Computer Science - Machine Learning},
+  file = {/home/james/snap/zotero-snap/common/Zotero/storage/SUK4VGFG/Kang et al. - 2023 - Exploiting Programmatic Behavior of LLMs Dual-Use Through Standard Security Attacks.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/3DPCT287/2302.html}
+}
+
@misc{marksAuditingLanguageModels2025,
  title = {Auditing Language Models for Hidden Objectives},
  author = {Marks, Samuel and Treutlein, Johannes and Bricken, Trenton and Lindsey, Jack and Marcus, Jonathan and {Mishra-Sharma}, Siddharth and Ziegler, Daniel and Ameisen, Emmanuel and Batson, Joshua and Belonax, Tim and Bowman, Samuel R. and Carter, Shan and Chen, Brian and Cunningham, Hoagy and Denison, Carson and Dietz, Florian and Golechha, Satvik and Khan, Akbir and Kirchner, Jan and Leike, Jan and Meek, Austin and {Nishimura-Gasparian}, Kei and Ong, Euan and Olah, Christopher and Pearce, Adam and Roger, Fabien and Salle, Jeanne and Shih, Andy and Tong, Meg and Thomas, Drake and Rivoire, Kelley and Jermyn, Adam and MacDiarmid, Monte and Henighan, Tom and Hubinger, Evan},
@@ -102,7 +120,7 @@
  file = {/home/james/snap/zotero-snap/common/Zotero/storage/JXU3M77C/Marks et al. - 2025 - Auditing language models for hidden objectives.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/T5PG969M/2503.html}
 }

-@misc{meinkeFrontierModelsAre2025a,
+@misc{meinkeFrontierModelsAre2025,
  title = {Frontier {{Models}} Are {{Capable}} of {{In-context Scheming}}},
  author = {Meinke, Alexander and Schoen, Bronson and Scheurer, J{\'e}r{\'e}my and Balesni, Mikita and Shah, Rusheb and Hobbhahn, Marius},
  year = {2025},
@@ -119,6 +137,24 @@
  file = {/home/james/snap/zotero-snap/common/Zotero/storage/QIECYJ4G/Meinke et al. - 2025 - Frontier Models are Capable of In-context Scheming.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/PYUPEAER/2412.html}
 }

+@misc{needhamLargeLanguageModels2025,
+  title = {Large {{Language Models Often Know When They Are Being Evaluated}}},
+  author = {Needham, Joe and Edkins, Giles and Pimpale, Govind and Bartsch, Henning and Hobbhahn, Marius},
+  year = {2025},
+  month = jul,
+  number = {arXiv:2505.23836},
+  eprint = {2505.23836},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2505.23836},
+  urldate = {2025-08-28},
+  abstract = {If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of 0.83), but do not yet surpass our simple human baseline (AUC of 0.92). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models. Dataset and code are available at https://huggingface.co/datasets/jjpn2/eval\_awareness and https://github.com/jjpn97/eval\_awareness.},
+  archiveprefix = {arXiv},
+  langid = {english},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computation and Language},
+  file = {/home/james/snap/zotero-snap/common/Zotero/storage/N3WDBL54/Needham et al. - 2025 - Large Language Models Often Know When They Are Being Evaluated.pdf}
+}
+
@misc{perezDiscoveringLanguageModel2022,
  title = {Discovering {{Language Model Behaviors}} with {{Model-Written Evaluations}}},
  author = {Perez, Ethan and Ringer, Sam and Luko{\v s}i{\=u}t{\.e}, Kamil{\.e} and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and {Tran-Johnson}, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noem{\'i} and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and Showk, Sheer El and Lanham, Tamera and {Telleen-Lawton}, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and {Hatfield-Dodds}, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared},
@@ -136,6 +172,23 @@
  file = {/home/james/snap/zotero-snap/common/Zotero/storage/XXZTZSET/Perez et al. - 2022 - Discovering Language Model Behaviors with Model-Written Evaluations.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/MTZFSHAV/2212.html}
 }

+@misc{shevlaneModelEvaluationExtreme2023,
+  title = {Model Evaluation for Extreme Risks},
+  author = {Shevlane, Toby and Farquhar, Sebastian and Garfinkel, Ben and Phuong, Mary and Whittlestone, Jess and Leung, Jade and Kokotajlo, Daniel and Marchal, Nahema and Anderljung, Markus and Kolt, Noam and Ho, Lewis and Siddarth, Divya and Avin, Shahar and Hawkins, Will and Kim, Been and Gabriel, Iason and Bolina, Vijay and Clark, Jack and Bengio, Yoshua and Christiano, Paul and Dafoe, Allan},
+  year = {2023},
+  month = sep,
+  number = {arXiv:2305.15324},
+  eprint = {2305.15324},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2305.15324},
+  urldate = {2025-08-28},
+  abstract = {Current approaches to building general-purpose AI systems tend to produce systems with both beneficial and harmful capabilities. Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks. Developers must be able to identify dangerous capabilities (through "dangerous capability evaluations") and the propensity of models to apply their capabilities for harm (through "alignment evaluations"). These evaluations will become critical for keeping policymakers and other stakeholders informed, and for making responsible decisions about model training, deployment, and security.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Artificial Intelligence},
+  file = {/home/james/snap/zotero-snap/common/Zotero/storage/C79UVMNS/Shevlane et al. - 2023 - Model evaluation for extreme risks.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/L56SVZUV/2305.html}
+}
+
@misc{thomasSupportingHumanRaters2024,
  title = {Supporting {{Human Raters}} with the {{Detection}} of {{Harmful Content}} Using {{Large Language Models}}},
  author = {Thomas, Kurt and Kelley, Patrick Gage and Tao, David and Meiklejohn, Sarah and Vallis, Owen and Tan, Shunwen and Bratani{\v c}, Bla{\v z} and Ferreira, Felipe Tiengo and Eranti, Vijay Kumar and Bursztein, Elie},
@@ -153,6 +206,23 @@
  file = {/home/james/snap/zotero-snap/common/Zotero/storage/43E2KV3V/Thomas et al. - 2024 - Supporting Human Raters with the Detection of Harmful Content using Large Language Models.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/NA64MNC9/2406.html}
 }

+@misc{wangLargeLanguageModels2023,
+  title = {Large {{Language Models}} Are Not {{Fair Evaluators}}},
+  author = {Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Liu, Qi and Liu, Tianyu and Sui, Zhifang},
+  year = {2023},
+  month = aug,
+  number = {arXiv:2305.17926},
+  eprint = {2305.17926},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2305.17926},
+  urldate = {2025-08-28},
+  abstract = {In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models{\textasciitilde}(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score; 3) Human-in-the-Loop Calibration, which introduces a balanced position diversity entropy to measure the difficulty of each example and seeks human assistance when needed. We also manually annotate the "win/tie/lose" outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark's question prompt, and extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. We release our code and human annotation at {\textbackslash}url\{https://github.com/i-Eval/FairEval\} to facilitate future research.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computation and Language,Computer Science - Information Retrieval},
+  file = {/home/james/snap/zotero-snap/common/Zotero/storage/LCJXUS3U/Wang et al. - 2023 - Large Language Models are not Fair Evaluators.pdf;/home/james/snap/zotero-snap/common/Zotero/storage/5ZHM4NIQ/2305.html}
+}
+
@misc{yuanSEvalAutomatedComprehensive2025,
  title = {S-{{Eval}}: {{Towards Automated}} and {{Comprehensive Safety Evaluation}} for {{Large Language Models}}},
  shorttitle = {S-{{Eval}}},
Author	SHA1	Message	Date
1jamesthompson1	0e1fbdd5ea	Final remove italics	2025-08-29 12:40:25 +12:00
1jamesthompson1	1ca06db8cf	Final draft	2025-08-29 12:34:45 +12:00