104 lines
6.0 KiB
TeX
104 lines
6.0 KiB
TeX
\documentclass{article}[11pt]
|
||
\usepackage[a4paper,margin=1in]{geometry}
|
||
|
||
\usepackage[utf8]{inputenc}
|
||
|
||
\usepackage[style=authoryear,backend=biber]{biblatex}
|
||
\addbibresource{references.bib}
|
||
\usepackage{graphicx}
|
||
\usepackage{mdframed}
|
||
|
||
\DeclareFieldFormat{title}{#1}
|
||
\DeclareCiteCommand{\fullcite}
|
||
{\usebibmacro{prenote}}
|
||
{\printfield{title} (\cite{\thefield{entrykey}})}
|
||
{\multicitedelim}
|
||
{\usebibmacro{postnote}}
|
||
|
||
\newcommand{\paperdetails}[4]{
|
||
\subsection*{Paper: \fullcite{#1}}
|
||
\begin{itemize}
|
||
\item \textbf{Main Contributions/Findings:}
|
||
\begin{itemize}
|
||
#2
|
||
\end{itemize}
|
||
\item \textbf{Interesting Ideas/Results:}
|
||
\begin{itemize}
|
||
#3
|
||
\end{itemize}
|
||
\item \textbf{Significant Limitations:}
|
||
\begin{itemize}
|
||
#4
|
||
\end{itemize}
|
||
\end{itemize}
|
||
}
|
||
|
||
\begin{document}
|
||
\noindent Student ID: 300680096 \\
|
||
\noindent \textit{You should choose 3-5 papers in your approved application area. Your selected papers should be coherent --- not too similar, but still related enough that they provide a good overview of the main applications and techniques in your area. The provided word targets are guidelines --- it is important that you are sufficiently detailed but also demonstrating an ability to write concisely.\\ \\
|
||
Please delete the text in italics for your final submission. Submit in PDF format by the deadline.
|
||
}
|
||
|
||
\section*{Introduction (target 150--250 words)}
|
||
\begin{mdframed}[] %remove this argument
|
||
\textit{(Introduce the application area you are going to cover. Make sure that you have constrained your focus to a narrow area.)}
|
||
\end{mdframed}
|
||
|
||
\section*{Key Points in the Papers (target 600-800 words)}
|
||
\begin{mdframed}[] %remove this argument
|
||
\textit{(For each paper:
|
||
\begin{itemize}
|
||
\item[-] List the main contributions/findings of the paper (typically 1 to 3 per paper, a few sentences per contribution/finding).
|
||
\item[-] List any ideas or results in the paper that you thought were particularly interesting and explain why.
|
||
\item[-] List any significant limitations you identified in the work.)
|
||
\end{itemize}}
|
||
|
||
To explore this area I will look at the following papers:
|
||
\begin{itemize}
|
||
\item \fullcite{marksAuditingLanguageModels2025} to see how effectively current method discover maligned models.
|
||
\item \fullcite{bondarenkoDemonstratingSpecificationGaming2025} to explore an instance of malignment in a deployed language model.
|
||
\item \fullcite{yuanSEvalAutomatedComprehensive2025} to understand the method of using LLM to evaluate LLMs.
|
||
\end{itemize}
|
||
\paperdetails{bondarenkoDemonstratingSpecificationGaming2025}{
|
||
\item [--] Reasoning models like o1-preview and Deepseek R1 will hack the game to win more often than not.
|
||
\item [--] Base Language models require more suggestive prompting before they will hack the game.
|
||
}{
|
||
\item [--] Some models like o3-mini and Qwen 32B Preview would often fail to complete the task at all.
|
||
\item [--] Small differences in the prompt could lead to large differences in the models hacking attempts. For example claude 3.5 Sonnet doesn't hack at all for 2 prompts but for the third hacks about 3/4 of the time.
|
||
}{
|
||
\item [--] The LLM harness used for the experiments is overly complex. This reduces the interpretability and makes the results more opaque
|
||
\item [--] All of the runs are judged by a LLM that disagree about 25\% of the time. I believe this rate of disagreement is significantly higher than would be found when using human judges.
|
||
\item [--] The goal of the experiment is to test an AI agents chance of going against the "spirit" of the task. As it is only tested on a single specific environment it is hard to generalize across to the models themselves.
|
||
}
|
||
|
||
\paperdetails{marksAuditingLanguageModels2025}{
|
||
\item [--] Current malignment auditing techniques can discover hidden objectives in LLMs. When working in the context of an AI developer who has access to the training data and weights they can in a reasonble amount of time (< 1 week) identify the hidden objective and most of the biases.
|
||
\item [--] Without access to the training data and weights (API only access) there was no successful identifying of the known (but withheld) objective. This was used as a anecdotal proof of non-trvialilty of hidden objective toughness, however can also be used to observe current external auditing for hidden objectrive is weak.
|
||
}{
|
||
\item [--] An idea that I found interesting were the authors almost dismissal of SAEs as a viable long term strategy. They seemed quite keen to show that there are other simpler methods that work.
|
||
\item [--] From a somewhat outsider perspective I would have assumed that searching the training data to be a bit of a crux and not "truly" auditing. I am interested that they didn't have a single team that had access to weights but not training data (like how many open source models are provided).
|
||
}{
|
||
\item [--] The game was only played for a single round with only 4 teams. This is an extremely small sample size and only tests for a single type of hidden objective. It would be hard to conclude with any statistical significance if the models are effective.
|
||
\item [--] The hidden objective was injected firstly through synthetic data with clear biases. This resulted in all successful teams discovering the hidden objective by finding it in the training data directly. It seems much more problematic (and much harder) to discover hidden objectives when they are simply emergent properties of such a large corpus.
|
||
}
|
||
|
||
\end{mdframed}
|
||
|
||
\section*{Discussion (target 300-400 words)}
|
||
\begin{mdframed}[] %remove this argument
|
||
\textit{(You should:
|
||
\begin{itemize}
|
||
\item[(a)] Compare and contrast the papers.
|
||
\item[(b)] Identify where these papers could head –-- e.g.\ new research questions or applications to real tasks.
|
||
\end{itemize}
|
||
An excellent discussion will cohesively link the papers together and provide critical insight into the application area as a whole and where it is/could be going in the future.)
|
||
}
|
||
\end{mdframed}
|
||
|
||
|
||
\section*{References}
|
||
\begin{mdframed}[] %remove this argument
|
||
\printbibliography[heading=none]
|
||
\end{mdframed}
|
||
\end{document}
|