1.9 KiB
Structured survey
For this assignment we have to do a small literatrue review of the applications and use of AI in (small) area.
Proposal
I would like to look into AI within the AI safety research space. In particular I wanted to look into the concept of LLM alignment.
To do this I thought I could use these papers: Auditing language models for hidden objectives - An experiment from Anthropic in which they purposely malign a LLM with hidden objectives and then run a blind experiment where safety testers do a mock 'pre-deployment' test.
Demonstrating specification gaming in reasoning models - Giving o1 model from OpenAI the task of beating Stockfish in chess. Where o1 decides to cheat by modifying the code of the chess game when it knows it is loosing.
S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models - Similar idea to https://arxiv.org/abs/2212.09251 but with modern models. Effectively having a LLM wrtie evaluation tests
There are quite a few papers I have looked at so I am not sure these are the most interesting. However I expect that these give a good sample of examples of maligned models and potential ways to catch them.
Aftera request for some more specificity I followed up with
Your right that "AI safety research space" is a bit too broad. To be specific for this survey I wanted to look at the ability to get a LLM to follow and act inline with your objective and your values.
I think I can break this down into two parts which are:
How well can tell if a LLM is malign to an objective or ignoring our values. What is the baseline for current LLM and their alignment to our values and objectives.
Was this narrow enough? I could narrow it down further by just looking at Automated alignment testing of LLM by LLMs if that is better.
This was accepted so this will be my scope.