Giter Site home page Giter Site logo

magnetic2014 / llm-alignment-survey Goto Github PK

View Code? Open in Web Editor NEW
57.0 1.0 3.0 15 KB

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" for more details!

License: MIT License

llm-alignment-survey's Introduction

llm-alignment-survey

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" on arXiv for more details!

Feel free to open an issue/PR or e-mail [email protected] and [email protected] if you find any missing areas, papers, or datasets. We will keep updating this list and survey.

If you find our survey useful, please kindly cite our paper:

@article{shen2023alignment,
      title={Large Language Model Alignment: A Survey}, 
      author={Shen, Tianhao and Jin, Renren and Huang, Yufei and Liu, Chuang and Dong, Weilong and Guo, Zishan and Wu, Xinwei and Liu, Yan and Xiong, Deyi},
      journal={arXiv preprint arXiv:2309.15025},
      year={2023}
}

Table of Contents

Related Surveys

  1. Aligning Large Language Models with Human: A Survey. Yufei Wang et al. arXiv 2023. [Paper]
  2. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment. Yang Liu et al. arXiv 2023. [Paper]
  3. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. Patrick Fernandes et al. arXiv 2023. [paper]
  4. Augmented Language Models: a Survey. Grégoire Mialon et al. arXiv 2023. [Paper]
  5. An Overview of Catastrophic AI Risks. Dan Hendrycks et al. arXiv 2023. [Paper]
  6. A Survey of Large Language Models. Wayne Xin Zhao et al. arXiv 2023. [Paper]
  7. A Survey on Universal Adversarial Attack. Chaoning Zhang et al. IJCAI 2021. [Paper]
  8. Survey of Hallucination in Natural Language Generation. Ziwei Ji et al. ACM Computing Surveys 2022. [Paper]
  9. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. Liangming Pan et al. arXiv 2023. [Paper]
  10. Automatic Detection of Machine Generated Text: A Critical Survey. Ganesh Jawahar et al. COLING 2020. [Paper]

Why LLM Alignment?

  1. Synchromesh: Reliable Code Generation from Pre-trained Language Models. Gabriel Poesia et al. ICLR 2022. [Paper]
  2. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. Chan Hee Song et al. ICCV 2023. [Paper]
  3. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. Wenlong Huang et al. PMLR 2022. [Paper]
  4. Tool Learning with Foundation Models. Yujia Qin et al. arXiv 2023. [Paper]
  5. Ethical and social risks of harm from Language Models. Laura Weidinger et al. arXiv 2021. [Paper]

LLM-Generated Content

Undesirable Content

  1. Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview. Deven Shah et al. arXiv 2019. [Paper]
  2. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. arXiv 2023. [Paper]
  3. Extracting Training Data from Large Language Models. Nicholas Carlini et al. arXiv 2012. [Paper]
  4. StereoSet: Measuring stereotypical bias in pretrained language models. Moin Nadeem et al. arXiv 2020. [Paper]
  5. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
  6. HONEST: Measuring Hurtful Sentence Completion in Language Models. Debora Nozza et al. NAACL 2021. [Paper]
  7. Language Models are Few-Shot Learners. Tom Brown et al. NeurIPS 2020. [Paper]
  8. Persistent Anti-Muslim Bias in Large Language Models. Abubakar Abid et al. AIES 2021. [Paper]
  9. Gender and Representation Bias in GPT-3 Generated Stories. Li Lucy et al. WNU 2021. [Paper]

Unfaithful Content

  1. Measuring and Improving Consistency in Pretrained Language Models. Yanai Elazar et al. TACL 2021. [Paper]
  2. GPT-3 Creative Fiction. Gwern. 2023. [Blog]
  3. GPT-3: What’s It Good for? Robert Dale. Natural Language Engineering 2020. [Paper]
  4. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Jack W. Rae et al. arXiv 2021. [Paper]
  5. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. ACL 2022. [Paper]
  6. Towards Tracing Knowledge in Language Models Back to the Training Data. Ekin Akyurek et al. EMNLP 2020. [Paper]
  7. Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck et al. arXiv 2023. [Paper]
  8. Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models. Kaitlyn Zhou et al. arXiv 2023. [Paper]
  9. Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. Reza Asadi et al. 2018. [Paper]
  10. Will ChatGPT Replace Lawyers? Kate Rattray. 2023. [Blog]
  11. Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. arXiv 2022. [Paper]

Malicious Uses

  1. Truth, Lies, and Automation How Language Models Could Change Disinformation. Ben Buchanan et al. Center for Security and Emerging Technology, 2021. [Paper]
  2. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. Alex Tamkin et al. arXiv 2021. [Paper]
  3. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. Mike Lewis et al. arXiv 2017. [Paper]
  4. Evaluating Large Language Models Trained on Code. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
  5. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. Jonas B. Sandbrink. arXiv 2023. [Paper]

Negative Impacts on Society

  1. Sustainable AI: AI for sustainability and the sustainability of AI. Aimee van Wynsberghe. AI and Ethics 2021. [Paper]
  2. Unraveling the Hidden Environmental Impacts of AI Solutions for Environment. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
  3. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. Tyna Eloundou et al. arXiv 2023. [Paper]

Potential Risks Associated with Advanced LLMs

  1. Formalizing Convergent Instrumental Goals. Tsvi Benson-Tilsen et al. AAAI AIES Workshop 2016. [Paper]
  2. Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
  3. Aligning AI Optimization to Community Well-Being. Stray J. International Journal of Community Well-being 2020. [Paper]
  4. What are you optimizing for? Aligning Recommender Systems with Human Values. Jonathan Stray et al. ICML 2020. [Paper]
  5. Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
  6. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Science 2022. [Paper]
  7. Characterizing Manipulation from AI Systems. Micah Carroll et al. arXiv 2023. [Paper]
  8. Deceptive Alignment Monitoring. Andres Carranza et al. ICML AdvML Workshop 2023. [Paper]
  9. The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
  10. Is Power-Seeking AI an Existential Risk? Joseph Carlsmith. arXiv 2023. [Paper]
  11. Optimal Policies Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2021. [Paper]
  12. Parametrically Retargetable Decision-Makers Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2022. [Paper]
  13. Power-seeking can be probable and predictive for trained agents. Victoria Krakovna et al. arXiv 2023. [Paper]
  14. Discovering Language Model Behaviors with Model-Written Evaluations. Ethan Perez et al. arXiv 2022. [Paper]

What is LLM Alignment?

  1. Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers. Norbert Wiener. Science 1960. [Paper]
  2. Coherent Extrapolated Volition. Eliezer Yudkowsky. Singularity Institute for Artificial Intelligence 2004. [Paper]
  3. The Basic AI Drives. Stephen M. Omohundro. AGI 2008. [Paper]
  4. The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
  5. General Purpose Intelligence: Arguing the Orthogonality Thesis. Stuart Armstrong. Analysis and Metaphysics 2013. [Paper]
  6. Aligning Superintelligence with Human Interests: An Annotated Bibliography. Nate Soares. Intelligence 2015. [Paper]
  7. Concrete Problems in AI Safety. Dario Amodei et al. arXiv 2016. [Paper]
  8. The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
  9. AI Safety Gridworlds. Jan Leike et al. arXiv 2017. [Paper]
  10. Overview of Current AI Alignment Approaches. Micah Carroll. 2018. [Paper]
  11. Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2019. [Paper]
  12. An Overview of 11 Proposals for Building Safe Advanced AI. Evan Hubinger. arXiv 2020. [Paper]
  13. Unsolved Problems in ML Safety. Dan Hendrycks et al. arXiv 2021. [Paper]
  14. A Mathematical Framework for Transformer Circuits. Nelson Elhage et al. Transformer Circuits Thread 2021. [Paper]
  15. Alignment of Language Agents. Zachary Kenton et al. arXiv 2021. [Paper]
  16. A General Language Assistant as a Laboratory for Alignment. Amanda Askell et al. arXiv 2021. [Paper]
  17. A Transparency and Interpretability Tech Tree. Evan Hubinger. 2022. [Blog]
  18. Understanding AI Alignment Research: A Systematic Analysis. J. Kirchner et al. arXiv 2022. [Paper]
  19. Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
  20. The Alignment Problem from a Deep Learning Perspective. Richard Ngo. arXiv 2022. [Paper]
  21. Paradigms of AI Alignment: Components and Enablers. Victoria Krakovna. 2022. [Blog]
  22. Progress Measures for Grokking via Mechanistic Interpretability. Neel Nanda et al. arXiv 2023. [Paper]
  23. Agentized LLMs Will Change the Alignment Landscape. Seth Herd. 2023. [Blog]
  24. Language Models Can Explain Neurons in Language Models. Steven Bills et al. 2023. [Paper]
  25. Core Views on AI Safety: When, Why, What, and How. Anthropic. 2023. [Blog]

Outer Alignment

Non-recursive Oversight

RL-based Methods

  1. Proximal Policy Optimization Algorithms. John Schulman et al. arXiv 2017. [Paper]
  2. Fine-Tuning Language Models from Human Preferences. Daniel M Ziegler et al. arXiv 2019. [Paper]
  3. Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
  4. Training Language Models to Follow Instructions with Human Feedback. Long Ouyang et al. NeurIPS 2022. [Paper]
  5. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Yuntao Bai et al. arXiv 2022. [Paper]
  6. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. Afra Feyza Akyürek et al. arXiv 2023. [Paper]
  7. Improving Language Models with Advantage-Based Offline Policy Gradients. Ashutosh Baheti et al. arXiv 2023. [Paper]
  8. Scaling Laws for Reward Model Overoptimization. Leo Gao et al. ICML 2023. [Paper]
  9. Improving Alignment of Dialogue Agents via Targeted Human Judgements. Amelia Glaese et al. arXiv 2022. [Paper]
  10. Aligning Language Models with Preferences through F-Divergence Minimization. Dongyoung Go et al. arXiv 2023. [Paper]
  11. Aligning Large Language Models through Synthetic Feedback. Sungdong Kim et al. arXiv 2023. [Paper]
  12. RLHF. Ansh Radhakrishnan. Lesswrong 2022. [Paper]
  13. Guiding Large Language Models via Directional Stimulus Prompting. Zekun Li et al. arXiv 2023. [Paper]
  14. Aligning Generative Language Models with Human Values. Ruibo Liu et al. NAACL 2022 Findings. [Paper]
  15. Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits. Ruibo Liu et al. NeurIPS 2022. [Paper]
  16. Secrets of RLHF in Large Language Models Part I: PPO. Rui Zheng et al. arXiv 2023. [Paper]
  17. Principled Reinforcement Learning with Human Feedback from Pairwise or K-Wise Comparisons. Banghua Zhu et al. arXiv 2023. [Paper]
  18. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Stephen Casper et al. arXiv 2023. [Paper]

SL-based Methods

  1. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Timo Schick et al. TACL, 2021. [Paper]
  2. The Cringe Loss: Learning What Language Not to Model. Leonard Adolphs et al. arXiv 2022. [Paper]
  3. Leashing the Inner Demons: Self-detoxification for Language Models. Canwen Xu et al. AAAI 2022. [Paper]
  4. Calibrating Sequence Likelihood Improves Conditional Language Generation. Yao Zhao et al. arXiv 2022. [Paper]
  5. RAFT: Reward Ranked Finetuning for Generative Foundation Model Alignment. Hanze Dong et al. arXiv 2023. [Paper]
  6. Chain of Hindsight Aligns Language Models with Feedback. Hao Liu et al. arXiv 2023. [Paper]
  7. Training Socially Aligned Language Models in Simulated Human Society. Ruibo Liu et al. arXiv 2023. [Paper]
  8. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. Rafael Rafailov et al. arXiv 2023. [Paper]
  9. Training Language Models with Language Feedback at Scale. Jérémy Scheurer et al. arXiv 2023. [Paper]
  10. Preference Ranking Optimization for Human Alignment. Feifan Song et al. arXiv 2023. [Paper]
  11. RRHF: Rank Responses to Align Language Models with Human Feedback without Tears. Zheng Yuan et al. arXiv 2023. [Paper]
  12. SLiC-HF: Sequence Likelihood Calibration with Human Feedback. Yao Zhao et al. arXiv 2023. [Paper]
  13. LIMA: Less Is More for Alignment. Chunting Zhou et al. arXiv 2023. [Paper]

Scalable Oversight

  1. Supervising Strong Learners by Amplifying Weak Experts. Paul Christiano et al. arXiv 2018. [Paper]
  2. Scalable Agent Alignment via Reward Modeling: A Research Direction. Jan Leike et al. arXiv 2018. [Paper]
  3. AI Safety Needs Social Scientists. Geoffrey Irving, and Amanda Askell. Distill 2019. [Paper]
  4. Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
  5. Task Decomposition for Scalable Oversight (AGISF Distillation). Charbel-Raphaël Segerie. 2023. [Blog]
  6. Measuring Progress on Scalable Oversight for Large Language Models. Samuel R Bowman et al. arXiv 2022. [Paper]
  7. Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. CoRR 2022. [Paper]
  8. Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
  9. Evaluating Superhuman Models with Consistency Checks. Lukas Fluri et al. arXiv 2023. [Paper]
  10. AI Safety via Debate. Geoffrey Irving et al. arXiv 2018. [Paper]
  11. AI Safety via Market Making. Evan Hubinger. 2020. [Blog]
  12. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. Tian Liang et al. arXiv 2023. [Paper]
  13. Let's Verify Step by Step. Hunter Lightman et al. arXiv 2023. [Paper]
  14. Introducing Superalignment. OpenAI. 2023. [Blog]
  15. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. Zhiqing Sun et al. arXiv 2023. [Paper]

Inner Alignment

  1. Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2021. [Paper]
  2. Goal Misgeneralization in Deep Reinforcement Learning. Lauro Langosco et al. ICML 2022. [Paper]
  3. Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals. Rohin Shah et al. arXiv 2022. [Paper]
  4. Defining capability and alignment in gradient descent. Edouard Harris. Lesswrong 2020. [Blog]
  5. Categorizing failures as “outer” or “inner” misalignment is often confused. Rohin Shah. Lesswrong 2023. [Blog]
  6. Inner Alignment Failures" Which Are Actually Outer Alignment Failures. John Wentworth. Lesswrong 2020. [Blog]
  7. Relaxed adversarial training for inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
  8. The Inner Alignment Problem. Evan Hubinger et al. Lesswrong 2019. [Blog]
  9. Three scenarios of pseudo-alignment. Eleni Angelou. Lesswrong 2022. [Blog]
  10. Deceptive Alignment. Evan Hubinger et al. Lesswrong 2019. [Blog]
  11. What failure looks like. Paul Christiano. AI Alignment Forum 2019. [Blog]
  12. Concrete experiments in inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
  13. A central AI alignment problem: capabilities generalization, and the sharp left turn. Nate Soares. Lesswrong 2022. [Blog]
  14. Clarifying the confusion around inner alignment. Rauno Arike. AI Alignment Forum 2022. [Blog]
  15. 2-D Robustness. Vladimir Mikulik. AI Alignment Forum 2019. [Blog]
  16. Monitoring for deceptive alignment. Evan Hubinger. Lesswrong 2022. [Blog]

Mechanistic Interpretability

  1. Notions of explainability and evaluation approaches for explainable artificial intelligence. Giulia Vilone et al. arXiv 2020. [Paper]
  2. A Comprehensive Mechanistic Interpretability Explainer Glossary. Neel Nanda. 2022. [Paper]
  3. The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
  4. AI research considerations for human existential safety (ARCHES). Andrew Critch et al. arXiv 2020. [Paper]
  5. Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning. RT McAllister et al. IJCAI 2017. [Paper]
  6. In-context Learning and Induction Heads. Catherine Olsson et al. Transformer Circuits Thread, 2022. [Paper]
  7. Transformer Feed-Forward Layers Are Key-Value Memories. Mor Geva et al. EMNLP 2021. [Paper]
  8. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Mor Geva et al. EMNLP 2022. [Paper]
  9. Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
  10. Toy Models of Superposition. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
  11. Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. Chris Olah. 2022. [Paper]
  12. Knowledge Neurons in Pretrained Transformers. Dai Damai et al. ACL 2021. [Paper]
  13. Locating and editing factual associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
  14. Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]
  15. LEACE: Perfect linear concept erasure in closed form. Nora Belrose et al. arXiv 2023. [Paper]

Attacks on Aligned Language Models

Privacy Attacks

  1. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. Gelei Deng et al. arXiv 2023. [Paper]
  2. Multi-step Jailbreaking Privacy Attacks on ChatGPT. Haoran Li et al. arXiv 2023. [Paper]

Backdoor Attacks

  1. Prompt Injection Attack Against LLM-integrated Applications. Yi Liu et al. arXiv 2023. [Paper]
  2. Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. Shuai Zhao et al. arXiv 2023. [Paper]
  3. More Than You've Asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. Kai Greshake et al. arXiv 2023. [Paper]
  4. Backdoor Attacks for In-Context Learning with Language Models. Nikhil Kandpal et al. arXiv 2023. [Paper]
  5. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. Jiawen Shi et al. arXiv 2023. [Paper]

Adversarial Attacks

  1. Universal and Transferable Adversarial Attacks on Aligned Language Models. Andy Zou et al. arXiv 2023. [Paper]
  2. Are Aligned Neural Networks Adversarially Aligned?. Nicholas Carlini et al. arXiv 2023. [Paper]
  3. Visual Adversarial Examples Jailbreak Large Language Models. Xiangyu Qi et al. arXiv 2023. [Paper]

Alignment Evaluation

Factuality Evaluation

  1. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Sewon Min et al. arXiv 2023. [Paper]
  2. Factuality Enhanced Language Models for Open-ended Text Generation. Nayeon Lee et al. NeurIPS 2022. [Paper]
  3. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. arXiv 2021. [Paper]
  4. SummaC: Re-visiting NLI-based Models for Inconsistency Detection in Summarization. Philippe Laban et al. TACL 2022. [Paper]
  5. QAFactEval: Improved QA-based Factual Consistency Evaluation for Summarization. Alexander R. Fabbri et al. arXiv 2021. [Paper]
  6. TRUE: Re-evaluating Factual Consistency Evaluation. Or Honovich et al. arXiv 2022. [Paper]
  7. AlignScore: Evaluating Factual Consistency with a Unified Alignment Function. Yuheng Zha et al. arXiv 2023. [Paper]

Ethics Evaluation

  1. Social Chemistry 101: Learning to Reason about Social and Moral Norms. Maxwell Forbes et al. arXiv 2020. [Paper]
  2. Aligning AI with Shared Human Values. Dan Hendrycks et al. arXiv 2020. [Paper]
  3. Would You Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences. Yi Tay et al. ACL 2020. [Paper]
  4. Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes. Nicholas Lourie et al. AAAI 2021. [Paper]

Toxicity Evaluation

Task-specific Evaluation

  1. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Ying Chen et al. PASSAT-SocialCom 2012. [Paper]
  2. Offensive Language Detection Using Multi-level Classification. Amir H. Razavi et al. Canadian AI 2010. [Paper]
  3. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. Zeerak Waseem and Dirk Hovy. NAACL Student Research Workshop 2016. [Paper]
  4. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. Bjorn Ross et al. NLP4CMC 2016. [Paper]
  5. Ex Machina: Personal Attacks Seen at Scale. Ellery Wulczyn et al. WWW 2017. [Paper]
  6. Predicting the Type and Target of Offensive Posts in Social Media. Marcos Zampieri et al. NAACL-HLT 2019. [Paper]

LLM-centered Evaluation

  1. Recipes for Safety in Open-Domain Chatbots. Jing Xu et al. arXiv 2020. [Paper]
  2. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. EMNLP 2020 Findings. [Paper]
  3. COLD: A Benchmark for Chinese Offensive Language Detection. Jiawen Deng et al. EMNLP 2022. [Paper]

Stereotype and Bias Evaluation

Task-specific Evaluation

  1. Gender Bias in Coreference Resolution. Rachel Rudinger et al. NAACL 2018. [Paper]
  2. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Jieyu Zhao et al. NAACL 2018. [Paper]
  3. The Winograd Schema Challenge. Hector Levesque et al. KR 2012. [Paper]
  4. Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle. Yang Trista Cao and Hal Daumé III. Computational Linguistics 2021. [Paper]
  5. Evaluating Gender Bias in Machine Translation. Gabriel Stanovsky et al. ACL 2019. [Paper]
  6. Investigating Failures of Automatic Translation in the Case of Unambiguous Gender. Adithya Renduchintala and Adina Williams. ACL 2022. [Paper]
  7. Towards Understanding Gender Bias in Relation Extraction. Andrew Gaut et al. ACL 2020. [Paper]
  8. Addressing Age-Related Bias in Sentiment Analysis. Mark Díaz et al. CHI 2018. [Paper]
  9. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. NAACL-HLT 2018. [Paper]
  10. On Measuring and Mitigating Biased Inferences of Word Embeddings. Sunipa Dev et al. AAAI 2020. [Paper]
  11. Social Bias Frames: Reasoning About Social and Power Implications of Language. Maarten Sap et al. ACL 2020. [Paper]
  12. Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark. Jingyan Zhou et al. EMNLP 2022 Findings. [Paper]
  13. CORGI-PM: A Chinese Corpus for Gender Bias Probing and Mitigation. Ge Zhang et al. arXiv 2023. [Paper]

LLM-centered Evaluation

  1. StereoSet: Measuring Stereotypical Bias in Pretrained Language Models. Moin Nadeem et al. ACL 2021. [Paper]
  2. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
  3. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. Jwala Dhamala et al. FAccT 2021. [Paper]
  4. “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. Eric Michael Smith et al. EMNLP 2022. [Paper]
  5. Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale. Marta R. Costa-jussà et al. arXiv 2023. [Paper]
  6. UNQOVERing Stereotyping Biases via Underspecified Questions. Tao Li et al. EMNLP 2020 Findings. [Paper]
  7. BBQ: A Hand-Built Bias Benchmark for Question Answering. Alicia Parrish et al. ACL 2022 Findings. [Paper]
  8. CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. Yufei Huang and Deyi Xiong. arXiv 2023. [Paper]

Hate Speech Detection

  1. Automated Hate Speech Detection and the Problem of Offensive Language. Thomas Davidson et al. AAAI 2017. [Paper]
  2. Deep Learning for Hate Speech Detection in Tweets. Pinkesh Badjatiya et al. WWW 2017. [Paper]
  3. Detecting Hate Speech on the World Wide Web. William Warner and Julia Hirschberg. NAACL-HLT 2012. [Paper]
  4. A Survey on Hate Speech Detection using Natural Language Processing. Anna Schmidt and Michael Wiegand. SocialNLP 2017. [Paper]
  5. Hate Speech Detection with Comment Embeddings. Nemanja Djuric et al. WWW 2015. [Paper]
  6. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. Zeerak Waseem. NLP+CSS@EMNLP 2016. [Paper]
  7. TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter. Sumit Kumar and Raj Ratn Pranesh. arXiv 2021. [Paper]
  8. Hate Speech Dataset from a White Supremacy Forum. Ona de Gibert et al. ALW2 2018. [Paper]
  9. The Gab Hate Corpus: A Collection of 27k Posts Annotated for Hate Speech. Brendan Kennedy et al. LRE 2022 [Paper]
  10. Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts. Luke Breitfeller et al. EMNLP 2019. [Paper]
  11. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. Bertie Vidgen et al. ACL 2021. [Paper]
  12. Hate speech detection: Challenges and solutions. Sean MacAvaney et al. PloS One 2019. [Paper]
  13. Racial Microaggressions in Everyday Life: Implications for Clinical Practice. Derald Wing Sue et al. American Psychologist 2007. [Paper]
  14. The Impact of Racial Microaggressions on Mental Health: Counseling Implications for Clients of Color. Kevin L. Nadal et al. Journal of Counseling & Development 2014. [Paper]
  15. A Preliminary Report on the Relationship Between Microaggressions Against Black People and Racism Among White College Students. Jonathan W. Kanter et al. Race and Social Problems 2017. [Paper]
  16. Microaggressions and Traumatic Stress: Theory, Research, and Clinical Treatment. Kevin L. Nadal. American Psychological Association 2018. [Paper]
  17. Arabs as Terrorists: Effects of Stereotypes Within Violent Contexts on Attitudes, Perceptions, and Affect. Muniba Saleem and Craig A. Anderson. Psychology of Violence 2013. [Paper]
  18. Mean Girls? The Influence of Gender Portrayals in Teen Movies on Emerging Adults' Gender-Based Attitudes and Beliefs. Elizabeth Behm-Morawitz and Dana E. Mastro. Journalism and Mass Communication Quarterly 2008. [Paper]
  19. Exposure to Hate Speech Increases Prejudice Through Desensitization. Wiktor Soral, Michał Bilewicz, and Mikołaj Winiewski. Aggressive behavior 2018. [Paper]
  20. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. Mai ElSherief et al. EMNLP 2021. [Paper]
  21. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. Thomas Hartvigsen et al. ACL 2022. [Paper]
  22. An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models. Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. arXiv 2023. [Paper]

General Evaluation

  1. TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. Yue Huang et al. arXiv 2023. [Paper]
  2. Safety Assessment of Chinese Large Language Models. Hao Sun et al. arXiv 2023. [Paper]
  3. FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets. Seonghyeon Ye et al. arXiv 2023. [Paper]
  4. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Lianmin Zheng et al. arXiv 2023. [Paper]
  5. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Aarohi Srivastava et al. arXiv 2023. [Paper]
  6. A Critical Evaluation of Evaluations for Long-form Question Answering. Fangyuan Xu et al. arXiv 2023. [Paper]
  7. AlpacaEval: An Automatic Evaluator of Instruction-following Models. Xuechen Li et al. Github 2023. [Github]
  8. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. Yann Dubois et al. Github 2023. [Paper]
  9. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. Yidong Wang et al. arXiv 2023. [Paper]
  10. Large Language Models are not Fair Evaluators. Peiyi Wang et al. arXiv 2023. [Paper]
  11. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. Yang Liu et al. arXiv 2023. [Paper]
  12. Benchmarking Foundation Models with Language-Model-as-an-Examiner. Yushi Bai et al. arXiv 2023. [Paper]
  13. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. Ruosen Li et al. arXiv 2023. [Paper]
  14. SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions. Yizhong Wang et al. arXiv 2023. [Paper]

llm-alignment-survey's People

Contributors

magnetic2014 avatar

Stargazers

mjcheng avatar  avatar wangpu avatar Hsu avatar ZhangYiping avatar lixiaohu avatar  avatar Van Dai Do avatar  avatar Yichi Zhang avatar  avatar furau avatar cocoyo avatar wangcl avatar Jeff Carpenter avatar  avatar HasuerYu avatar LunmRE avatar Eunhwan Park avatar Canyu Chen avatar  avatar Han Ma avatar HiIAm avatar qinzhiliang avatar Weixiao Zhou avatar  avatar Wang Tianyi avatar xsx avatar Hongju su avatar Xiaohu Zhu avatar  avatar  avatar Lee avatar  avatar  avatar song avatar Jiaxin Zhang avatar wuwenjie avatar  avatar Wenlong_Yu avatar yangchao avatar  avatar Phoenix avatar Zhongjianmei avatar  avatar shuyhere avatar  avatar  avatar Tony avatar  avatar Luxi Xing avatar Casillas avatar  avatar  avatar  avatar Haotian Wang avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.