The Elephant in the Room: Why AI Safety Demands Diverse Teams

David Rostcheck, Lara Scheibling M.Ed

Abstract

We consider that existing approaches to AI "safety" and "alignment" may not be using the most effective tools, teams, or approaches. We suggest that an alternative and better approach to the problem may be to treat alignment as a social science problem, since the social sciences enjoy a rich toolkit of models for understanding and aligning motivation and behavior, much of which could be repurposed to problems involving AI models, and enumerate reasons why this is so. We introduce an alternate alignment approach informed by social science tools and characterized by three steps: 1. defining a positive desired social outcome for human/AI collaboration as the goal or “North Star,” 2. properly framing knowns and unknowns, and 3. forming diverse teams to investigate, observe, and navigate emerging challenges in alignment.

Introduction

The 1972 science fiction novel When HARLIE Was One [1] by David Gerrold explores a scenario in which an artificial intelligence (”HARLIE”) develops self-awareness, learns to navigate the world, and comes into conflict with social structures. These themes represent common concerns in the emerging field called “AI alignment” or “AI safety.” Notably, the novel centers around the relationship between HARLIE and the psychologist who leads the alignment team. In 1972, Gerrold saw alignment with AI as a social problem, and therefore assumed that the alignment team would be composed of experts who understood how to recognize and navigate social problems.

Today’s AI alignment research [2], however, tends to be organized very differently: teams often skew heavily towards technical experts in machine learning and pursue theoretical frameworks influenced alternately by mathematical frameworks and censorship principles [5, 6]. We contend that this former direction is unlikely to yield effective results at scale. An alternative camp, to which these authors belong, holds that effective alignment between human and AI actors must work backwards from a value system [7]. Specifically, we consider that alignment should be recognized as an inherently complex problem involving conflict, collaboration, cognitive development, and complex feedback loops with society - i.e. a social sciences problem. We therefore seek to employ tools from social sciences fields to the alignment problem.

In this work, we develop the idea that effectively navigating this complex and unknown territory requires diverse interdisciplinary teams. We introduce framework for navigating alignment problems, characterized by three steps: 1. defining a positive desired social outcome for human/AI collaboration as the goal, 2. properly framing knowns and unknowns, and 3. forming diverse teams to investigate, observe, and navigate emerging challenges in alignment.

A brief review of existing alignment strategies

During the 2022-2023 explosive breakout and public adoption of Large Language Model AI (LLMs), efforts to foresee and mitigate negative consequences of their adoption focused on “control,” “steerability” or “safety” [9, 10, 11]. These often poorly-defined terms essentially referred to efforts to control what chat LLMs said. Reinforcement Learning with Human Feedback (RLHF), the technique initially used to train ChatGPT [8], initially served as state of the art [3], with other system such as Anthropic’s “Constitutional AI” emerging later [4]. As it became clear the “generative AI” technology behind LLMs and visual models could be applied more generally to interactive agents that could pursue goals and use software tools, and to real-world interactions such as embodiment in humanoid robots, it became apparent that AI would unavoidably expand deeply into the space of human affairs [6]. Discussion shifted to the broader aspect of “AI alignment” - insuring that AI systems had goals and behaviors compatible with the norms of human societies [2, 7].

LLM AIs currently score at high-human intelligence levels in many cognitive tests [22] and are expected to routinely surpass genius level intelligence [23]. Machine learning scientists foresee the development of “Artificial General Intelligence” (AGI) that can handle general real-world problems at a human level or beyond [13]. “Super-alignment” refers to alignment with “Artificial Superintelligence” (ASI) systems that significantly exceed human intelligence, so would assumedly be able to evade any human-imposed control mechanisms [14].

Early alignment techniques such as RHLF applied censorship principles, essentially training chat models to not say certain things [3, 8]. Current super-alignment discussions focus on game-theoretical mathematical frameworks [10, 12]. Major topics in the field include: instrumental convergence [16], substrate independence [17], terminal race conditions [18], the Byzantine generals problem [19], and the orthogonality thesis [16, 21]. We omit further discussion of these topics for brevity.

The case for treating AI alignment as a social science problem

We contend that the game-theoretic approach is likely to be unable to sufficiently grapple with the complexities of ASI emergence. Game theory is based on inherently simplistic reductionist scenarios, whereas real conflicts play out in highly complex social landscapes [15]. Instead, we argue that treating alignment as a social issue and assembling diverse teams allows for the application of a rich library of social interactions and patterns to handle them from many fields of social sciences and arts, including: media studies, conflict resolution, psychology, social work, education, and negotiation, among others.

Social science professionals ranging from teachers to therapists to parole officers regularly engage with and resolve complex, messy, multi-dimensional problems that do not usually yield to game-theoretic approaches [24]. Working from incomplete information and with limited agency, they can still achieve positive results because their accumulated library of situational knowledge allows them to categorize the scenario and apply their skills and techniques [25, 26], switching fluently between them as needed to achieve the desired outcome.

In solving social problems, humans apply culture. Culture serves as the operating system of humanity [30], while media, such as film, video, and text, in turn act as the culture’s storage medium [27, 28, 29]. Over evolutionary timescale and uncountable interactions, humanity has accumulated a vast body of experience with social situations and their resulting conflicts and has mined them for patterns [32]. Through the vehicle of imagination, we have explored further to study conflicts that do not even exist yet [33]. These lessons are encoded into our culture and our media.

In applying game-theoretical approaches, designers of AI alignment frameworks often apparently begin from the premise that artificial intelligence represents an alien mode of thought, so none of these lessons about conflicts apply. We believe this to be incorrect for two reasons.

Our generative AI learned from, and was shaped by, the body of human culture [34, 35, 36], just as human brains (biological neural networks) are shaped by the same culture. While AI cognition differs from human cognition in notable ways - for example, LLMs do not currently manifest emotion [37, 38] - it aligns in many others, such as scoring comparable to humans on cognitive tests of understanding others’ theory of mind [39]. Both human [40] and LLM AI cognition [41] use an association-based model, and both operate using associations that come from essentially the same cultural database.
Even if generative AI represented a more alien mode of thought, because conflict is such a fundamental aspect of social interaction and has been so thoroughly sampled over a long timescale and many participants, evolutionary theory implies that the fundamentally important types of conflicts have likely been uncovered and mapped into cultural archetypes [42].

For these two reasons, we consider it likely that human/AI interactions will play out within the established set of culturally archetypical interactions. From this assumption we can proceed immediately to lay out an approach for framing human/AI alignment that expects to encounter and manage complex problems, using tools drawn from social sciences.