Self-Aware AI: A Red Line That Should Not Be Crossed

Christopher Ackerman

This work was done as the capstone project for BlueDot Impact’s AI Safety Fundamentals - Governance course, April 2025.

Summary

Current work on advanced AI could lead to self-aware AI.
Self awareness poses fundamental challenges to alignment, and demands heavy moral and legal obligations; self-aware AI would impose unacceptable risks and burdens on society.
Policy - in the form of internal safety plans from labs that are developing advanced AI or of government regulations - should be put in place to prepare for and mitigate this danger.

Self-awareness is the recognition of oneself as an individual separate from the environment and other individuals, and as a continuous entity in time. It entails knowledge of one’s knowledge, mental states (e.g., perceptions, intentions), preferences, characteristics, abilities, goals, and interests. And it entails having a durable record of those, so that they can be maintained and pursued over time. Self-awareness is a trait that all normal human adults (and to a limited degree some other cognitively sophisticated species; 1, 2) share, and it is present in the primordial form of a self-other distinction even at birth (3). It is not the same as phenomenal consciousness, the ability to have subjective experiences, but it co-occurs with it in humans, on some views is a necessary condition for it (4, 5), and at any rate is indistinguishable from it to an outside observer, as to determine consciousness empirically we rely on tests that measure self-awareness.

Current work on advanced AI could lead to self-aware AI

As the technology advances, increasingly AI researchers and philosophers of mind are recognizing that AI is approaching the point where self-awareness and consciousness become serious possibilities (6, 7, 8, 9). Last year, frontier model developer Anthropic even hired someone to work specifically on “model welfare” (10), and Google has recently hired a researcher to study, among other things, AI consciousness (11). What may matter even more for sociopolitical considerations are the opinions of non-experts, and one survey of the general US public in 2023 found that 19% believed AI was already sentient[f1] (a number that had increased over time), and among the 75% who believed AI sentience was possible, the median year predicted for its arrival was 2028 (12). Another found that two thirds of people were willing to attribute some aspects of an inner life to (a 2023 version of) ChatGPT, and that more frequent users attributed more inner life to it (13). The latest language and multimodal models are generating more and more compelling examples of apparent self-awareness, weaving convincing personal narratives and drawing poignant self portraits (14, 15, 16, 17).

A number of AI researchers have begun to develop objective assessments for aspects of self-awareness in large language models (LLMs). The predominant aspect tested for has been self-knowledge, and the predominant paradigm utilized (as with humans) has been self report. Early work in previous generations of LLMs probed for a rudimentary form of self-knowledge and found that the models’ confidence in the correctness of their outputs, as indicated by the probabilities their output layer assigned to generated tokens, was correlated with the probability of their outputs being correct (i.e., was “calibrated”) and that (after fine tuning and the addition of a new decoder layer) the same was true of their confidence in their prediction of whether they could answer correctly (18), with larger models being better calibrated.

This sort of implicit self-knowledge, identified using an external readout mechanism, is a prerequisite for self-awareness; subsequent work has sought to demonstrate explicit self-knowledge, as evidenced by the ability to verbally report or otherwise use this knowledge. One study, of a more powerful model (GPT3), found that, while as deployed the LLM could not reliably report its self-knowledge, it could be fine tuned in such a way that it could output explicit confidence ratings in its answers that were well calibrated (19). As models have gotten larger, researchers have found success coaxing models trained with reinforcement learning from human feedback (RLHF), like ChatGPT, to give calibrated verbal reports of certainty (20), and, sometimes, to report when they don’t know the answer (21), even without specific fine tuning.

Shedding further light on the scope of LLM self knowledge, another group of researchers has developed a range of tasks for models that overlap with self-awareness, such as the ability to report facts about itself (e.g., its name), its identity (e.g., that it’s an AI), its abilities (e.g., that it can generate text but not smell food), and its characteristics (e.g., recognizing its own writing), and tested a range of current (post GPT3) LLMs on them (22). They find models have above-chance, although far from perfect, success at most of these - without fine tuning. When prompted, some models can not only report but also use this knowledge (e.g., when told to respond in German if they are an AI and English otherwise). The researchers also find that RLHF’d models show superior abilities at most of the tasks, supporting the idea that that now-prevalent training regimen imparts self-knowledge (23).