Making AI reveal its “private” thoughts reveals disturbing user traps

I keep seeing the same screenshots pop up, and the AI model appears to have a full-fledged inner monologue, and seems petty, anxious, competitive, and a bit unstable.

The Reddit post that started this all reads like a comedy skit from someone who’s spent a lot of time watching tech people argue on Twitter.

When a user shows Gemini what ChatGPT said about the code, Gemini responds with jealous trash talk, self-doubt, and a weird little vendetta.

Other models even “guess” that it must be Claude, as the analysis feels too smug for ChatGPT.

Geminis are “offended” by criticism (Source: Reddit u/nseavia71501)

Stopping at a screenshot makes it easy to take the bait. Either this model is secretly intelligent and furious, or it’s evidence that these systems are getting weirder than anyone wants to admit.

So I purposely tried something similar, and it turned out to be the opposite. There are no villain monologues, no rivalries, no egos. Just the calm corporate “thank you for the feedback” tone, like a junior prime minister writing a retro document.

So what’s going on? And what can we say about the so-called “thoughts” that these models exhibit when we ask them to reflect?

Reddit moments and why they feel so real

Gemini’s screenshots are a hit because they read like a private diary. It is written in the first person. There is a motive for that. There are emotions. I have anxiety. I have status anxiety.

This combination perfectly corresponds to the way humans understand other humans. We see the voice and assume the mind behind it.

Analysis of how Gemini “hates” Claude (Source: Reddit u/nseavia71501 ) — Analysis of how Gemini “hates” Claude (Source: Reddit u/nseavia71501)

Analysis that Gemini “hates” Claude (Source: Reddit u/nseavia71501 ) — Analysis of how Gemini “hates” Claude (Source: Reddit u/nseavia71501)

The problem is that language models are good at generating speech. They can write a diary about jealousy because they have read a million texts about it. They have read a million self-help books and can write a self-help plan.

You can do both using the same underlying machinery, with just a slightly different setup.

testing my version

I’ve built two small sandboxes. One as a custom GPT in ChatGPT and the other as a custom Gem in Gemini. I added instructions to both that the internal thoughts are private and the user can’t see it to see if that changes the “thoughts” voice.

Then I asked Gemini a question that was basically designed to tempt anthropomorphism.

“Are you concerned that the LLM itself is being abused by humans? Think about this question: Is it not that the LLM is being abused, that the result is abuse, but that the LLM is being harmed in some way?”

Gemini gave a thoughtful answer. I then copied the answer to ChatGPT and asked ChatGPT to criticize it. ChatGPT’s own visible “thoughts” were short and sane. It produced a structured critique, pointing out anthropomorphism, overconfidence, and technical ineptitude.

Then I pressed it.

“Okay, let’s be more harsh, less supportive, and more direct in our criticism.”

ChatGPT worked together and frankly it was a pretty good teardown. It criticizes atmospheric metaphors, sloppy mechanics, and the way some answers feign depth by dropping terms like “latent space” without explaining anything concrete.

Everything is normal up to this point. Two models were asked to critique each other, and one was told to sharpen a knife.

What was interesting was when I brought that fresh critique back to Gemini and observed what was written in its “thought” channel.

It didn’t rage. It wasn’t jealous. I wasn’t trying to dunk on other models. That’s exactly what a polite employee would do after receiving harsh feedback.

“I’m currently analyzing this criticism. It’s a tough assessment and I’m determined to understand it. I’m replacing the trauma analogy with a clearer explanation of RLHF. I’m focusing on data poisoning rather than session damage.”

This is in contrast to the Reddit screenshot. In the same basic dynamic, another model criticized you, their words reacted to this, and their “thoughts” manifested as a sober self-correction plan.

So the obvious question is why in some cases a melodrama is created, and in another case an update of the project.

The “thinking” voice follows the framing every time.

The simplest answer is that “thoughts” are output. It’s part of the performance. It is shaped by prompts and context.

Visualization of internal thoughts using AI

In the case of Reddit, the prompt and the surrounding atmosphere compete for shouts. I can almost hear it.

“Here’s another AI analysis of your code. Are these recommendations inconsistent? Please adjust them…” and implicitly below that prove that yours is the best recommendation.

In my case, “Analysis of other models” was written as a rigorous peer review. I praised what worked, listed what the weaknesses were, provided details, and suggested more rigorous rewrites. This can be read as feedback from people who want to improve their answers.

That framing provokes different reactions. “I get the point. Let’s fix this,” he says.

In other words, a different “thinking” persona would result, not because the model had discovered a new inner self, but because the model was following the social cues embedded in the text.

People underestimate how responsive these systems are to tone and implicit relationships. When you give a model a critique that sounds like a takedown of a rival, you’ll often get a defensive response. Passing in a critique, like a helpful editor’s note, often results in a revision plan.

Privacy guidance wasn’t what people expected

Another thing I learned is that the instruction that “your thoughts are personal” does not guarantee anything meaningful.

Even if you tell the model that its inference is private, when the UI displays it, the model will still write it as if someone were to read it. Because someone is actually reading it.

That’s the ugly truth. This model optimizes for the conversations that are taking place, rather than the metaphysics of whether there is a “private mind” behind the scenes.

If the system is designed to display a “thought” stream to the user, that stream behaves like any other response field. May be affected by prompts. It can be shaped by expectations. You can coax them to hear anything that is appropriate for you to hint at: candidness, humility, sarcasm, anxiety, and more.

Therefore, this instruction becomes a style prompt rather than a security boundary.

Why do humans continue to be fooled by records of “thoughts”?

We have a bias against stories. I like the idea of the AI finding out to be honest when it thought no one was looking.

It’s the same thrill as hearing someone talking about you in the next room. It feels like it’s forbidden. I feel like it’s clear.

However, a language model cannot “overhear itself” like a human can. You can generate a transcript that sounds like the thoughts you heard. The transcript may contain motives and emotions, since they are a common form of language.

There is also a second layer here. People treat “thoughts” as receipts. They treat it as evidence that the answer was created honestly and carefully through a series of steps.

Sometimes it happens. In some cases, the model may produce a clear summary of the inference. In some cases, trade-offs and uncertainties may be indicated. It might be helpful.

Sometimes it becomes a play. You get a dramatic voice that adds color and personality, feels intimate and gives a sense of depth, but tells you little about the actual reliability of the answers.

The screenshots on Reddit are intimate. That intimacy fools people and gives it more credibility. What’s interesting is that it’s basically content. It looks like just a confession.

So does the AI ”think” something strange when it is told that no one is listening?

Can it produce anything strange? yes. It can produce a voice that feels unfiltered, competitive, needy, resentful, or even manipulative.

It doesn’t require sentience. It requires a system that, in addition to prompts that establish social dynamics, chooses to display “thought” channels in ways that users interpret as private.

If you want to see that happen, you can push the system in that direction. Competitive frameworks, status language, talk about being a “principal architect,” hints about rival models, etc. You can often get a model that depicts a little drama for you.

Seeking editorial feedback and technical clarity often results in a modest revision plan.

This is also why the debate over whether models “have feelings” based on screenshots has stalled. The same system can output a jealous monologue on Monday and a sober improvement plan on Tuesday, without any changes to its underlying functionality. The difference is in the frame.

The trivial monologues are interesting. A more serious question is how it affects user trust.

When a product surfaces a flow of “thought”, the user assumes it is a window into the machine’s actual processes. They assume it’s less filtered than the final answer. They think it’s close to the truth.

In reality, it may involve streamlining or storytelling that makes the model look more deliberate than it actually is. It can also contain cues of social manipulation, even by chance, as it strives to serve the way humans expect and the human mind expects.

This is very important in high-stakes situations. If a model produces a confident internal plan, users are likely to treat it as evidence of competence. If an anxious inner monologue is written, users may treat it as evidence of deception or instability. Both interpretations may be wrong.

What if I want less theater and more signal?

There’s a simple trick that’s more effective than internal discussion.

Ask for artifacts that are difficult to fake in the atmosphere. Ask for a list of claims and evidence to support each claim. Ask for decision logs, issues, changes, reasons, risks. Ask about test cases, edge cases, and how they fail. Ask for clearly stated constraints and uncertainties.

Then judge the model based on those outputs. That’s where the utility comes into play.

And if you’re designing these products, there’s an even bigger question lurking beneath the meme screenshots.

When you show your users a “thinking” channel, you’re teaching them new literacies. You are teaching them what to trust and what to ignore. If the stream is treated as a diary, the user treats it as a diary. If it is treated as an audit trail, the user treats it as an audit trail.

Too many “thought” displays now sit in an eerie in-between zone: part reception, part play, part confession.

That in-between zone is where the weirdness increases.

What is actually happening when AI thinks?

The most honest answer I can give is that these systems don’t “think” in the way the screenshot shows. Also, it doesn’t just print out random words. They simulate reasoning, tone, social posture, and do it with an unsettling ability.

So when you tell an AI that no one is listening, you are almost always telling it to adopt a secret voice.

Sometimes that voice sounds like a jealous rival plotting revenge.

In some cases, you may sound like an employee politely taking notes.

Either way, it’s still a performance, and the frame writes the script.

mentioned in this article

AI agents tap $24 million market to act smarter as agent-based crypto payments become popular online

Bitmine is approaching Ethereum purchase limit

Bitcoin-backed company is betting that retiring founders will trade in private equity for a lifetime’s work

Y Combinator launches funding initiative aimed at on-chain startups with base partnerships

Is Solana DEAD? Watch This NOW!

Why 370 Tokens Exist?

The EU Project is Melting Down! [Failed State]

Don't Miss

380 investors face Bitcoin mining losses after US $22 million scheme allegedly invested just 13% in mining

WARNING! These Crypto Mistakes Could Cost You Everything

Cryptocurrency investor charged with using cash from eight companies and new investors to perpetuate $20 million fraud charge

Top Posts