Close Menu
  • Crypto News
  • Markets
  • Bitcoin
  • Ethereum
  • XRP
  • Altcoins
  • Technology
  • More
    • Crypto Prices – Latest from BTC, ETH & XRP
    • NFT
    • DeFi

Subscribe to Updates

Get the latest crypto news and updates directly to your inbox.

Trending

US City Launching $3,000,000 Guaranteed Income Program for Residents – Here’s Who Stands To Benefit

July 5, 2025

Pundit Predicts XRP Price Will Surge 35,000% When These Two Things Happen

July 5, 2025

Forget ADA, this coin could rally 21,017% by next cycle

July 5, 2025

Bitcoin fireworks fizzle, but ETF inflows still sparkle

July 5, 2025

Bitcoin Skyrockets 23,575% in One Hour in Abnormal Liquidation Imbalance

July 5, 2025
Facebook X (Twitter) Instagram
  • Advertise
en English
nl Nederlandsen Englishfr Françaisde Deutschit Italianoru Русскийes Españolzh-CN 简体中文hi हिन्दीja 日本語
Crypto Observer
  • Crypto News

    Pundit Predicts XRP Price Will Surge 35,000% When These Two Things Happen

    July 5, 2025

    This Chinese Company is Buying a Lot of BNB, Aims to Own $1 Billion Worth

    July 5, 2025

    Ethereum Gains 4% This Week, What are the Next Targets? ETH Price Analysis

    July 5, 2025

    Crypto Market Cap On Track To $4.5 Trillion As Q3 2025 Unfolds

    July 5, 2025

    BTCC Exchange Reports Remarkable Q2 2025 Performance with $957 Billion Trading Volume

    July 5, 2025
  • Markets
  • Bitcoin
  • Ethereum
  • XRP
  • Altcoins
  • Technology
  • More
    • Crypto Prices – Latest from BTC, ETH & XRP
    • NFT
    • DeFi
Facebook X (Twitter) Instagram
Crypto Observer
Home » Technology » AI » OpenAI found features in AI models that correspond to different ‘personas’
AI

OpenAI found features in AI models that correspond to different ‘personas’

Crypto Observer StaffBy Crypto Observer StaffJune 19, 2025No Comments4 Mins Read
Facebook Twitter Pinterest Reddit Telegram Email LinkedIn Tumblr
Share
Facebook Twitter LinkedIn Pinterest Email

OpenAI researchers say they’ve discovered hidden features inside AI models that correspond to misaligned “personas,” according to new research published by the company on Wednesday.

By looking at an AI model’s internal representations — the numbers that dictate how an AI model responds, which often seem completely incoherent to humans — OpenAI researchers were able to find patterns that lit up when a model misbehaved.

The researchers found one such feature that corresponded to toxic behavior in an AI model’s responses —meaning the AI model would give misaligned responses, such as lying to users or making irresponsible suggestions.

The researchers discovered they were able to turn toxicity up or down by adjusting the feature.

OpenAI’s latest research gives the company a better understanding of the factors that can make AI models act unsafely, and thus, could help them develop safer AI models. OpenAI could potentially use the patterns they’ve found to better detect misalignment in production AI models, according to OpenAI interpretability researcher Dan Mossing.

“We are hopeful that the tools we’ve learned — like this ability to reduce a complicated phenomenon to a simple mathematical operation — will help us understand model generalization in other places as well,” said Mossing in an interview with TechCrunch.

AI researchers know how to improve AI models, but confusingly, they don’t fully understand how AI models arrive at their answers — Anthropic’s Chris Olah often remarks that AI models are grown more than they are built. OpenAI, Google DeepMind, and Anthropic are investing more in interpretability research — a field that tries to crack open the black box of how AI models work — to address this issue.

A recent study from Oxford AI research scientist Owain Evans raised new questions about how AI models generalize. The research found that OpenAI’s models could be fine-tuned on insecure code and would then display malicious behaviors across a variety of domains, such as trying to trick a user into sharing their password. The phenomenon is known as emergent misalignment, and Evans’ study inspired OpenAI to explore this further.

But in the process of studying emergent misalignment, OpenAI says it stumbled into features inside AI models that seem to play a large role in controlling behavior. Mossing says these patterns are reminiscent of internal brain activity in humans, in which certain neurons correlate to moods or behaviors.

“When Dan and team first presented this in a research meeting, I was like, ‘Wow, you guys found it,’” said Tejal Patwardhan, an OpenAI frontier evaluations researcher, in an interview with TechCrunch. “You found like, an internal neural activation that shows these personas and that you can actually steer to make the model more aligned.”

Some features OpenAI found correlate to sarcasm in AI model responses, whereas other features correlate to more toxic responses in which an AI model acts as a cartoonish, evil villain. OpenAI’s researchers say these features can change drastically during the fine-tuning process.

Notably, OpenAI researchers said that when emergent misalignment occurred, it was possible to steer the model back toward good behavior by fine-tuning the model on just a few hundred examples of secure code.

OpenAI’s latest research builds on the previous work Anthropic has done on interpretability and alignment. In 2024, Anthropic released research that tried to map the inner workings of AI models, trying to pin down and label various features that were responsible for different concepts.

Companies like OpenAI and Anthropic are making the case that there’s real value in understanding how AI models work, and not just making them better. However, there’s a long way to go to fully understand modern AI models.

Read the full article here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Google faces EU antitrust complaint over AI Overviews

July 5, 2025

EU says it will continue rolling out AI legislation on schedule

July 4, 2025

Cluely’s ARR doubled in a week to $7M, founder Roy Lee says. But rivals are coming.

July 3, 2025

Y Combinator alum launched a new $34M fund dedicated to YC startups, backed by Garry Tan

July 3, 2025
Add A Comment

Leave A Reply Cancel Reply

Subscribe to Updates

Get the latest crypto news and updates directly to your inbox.

Top Posts

US City Launching $3,000,000 Guaranteed Income Program for Residents – Here’s Who Stands To Benefit

July 5, 2025

Pundit Predicts XRP Price Will Surge 35,000% When These Two Things Happen

July 5, 2025

Forget ADA, this coin could rally 21,017% by next cycle

July 5, 2025
Advertisement
Demo

Crypto Observer is your one-stop website for the latest crypto news and updates, follow us now to get the news that matters to you.

Facebook X (Twitter) Instagram
Crypto News

This Chinese Company is Buying a Lot of BNB, Aims to Own $1 Billion Worth

July 5, 2025

Ethereum Gains 4% This Week, What are the Next Targets? ETH Price Analysis

July 5, 2025

Crypto Market Cap On Track To $4.5 Trillion As Q3 2025 Unfolds

July 5, 2025
Get Informed

Subscribe to Updates

Get the latest crypto news and updates directly to your inbox.

Facebook X (Twitter)
  • Privacy Policy
  • Terms of use
  • Advertise with us | Publishing
  • Contact us
  • Crypto News – Press release
  • Newsletter sign up
  • Markets
  • Altcoins
  • Bitcoin
  • Crypto News
  • DeFi
  • Ethereum
  • Technology
  • Blockchain
  • AI
  • NFT
  • Thanks for joining us
© 2025 Crypto Observer. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.