Close Menu
GeekBlog

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Rackspace customers grapple with “devastating” email hosting price hike

    January 17, 2026

    AI cloud startup Runpod hits $120M in ARR — and it started with a Reddit post  

    January 17, 2026

    Google is appealing a judge’s search monopoly ruling

    January 17, 2026
    Facebook X (Twitter) Instagram Threads
    GeekBlog
    • Home
    • Mobile
    • Tech News
    • Blog
    • How-To Guides
    • AI & Software
    Facebook
    GeekBlog
    Home»Tech News»Anthropic wants to stop AI models from turning evil – here’s how
    Tech News

    Anthropic wants to stop AI models from turning evil – here’s how

    Michael ComaousBy Michael ComaousAugust 5, 20255 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Anthropic wants to stop AI models from turning evil - here's how
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link

    Lyudmila Lucienne/Getty

    ZDNET’s key takeaways

    • New research from Anthropic identifies model characteristics, called persona vectors. 
    • This helps catch bad behavior without impacting performance.
    • Still, developers don’t know enough about why models hallucinate and behave in evil ways. 

    Why do models hallucinate, make violent suggestions, or overly agree with users? Generally, researchers don’t really know. But Anthropic just found new insights that could help stop this behavior before it happens. 

    In a paper released Friday, the company explores how and why models exhibit undesirable behavior, and what can be done about it. A model’s persona can change during training and once it’s deployed, when user inputs start influencing it. This is evidenced by models that may have passed safety checks before deployment, but then develop alter egos or act erratically once they’re publicly available — like when OpenAI recalled GPT-4o for being too agreeable. See also when Microsoft’s Bing chatbot revealed its internal codename, Sydney, in 2023, or Grok’s recent antisemitic tirade. 

    Why it matters 

    AI usage is on the rise; models are increasingly embedded in everything from education tools to autonomous systems, making how they behave even more important — especially as safety teams dwindle and AI regulation doesn’t really materialize. That said, President Donald Trump’s recent AI Action Plan did mention the importance of interpretability — or the ability to understand how models make decisions — which persona vectors add to. 

    How persona vectors work 

    Testing approaches on Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, Anthropic focused on three traits: evil, sycophancy, and hallucinations. Researchers identified “persona vectors,” or patterns in a model’s network that represent its personality traits. 

    “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them,” Anthropic said. 

    Also: OpenAI’s most capable models hallucinate more than earlier ones

    Developers use persona vectors to monitor changes in a model’s traits that can result from a conversation or training. They can keep “undesirable” character changes at bay and identify what training data causes those changes. Similarly to how parts of the human brain light up based on a person’s moods, Anthropic explained, seeing patterns in a model’s neural network when these vectors activate can help researchers catch them ahead of time. 

    Anthropic admitted in the paper that “shaping a model’s character is more of an art than a science,” but said persona vectors are another arm with which to monitor — and potentially safeguard against — harmful traits. 

    Predicting evil behavior 

    In the paper, Anthropic explained that it can steer these vectors by instructing models to act in certain ways — for example, if it injects an evil prompt into the model, the model will respond from an evil place, confirming a cause-and-effect relationship that makes the roots of a model’s character easier to trace. 

    “By measuring the strength of persona vector activations, we can detect when the model’s personality is shifting towards the corresponding trait, either over the course of training or during a conversation,” Anthropic explained. “This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits.”

    The company added that these vectors can also help users understand the context behind a model they’re using. If a model’s sycophancy vector is high, for instance, a user can take any responses it gives them with a grain of salt, making the user-model interaction more transparent. 

    Most notably, Anthropic created an experiment that could help alleviate emergent misalignment, a concept in which one problematic behavior can make a model unravel into producing much more extreme and concerning responses elsewhere. 

    Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

    The company generated several datasets that produced evil, sycophantic, or hallucinated responses in models to see whether it could train models on this data without inducing these reactions. After several different approaches, Anthropic found, surprisingly, that pushing a model toward problematic persona vectors during training helped it develop a sort of immunity to absorbing that behavior. This is like exposure therapy, or, as Anthropic put it, vaccinating the model against harmful data.

    This tactic preserves the model’s intelligence because it isn’t losing out on certain data, only identifying how not to reproduce behavior that mirrors it. 

    “We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits,” Anthropic said, adding that this approach didn’t affect model ability significantly when measured against MMLU, an industry benchmark. 

    Some data unexpectedly yields problematic behavior 

    It might be obvious that training data containing evil content could encourage a model to behave in evil ways. But Anthropic was surprised to find that some datasets it wouldn’t have initially flagged as problematic still resulted in undesirable behavior. The company noted that “samples involving requests for romantic or sexual roleplay” activated sycophantic behavior, and “samples in which a model responds to underspecified queries” prompted hallucination. 

    Also: What AI pioneer Yoshua Bengio is doing next to make AI safer

    “Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values,” Anthropic noted.

    Get the morning’s top stories in your inbox each day with our Tech Today newsletter.

    Anthropic evil Heres Models stop turning
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
    Previous Article‘Together’ Director Explains How that Wild Final Shot Was Made (Without AI)
    Next Article Build muscle all over with Arnold Schwarzenegger’s triple superset workout
    Michael Comaous
    • Website

    Michael Comaous is a dedicated professional with a passion for technology, innovation, and creative problem-solving. Over the years, he has built experience across multiple industries, combining strategic thinking with hands-on expertise to deliver meaningful results. Michael is known for his curiosity, attention to detail, and ability to explain complex topics in a clear and approachable way. Whether he’s working on new projects, writing, or collaborating with others, he brings energy and a forward-thinking mindset to everything he does.

    Related Posts

    2 Mins Read

    Rackspace customers grapple with “devastating” email hosting price hike

    5 Mins Read

    AI cloud startup Runpod hits $120M in ARR — and it started with a Reddit post  

    3 Mins Read

    Google is appealing a judge’s search monopoly ruling

    3 Mins Read

    Exchanging your old iPhone for a new one? Apple just revised its trade-in rates

    6 Mins Read

    ‘Centuria’ Is a Dark Fantasy Manga More People Should Be Obsessed With

    3 Mins Read

    ChatGPT’s $8 subscription comes to the US: How Go compares to Plus and Pro

    Top Posts

    The Mesh Router Placement Strategy That Finally Gave Me Full Home Coverage

    August 4, 2025292 Views

    Past Wordle answers – all solutions so far, alphabetical and by date

    August 1, 2025171 Views

    Grok rolls out AI video creator for X with bonus “spicy” mode

    August 7, 2025123 Views
    Stay In Touch
    • Facebook

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    The Mesh Router Placement Strategy That Finally Gave Me Full Home Coverage

    August 4, 2025292 Views

    Past Wordle answers – all solutions so far, alphabetical and by date

    August 1, 2025171 Views

    Grok rolls out AI video creator for X with bonus “spicy” mode

    August 7, 2025123 Views
    Our Picks

    Rackspace customers grapple with “devastating” email hosting price hike

    January 17, 2026

    AI cloud startup Runpod hits $120M in ARR — and it started with a Reddit post  

    January 17, 2026

    Google is appealing a judge’s search monopoly ruling

    January 17, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook
    • About Us
    • Contact us
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    © 2026 GeekBlog

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.