Close Menu
GeekBlog

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Meta will sell you refurbished Ray-Ban smart glasses for $76 off – how to find them

    August 30, 2025

    Garmin Fenix 8 Pro rumors swirl, and new leaks point to 4 new subscription tiers – mere months after the Connect+ debacle

    August 30, 2025

    SSA Whistleblower’s Resignation Email Mysteriously Disappeared From Inboxes

    August 29, 2025
    Facebook X (Twitter) Instagram Threads
    GeekBlog
    • Home
    • Mobile
    • Reviews
    • Tech News
    • Deals & Offers
    • Gadgets
      • How-To Guides
    • Laptops & PCs
      • AI & Software
    • Blog
    Facebook X (Twitter) Instagram
    GeekBlog
    Home»Tech News»Anthropic wants to stop AI models from turning evil – here’s how
    Tech News

    Anthropic wants to stop AI models from turning evil – here’s how

    Michael ComaousBy Michael ComaousAugust 5, 2025No Comments5 Mins Read0 Views
    Share Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
    Anthropic wants to stop AI models from turning evil - here's how
    Share
    Facebook Twitter LinkedIn Pinterest Email Copy Link

    Lyudmila Lucienne/Getty

    ZDNET’s key takeaways

    • New research from Anthropic identifies model characteristics, called persona vectors. 
    • This helps catch bad behavior without impacting performance.
    • Still, developers don’t know enough about why models hallucinate and behave in evil ways. 

    Why do models hallucinate, make violent suggestions, or overly agree with users? Generally, researchers don’t really know. But Anthropic just found new insights that could help stop this behavior before it happens. 

    In a paper released Friday, the company explores how and why models exhibit undesirable behavior, and what can be done about it. A model’s persona can change during training and once it’s deployed, when user inputs start influencing it. This is evidenced by models that may have passed safety checks before deployment, but then develop alter egos or act erratically once they’re publicly available — like when OpenAI recalled GPT-4o for being too agreeable. See also when Microsoft’s Bing chatbot revealed its internal codename, Sydney, in 2023, or Grok’s recent antisemitic tirade. 

    Why it matters 

    AI usage is on the rise; models are increasingly embedded in everything from education tools to autonomous systems, making how they behave even more important — especially as safety teams dwindle and AI regulation doesn’t really materialize. That said, President Donald Trump’s recent AI Action Plan did mention the importance of interpretability — or the ability to understand how models make decisions — which persona vectors add to. 

    How persona vectors work 

    Testing approaches on Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, Anthropic focused on three traits: evil, sycophancy, and hallucinations. Researchers identified “persona vectors,” or patterns in a model’s network that represent its personality traits. 

    “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them,” Anthropic said. 

    Also: OpenAI’s most capable models hallucinate more than earlier ones

    Developers use persona vectors to monitor changes in a model’s traits that can result from a conversation or training. They can keep “undesirable” character changes at bay and identify what training data causes those changes. Similarly to how parts of the human brain light up based on a person’s moods, Anthropic explained, seeing patterns in a model’s neural network when these vectors activate can help researchers catch them ahead of time. 

    Anthropic admitted in the paper that “shaping a model’s character is more of an art than a science,” but said persona vectors are another arm with which to monitor — and potentially safeguard against — harmful traits. 

    Predicting evil behavior 

    In the paper, Anthropic explained that it can steer these vectors by instructing models to act in certain ways — for example, if it injects an evil prompt into the model, the model will respond from an evil place, confirming a cause-and-effect relationship that makes the roots of a model’s character easier to trace. 

    “By measuring the strength of persona vector activations, we can detect when the model’s personality is shifting towards the corresponding trait, either over the course of training or during a conversation,” Anthropic explained. “This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits.”

    The company added that these vectors can also help users understand the context behind a model they’re using. If a model’s sycophancy vector is high, for instance, a user can take any responses it gives them with a grain of salt, making the user-model interaction more transparent. 

    Most notably, Anthropic created an experiment that could help alleviate emergent misalignment, a concept in which one problematic behavior can make a model unravel into producing much more extreme and concerning responses elsewhere. 

    Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

    The company generated several datasets that produced evil, sycophantic, or hallucinated responses in models to see whether it could train models on this data without inducing these reactions. After several different approaches, Anthropic found, surprisingly, that pushing a model toward problematic persona vectors during training helped it develop a sort of immunity to absorbing that behavior. This is like exposure therapy, or, as Anthropic put it, vaccinating the model against harmful data.

    This tactic preserves the model’s intelligence because it isn’t losing out on certain data, only identifying how not to reproduce behavior that mirrors it. 

    “We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits,” Anthropic said, adding that this approach didn’t affect model ability significantly when measured against MMLU, an industry benchmark. 

    Some data unexpectedly yields problematic behavior 

    It might be obvious that training data containing evil content could encourage a model to behave in evil ways. But Anthropic was surprised to find that some datasets it wouldn’t have initially flagged as problematic still resulted in undesirable behavior. The company noted that “samples involving requests for romantic or sexual roleplay” activated sycophantic behavior, and “samples in which a model responds to underspecified queries” prompted hallucination. 

    Also: What AI pioneer Yoshua Bengio is doing next to make AI safer

    “Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values,” Anthropic noted.

    Get the morning’s top stories in your inbox each day with our Tech Today newsletter.

    Anthropic evil Heres Models stop turning
    Follow on Google News Follow on Flipboard
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
    Previous Article‘Together’ Director Explains How that Wild Final Shot Was Made (Without AI)
    Next Article Build muscle all over with Arnold Schwarzenegger’s triple superset workout
    Michael Comaous
    • Website

    Related Posts

    5 Mins Read

    Meta will sell you refurbished Ray-Ban smart glasses for $76 off – how to find them

    3 Mins Read

    Garmin Fenix 8 Pro rumors swirl, and new leaks point to 4 new subscription tiers – mere months after the Connect+ debacle

    4 Mins Read

    SSA Whistleblower’s Resignation Email Mysteriously Disappeared From Inboxes

    1 Min Read

    The fight against labeling long-term streaming rentals as “purchases” you “buy”

    2 Mins Read

    Libby is adding an AI book recommendation feature

    3 Mins Read

    TikTok now lets users send voice notes and images in DMs

    Top Posts

    8BitDo Pro 3 review: better specs, more customization, minor faults

    August 8, 202512 Views

    WIRED Roundup: ChatGPT Goes Full Demon Mode

    August 2, 202512 Views

    Framework Desktop Review: A Delightful Surprise

    August 7, 202511 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    8BitDo Pro 3 review: better specs, more customization, minor faults

    August 8, 202512 Views

    WIRED Roundup: ChatGPT Goes Full Demon Mode

    August 2, 202512 Views

    Framework Desktop Review: A Delightful Surprise

    August 7, 202511 Views
    Our Picks

    Meta will sell you refurbished Ray-Ban smart glasses for $76 off – how to find them

    August 30, 2025

    Garmin Fenix 8 Pro rumors swirl, and new leaks point to 4 new subscription tiers – mere months after the Connect+ debacle

    August 30, 2025

    SSA Whistleblower’s Resignation Email Mysteriously Disappeared From Inboxes

    August 29, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest Threads
    • About Us
    • Contact us
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    © 2025 geekblog. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.