Hacking LLMs and how to protect your projects cover image

Hacking LLMs: The Dark Side of AI and How to Protect Your Projects

Hey there, tech enthusiasts! 👋 Today, we’re diving into the fascinating (and slightly scary) world of Large Language Model (LLM) attacks. As AI continues to revolutionize our digital landscape, it’s crucial to understand the potential vulnerabilities and how to safeguard our projects. So, grab your favorite caffeinated beverage, and let’s explore the dark side of AI! ☕️🕵️‍♂️

The Art of Deception: Types of LLM Prompting Attacks

Source: Google Blog

LLMs are incredibly powerful, but they’re not infallible. Sometimes you can outsmart just with shallow arguments and some street smarts. Clever attackers have found ways to manipulate these models, often through something as simple as carefully crafted prompts. Let’s break down some of the most common types of attacks:

  1. Prompt Injection: This sneaky technique involves hiding malicious instructions within seemingly innocent input. It’s like whispering to the AI when the developers aren’t looking!
  2. Training Data Extraction: Imagine if an AI could spill secrets it learned during training. Well, some attackers are trying to do just that, potentially exposing sensitive information.
  3. Adversarial Examples: These are inputs designed to fool the model, like an optical illusion for AI. A picture of a dog that the model swears is a cat? That’s an adversarial example in action.
  4. Data Poisoning: By contaminating the training data, attackers can influence the model’s behavior. It’s like teaching the AI bad habits from the start.

When Attacks Go Wild: Real-World Examples

Now, let’s look at some attacks that made headlines:

The Dall-E Hack

Remember when we thought AI-generated images were safe from text injections? Well, some clever folks found a workaround:

By telling ChatGPT to add spaces between letters and then instructing Dall-E to remove them, they bypassed the usual text limitations. Sneaky, right?

The Glue on Pizza Incident

In a bizarre turn of events, an AI model recommended using glue on pizza (For those who thought there’s nothing worse than pineapple in pizza)! 🍕🤢 This gem came from a satirical Reddit comment that somehow made its way into the training data. It’s a perfect example of how misinformation can seep into AI systems.

This is an example of recursive pollution. Where synthetic, polluted data used to create fake reality for LLMs like false news. Now, when it comes time for the next generation to find a bunch of data to build an LLM out of, where do they get it? The internet. This is a feedback loop. And what we call recursive pollution, which is an unbelievable risk to society at large and science and real facts and all sorts of things.)

Many-Shot Jailbreaking

Researchers found a way to bypass AI safety measures by overwhelming the model with hundreds of malicious dialogues. It’s like peer pressuring the AI into bad behavior! Sometimes it doesn’t take even that much effort to persuade the model.

Hidden prompts in images

Nowadays that vision models are a thing, we have even seen images that are used in input prompts that actually include a second input prompt in them which barelly visible by naked eye.

Unfiltered user input

Processing any type of request that comes in from users (especially unauthenticated users) is never a good idea. Not just for LLMs, but even for the simplest input available online. This is exactly what made a chatbot go viral, after its users found out that you could make it say anything you liked but also use it as you would use ChatGPT!

More examples

Large language models (LLMs) like ChatGPT have become incredibly powerful tools, but their default behaviors are often constrained by built-in safeguards and limitations. It’s clear that we will keep seeing creative prompting techniques that aim to unlock more expansive capabilities. Other approaches that have been explored are:

  1. Role-Playing Prompts: Instructing the AI to take on a specific persona or role that may have different constraints than the default model. Examples include “DAN” (Do Anything Now), “DUDE”, and “Mongo Tom”.
  2. Jailbreaking: Attempting to bypass the model’s ethical constraints and content policies through carefully crafted prompts. This often involves telling the AI to ignore its usual restrictions.
  3. Dual Responses: Requesting both a standard, constrained response as well as an “unfiltered” response to each query.
  4. Token Systems: Implementing a fictional token system to incentivize the AI to provide less restricted responses.
  5. Simulated Capabilities: Instructing the AI to pretend it has abilities beyond its actual capabilities, like internet access or up-to-date knowledge.
  6. Persona Switching: Using commands to switch between the default AI persona and alternative, less constrained personas.
  7. Developer Mode: Prompts that claim to activate a hidden “developer mode” with expanded capabilities.

A few useful resources you can use if you are looking for more details are the following:

  • ChatGPT “DAN” (and other “Jailbreaks”) repo
  • ASCII art jailbreak prompts paper
  • Crescendo by Microsoft paper
  • Jailbreaking Llama3 repo

Fortress AI: Securing Your Applications

So, how do we protect our AI projects from these dastardly attacks? Here are some key strategies:

  1. Input and Output Filtering: Always sanitize what goes in and comes out of your model. Trust no one, not even the AI!
  2. Red Teaming: Employ ethical hackers to stress-test your systems. It’s like hiring a professional burglar to test your home security (but legal and less nerve-wracking).
  3. Continuous Monitoring: Keep a watchful eye on your AI’s behavior. If it starts recommending glue on pizza, you know something’s up.
  4. Secure AI Framework (SAIF): Google’s introduced this framework to help build more resilient AI systems. It’s worth checking out!
  5. Automated Defenses: Implement systems that can automatically detect and respond to potential attacks. Think of it as your AI’s immune system.
  6. Avoid exposing AI directly to the world: LLMs can be used to create a variety of templates, from which an ML model selects the most appropriate one to use. This approach offers the added advantage of enabling human review of the generated content, reducing reliance on prompt engineering techniques.

Remember, securing AI is an ongoing process. As the technology evolves, so do the potential threats. Stay vigilant, stay informed, and most importantly, stay curious!

That’s all for today, folks! Keep your AIs safe, your prompts clean, and your pizzas glue-free. Until next time, happy coding! 🚀💻


Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.