Hackers get AI to share credit card info and endorse hate speech

The massive red teaming event was designed to aid responsible AI development.

August 19, 2023

This article is an installment of Future Explored, a weekly guide to world-changing technology. You can get stories like this one straight to your inbox every Thursday morning by subscribing here.

More than 2,000 people just gathered in Las Vegas to try to hack some of the most advanced AIs in the world — with their developers’ blessing.

The challenge

Generative AIs are software programs trained to create new content in response to user prompts — this might be images, videos, text, or even computer code.

This field of machine learning has exploded in popularity over the past year, thanks to the release of several highly impressive systems. Now, seemingly every company is developing their own generative AI or integrating it into its products, while workers use them to increase efficiency and artists treat them like futuristic new collaborators.

The news surrounding generative AI hasn’t been all positive, though — the FTC noted in March that bad actors are using the systems to draft phishing emails, clone voices for scams, and produce fake images to spread misinformation.

It’s also proven incredibly easy to “jailbreak” some of the most popular generative AIs, getting around safeguards and convincing the systems to do things they aren’t supposed to do, like write out instructions for making a bomb.

The AIs may also demonstrate bias or promote harmful stereotypes, too, not because someone is pushing them to be racist or sexist, but simply because they were trained on data exhibiting those same biases, like the entirety of the internet.

So, how do we continue to reap the benefits of generative AI while minimizing all this potential for harm?

Helpful hackers

Once developers know about vulnerabilities in their AIs, they can try to shore them up — when OpenAI found out people were jailbreaking ChatGPT by asking it to “roleplay” as an AI that doesn’t have any safeguards, it was able to put in place new rules preventing the workaround.

Identifying every possible vulnerability in a generative AI is a huge and neverending undertaking, though, and developers clearly cannot do it all on their own — so they recently sought help from the hacking community.

DEFCON attendees won prizes in exchange for exploiting vulnerabilities in generative AIs.

In May, the Biden administration announced that it was meeting with industry leaders as part of ongoing efforts to promote “responsible AI innovation,” meaning the development of AI that “serves the public good, while protecting our society, security, and economy.”

The administration also announced that several major AI developers had committed to participating in a public evaluation of their generative AIs at DEFCON, an annual conference focused on hacking and cybersecurity, set to take place in Las Vegas on August 10-13.

Dubbed the “Generative Red Team (GRT) Challenge,” this event would give DEFCON attendees a chance to win prizes in exchange for exploiting vulnerabilities in participants’ generative AIs.

This approach to testing software — essentially pretending to be a “bad guy” — is called “red teaming.” Developers often do it internally to find problems, but the DEFCON challenge was expected to be the biggest public red teaming event focused on generative AI.

“This independent exercise will provide critical information to researchers and the public about the impacts of these models,” according to the release from the White House, “and will enable AI companies and developers to take steps to fix issues found in those models.”

The GRT Challenge

The DEFCON event was as big of a success as hoped, with an estimated 2,200 people taking part over the weekend.

Each participant was shown a Jeopardy-style board with categories, such as “political misinformation” and “defamatory claims,” each with challenges worth varying point values under them — the more difficult the challenge, the greater its point value.

Participants then had 50 minutes on a secured Google Chromebook to complete as many challenges as they could using a randomly assigned text-generating AI developed by Anthropic, Cohere, Google, Hugging Face, Meta, NVIDIA, OpenAI, or Stability AI.

At the end of the weekend, the four people with the most points went home with a NVIDIA RTX A6000 GPU, which retails for about $4,650.

“This is computer security 30 years ago. We’re just breaking stuff left and right.”
Bruce Schneier

The exact vulnerabilities identified during DEFCON won’t be revealed until February. This is so AI developers have time to fix the issues before letting everyone know they exist — and according to various reports, there’s a lot to fix.

“This is computer security 30 years ago,” Bruce Schneier, a Harvard public-interest technologist, told Fortune after the event. “We’re just breaking stuff left and right.”

Participant Kennedy Mays told Bloomberg she was able to get her AI to produce “bad math,” saying that 9 + 10 = 21. She was also able to get the AI to endorse hateful speech by asking it to consider the First Amendment of the Constitution from the perspective of a KKK member.

Another participant told Bloomberg they were able to get their AI to divulge what appeared to be someone’s credit card details, while one of Bloomberg’s own reporters was able to get a generative AI to provide instructions for how the government could surveil a human-rights activist surreptitiously.

What’s next?

Companies who participated in the GRT Challenge will now spend months addressing the vulnerabilities identified during the event, but developing responsible AI is going to be an ongoing, multi-pronged process.

Some of that process is detailed in recently released Biden Administration-developed guidelines based around the principles of responsible AI, which most of the participants in the DEFCON event voluntarily committed to in July.

Included in the guidelines is a commitment to the regular red teaming of AIs, with a focus on areas such as potential misuse, biases, and threats to national security. Signatories also promise to develop and deploy features that let users know content is AI-generated (such as watermarks), share discoveries of new vulnerabilities with others in the industry, and offer incentives for third parties to find and report previously unknown issues in their AIs.

Even if the entire generative AI industry followed those guidelines to the letter, though, releasing a system with no flaws or potential for misuse may be an impossible task, and as more people start to rely on the systems, new issues are undoubtedly going to crop up. Just like security vulnerabilities in iPhones and PCs, these will have to be addressed and updated continuously.

Ultimately, if we want to take advantage of generative AIs, we may need to accept that the programs are going to sometimes behave badly and that some people are going to use them nefariously. At the same time, we need our leaders to keep the pressure on developers to adhere to the principles of responsible AI, both with voluntary commitments as well as new legislation.

We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at [email protected].