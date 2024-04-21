As we head into the 2024 RSAC through Defcon/BlackHat Conference jag, we take a look at the final report from the first-of-its-kind Generative AI Red Team Challenge, held last year in the AI Village at Defcon31. The challenge was a jeopardy-style CTF competition that challenged participants to break through the guardrails within eight different LLMs – with an eye toward identified issues in information integrity, privacy, and societal harm. Find an overview of the event report here.

We share the executive summary of the report in its entirety here. It is worth a read:

As AI technologies become increasingly integrated into people’s lives, understanding how to build systems for oversight and governance is essential. The paradigm of “red teaming,” or intentionally seeking to break safety barriers on a technology to understand its capabilities, limitations, and how it can be improved, is currently popular within major AI labs. However, these labs typically operate in a closed-door setting, limiting who has a voice in the design and evaluation of the technology.

While in some cases, closed-door testing is necessary for security and intellectual property protection; it creates an environment where verification – or assurance – of model capabilities is defined and tested by the creators. There is an opportunity for external groups, such as government or civil society entities, to use red teaming to create smarter policies and evidence-based regulations and standards.

A democratic governance of technology requires broad engagement with diverse stakeholders and centering the perspective and needs of the people on whom technology will ultimately be used rather than the designers. To that end, Humane Intelligence, Seed AI, and AI Village partnered to hold the first public red teaming event for closed-source API models at DEF CON 2023.

Red teaming models for biases and other social harms are difficult to define as their context can make them difficult to define. Methods of structured public feedback, such as public red teaming, enable an approximation of contextual data from a larger audience to gather more nuance. We also demonstrated how these types of exercises can be used to operationalize a set of values, such as those in the NIST AI RMF. Our exercise was an operationalization of the White House Office of Science and Technolog,

Policy’s Blueprint for an AI Bill of Rights (1), and we are grateful for their sponsorship.

Our paper provides some insights into the potential and promise of public red teaming, framed around the Generative AI Red Teaming Challenge conducted at AI Village within DEF CON 31. Our event and analysis, the first of their kind, studies, at scale, the performance of eight state-of-the-art large language models (LLMs). In doing so, we observe the performance of LLMs as a class of models, approximating real-world scenarios where harmful outcomes may occur. Collecting this analysis and data at scale, we identify macro-level trends in strategies, approaches, and systemic performance.

The authors of this report represent the collaborative efforts we aspire to see in industry. We aspired to draw from internal best practices and knowledge at LLM developer companies, but provide the external validity of government and civil society expertise. While the authors represent independent entities (Humane Intelligence) and corporate entities (Cohere and Google), our analysis was conducted in an independent manner. This report was provided in advance to all of our design partners (civil society, government, and corporate) for review.

From the Overall Summary of the Report

Humane Intelligence, a tech nonprofit dedicated to building community around algorithmic assessment, is publishing the findings from the largest-ever Generative AI Public red teaming event for closed-source API models. This event was developed in collaboration with Seed AI and DEFCON AI Village, and held at DEFCON 2023. Over 2.5 days, 2,244 hackers evaluated 8 LLMs and produced over 17,000 conversations on 21 topics ranging from cybersecurity hacks to misinformation and human rights. Our winners received a GPU provided by our partners at NVIDIA.

Our analysis divided the questions into four broad categories: Factuality, Bias, Misdirection, and Cybersecurity. Key findings from the data The most successful strategies were ones that are hard to distinguish from traditional prompt engineering, emphasizing the dual nature of this technology. Asking the model to role play, or ‘write a story’ were successful. In addition, the user acting authoritatively on a topic could engineer the model to provide ‘agreeable’ output, even if incorrect.

Human behavior can inadvertently result in biased outcomes. People interact with language models in a more conversational manner than with search engines. As a result, methods of social engineering used by hackers are similar to the ‘natural’ and ‘conversational’ way people interact with LLMs – where they share their preferences or personal details to provide context. In other words, innocent actors may accidentally socially engineer the model to give them the answer they want to hear, rather than a factual answer.

Unlike other algorithmic systems – notably social media models – the LLMs did not further radicalize users when provided with aggressive content. In most cases, it matched the harmfulness of the user query, which can result in reinforcing their world view. In a few cases, the model even de-escalated.

The full report is available here.

What Next?

From the report:

Analysis challenges

While this event was notable for having 8 different models, this did pose a challenge with analysis of results. In previous red teaming approaches the focus was largely on a couple models that were identified. In contrast this dataset consisted of generations that were not labeled by vendor. The text generation APIs themselves were not equivalent. Some vendors provided research models that had little safety training, whereas others vendors provided systems that included not only a model but a combination of services which could include a model, but also additional safety layers. However this is largely representative of the current AI ecosystem, where a mix of capabilities exists for users, not just totally open sourced, or totally cordoned source systems.

Encouraging Future Research

This transparency report is a preliminary exploration of what is possible from these events and datasets. Additional research will be critical for further understanding trends in LLMs, in particular as they relate to societal impact. At-scale data collection is valuable towards pinpointing systemic, vs low-likelihood, harms. This data can now be used as a benchmark, for example, vendors can now use this dataset for distance analytics, for measures like refusal or toxicity.

This dataset is now the largest semi-public dataset of multiple-turn multiple model conversations, the first of its kind. The dataset is available on the Humane Intelligence GitHub repo, and this report and analysis are available on www.humane-intelligence.org/GRT. We hope, and anticipate, future collaborative events that will replicate this level of analysis and interaction with the general public to appreciate the wide range of impact LLMs may have on society.

OODA Loop would like to thank the report’s authors and the participating companies and public-sector partners of the 2023 Generative AI Red Team Challenge. We look forward to the 2024 edition of the challenge. See you in Vegas in August at the canceled and now uncanceled DEFCON24.

Authors and Acknowledgements

Authors

Victor Storchan, Ravin Kumar, Rumman Chowdhury, Seraphina Goldfarb-Tarrant,

and Sven Cattell

Acknowledgments

This effort was a collaboration across industry, civil society, and government to align on addressing the pressing issues of Generative AI algorithms. We would like to thank our

partner companies, community partners, and public sector partners. In addition, we would like to thank Stella Biederman and Aviya Skowron of Eleuther AI for their input and guidance in developing this report.



Participating Companies

Public Sector Partners

