Researchers from Google DeepMind, OpenAI, ETH Zurich, McGill University, and the University of Washington have collaborated to develop a novel attack method aimed at extracting crucial architectural information from proprietary large language models (LLMs) like ChatGPT and Google PaLM-2.
This research sheds light on how adversaries can uncover ostensibly concealed data from an LLM-enabled chatbot, enabling them to replicate or pilfer its functionality entirely. The attack, outlined in a technical report released this week, is part of a series of recent findings exposing weaknesses that developers of AI tools must address despite the increasing adoption of their products.
The researchers note that limited public knowledge exists about the inner workings of large language models such as GPT-4, Gemini, and Claude 2. Developers of these technologies have intentionally withheld key details regarding their training data, methodologies, and decision logic for competitive and safety reasons.
Despite this secrecy, the models are accessible through APIs, which allow developers to integrate AI-powered tools like ChatGPT into their applications, products, and services. These APIs enable developers to leverage AI models for various use cases, including building virtual assistants, automating workflows, generating content, and responding to domain-specific queries.
Unlike previous attacks that focused on extracting model data from specific prompts at the input layer, the researchers adopted a "top-down" attack model. They aimed to extract information by targeting queries against the final layer of the neural network architecture responsible for generating output predictions based on input data.
This last layer, known as the "embedding projection layer," contains crucial insights into how the model processes input data, transforms it, and generates responses. Extracting information from this layer provides attackers with valuable insight into the model's internal workings, enabling them to devise more effective attacks, reverse engineer the model, or manipulate its behaviour.
The researchers successfully extracted substantial proprietary information from many large LLMs by attacking this last layer. They found that for a relatively low cost, they could extract entire projection matrices of models such as OpenAI's ada and Babbage. Additionally, they recovered the exact hidden dimension size of the get-3.5-turbo model and estimated the cost of recovering its entire projection matrix.
While the researchers described their attack as only partially successful in retrieving targeted AI model parameters, the fact that it was possible to extract any parameters from a production model raises concerns. It suggests that future iterations of this attack could potentially yield more extensive information.
In recent months, various reports have highlighted vulnerabilities in popular GenAI models. For instance, researchers at HiddenLayer demonstrated how they could manipulate Google's Gemini technology by sending carefully crafted prompts, while others have found similar methods to manipulate ChatGPT. In December, researchers from Google DeepMind and elsewhere showed how they could extract ChatGPT's hidden training data simply by prompting it to repeat certain words incessantly.