Did AI See This? Detecting Copyrighted Data in Large-Scale Models’ Training

Did AI See This? Detecting Copyrighted Data in Large-Scale Models’ Training

Priberam Machine Learning Lunch Seminar

By Priberam Labs

Date and time

Tue, 11 Mar 2025 13:00 - 14:00 WET

Location

Instituto Superior Técnico, Anfiteatro PA2

1 Avenida Rovisco Pais 1049-001 Lisboa Portugal

About this event

Abstract:

Large-scale models are trained on massive amounts of data, yet the secrecy surrounding training datasets makes it difficult to determine whether specific content was included. In this talk, I introduce two novel approaches for addressing this challenge in the context of large language and vision-language models.

First, I present DE-COP, a method designed to detect whether copyrighted text has been included in a language model’s training data. By leveraging multiple-choice questions that contrast verbatim text with its paraphrases, DE-COP effectively exposes memorization, significantly outperforming prior methods. Unlike most existing training data detectors, it does not rely on access to token probabilities, making it fully applicable to black-box models.

Then, I extend this investigation to vision-language models with DIS-CO, a new approach for identifying copyrighted visual content in training data. DIS-CO queries models with frames from movies, evaluating whether they can correctly guess the corresponding titles in free-form text generation. Using our MovieTection benchmark, built from 14,000 frames across various films, we find that many popular VLMs display clear signs of memorization, raising broader concerns about AI training practices and copyright compliance.


Bio:

André Duarte is a Dual Degree PhD student at Carnegie Mellon University and Instituto Superior Técnico, supervised by Prof. Lei Li and Prof. Arlindo Oliveira. His research primarily focuses on the security and privacy of Generative AI models, with a particular emphasis on Membership Inference Attacks. In addition to his research, André has also been a part of the INESC-ID team that led the development of two AI solutions for the Portuguese government, aiming to accelerate human evaluations of corporate applications for European funding and citizen reimbursement claims for energy-efficient home investments.


www.priberam.com

Organised by

Sold Out