OpenAI has to open training data
September 25, 2024
OpenAI will provide access to its training data for review of whether copyrighted works were used to power its technology. In a court filing authors in a class action suing the firm indicated that they came to terms on protocols for inspection of the information. They’ll seek details related to the incorporation of their works in training datasets.
The agreement flows from three lawsuits accusing OpenAI of harvesting mass quantities of content from the web, which were then allegedly used to produce copyright infringing answers by ChatGPT. Although several claims alleging copyright theft have been dismissed, the writers’ claim for direct copyright infringement remain.
OpenAI has said that it trains its model on “large, publicly available datasets that include copyrighted works.” Last year, it pivoted to no longer disclosing those materials in an attempt to maintain an advantage over competitors. While it remains unknown which works were used, the authors pointed to ChatGPT generating summaries and in-depth analyses of the themes in their novels. They claimed that the company downloaded hundreds of thousands of books from shadow library sites to train its AI system.
Under the agreement, the training datasets will be made available at OpenAI’s San Francisco office on a secured computer without internet or network access. Any person who’ll review the information will be required to sign a non-disclosure agreement, sign a visitor’s log and provide identification.
“The Inspecting Party’s counsel and/or experts may take handwritten notes or electronic notes on the provided note-taking computer in scratch files, but may not copy any Training Data itself into any notes,” the filing states.