Meta, the tech giant behind Facebook and Instagram, finds itself at the center of a legal storm. Authors Sarah Silverman and Ta-Nehisi Coates are among the plaintiffs in the case Kadrey v. Meta, which challenges the company's practices in training its AI models. The case highlights internal discussions within Meta regarding the use of copyrighted works obtained through potentially questionable means. These discussions have raised concerns about the legality and ethics of the methods employed by Meta's AI team to enhance their models.
Meta employees have candidly discussed using copyrighted content to train AI models, sparking legal and ethical questions. The company has been fine-tuning its AI to avoid "IP risky prompts" that could lead to copyright infringements. However, with first-party data from platforms like Facebook and Instagram falling short of requirements, Meta's leadership is contemplating revisiting past decisions on training data. The goal is to ensure that their AI models have ample training material without stepping into legally dubious territory.
One of the controversial alternatives under consideration is Libgen, a links aggregator notorious for providing access to copyrighted works without authorization. Despite its legal troubles, including multiple lawsuits and fines, Meta has reportedly considered using Libgen as a substitute for licensed data sources. In internal conversations, a research engineer at Meta noted that "a gazillion" startups likely already use pirated books for model training, implying it is a common practice in the industry.
In an effort to secure legitimate sources of data, Meta is negotiating with document hosting platform Scribd and others to obtain licenses for using copyrighted works. Additionally, there are discussions about scraping data from Reddit via methods similar to those used by a third-party app called Pushift. These strategies indicate Meta's pursuit of diverse data sources to meet its AI model training needs.
"My opinion would be (in the line of ‘ask forgiveness, not for permission’): we try to acquire the books and escalate it to execs so they make the call," said Xavier Martinet in internal discussions.
Meta's legal team claims that training AI models on IP-protected works, particularly books, falls under "fair use." This defense will likely play a crucial role as the case unfolds. To bolster its defense, Meta has added two Supreme Court litigators from the law firm Paul Weiss to their legal team.
"Essential to meet SOTA numbers across all categories," noted Sony Theakanath, highlighting the importance of reaching state-of-the-art performance benchmarks.
The case has seen several amendments since its filing in 2023, reflecting ongoing developments and discoveries. Internal communications reveal that Meta's lawyers are adopting a less conservative approach compared to the past when it comes to approving the use of publicly available data for model training.
"This is why they set up this gen ai org for [sic]: so we can be less risk averse," Martinet explained.
"We would not disclose use of Libgen datasets used to train," added Theakanath, pointing to potential confidentiality concerns.
As Meta navigates these challenges, it is clear that the stakes are high. The outcome of this case could set a precedent for how tech companies approach AI training with copyrighted materials. The balance between innovation and intellectual property rights remains delicate, with implications for both the industry and creators.
"My 2 cents again: trying to have deals with publishers directly takes a long time […]," Martinet expressed, acknowledging the complexities of securing direct agreements with content owners.
Leave a Reply