Redefining Originality

Note: I originally wrote this position paper in December 2023, so some references may (already) be a bit dated.

In November 2022, OpenAI announced the immediate availability of their artificially intelligent “chatbot” ChatGPT, its latest in a series of ever-expanding Large Language Models (LLM) capable of producing novel, human-like text. OpenAI gave the general public access to interact with ChatGPT for the purpose of soliciting user feedback on its strengths and weaknesses, and it did not take long before ChatGPT took the technology world by storm (OpenAI, 2022). With its uncanny ability to produce high quality, coherent, long-form written work, it ushered in a spate of conversations regarding the ethics of Artificial Intelligence (AI), including plagiarism concerns within academic circles of students passing off LLM-generated content as their own. However, in the months that followed and the technology behind LLM was further explored by the public, a new concern began to arise: whether or not the LLMs themselves could be considered plagiarists. With the advent of LLMs capable of producing text that is largely indistinguishable from human authorship, it is crucial that society reconsiders its position on how computer-generated (and computer-assisted) content is evaluated for originality.

While Large Language Models have caught the public’s attention in the past twelve months since the initial release of OpenAI’s ChatGPT, LLMs have been in active development for many years. Most modern LLMs can trace their basic DNA back to a group of researchers at Google who published a seminal paper in 2017 titled “Attention Is All You Need.” This paper lays out the basic algorithmic approach for designing these engines. Using the techniques described in this paper, Large Language Models are reared through a training process, wherein massive amounts of text are fed into a machine learning algorithm that encodes word orders, frequencies, and contextual meanings into (effectively) a giant mathematical function — one containing hundreds of billions of variables. Breakthroughs over the past decade such as Google’s “attention” architecture, as well as continued advancements in the availability and performance of the underlying computer hardware, have led to massive LLMs where these functions include billions of discrete, tunable parameters. Trained on hundreds of millions of pieces of content over a period of months and/or years, these models require significant upfront capital investment, limiting their creation to a handful of “hyperscale” technology companies such as Google, Meta/Facebook, OpenAI/Microsoft, etc.

The specific content used for training these LLMs is seldom, if ever disclosed by the companies that produce them. While there is a huge trove of information on the internet in the public domain – sites such as the online encyclopedia Wikipedia, and the library of classic books at Project Gutenberg – the training data used to produce the hyperscale models also includes copyrighted content and information behind publisher paywalls (Golding, 2023). Matthew Sag, a Professor of Law in Artificial Intelligence at Emory University, suggests in his testimony to the Senate Judiciary Committee that “[t]here is no question that the best known LLMs today were built… with very little or no regard to whether those works were subject to copyright and whether the authors would object” (Sag, 2023).

In fact, authors have objected. In July 2023, notable celebrities and authors including comedienne Sarah Silverberg, and writers Margaret Atwood and George R.R. Martin have either sued the makers of these models, or publicly accused them of “exploitative practices… by creating software that mimics and regurgitates their language, style and ideas” (Thornbecke, 2023). This allegation of “mimicking” an author’s work is at the heart of the debate surrounding ChatGPT (et. al.), as it challenges the precedents relied upon by their creators: those afforded by the United States doctrine known as fair use.

The fair use doctrine, established as part of the Copyright Act of 1976, allows for repurposing of copyrighted work for so-called “non-expressive use.” Non-expressive use broadly allows for copying copyright-protected work as long as it does not communicate an author’s original expression to a new audience (Sag, 2023). It is this fair use doctrine which permits various activities such as news reporting, academic research, public commentary, and the like, while using copyrighted material to do so. The fair use doctrine was instrumental in allowing the rise of search engines in the nascent days of the internet, granting them the ability to copy and index materials found on the world wide web, and this doctrine is the cornerstone of the defense proffered by OpenAI, Meta, et. al. in their lawsuits with Silverman and others. That the LLMs ingest large swaths of material from the internet for data mining purposes and to train their AI, is no different than indexing that material for compiling a search engine database, they allege.

However, unlike prior examples of fair use it is difficult to disentangle the authors’ expressiveness when considering the methods AI companies use to compile their language models. In fact, the language models are designed specifically to mimic human authors, through the patterns they discern via reading hundreds of millions of pieces of published work. This is in sharp contrast to previous large-scale fair use endeavors, such as Google’s search engine and their Google Books project, which according to journalist Mathew Ingram (2013) “make provisions to acknowledge factual information about the copyrighted works and link out to where users can find them” – i.e., providing a public service – rather than a producing a generated, citation-free doppelganger as a stand-in.

When discussing the impact that “Generative AI” (computer software capable of generating text, art, music, etc.) has on news media, the Copyright Alliance, a nonprofit group representing content creators, argued that AI reproductions of copyrighted material create market substitutes of the original work. Attorney Lloyd Jassin, a legal specialist in the publishing industry, refers to this as “artful plagiarism,” a Dickensian homage he coined to describe an LLM’s uncanny ability to approximate its source material without necessarily copying it outright. He continues, “insofar as text generated in response to a prompt is not substantially similar – a legal term of the art – to the data it is scraping, it is not an infringement” (Jassin, 2023).

Indeed, as LLMs are trained on hundreds of millions of documents and “learn” the nuances of the written language across the panoply of styles and literary voices contained in the training material, the voice of the Generative AI tends to be distinct from that of any single contributing author in its training corpus. (While it is certainly an oversimplification, it’s not unreasonable to think of the AI’s “voice” as the mathematical average of that of its source material.) To wit, the Law of Large Numbers serves as a bellwether for LLM output: the more material used to generate the model, the closer to “literary mean” the results should get. And make no mistake, the companies producing these models have voracious appetites for new, human-generated content.

But what happens when the Law of Large Numbers doesn’t apply? Blogger Michal Zalewksi (2023), who writes under the name “lcamtuf,” often posts on self-described “exotic topics” – ephemera such as the history of Communist-themed comic strips, and the best types of resin for casting CNC molds. As he recognizes the communities around his exotic hobbies are small and with little in the way of content available online, he probed several of the public LLMs for information on his esoteric hobbies to see what they were capable of producing. His experimentation resulted in text output strikingly close – or, as Jassin may say, substantially similar – to his own original text, as seen in Figure 1 (Zalewski’s original blog) and Figure 2 (an interaction between Zalewski and Google’s “Bard” LLM).

Figure 1. Note: Figure from Zalewski’s blog (2023).

Figure 2.Note: Figure from Zalewski's blog (2023).

In these excerpts it seems undeniable that Google’s LLM trained on Zalewski’s blog post, with much of the resin’s details repeated verbatim, including punctuation: “reasonable pot life (8 minutes),” “quick demold time (2-3 hours for smaller parts),” as well as the structure and cadence of the phrasing. It is also self-evident that a human submitting the work presented in Figure 2 without proper citation would assuredly be guilty of plagiarism.

Zalewski’s research uncovers an example not of artful plagiarism, but of accidental plagiarism. Considered the most widespread type of plagiarism, accidental plagiarism occurs when policies and procedures for academic citation are followed correctly or – as is the case with Bard’s output above – ignored entirely (Copyleaks, 2020). The use of the term “accidental” may seem to imply that an accidental plagiarism offense may be overlooked or waived; however, there is no distinction between accidental and intentional plagiarism in the punitive repercussions.

Unlike student creations, however, LLM output is legally evaluated using a different set of rules. In the 2011 case Naruto v. Slater, where PETA – on behalf of the monkey named Naruto – sued photographer David Slater for copyrighting under his own name a photo taken by the monkey using his equipment. In his summary of the case, attorney and emeritus law professor Henry Perritt, Jr. explains in “Copyright for Robots?” (2023) that the district and appellate courts’ decisions establish that ultimately, cases involving non-humans as authors are inadmissible due to a lack of legal standing. The consequence of this lack of standing is that LLMs – at least in the eyes of today’s legal environment – are incapable of being charged with plagiarism or copyright infringement.

Given the enormous potential that AI has for revolutionizing many critical fields with near-term benefit to humankind (e.g., education, healthcare, agriculture) it is irresponsible to stifle the current creative explosion in its research and development. However, a reasonable compromise to ameliorate concerns over creators losing revenue to market substitutes (or losing motivation to create in the first place) is for AI producers to divulge the content on which they were trained, and create an “opt out” program for content creators whose copyrighted work should be excluded. Further, content produced by AI should be watermarked as such, to avoid market confusion surrounding AI-generated content that mimics a human creator (intentionally or otherwise). As part of a pledge to the White House, several prominent technology titans including OpenAI and Google, are currently pursuing such a watermarking strategy (Bartz, Hu, 2023).

The release of ChatGPT ushered in a media frenzy about the potential, and risk, of Artificial Intelligence. In its guise as a chatbot, OpenAI’s LLM took the public by storm with its uncanny ability to interact with users in a near-human capacity. AI researchers at the world’s top technology companies are continuing to iterate and innovate on today’s platforms, and we are certain to see continued improvement in the capabilities of AI in the coming years. The food to fuel this AI revolution is data – human-generated data. While an AI engine may merely be a large mathematical model under the hood, its output is unequivocally derived from its input. As researchers invest in more lifelike models, they will also need to seek out more unique source material. This, in turn, will improve the ability for these AI models to produce on-demand content that will yield substantially similar market substitutes.

With responsible disclosure of training content and watermarks to distinguish human- from machine-created data, the provenance of information found on the internet can be maintained in a manner similar to how search engines treated data during their rose to prominence in the late 20th century: providing access to information as a service, while crediting its original author. While AI-generated content will continue to stoke the flame of the copyright/fair use debate, its power to transform is still in its infancy and we should expect additional challenges ahead as technologists march toward artificial general intelligence.

(ChatGPT, Personal communication, December 5, 2023).

References

Accidental Plagiarism: Understanding and Avoiding It. (n.d.). Copyleaks. Retrieved November 20, 2023, from https://copyleaks.com/blog/accidental-plagiarism-understanding-and-avoiding-it

Artificial Intelligence and Intellectual Property – Part II: Copyright and Artificial Intelligence. 118th Cong. (2023). (testimony of Matthew Sag) https://www.judiciary.senate.gov/download/2023-07-12-pm-testimony-sag

Bartz, D., & Hu, K. (2023, July 21). OpenAI, Google, others pledge to watermark AI content for safety, White House says. Reuters. https://www.reuters.com/technology/openai-google-others-pledge-watermark-ai-content-safety-white-house-2023-07-21/

Currie, G. M. (2023). Academic integrity and artificial intelligence: Is ChatGPT hype, hero or heresy? Seminars in Nuclear Medicine, 53(5), 719–730. https://doi.org/10.1053/j.semnuclmed.2023.04.008

Golding, Y. R. (2023, October 18). The news media and AI: A new front in copyright law. Columbia Journalism Review. Retrieved October 30, 2023, from https://www.cjr.org/business_of_news/data-scraping-ai-litigation-lawsuit-artists-authors.php

Ingram, M. (2023, October 26). An AI engine scans a book. Is that copyright infringement or fair use? Columbia Journalism Review. Retrieved October 30, 2023, from https://www.cjr.org/the_media_today/an-ai-engine-scans-a-book-is-that-copyright-infringement-or-fair-use.php

OpenAI. (2022, November 30). Introducing ChatGPT. Retrieved November 20, 2023, from https://openai.com/blog/chatgpt

OpenAI. (2023). ChatGPT (Dec 4 version) [Large language model]. https://chat.openai.com/chat

Jassin, L. J. (2023). Generative AI vs. Copyright. Publishers Weekly, 270(39), 96–96.

Pazzanese, C. (2023, September 21). Key issues in writers’ case against OpenAI explained. Harvard Gazette. https://news.harvard.edu/gazette/story/2023/09/key-issues-in-writers-case-against-openai-explained/

Perritt Jr., H. H. (2023). Copyright for Robots? Indiana Law Review, 57(1), 139–198.

Thorbecke, C. (2023, July 10). Sarah Silverman sues OpenAI and Meta alleging copyright infringement | CNN Business. CNN. https://www.cnn.com/2023/07/10/tech/sarah-silverman-openai-meta-lawsuit/index.html

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, July 12). Attention Is All You Need (arXiv:1706.03762). arXiv. https://doi.org/10.48550/arXiv.1706.03762

Walsh, B. (2023, October 25). Why I let an AI chatbot train on my book. Vox. https://www.vox.com/future-perfect/2023/10/25/23930683/openai-chatgpt-george-rr-martin-john-grisham-ai-safety-artificial-intelligence-copyright-law-chatbot

Whittington, T. (2016, October 7). How Many Books Does the Average Person Read? Iris. https://irisreading.com/how-many-books-does-the-average-person-read/. Retrieved October 30, 2023.

Zalewski, M. (2023, May 15). LLMs and plagiarism: A case study [Substack newsletter]. Lcamtuf’s Thing. https://lcamtuf.substack.com/p/large-language-models-and-plagiarism