Posted on

Publishers Strike Data Deals with AI Companies: What It Means for Academic Authors

By Kimberly Becker

Recent developments in academic publishing have dramatically shifted the landscape for authors. Major publishers like Taylor & Francis and Wiley have forged partnerships with tech giants, aiming to leverage vast academic content repositories for AI development. This means that copyrighted materials from these publishers are now being used to train AI models – a practice I previously advised against.

As a presenter at the recent TAA conference, I discussed the ethical integration of AI in academic writing. However, these new partnerships have rendered some of my initial advice partially obsolete. In light of these changes, it’s crucial to revisit this topic and explore its implications for TAA members.

This blog post will explore the AI revolution in academic publishing, explain how Large Language Models (LLMs) work, and discuss what these developments mean for academic authors. We’ll explore the potential benefits, risks, and ethical considerations of this paradigm shift that’s happening with or without our consent.

The AI Revolution in Academic Publishing

Academic publishing is undergoing a seismic shift as major publishers forge partnerships with tech giants. Taylor & Francis’s deal with Microsoft exemplifies this trend, and Wiley has also made agreements with tech firms. The deals aim to leverage vast academic content repositories for AI development.

This means that they are already using copyrighted materials from those publishers to train the models – exactly what I advised against.

Oxford University Press (OUP) and Cambridge University Press (CUP) are taking more measured approaches, but regardless, these partnerships have sparked controversy. Many academics have derided the lack of consultation and transparency, and concerns about intellectual property rights are driving calls for opt-in/opt-out mechanisms.

AI partnerships in academic publishing present a double-edged sword of innovation and risk. On the one hand, these collaborations promise to develop groundbreaking research tools, enhance discoverability of scholarly work, and foster interdisciplinary connections. As a corpus linguist, I’m particularly excited about AI’s potential to uncover hidden patterns in extensive datasets; however, I want to acknowledge that this is a huge change that often invokes fear.

Many people are fearful about data privacy, others worry about the potential for AI to generate derivative works without attribution, and still others express concerns about their work being taken out of context. There’s also the looming question of how AI might impact academic job markets. Amidst all these fears, it occurs to me that few things can assuage the anxiety except knowledge. Understanding the capabilities and workings of LLMs is a key way authors can prepare to negotiate effectively to empower your voice in response to the paradigm shift happening with or without your consent.

How LLMs Work: A Primer for Authors

To understand the implications of these publisher-tech company deals, it’s essential to grasp how LLMs function “under the hood.” Here’s a simplified explanation:

1. Data Ingestion and Representation:

  • LLMs don’t store full text. Instead, they convert text into mathematical representations called vectors.
  • These vectors capture the essence of the content – its meaning, context, and relationships to other concepts.
  • Think of it as creating a multidimensional semantic map of relationships.

2. Knowledge Synthesis:

  • The LLM uses these vectors to understand patterns across vast amounts of data.
  • It builds a mathematical model of meaning-based connections, forming a broader knowledge base.
  • This knowledge base is incredibly large, often measured in petabytes of data.

3. Generation Process:

  • When prompted, the LLM doesn’t retrieve specific texts.
  • Instead, it uses its understanding of semantic patterns and relationships to create new content.
  • This content reflects its acquired knowledge but isn’t a direct copy of any source material.

4. Scale o fLLMs: To put the size in perspective:

  • 1 average book ≈ 1 MB (megabyte)
  • 1 million books ≈ 1 TB (terabyte)
  • 1 billion books ≈ 1 PB (petabyte)

LLMs often work with multiple petabytes of data, equivalent to billions of books.

The U.S. Library of Congress contains about 51 million books and print materials. One petabyte is roughly 20 times that amount.

5. Beyond Size:

  • Raw data size isn’t the only factor in model performance.
  • Quality, diversity, and curation of the data also play crucial roles.
  • The publisher-tech company partnerships likely aim to improve these aspects.

Understanding this process helps explain why these partnerships are significant. By giving tech companies access to their published content, academic publishers are allowing this material to be incorporated into the vast datasets used for developing and refining AI technologies. This integration enriches the AI’s knowledge base with high-quality, specialized academic content.

Implications for Authors

To better understand how AI models use academic content, consider this analogy from Ted Chiang’s New Yorker article: AI models generate content like a “blurry JPEG of the web.” Let’s extend this to academic content:

  1. Loss of Fidelity: Just as a JPEG loses some detail when compressed, converting text into vectors and then generating new content results in a loss of the original’s specificity and nuance.
  2. Blending of Sources: Like how pixels blend in a blurry image, ideas and writing styles from various sources merge in the AI’s outputs, making it dificult to trace specific origins.
  3. Recognizable but NOT Identical: You might recognize elements of the original in the AI’s output, but it’s not an exact reproduction – much like you can identify objects in a blurry photo without seeing precise details.

With this analogy in mind, here are the key implications for authors:

  • No Direct Quotations:
    • LLMs typically don’t reproduce exact passages from your work.
    • They may generate content that resembles your ideas or writing style.
  • Potential for Derivative Works:
    • AI could generate content heavily influenced by your ideas.
    • This createst potential gray areas in terms of intellectual property.
  • Copyright and Attribution Challenges:
    • It becomes dificult to trace specific origins of AI-generated content.
    • Questions arise about fair use, attribution, and rights to derivative works.
  • Changing Metrics and Roles:
    • New metrics for measuring academic impact may emerge, potentially including AI system utilization of work.
    • New roles in academia, such as AI-human collaboration specialists, might develop.
  • Issues of Access, Advocacy and Engagement:
    • Be aware of potential stratification between institutions with access to AI-enhanced research tools and those without.
    • Stay informed about these developments and their implications.
    • Engage in discussions about authorial rights and academic integrity.
    • Consider advocating for opt-in/opt-out mechanisms or clearer policies regarding the use of your work in AI training.

Understanding these implications empowers you to make informed decisions about your work and to actively participate in shaping the future of academic publishing. However, these considerations also raise questions in terms of copyright and intellectual property that this author is not qualified to answer.

  • Transformative Use: Is the AI’s use of your work transformative enough to be considered fair use?
  • Attribution Challenges: How can proper attribution be ensured when the AI’s output synthesizes thousands of sources?
  • Derivative Work Rights: At what point does AI-generated content influenced by your work become a derivative work, and what rights do you have in that case?
  • Competitive Concerns: Could the AI generate content that competes with your future publications or academic outputs?

Having no legal background, my hope is for a follow-up blog from a someone who can help us sort through these issues.

Takeaways for this New Frontier of Academic Publishing

Partnerships between publishers and tech companies mark a significant turning point in academic publishing. As we’ve explored, these deals have far-reaching implications for authors, institutions, and the broader academic community:

  1. They challenge our traditional understanding of copyright and fair use.
  2. They raise questions about the nature of authorship and attribution.
  3. They present both opportunities for innovation and risks to academic integrity.

As academic authors, it’s vital to:

  • Stay informed about these rapidly evolving developments.
  • Engage in discussions about the ethical use of AI in academic writing.
  • Advocate for transparent policies and opt-in/out mechanisms for the use of your work in AI training.
  • Consider how these changes might affect your publishing strategies and career development.

While the full impact of these partnerships is yet to be seen, one thing is clear: the landscape of academic publishing is changing, and authors need to be proactive in shaping its future. By understanding the technology, engaging in meaningful dialogue, and advocating for our rights, we can help ensure that the integration of AI in academic publishing enhances rather than undermines the values of scholarly work.

As we move forward, it will be crucial to balance the potential benefits of AI-enhanced research tools with the protection of intellectual property rights and the preservation of academic integrity. This is not just a technological shift, but a fundamental reimagining of how knowledge is created, disseminated, and valued in the academic world.

The conversation is just beginning, and your voice matters. Stay engaged, stay informed, and be part of shaping the future of academic publishing in the age of AI.


Kimberly BeckerKimberly Becker, Co-founder of Moxie, is an applied linguist who specializes in disciplinary academic writing and English for research publication purposes. She has a Ph.D. in Applied Linguistics and Technology (Iowa State University, 2022) and an M.A. in Teaching English as a Second Language (Northern Arizona University, 2004). Kimberly’s research and teaching experience as a professor and communication consultant has equipped her to support native and non-native English speakers in written, oral, visual, and electronic communication. Her most recent publications are related to the use of ethical AI for automated writing evaluation and a co-authored e-book titled Preparing to Publish, about composing academic research manuscripts. Click here to view Kimberly’s research on ResearchGate.

Please note that all ​content on this site ​is copyrighted by the Textbook & Academic Authors Association (TAA). Individual articles may be re​posted and/or printed in non-commercial publications provided you include the byline​ (if applicable), the entire article without alterations, and this copyright notice: “© 202​4, Textbook & Academic Authors Association (TAA). Originally published ​on the TAA Blog, Abstract on [Date, Issue, Number].” A copy of the issue in which the article is reprinted​, or a link to the blog or online site, should be mailed to ​K​im Pawlak P.O. Box 3​37, ​C​ochrane, WI 5462​2 or ​K​im.Pawlak @taaonline.net.