Close Menu
    Facebook X (Twitter) Instagram
    Wifi PortalWifi Portal
    • Blogging
    • SEO & Digital Marketing
    • WiFi / Internet & Networking
    • Cybersecurity
    • Tech Tools & Mobile / Apps
    • Privacy & Online Earning
    Facebook X (Twitter) Instagram
    Wifi PortalWifi Portal
    Home»SEO & Digital Marketing»Publishers push Common Crawl to stop collecting content for AI training
    SEO & Digital Marketing

    Publishers push Common Crawl to stop collecting content for AI training

    adminBy adminJune 11, 2026No Comments3 Mins Read
    Facebook Twitter LinkedIn Telegram Pinterest Tumblr Reddit WhatsApp Email
    Publishers push Common Crawl to stop collecting content for AI training
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Digital Content Next (DCN) sent the Common Crawl Foundation a cease-and-desist letter demanding that it stop scraping and distributing protected publisher content.

    The U.S. trade group, which represents major digital publishers (e.g., the AP, the New York Times, NBC Universal, Bloomberg, NPR, and Fox), also asked Common Crawl to remove DCN members’ content from its datasets, including paywalled and subscriber-only news articles.

    Publishers question opt-outs. DCN’s lawyers raised concerns about whether Common Crawl honored publisher opt-out requests and removed older content when asked.

    • The letter said Common Crawl had, in some cases, told publishers it was complying, only to later say technical costs and delays prevented full removal. DCN’s lawyers said they were reviewing whether those statements may have been inaccurate or misleading.
    • Common Crawl publishes a registry of sites that have opted out of scraping. The list includes many large news publishers.

    DCN alleges infringement. The letter argued that copyright law is not an opt-out system. DCN said Common Crawl “flagrantly infringed” publisher copyrights by creating and distributing datasets containing protected content without permission or compensation.

    • The group also said Common Crawl made that content available to companies developing AI tools and large language models.
    • DCN CEO Jason Kint said the legal notice challenges the idea that online content can be collected, stored, and reused simply because it is accessible.

    Common Crawl pushes back. Executive Director Rich Skrenta denied that CCBot bypasses paywalls to scrape websites. He also denied misleading publishers after The Atlantic reported in November that some content from publishers that had requested removal remained available.

    • “When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset,” Skrenta said.

    Why we care. This fight could shape how much publisher content AI search engines can use without permission. If courts or settlements impose stricter consent requirements, AI responses may rely more on licensed sources and less on the open web.

    AI training stakes. Since 2008, Common Crawl has scraped billions of webpages to build a free public archive. Its datasets have been widely used to train AI models. The New York Times’ 2023 copyright lawsuit against OpenAI cited Common Crawl as making up 60% of GPT-3’s training data, Press Gazette reported.

    • A 2024 Mozilla Foundation paper said that, in its current form, generative AI likely would not have been possible without Common Crawl.
    • Common Crawl has been working on open standards for AI crawling preferences, Skrenta said this week. DCN’s letter asks for a harder line: stop scraping protected publisher content and remove member content already in the datasets.

    Search Engine Land is owned by Semrush. We remain committed to providing high-quality coverage of marketing topics. Unless otherwise noted, this page’s content was written by either an employee or a paid contractor of Semrush Inc.


    Danny Goodwin
    Danny Goodwin is Editorial Director of Search Engine Land & Search Marketing Expo – SMX. He joined Search Engine Land in 2022 as Senior Editor. In addition to reporting on the latest search marketing news, he manages Search Engine Land’s SME (Subject Matter Expert) program. He also helps program U.S. SMX events.

    Goodwin has been editing and writing about the latest developments and trends in search and digital marketing since 2007. He previously was Executive Editor of Search Engine Journal (from 2017 to 2022), managing editor of Momentology (from 2014-2016) and editor of Search Engine Watch (from 2007 to 2014). He has spoken at many major search conferences and virtual events, and has been sourced for his expertise by a wide range of publications and podcasts.

    collecting common Content Crawl publishers push Stop Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email
    Previous ArticleFrom the data center to the edge: How to build secure, effective enterprise AI infrastructure
    Next Article Research Suggests AI Engines Assign Ranking Roles To Sources
    admin
    • Website

    Related Posts

    Research Suggests AI Engines Assign Ranking Roles To Sources

    June 11, 2026

    The Missing Layer In Your AI Visibility Audit

    June 11, 2026

    Ginny Marvin clarifies AI Max, AI Search ads and what advertisers should prioritize after GML

    June 11, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Search Blog
    About
    About

    At WifiPortal.tech, we share simple, easy-to-follow guides on cybersecurity, online privacy, and digital opportunities. Our goal is to help everyday users browse safely, protect personal data, and explore smart ways to earn online. Whether you’re new to the digital world or looking to strengthen your online knowledge, our content is here to keep you informed and secure.

    Trending Blogs

    Research Suggests AI Engines Assign Ranking Roles To Sources

    June 11, 2026

    Publishers push Common Crawl to stop collecting content for AI training

    June 11, 2026

    From the data center to the edge: How to build secure, effective enterprise AI infrastructure

    June 11, 2026

    Freezing Your Credit Is Free and the Strongest Protection Against Identity Theft

    June 11, 2026
    Categories
    • Blogging (92)
    • Cybersecurity (1,955)
    • Privacy & Online Earning (253)
    • SEO & Digital Marketing (1,445)
    • Tech Tools & Mobile / Apps (1,796)
    • WiFi / Internet & Networking (348)

    Subscribe to Updates

    Stay updated with the latest tips on cybersecurity, online privacy, and digital opportunities straight to your inbox.

    WifiPortal.tech is a blogging platform focused on cybersecurity, online privacy, and digital opportunities. We share easy-to-follow guides, tips, and resources to help you stay safe online and explore new ways of working in the digital world.

    Our Picks

    Research Suggests AI Engines Assign Ranking Roles To Sources

    June 11, 2026

    Publishers push Common Crawl to stop collecting content for AI training

    June 11, 2026

    From the data center to the edge: How to build secure, effective enterprise AI infrastructure

    June 11, 2026
    Most Popular
    • Research Suggests AI Engines Assign Ranking Roles To Sources
    • Publishers push Common Crawl to stop collecting content for AI training
    • From the data center to the edge: How to build secure, effective enterprise AI infrastructure
    • Freezing Your Credit Is Free and the Strongest Protection Against Identity Theft
    • The Missing Layer In Your AI Visibility Audit
    • Ginny Marvin clarifies AI Max, AI Search ads and what advertisers should prioritize after GML
    • A quick look at Cisco’s strategy to become a software monster
    • Congress Just Rushed Through a Disastrous Copyright Office Overhaul
    © 2026 WifiPortal.tech. Designed by WifiPortal.tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer

    Type above and press Enter to search. Press Esc to cancel.