Close Menu
  • Crypto News
  • Markets
  • Bitcoin
  • Ethereum
  • XRP
  • Altcoins
  • Technology
  • More
    • Crypto Prices – Latest from BTC, ETH & XRP
    • NFT
    • DeFi

Subscribe to Updates

Get the latest crypto news and updates directly to your inbox.

Trending

XRP maintains 200-EMA, poised for a bounce, but will these altcoins below $1 outperform?

June 7, 2025

Ethereum price eyes breakout, ETHA ETF nears $5b milestone

June 7, 2025

Bitcoin Rally Faces Headwinds as Matrixport Report Points to Weakening U.S. Economy

June 7, 2025

Polymarket to Serve as Official Prediction Market of Elon Musk’s X

June 7, 2025

Chainlink Solves the Biggest Problems in Blockchain—Here’s How

June 7, 2025
Facebook X (Twitter) Instagram
  • Advertise
en English
nl Nederlandsen Englishfr Françaisde Deutschit Italianoru Русскийes Españolzh-CN 简体中文hi हिन्दीja 日本語
Crypto Observer
  • Crypto News

    $31M Bitcoin Donation to Ross Ulbricht Traced to AlphaBay, Not Silk Road

    June 7, 2025

    Elon Musk ‘Will Do Anything’ To Make XRP King, Tech Mogul Says

    June 7, 2025

    Ethereum Holds Key Range Support After Pullback – Bulls Eye $3,000 Level

    June 7, 2025

    Why Bitcoin Dominance Must Fall To 62%

    June 7, 2025

    Investors Remain Cautiously Optimistic as Market Sees Volatility Toward The Weekend: Report

    June 7, 2025
  • Markets
  • Bitcoin
  • Ethereum
  • XRP
  • Altcoins
  • Technology
  • More
    • Crypto Prices – Latest from BTC, ETH & XRP
    • NFT
    • DeFi
Facebook X (Twitter) Instagram
Crypto Observer
Home » Technology » AI » EleutherAI releases massive AI training dataset of licensed and open domain text
AI

EleutherAI releases massive AI training dataset of licensed and open domain text

Crypto Observer StaffBy Crypto Observer StaffJune 6, 2025No Comments3 Mins Read
Facebook Twitter Pinterest Reddit Telegram Email LinkedIn Tumblr
Share
Facebook Twitter LinkedIn Pinterest Email

EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.

The dataset, called The Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, The Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.

AI companies, including OpenAI, are embroiled in lawsuits over their AI training practices, which rely on scraping the web — including copyrighted material like books and research journals — to build model training datasets. While some AI companies have licensing arrangements in place with certain content providers, most maintain that the U.S. legal doctrine of fair use shields them from liability in cases where they trained on copyrighted work without permission.

EleutherAI argues that these lawsuits have “drastically decreased” transparency from AI companies, which the organization says has harmed the broader AI research field by making it more difficult to understand how models work and what their flaws might be.

“[Copyright] lawsuits have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in,” Stella Biderman, EleutherAI’s executive director, wrote in a blog post on Hugging Face early Friday. “Researchers at some companies we have spoken to have also specifically cited lawsuits as the reason why they’ve been unable to release the research they’re doing in highly data-centric areas.”

The Common Pile v0.1, which can be downloaded from Hugging Face’s AI dev platform and GitHub, was created in consultation with legal experts, and it draws on sources, including 300,000 public domain books digitized by the Library of Congress and the Internet Archive. EleutherAI also used Whisper, OpenAI’s open source speech-to-text model, to transcribe audio content.

EleutherAI claims Comma v0.1-1T and Comma v0.1-2T are evidence that the Common Pile v0.1 was curated carefully enough to enable developers to build models competitive with proprietary alternatives. According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.

Parameters, sometimes referred to as weights, are the internal components of an AI model that guide its behavior and answers.

“In general, we think that the common idea that unlicensed text drives performance is unjustified,” Biderman wrote in her post. “As the amount of accessible openly licensed and public domain data grows, we can expect the quality of models trained on openly licensed content to improve.”

The Common Pile v0.1 appears to be in part an effort to right EleutherAI’s historical wrongs. Years ago, the company released The Pile, an open collection of training text that includes copyrighted material. AI companies have come under fire — and legal pressure — for using The Pile to train models.

EleutherAI is committing to releasing open datasets more frequently going forward in collaboration with its research and infrastructure partners.

Read the full article here

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

Toyota and NLX: Making AI work in the world of car repairs

June 7, 2025

Cut through the AI hype and learn what really gets funded in 2025

June 7, 2025

The case for AI co-founders, from less equity dilution to an infinite memory

June 7, 2025

Democratizing AI: Google Cloud’s vision for accessible agent development

June 7, 2025
Add A Comment

Leave A Reply Cancel Reply

Subscribe to Updates

Get the latest crypto news and updates directly to your inbox.

Top Posts

XRP maintains 200-EMA, poised for a bounce, but will these altcoins below $1 outperform?

June 7, 2025

Ethereum price eyes breakout, ETHA ETF nears $5b milestone

June 7, 2025

Bitcoin Rally Faces Headwinds as Matrixport Report Points to Weakening U.S. Economy

June 7, 2025
Advertisement
Demo

Crypto Observer is your one-stop website for the latest crypto news and updates, follow us now to get the news that matters to you.

Facebook X (Twitter) Instagram
Crypto News

Elon Musk ‘Will Do Anything’ To Make XRP King, Tech Mogul Says

June 7, 2025

Ethereum Holds Key Range Support After Pullback – Bulls Eye $3,000 Level

June 7, 2025

Why Bitcoin Dominance Must Fall To 62%

June 7, 2025
Get Informed

Subscribe to Updates

Get the latest crypto news and updates directly to your inbox.

Facebook X (Twitter)
  • Privacy Policy
  • Terms of use
  • Advertise with us | Publishing
  • Contact us
  • Crypto News – Press release
  • Newsletter sign up
  • Markets
  • Altcoins
  • Bitcoin
  • Crypto News
  • DeFi
  • Ethereum
  • Technology
  • Blockchain
  • AI
  • NFT
  • Thanks for joining us
© 2025 Crypto Observer. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.