Join senior executives in San Francisco on July 11-12 to learn how leaders are integrating and optimizing AI investments for success. Learn more
The company claims that Dolly 2.0 is the first open-sourcefollowing the instructions LLM refined on a transparent and freely available dataset that is also open-source to use for commercial purposes. This means that Dolly 2.0 is available for commercial applications without the need to pay for API access or share data with third parties.
According to Databricks CEO Ali Ghodsi, while there are other LLMs that can be used for business purposes, “they won’t sound like Dolly 2.0 to you.” And, he explained, users can modify and improve the training data because it is made freely available under an open source license. “So you can make your own version of Dolly,” he said.
Databricks released the dataset on which Dolly 2.0 was trained
Additionally, Databricks said that as part of its continued commitment to open source, it is also releasing the dataset Dolly 2.0 was trained on, called databricks-dolly-15k. It is a corpus of more than 15,000 records generated by thousands of Databricks employees, and Databricks claims it is the “first open-source, human-generated corpus of instructions specifically designed to allow a great language to showcase the magical interactivity of ChatGPT”.
There have been a wave of as instructed, ChatGPT-like LLM releases within the past two months that are considered open source by many definitions (or provide some level of openness or secure access). One was Meta’s LLaMA, which in turn inspired others like Alpaca, Koala, Vicuna and Databricks’ Dolly 1.0.
However, many of these “open” models were under “industrial capture“, said Ghodsi, because they were trained on datasets whose terms claim to limit commercial use – such as a dataset of 52,000 questions and answers from the Stanford Alpaca project which was formed on OpenAI’s ChatGPT release. But OpenAI’s terms of service, he explained, include a rule that you can’t use the output of services that compete with OpenAI.
Databricks, however, has found a way around this problem: Dolly 2.0 is an open-source based 12 billion parameter language model Eleuther AI pythia model family and refined exclusively on a small open source corpus of instruction records (databricks-dolly-15k) generated by Databricks employees. The license terms for this dataset allow it to be used, modified, and extended for any purpose, including academic or commercial applications.
Models trained on ChatGPT output have so far been in a legal gray area. “The whole community has been tiptoeing around and everyone is releasing these models, but none of them could be used commercially,” Ghodsi said. “That’s why we’re super excited.”
Dolly 2.0 is small but mighty
A Databricks blog post pointed out that, like the original Dolly, version 2.0 is not state-of-the-art, but “presents a surprisingly capable level of following instructions given the size of the training corpus”. The post adds that the level of effort and expense needed to create powerful AI technologies is “orders of magnitude lower than previously imagined.”
“Everyone wants to go bigger, but we’re actually interested in smaller ones,” Ghodsi said of Dolly’s petite stature. “Secondly, it’s high quality. We have reviewed all responses.
Ghodi added that he thinks Dolly 2.0 will trigger a “snowball” effect – which other members of the AI community can join in and come up with other alternatives. The commercial use limit, he explained, was a big hurdle to overcome: “We’re thrilled now that we’ve finally found a way around it. I promise you, you’re going to see people apply the 15,000 questions to every model that’s out there, and they’re going to see how many of those models suddenly become magical, where you can interact with them.
VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Discover our Briefings.