The Background
We are in the Google moment for AI data
Background
Until now the Internet has created 180ZB data, but AI developers today are still struggling to find proper data to serve their applications. For example, 80% of the time in AI/ML project is spent on data preparation. The data is scattered across different open datasets or enterprises, they are super fragmented and they are stale. The value of the data dropped 50% every 3 months, but most of the open source datasets out there was created years ago. These problems make existing dataset completely unsearchable and unqueriable.
Actually we have experienced same issue against web data. When Internet was just created, the web data was indexed by websites, or domains, and served by Yellow page websites like Yahoo. It’s very hard for users to find information across multiple websites, because they have to browse these websites manually. Google jumps in to solve the problem by providing a scalable and up-to-date index over web content directly, instead of simply listing all websites, and provide better search experience. After google, the # of web applications boosted because their content could be discovered by users easily.
We are right in the Google moment for AI data. There are some data hosting platforms out there such as HuggingFace which is serving as Yellow pages for AI Data, but it is not enough. MIZU aims solves the data fragmentation and staleness problem and help AI developers to find data easier by building a scalable, unified and up-to-date AI dataset which indexes data content. The dataset is completely open source, offering low-risk and easily accessible data solutions for various AI applications. The # of AI applications will be boosted with MIZU just like the web applications with google.
Why Open Source Data
The current AI landscape is dominated by major tech corporations, leaving limited opportunities for AI startups due to a lack of access to quality data. This situation mirrors the early days of software development when code was closely guarded. However, just as the industry realized that code itself wasn’t the ultimate competitive advantage, but rather the network effects and ecosystems built around it, we are now recognizing that data shouldn’t be hoarded. Open source data, analogous to open source code, offers a solution to democratize AI development and foster a diverse application ecosystem. This community-driven approach can help solve the monopolization problem by creating opportunities for a wider range of participants and distributing the benefits more equitably.
The adoption of open data practices has the potential to completely transform the way AI applications are built and maintained. By revealing the relationship between applications and their underlying datasets, we provide the community with two significant advantages. First, it instills confidence in users of AI applications, as they can understand exactly how the application is built and what data it relies on. Second, it creates a channel for both users and developers to actively improve the performance of applications by contributing additional data to the underlying datasets. This approach facilitates faster iterations, lowers development costs, and fosters a stronger developer community that leverages collective intelligence. Ultimately, open data can lead to more innovative, transparent, and efficient AI solutions that benefit a broader range of stakeholders.
Why Open Source AI
Open-source AI has garnered significant attention in recent times, with the promise of democratizing access to powerful AI models and enabling collaborative development. However, the current state of open-source large language models (LLMs) falls short of being truly “open-source” in the fullest sense.
While the models and parameters of these LLMs are openly available, the training data recipes, training code, and training process often remain opaque. This lack of transparency limits the potential for customization and improvement by the wider developer community. Developers are restricted to fine-tuning the models, without the ability to modify the pre-training process itself. This limitation can be problematic, as it prevents developers from excluding the influence of potentially harmful or biased data that may have been inadvertently included during the pre-training phase.
To draw an analogy, the current state of open-source LLMs is akin to the Windows operating system, where the source code is not fully accessible, and customization options are limited. In contrast, truly open-source AI models should be more like Linux, where every aspect of the system is transparent, modifiable, and community-driven.
Why Synthetic Data
Synthetic data, artificially created to mimic real-world data characteristics, is a critical piece of open source data and offers several advantages.
In the post-GPT-4 era, AI faces a new challenge: data scarcity. With most real-world data already utilized, finding diverse and relevant data for AI training has become increasingly difficult. To address this, the industry has turned to synthetic data generation.
Synthetic data allows for rapid generation of large data volumes, provides flexibility to create data with specific properties, and helps mitigate privacy concerns associated with real-world data.
This approach has shown promise in various domains, including computer vision and natural language processing, enabling AI developers to ensure a steady supply of high-quality, diverse data for model training and refinement.
Why Web3
Open source datasets face significant challenges that impede their growth and utility. The iteration cycle is painfully slow, with new versions taking months to release, failing to keep pace with evolving research needs. Collaboration is hindered by the absence of platforms that allow for easy contributions and incremental changes to existing datasets. Moreover, the lack of proper incentives discourages potential contributors, as they receive little reward for their efforts. These issues collectively stifle the development and relevance of open source datasets, limiting their impact on research and innovation.
Web3 leverages blockchain and decentralized networks to align stakeholder incentives and foster an open data ecosystem. It enables transparent and trustless collaboration, allowing for on-chain tracking of data ownership and contributions. This ensures proper recognition and rewards for contributors. Web3’s composability also allows for modular development, accelerating innovation.