The Background
We are in the Google moment for AI data
Background
Until now the Internet has created 180ZB data, but AI developers today are still struggling to find proper data to serve their applications. For example, 80% of the time in AI/ML project is spent on data preparation. The data is scattered across different open datasets or enterprises, they are super fragmented and they are stale. The value of the data dropped 50% every 3 months, but most of the open source datasets out there was created years ago. These problems make existing dataset completely unsearchable and unqueriable.
Actually we have experienced same issue against web data. When Internet was just created, the web data was indexed by websites, or domains, and served by Yellow page websites like Yahoo. It’s very hard for users to find information across multiple websites, because they have to browse these websites manually. Google jumps in to solve the problem by providing a scalable and up-to-date index over web content directly, instead of simply listing all websites, and provide better search experience. After google, the # of web applications boosted because their content could be discovered by users easily.
We are right in the Google moment for AI data. There are some data hosting platforms out there such as HuggingFace which is serving as Yellow pages for AI Data, but it is not enough. MIZU addresses data fragmentation and staleness by creating a scalable, affordable, and up-to-date data processing network for hyperscale AI data. By leveraging edge devices, MIZU enables affordable and efficient data operations like querying, crawling, and cleaning, making data more accessible for AI developers. This open and cost-effective solution empowers AI applications, driving innovation much like how Google transformed web applications.
Why Open Source Data
The current AI landscape is dominated by major tech corporations, leaving limited opportunities for AI startups due to a lack of access to quality data. This situation mirrors the early days of software development when code was closely guarded. However, just as the industry realized that code itself wasn’t the ultimate competitive advantage, but rather the network effects and ecosystems built around it, we are now recognizing that data shouldn’t be hoarded. Open source data, analogous to open source code, offers a solution to democratize AI development and foster a diverse application ecosystem. This community-driven approach can help solve the monopolization problem by creating opportunities for a wider range of participants and distributing the benefits more equitably.
The adoption of open data practices has the potential to completely transform the way AI applications are built and maintained. By revealing the relationship between applications and their underlying datasets, we provide the community with two significant advantages. First, it instills confidence in users of AI applications, as they can understand exactly how the application is built and what data it relies on. Second, it creates a channel for both users and developers to actively improve the performance of applications by contributing additional data to the underlying datasets. This approach facilitates faster iterations, lowers development costs, and fosters a stronger developer community that leverages collective intelligence. Ultimately, open data can lead to more innovative, transparent, and efficient AI solutions that benefit a broader range of stakeholders.
Why Web3
Open source datasets face significant challenges that impede their growth and utility. The iteration cycle is painfully slow, with new versions taking months to release, failing to keep pace with evolving research needs. Collaboration is hindered by the absence of platforms that allow for easy contributions and incremental changes to existing datasets. Moreover, the lack of proper incentives discourages potential contributors, as they receive little reward for their efforts. These issues collectively stifle the development and relevance of open source datasets, limiting their impact on research and innovation.
Web3 leverages blockchain and decentralized networks to align stakeholder incentives and foster an open data ecosystem. It enables transparent and trustless collaboration, allowing for on-chain tracking of data ownership and contributions. This ensures proper recognition and rewards for contributors. Web3’s composability also allows for modular development, accelerating innovation.