Data Generation & Validation
decentralized data generation and validation
The data repo is designed to be permissionless, which means anyone can contribute to the repo as long as the data passes the validation rules. To ensure the permissionless feature, we built the MIZU data network for the community to generate and validate data in a trustless manner.
The Importance of Decentralization
Ensuring Neutrality and Accessibility
We value the open and permissionless nature of the system, so we’d like to ensure that it remains neutral, resistant to censorship, and cannot be shut down. The data should be shared across the world without barriers. This can only happen if the validation and generation are done in a fully decentralized manner.
Incentivizing Quality Contributions
In the future, we may airdrop tokens to high-quality dataset maintainers and reputable contributors. In this case, we will need strong consensus among the community that the data in the repo truly satisfies the rules, which can only be provided by a decentralized network.
Ensuring Data Integrity in Repo Dependencies
People may build new repos on top of existing ones, leveraging the data and structures already present in the parent repos. However, if the data in a parent repo is compromised or fails to meet the specified requirements, it could lead to a supply-chain attack, where the dependent repos inherit and propagate the invalid or malicious data. To mitigate this risk, MIZU employs a decentralized approach to ensure that all data in the dependencies of a repo is valid and meets the requirements specified by its rules.
AI-driven Data Generation
MIZU aims to maximize its AI usage to generate and validate data for several reasons:
Lowering the Barrier to Entry
With the help of AI, users can join the platform and contribute simply by providing description requirements or through simple prompt engineering. This will help to bring knowledge from different domains, even from those who have no programming experience.
Enhancing Data Lineage
For each data record in our network, we will know not only where it comes from but also how it’s generated, with which prompt, by what model, and via whom it was contributed to which repo. This will give us a clear understanding of the motivation behind the data and help the data consumer to better understand the data.
Fostering Collaboration and Knowledge Sharing
We want to aggregate not only the data but also the knowledge of how the data is generated. With AI, users will be able to share the prompts or workflows so the community can reuse them to generate more data and build better ones on top of it, leading to a more collaborative ecosystem.