Although MIZU provides a default workflow for users to generate and validate data, it may not be flexible enough to cater to every user’s specific needs or generate high-quality data in all scenarios. To address this issue of limited flexibility, the MIZU network will support customized workflow deployment and execution. This feature allows developers to create and deploy their own workflows tailored to their specialized requirements, ensuring that they can generate high-quality data that meets their unique needs.

Typical Data workflow Structure

Two-phase Design

  1. Data Generation Phase: This phase utilizes data engines to produce data based on input parameters, such as prompts or simulator settings. The system provides a flexible interface to integrate with different data engines seamlessly.

  2. Data Verification Phase: The verification phase comprises multiple steps, each applying specific rules or leveraging LLM inferencing to identify and correct or filter out low-quality data. The effectiveness of the verification logic heavily influences the overall data quality.

DataVM and Workflow Compilation

The data workflow system compiles each workflow into atomic data operations that are interpretable and executable by the DataVM. These operations are designed to be deterministic, enabling proof generation and verification. One of the most critical data operations is “inference,” which involves calling external hosted LLMs and obtaining signed inference results.

Proof Generation and Verification

During the execution of data operations, the system generates proofs for each operation, leveraging the deterministic nature of the execution. These proofs are stored securely for later verification, ensuring the integrity and correctness of the workflow execution.

Task Dispatching & Parrallel Processing

In the MIZU data workflow system, users can select the number of nodes to employ for running data workflows that generate large amounts of data. The orchestrator utilizes a global task dispatching algorithm to split the task and assign it to the appropriate nodes, considering factors such as computational resources, workload, and workflow requirements. The assigned nodes execute their portions of the data workflow independently, generating results within a specified time frame. To ensure integrity and reliability, the system incorporates a verification and challenge mechanism, where users can request a challenge if a node fails to deliver results on time, resulting in token slashing and task reassignment to another available node.

Challenges in Building High-Quality Data workflows

Data Storage

The data workflow system employs a two-tier storage architecture: local storage and global storage. Local storage is accessible only by the node running the logic and is used for storing temporary data. In contrast, global storage is accessible by all nodes and is used for storing finalized data. Local storage is more cost-effective compared to global storage, as it eliminates the need for data transfer across the network and reduces storage costs in the DA layer.

Data Diversity

Generating diverse data is a significant challenge, as similar prompts can lead to homogeneous results. To ensure data diversity, it is essential to create specific and comprehensive prompts. For instance, when generating people with different personalities, instead of simply asking the model to “generate a person with personality,” it is more effective to provide a predefined personality list and randomly select one for prompt generation. To enable this feature, the data workflow system supports data selection from pre-loaded lists/maps in global storage and prompt template auto-filling.

Data Randomness

Maintaining the deterministic and verifiable nature of data workflow execution is crucial, which means that introducing randomness to the execution is not feasible. To randomly select data from global storage, the system relies on either pseudo-random algorithms (e.g., hashing the data to a number and selecting the lowest one) or external verifiable random number generators, similar to the external verifiable inference network.

Data Aggregation

In a distributed computing environment, data is processed in parallel by multiple nodes. However, certain operations, such as data deduplication, may require data aggregation. Data aggregation is relatively expensive compared to pure distributed computing, as it involves a significant amount of data transfer and shuffling.

The E2E Workflow

Hero Light