ExchangeDEX+

Buy Crypto Markets Spot Futures500X Earn Events

Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.Multimodal AI workloads are breaking Spark and Ray. See how Daft’s streaming model runs 7× faster and more reliably across audio, video, and image pipelines.

Why Multimodal AI Broke the Data Pipeline — And How Daft Is Beating Ray and Spark to Fix It

By: Hackernoon

2025/11/03 13:19

WHY$0.00000002085-8.06%

AI$0.05977-7.89%

RAY$1.357-10.13%

SPARK$0.002288-15.66%

MORE$0.00717-17.25%

Multimodal AI workloads break traditional data engines. They need to embed documents, classify images, and transcribe audio, not just run aggregations and joins. These multimodal workloads are tough: memory usage balloons mid-pipeline, processing requires both CPU and GPU, and a single machine can't handle the data volume.

This post provides a comprehensive comparison of Daft and Ray Data for multimodal data processing, examining their architectures and performance. Benchmarks across large-scale audio, video, document, and image workloads found Daft ran 2-7x faster than Ray Data and 4-18x faster than Spark, while finishing jobs reliably.

The Multimodal Data Challenge

Multimodal data processing presents unique challenges:

Memory Explosions: A compressed image like a JPEG inflates 20x in memory once decoded. A single video file can be decoded into thousands of frames, each being megabytes.
Heterogeneous Compute: These workloads stress CPU, GPU, and network simultaneously. Processing steps include resampling, feature extraction, transcription, downloading, decoding, resizing, normalizing, and classification.
Data Volume: The benchmarked workloads included 113,800 audio files from Common Voice 17, 10,000 PDFs from Common Crawl, 803,580 images from ImageNet, and 1,000 videos from Hollywood2.

Introducing the Contenders

Daft

Daft is designed to handle petabyte-scale workloads with multimodal data (audio, video, images, text, embeddings) as first-class citizens.

Key features include:

Native multimodal operations: Built-in image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, reading video to image frames
Declarative DataFrame/SQL API: With schema validation and query optimizer that automatically handles projection pushdowns, filter pushdowns, and join reordering - optimizations users get "for free" without manual tuning
Comprehensive I/O support: Native readers and writers for Parquet, CSV, JSON, Lance, Iceberg, Delta Lake, and WARC formats, tightly integrated with the streaming execution model

Ray Data

Ray Data is a data processing library built on top of Ray, a framework for building distributed Python applications.

Key features include:

Low-level operators: Provides operations like map_batches that work directly on PyArrow record batches or pandas DataFrames
Ray ecosystem integration: Tight integration with Ray Train for distributed training and Ray Serve for model serving

Architecture Deep Dive

Daft's Streaming Execution Model

Daft's architecture revolves around its Swordfish streaming execution engine. Data is always "in flight": batches flow through the pipeline as soon as they are ready. For a partition of 100k images, the first 1000 can be fed into model inference while the next 1000 are being downloaded or decoded. The entire partition never has to be fully materialized in an intermediate buffer.

Backpressure mechanism: If GPU inference becomes the bottleneck, the upstream steps automatically slow down so memory usage remains bounded.

Adaptive batch sizing: Daft shrinks batch sizes on memory-heavy operations like url_download or image_decode, keeping throughput high without ballooning memory usage.

Flotilla distributed engine: Daft's distributed runner deploys one Swordfish worker per node, enabling the same streaming execution model to scale across clusters.

Ray Data's Execution Model

Ray Data streams data between heterogeneous operations (e.g., CPU → GPU) that users define via classes or resource requests. Within homogeneous operations, Ray Data fuses sequential operations into the same task and executes them sequentially, which can cause memory issues without careful tuning of block sizes. You can work around this by using classes instead of functions in map/map_batches, but this materializes intermediates in Ray's object store, adding serialization and memory copy overhead. Ray's object store is by default only 30% of machine memory, and this limitation can lead to excessive disk spilling.

Performance Benchmarks

Based on recent benchmarks conducted on identical AWS clusters (8 x g6.xlarge instances with NVIDIA L4 GPUs, each with 4 vCPUs, 16 GB memory, and 100 GB EBS volume), here's how the two frameworks compare:

| Workload | Daft | Ray Data | Spark | |----|----|----|----| | Audio Transcription (113,800 files) | 6m 22s | 29m 20s (4.6x slower) | 25m 46s (4.0x slower) | | Document Embedding (10,000 PDFs) | 1m 54s | 14m 32s (7.6x slower) | 8m 4s (4.2x slower) | | Image Classification (803,580 images) | 4m 23s | 23m 30s (5.4x slower) | 45m 7s (10.3x slower) | | Video Object Detection (1,000 videos) | 11m 46s | 25m 54s (2.2x slower) | 3h 36m (18.4x slower) |

Why Such Large Performance Differences?

Several architectural decisions contribute to Daft's performance advantages:

Native Operations vs Python UDFs: Daft has native multimodal expressions including image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, and reading video to image frames. These native multimodal expressions are highly optimized in Daft. In Ray Data you have to write your own Python UDFs that use external dependencies like Pillow, numpy, spacy, huggingface, etc. This comes at the cost of extra data movement because these libraries each have their own data format.
Memory Management - Streaming vs Materialization: Daft streams data through network, CPU, and GPU in a continuous stream without materializing entire partitions. Ray Data fuses sequential operations which can cause memory issues. While you can work around this by using classes to materialize intermediates in the object store, this adds serialization and memory copy overhead.
Resource Utilization: Daft pipelines everything inside a single Swordfish worker, which has control over all resources of the machine. Data asynchronously streams from cloud storage, into the CPUs to run pre-processing steps, then into GPU memory for inference, and back out for results to be uploaded. CPUs, GPUs, and the network stay saturated together for optimal throughput. In contrast, Ray Data by default reserves a CPU core for I/O-heavy operations like downloading large videos, which can leave that core unavailable for CPU-bound processing work, requiring manual tuning of fractional CPU requests to optimize resource usage.

When to Choose Which?

Based on the benchmark results and architectural differences:

Daft shows significant advantages for:

Multimodal data processing (images, documents, video, audio)
Workloads requiring reliable execution without extensive tuning
Complex queries with joins, aggregations, and multiple transformations
Teams preferring DataFrame/SQL semantics

Ray Data may be preferred when:

You have tight integration needs with the Ray ecosystem (Ray Train, Ray Serve)
You need fine-grained control over CPU/GPU allocation per operation

What Practitioners Are Saying

Is Daft battle-tested enough for production?

When Tim Romanski of Essential AI set out to taxonomize 23.6 billion web documents from Common Crawl (24 trillion tokens), his team pushed Daft to its limits - scaling from local development to 32,000 requests per second per VM. As he shared in a panel discussion: "We pushed Daft to the limit and it's battle tested… If we had to do the same thing in Spark, we would have to have the JVM installed, go through all of its nuts and bolts just to get something running. So the time to get something running in the first place was a lot shorter. And then once we got it running locally, we just scaled up to multiple machines."

What gap does Daft fill in the Ray ecosystem?

CloudKitchens rebuilt their entire ML infrastructure around what they call the "DREAM stack" (Daft, Ray, poEtry, Argo, Metaflow). When selecting their data processing layer, they identified specific limitations with Ray Data and chose Daft to complement Ray's compute capabilities. As their infrastructure team explained, "one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved." They chose Daft because "it fills the gap of Ray Data by providing amazing DataFrame APIs" and noted that "in our tests, it's faster than Spark and uses fewer resources."

How does Daft perform on even larger datasets?

A data engineer from ByteDance commented on Daft's 300K image processing demonstration, sharing his own experience with an even larger image classification workload: "Not just 300,000 images - we ran image classification evaluations on the ImageNet dataset with approximately 1.28 million images, and Daft was about 20% faster than Ray Data." Additionally, in a separate technical analysis of Daft's architecture, he praised its "excellent execution performance and resource efficiency" and highlighted how it "effortlessly enables streaming processing of large-scale image datasets."

Resources

Benchmarks for Multimodal AI Workloads - Primary source for performance data and architectural comparisons
Benchmark Code Repository - Open-source code to reproduce all benchmarks
Distributed Data Community Slack - Join the community to discuss with Daft developers and users

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

Share Insights