Multimodal AI workloads break traditional data engines. They need to embed documents, classify images, and transcribe audio, not just run aggregations and joins. These multimodal workloads are tough: memory usage balloons mid-pipeline, processing requires both CPU and GPU, and a single machine can't handle the data volume.
This post provides a comprehensive comparison of Daft and Ray Data for multimodal data processing, examining their architectures and performance. Benchmarks across large-scale audio, video, document, and image workloads found Daft ran 2-7x faster than Ray Data and 4-18x faster than Spark, while finishing jobs reliably.
Multimodal data processing presents unique challenges:
Daft is designed to handle petabyte-scale workloads with multimodal data (audio, video, images, text, embeddings) as first-class citizens.
Key features include:
Ray Data is a data processing library built on top of Ray, a framework for building distributed Python applications.
Key features include:
map_batches that work directly on PyArrow record batches or pandas DataFramesDaft's architecture revolves around its Swordfish streaming execution engine. Data is always "in flight": batches flow through the pipeline as soon as they are ready. For a partition of 100k images, the first 1000 can be fed into model inference while the next 1000 are being downloaded or decoded. The entire partition never has to be fully materialized in an intermediate buffer.
Backpressure mechanism: If GPU inference becomes the bottleneck, the upstream steps automatically slow down so memory usage remains bounded.
Adaptive batch sizing: Daft shrinks batch sizes on memory-heavy operations like url_download or image_decode, keeping throughput high without ballooning memory usage.
Flotilla distributed engine: Daft's distributed runner deploys one Swordfish worker per node, enabling the same streaming execution model to scale across clusters.
Ray Data streams data between heterogeneous operations (e.g., CPU → GPU) that users define via classes or resource requests. Within homogeneous operations, Ray Data fuses sequential operations into the same task and executes them sequentially, which can cause memory issues without careful tuning of block sizes. You can work around this by using classes instead of functions in map/map_batches, but this materializes intermediates in Ray's object store, adding serialization and memory copy overhead. Ray's object store is by default only 30% of machine memory, and this limitation can lead to excessive disk spilling.
Based on recent benchmarks conducted on identical AWS clusters (8 x g6.xlarge instances with NVIDIA L4 GPUs, each with 4 vCPUs, 16 GB memory, and 100 GB EBS volume), here's how the two frameworks compare:
| Workload | Daft | Ray Data | Spark | |----|----|----|----| | Audio Transcription (113,800 files) | 6m 22s | 29m 20s (4.6x slower) | 25m 46s (4.0x slower) | | Document Embedding (10,000 PDFs) | 1m 54s | 14m 32s (7.6x slower) | 8m 4s (4.2x slower) | | Image Classification (803,580 images) | 4m 23s | 23m 30s (5.4x slower) | 45m 7s (10.3x slower) | | Video Object Detection (1,000 videos) | 11m 46s | 25m 54s (2.2x slower) | 3h 36m (18.4x slower) |
Several architectural decisions contribute to Daft's performance advantages:
Based on the benchmark results and architectural differences:
Daft shows significant advantages for:
Ray Data may be preferred when:
When Tim Romanski of Essential AI set out to taxonomize 23.6 billion web documents from Common Crawl (24 trillion tokens), his team pushed Daft to its limits - scaling from local development to 32,000 requests per second per VM. As he shared in a panel discussion: "We pushed Daft to the limit and it's battle tested… If we had to do the same thing in Spark, we would have to have the JVM installed, go through all of its nuts and bolts just to get something running. So the time to get something running in the first place was a lot shorter. And then once we got it running locally, we just scaled up to multiple machines."
CloudKitchens rebuilt their entire ML infrastructure around what they call the "DREAM stack" (Daft, Ray, poEtry, Argo, Metaflow). When selecting their data processing layer, they identified specific limitations with Ray Data and chose Daft to complement Ray's compute capabilities. As their infrastructure team explained, "one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved." They chose Daft because "it fills the gap of Ray Data by providing amazing DataFrame APIs" and noted that "in our tests, it's faster than Spark and uses fewer resources."
A data engineer from ByteDance commented on Daft's 300K image processing demonstration, sharing his own experience with an even larger image classification workload: "Not just 300,000 images - we ran image classification evaluations on the ImageNet dataset with approximately 1.28 million images, and Daft was about 20% faster than Ray Data." Additionally, in a separate technical analysis of Daft's architecture, he praised its "excellent execution performance and resource efficiency" and highlighted how it "effortlessly enables streaming processing of large-scale image datasets."
\


