Toto, a pre-trained time series foundation model, delivers state-of-the-art performance across Long Sequence Forecasting (LSF) and Datadog benchmarks. Leveraging its Proportional Factorized Space-Time Attention mechanism and massive multi-domain training set of one trillion time points, Toto outperforms leading zero-shot models like Moirai, TimesFM, and Chronos on key metrics such as MAE, MSE, and sMAPE. Its strength lies in capturing complex temporal-spatial dependencies and adapting to diverse data conditions, making it a standout performer for both general-purpose forecasting and observability time series prediction.Toto, a pre-trained time series foundation model, delivers state-of-the-art performance across Long Sequence Forecasting (LSF) and Datadog benchmarks. Leveraging its Proportional Factorized Space-Time Attention mechanism and massive multi-domain training set of one trillion time points, Toto outperforms leading zero-shot models like Moirai, TimesFM, and Chronos on key metrics such as MAE, MSE, and sMAPE. Its strength lies in capturing complex temporal-spatial dependencies and adapting to diverse data conditions, making it a standout performer for both general-purpose forecasting and observability time series prediction.

Toto AI Model Sets New Benchmark for Time Series Forecasting

2025/10/23 02:53
6 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com
  1. Background
  2. Problem statement
  3. Model architecture
  4. Training data
  5. Results
  6. Conclusions
  7. Impact statement
  8. Future directions
  9. Contributions
  10. Acknowledgements and References

Appendix

5 Results

We report experimental results for a pre-trained Toto model in Section 5.1 and Section 5.2.

\ To evaluate predictions, we sequentially divide a time series into context and forecast segments. We input the context segment into Toto and autoregressively generate output patches by sampling from the Student-T mixture model distribution. We forecast a number of steps equal to the nearest multiple of the patch size, then truncate the predictions to the desired length. In order to keep inference time consistent, we vary the number of samples generated based on the cardinality and length of the dataset, with a minimum of 100 samples. We take the median sample at each time step as the final point prediction. This prediction is then compared against the ground-truth forecast segment for evaluation.

\ Table 1. Comparison of different models with Toto on the LSF benchmark datasets. Results are averaged across prediction lengths of 96, 192, 336, and 720 steps. For Toto, we use a stride of 512 steps and a historical context window of 512 steps. For other models, we use the results reported in [15] and [19]. Metrics for each prediction length are available in Table A.2. *TimesFM only reports values for MAE on ETTh1, ETTh2, ETTm1, and ETTm2. Key: Best results, Second-best results.

\ 5.1 LSF benchmarks

\ To assess general-purpose time series forecasting performance, we use the Long Sequence Forecasting (LSF) benchmark datasets (ETTh1, ETTh2, ETTm1, ETTm2, Electricity, and Weather) [12]. We evaluate with forecast lengths of 96, 192, 336, and 720 time steps, in sliding windows with stride 512, and average the results. For Toto, we used a historical context window of 512 steps and took the median of 200 samples. Following standard practice, we report normalized Mean Absolute Error (MAE) and Mean Squared Error (MSE), fitted on a training split, in order to be able to compare performance across different datasets. We compared Toto's performance with the reported results of other recent zero-shot foundation models [15, 19], as well as full-shot time series forecasting models [14, 16, 17, 36, 44–47]. We display these results in Table 1.

\ Toto demonstrates exceptional performance across a variety of benchmark datasets, excelling in zero-shot scenarios. In the LSF datasets, Toto consistently outperforms other models in terms of MAE and MSE. For example, on the ETTh1 dataset, Toto achieves an MAE of 0.389 and an MSE of 0.363, outperforming all zeroshot models, including the previously reported Moirai series and TimesFM. Macro-averaging across the six LSF datasets, Toto achieves an MAE of 0.312 and MSE of 0.265, again exceeding Moirai's reported zero-shot performance as well as the reported performance of the full-shot models.

\ Several architectural choices and data features likely contribute to Toto's superior performance. The novel Proportional Factorized Space-Time Attention mechanism allows Toto to efficiently capture both temporal and spatial dependencies within multivariate time series data. Additionally, the extensive training on a diverse dataset of one trillion time series points, including a mix of real-world observability metrics and multi-domain time series data, enhances Toto's ability to handle varied characteristics of different benchmark datasets.

\ While Toto generally excels, there are areas where its performance is closely matched by other models. In full-shot scenarios, models like PatchTST, Crossformer, and FEDformer show competitive results. For example, on the Electricity dataset, while Toto achieves a leading zero-shot MAE of 0.246 and MSE of 0.157, iTransformer and TimesNet also show strong performance, indicating that these models can catch up when additional training data is available.

\ Overall, Toto's architectural innovations and extensive training data enable it to achieve state-of-the-art performance across diverse benchmarks, excelling in zero-shot scenarios while remaining highly competitive in full-shot contexts.

\ 5.2 Datadog benchmark

\ We created a benchmark using anonymous Datadog data to assess performance across various observability metrics. To ensure a representative and realistic sample, we sampled data based on quality and relevance signals from dashboards, monitor alerts, and notebooks. This benchmark comprises 983,994 data points from 82 distinct multivariate time series, encompassing 1,122 variates.

\ Table 2. Performance of Toto and other zero-shot models on the Datadog benchmark dataset. Key: Best results, Second-best results.

\ We analyzed summary statistics of the series in our benchmark to identify characteristics that make observability time series challenging to forecast. The categories and their definitions are as follows:

\ • Sparse: Series with a low density of observations, indicating infrequent recording of data or rare events.

\ • Extreme right skew: Series with a distribution heavily skewed to the right, characterized by a few very high values and many lower values.

\ • Seasonal: Series exhibiting regular and recurring patterns, often linked to daily, weekly, or yearly cycles.

\ • Flat: Series with minimal variability, showing little to no change over time.

\ The relative proportion of these cases are displayed in Table 3.

\ To assess the prediction of other zero-shot models on the DD Benchmark, we follow sampling procedures delineated in their respective manuscripts. In short, for Chronos models, we generate 20 samples and take the median prediction. For Moirai models, we take the median of 100 samples and set the patch size to “auto”. TimesFM only produces point predictions of the mean, so we use those directly. Since TimesFM and Chronos only support univariate forecasting, we process each variate independently. Moirai, on the other hand, like Toto, makes joint predictions for each group of related variates. For Toto, we utilize the same evaluation procedure we used on the LSF benchmarks.

\ The evaluation results (Table 2) demonstrate that Toto outperforms the other models. We evaluate using a prediction length of 365, the maximum forecast window available for previous time series models within the Datadog platform. We use a historical context window of 512 steps. Because observability data can have extreme variation in both magnitude and dispersion, we select symmetric mean absolute percentage error (sMAPE) as a scale-invariant performance metric [48]. We also report symmetric median absolute percentage error (sMdAPE), a robust version of sMAPE [49] that minimizes the influence of the extreme outliers present in observability data. With the lowest sMAPE of 0.672 and sMdAPE of 0.318, Toto proves to be the most accurate for forecasting observability time series data.

\ These results suggest that current open datasets may not provide sufficient information to extrapolate to the specific nuances of observability data, highlighting the importance of training on more relevant data as demonstrated by Toto's superior performance.

\ Table 3. Breakdown of Datadog dataset based on case, computed based on the average characteristics of variates in each multivariate series. Note that these do not add to 100% because time series may fall into multiple categories.

\

:::info Authors:

(1) Ben Cohen (ben.cohen@datadoghq.com);

(2) Emaad Khwaja (emaad@datadoghq.com);

(3) Kan Wang (kan.wang@datadoghq.com);

(4) Charles Masson (charles.masson@datadoghq.com);

(5) Elise Rame (elise.rame@datadoghq.com);

(6) Youssef Doubli (youssef.doubli@datadoghq.com);

(7) Othmane Abou-Amal (othmane@datadoghq.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Market Opportunity
Sleepless AI Logo
Sleepless AI Price(SLEEPLESSAI)
$0.01968
$0.01968$0.01968
+3.85%
USD
Sleepless AI (SLEEPLESSAI) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Pi Network Visa Integration Logic Suggests Potential Shift in Global Payment Liquidity

Pi Network Visa Integration Logic Suggests Potential Shift in Global Payment Liquidity

Alleged Visa Related Logic in Pi Network Code Sparks Debate Over Future of Global Payment Systems Recent discussions within the Pi Network and broader bloc
Share
Hokanews2026/04/26 15:23
The New Geometry of Global Trade: Why Asia Is Winning in the AI Era

The New Geometry of Global Trade: Why Asia Is Winning in the AI Era

Global trade is not collapsing—it is transforming, and Asia is at the center of this... The post The New Geometry of Global Trade: Why Asia Is Winning in the AI
Share
Bitcoin News Asia2026/04/26 15:01
Franklin Templeton CEO Dismisses 50bps Rate Cut Ahead FOMC

Franklin Templeton CEO Dismisses 50bps Rate Cut Ahead FOMC

The post Franklin Templeton CEO Dismisses 50bps Rate Cut Ahead FOMC appeared on BitcoinEthereumNews.com. Franklin Templeton CEO Jenny Johnson has weighed in on whether the Federal Reserve should make a 25 basis points (bps) Fed rate cut or 50 bps cut. This comes ahead of the Fed decision today at today’s FOMC meeting, with the market pricing in a 25 bps cut. Bitcoin and the broader crypto market are currently trading flat ahead of the rate cut decision. Franklin Templeton CEO Weighs In On Potential FOMC Decision In a CNBC interview, Jenny Johnson said that she expects the Fed to make a 25 bps cut today instead of a 50 bps cut. She acknowledged the jobs data, which suggested that the labor market is weakening. However, she noted that this data is backward-looking, indicating that it doesn’t show the current state of the economy. She alluded to the wage growth, which she remarked is an indication of a robust labor market. She added that retail sales are up and that consumers are still spending, despite inflation being sticky at 3%, which makes a case for why the FOMC should opt against a 50-basis-point Fed rate cut. In line with this, the Franklin Templeton CEO said that she would go with a 25 bps rate cut if she were Jerome Powell. She remarked that the Fed still has the October and December FOMC meetings to make further cuts if the incoming data warrants it. Johnson also asserted that the data show a robust economy. However, she noted that there can’t be an argument for no Fed rate cut since Powell already signaled at Jackson Hole that they were likely to lower interest rates at this meeting due to concerns over a weakening labor market. Notably, her comment comes as experts argue for both sides on why the Fed should make a 25 bps cut or…
Share
BitcoinEthereumNews2025/09/18 00:36

Roll the Dice & Win Up to 1 BTC

Roll the Dice & Win Up to 1 BTCRoll the Dice & Win Up to 1 BTC

Invite friends & share 500,000 USDT!