Data Parallel MNIST with DTensor and TensorFlow Core

2025/09/09 16:00

Content Overview

  • Introduction
  • Overview of data parallel training with DTensor
  • Setup
  • The MNIST Dataset
  • Preprocessing the data
  • Build the MLP
  • The dense layer
  • The MLP sequential model
  • Training metrics
  • Optimizer
  • Data packing
  • Training
  • Performance evaluation
  • Saving your model
  • Conclusion

\ \ \

Introduction

This notebook uses the TensorFlow Core low-level APIs and DTensor to demonstrate a data-parallel distributed training example.

Visit the Core APIs overview to learn more about TensorFlow Core and its intended use cases. Refer to the DTensor Overview guide and Distributed Training with DTensors tutorial to learn more about DTensor.

This example uses the same model and optimizer as those shown in the Multilayer Perceptrons tutorial. See this tutorial first to get comfortable with writing an end-to-end machine learning workflow with the Core APIs.

\

:::tip Note: DTensor is still an experimental TensorFlow API which means that its features are available for testing, and it is intended for use in test environments only.

:::

\

Overview of data parallel training with DTensor

Before building an MLP that supports distribution, take a moment to explore the fundamentals of DTensor for data parallel training.

DTensor allows you to run distributed training across devices to improve efficiency, reliability and scalability. DTensor distributes the program and tensors according to the sharding directives through a procedure called Single program, multiple data (SPMD) expansion. A variable of a DTensor aware layer is created as dtensor.DVariable, and the constructors of DTensor aware layer objects take additional Layout inputs in addition to the usual layer parameters.

The main ideas for data parallel training are as follows:

  • Model variables are replicated on N devices each.
  • A global batch is split into N per-replica batches.
  • Each per-replica batch is trained on the replica device.
  • The gradient is reduced before weight up data is collectively performed on all replicas.
  • Data parallel training provides nearly linear speed with respect to the number of devices

Setup

DTensor is part of TensorFlow 2.9.0 release.

\

#!pip install --quiet --upgrade --pre tensorflow 

\

import matplotlib from matplotlib import pyplot as plt # Preset Matplotlib figure sizes. matplotlib.rcParams['figure.figsize'] = [9, 6] 

\

import tensorflow as tf import tensorflow_datasets as tfds from tensorflow.experimental import dtensor print(tf.__version__) # Set random seed for reproducible results  tf.random.set_seed(22) 

\

2024-08-15 02:49:40.914029: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-08-15 02:49:40.935518: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-08-15 02:49:40.941702: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2.17.0 

Configure 8 virtual CPUs for this experiment. DTensor can also be used with GPU or TPU devices. Given that this notebook uses virtual devices, the speedup gained from distributed training is not noticeable.

\

def configure_virtual_cpus(ncpu):   phy_devices = tf.config.list_physical_devices('CPU')   tf.config.set_logical_device_configuration(phy_devices[0], [         tf.config.LogicalDeviceConfiguration(),     ] * ncpu)  configure_virtual_cpus(8)  DEVICES = [f'CPU:{i}' for i in range(8)] devices = tf.config.list_logical_devices('CPU') device_names = [d.name for d in devices] device_names 

\

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1723690183.661893  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.665603  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.669301  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.672556  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.683679  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.687589  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.691101  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.694059  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.696961  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.700515  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.704018  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690183.706976  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.934382  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.936519  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.938569  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.940700  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.942765  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.944750  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.946705  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.948674  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.950629  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.952626  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.954710  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.956738  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.995780  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.997864  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690184.999851  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.001859  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See mo ['/device:CPU:0',  '/device:CPU:1',  '/device:CPU:2',  '/device:CPU:3',  '/device:CPU:4',  '/device:CPU:5',  '/device:CPU:6',  '/device:CPU:7'] re at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.003740  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.005715  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.007659  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.009659  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.011546  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.014055  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.016445  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 I0000 00:00:1723690185.018866  157397 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 

The MNIST Dataset

The dataset is available from TensorFlow Datasets. Split the data into training and testing sets. Only use 5000 examples for training and testing to save time.

\

train_data, test_data = tfds.load("mnist", split=['train[:5000]', 'test[:5000]'], batch_size=128, as_supervised=True) 

Preprocessing the data

Preprocess the data by reshaping it to be 2-dimensional and by rescaling it to fit into the unit interval, [0,1].

\

def preprocess(x, y):   # Reshaping the data   x = tf.reshape(x, shape=[-1, 784])   # Rescaling the data   x = x/255   return x, y  train_data, test_data = train_data.map(preprocess), test_data.map(preprocess) 

Build the MLP

Build an MLP model with DTensor aware layers.

The dense layer

Start by creating a dense layer module that supports DTensor. The dtensor.call_with_layout function can be used to call a function that takes in a DTensor input and produces a DTensor output. This is useful for initializing a DTensor variable, dtensor.DVariable, with a TensorFlow supported function.

\

class DenseLayer(tf.Module):    def __init__(self, in_dim, out_dim, weight_layout, activation=tf.identity):     super().__init__()     # Initialize dimensions and the activation function     self.in_dim, self.out_dim = in_dim, out_dim     self.activation = activation      # Initialize the DTensor weights using the Xavier scheme     uniform_initializer = tf.function(tf.random.stateless_uniform)     xavier_lim = tf.sqrt(6.)/tf.sqrt(tf.cast(self.in_dim + self.out_dim, tf.float32))     self.w = dtensor.DVariable(       dtensor.call_with_layout(           uniform_initializer, weight_layout,           shape=(self.in_dim, self.out_dim), seed=(22, 23),           minval=-xavier_lim, maxval=xavier_lim))      # Initialize the bias with the zeros     bias_layout = weight_layout.delete([0])     self.b = dtensor.DVariable(       dtensor.call_with_layout(tf.zeros, bias_layout, shape=[out_dim]))    def __call__(self, x):     # Compute the forward pass     z = tf.add(tf.matmul(x, self.w), self.b)     return self.activation(z) 

The MLP sequential model

Now create an MLP module that executes the dense layers sequentially.

\

class MLP(tf.Module):    def __init__(self, layers):     self.layers = layers    def __call__(self, x, preds=False):      # Execute the model's layers sequentially     for layer in self.layers:       x = layer(x)     return x 

Performing "data-parallel" training with DTensor is equivalent to tf.distribute.MirroredStrategy. To do this each device will run the same model on a shard of the data batch. So you'll need the following:

  • dtensor.Mesh with a single "batch" dimension
  • dtensor.Layout for all the weights that replicates them across the mesh (using dtensor.UNSHARDED for each axis)
  • dtensor.Layout for the data that splits the batch dimension across the mesh

Create a DTensor mesh that consists of a single batch dimension, where each device becomes a replica that receives a shard from the global batch. Use this mesh to instantiate an MLP mode with the following architecture:

Forward Pass: ReLU(784 x 700) x ReLU(700 x 500) x Softmax(500 x 10)

\

mesh = dtensor.create_mesh([("batch", 8)], devices=DEVICES) weight_layout = dtensor.Layout([dtensor.UNSHARDED, dtensor.UNSHARDED], mesh)  input_size = 784 hidden_layer_1_size = 700 hidden_layer_2_size = 500 hidden_layer_2_size = 10  mlp_model = MLP([     DenseLayer(in_dim=input_size, out_dim=hidden_layer_1_size,                 weight_layout=weight_layout,                activation=tf.nn.relu),     DenseLayer(in_dim=hidden_layer_1_size , out_dim=hidden_layer_2_size,                weight_layout=weight_layout,                activation=tf.nn.relu),     DenseLayer(in_dim=hidden_layer_2_size, out_dim=hidden_layer_2_size,                 weight_layout=weight_layout)]) 

Training metrics

Use the cross-entropy loss function and accuracy metric for training.

\

def cross_entropy_loss(y_pred, y):   # Compute cross entropy loss with a sparse operation   sparse_ce = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_pred)   return tf.reduce_mean(sparse_ce)  def accuracy(y_pred, y):   # Compute accuracy after extracting class predictions   class_preds = tf.argmax(y_pred, axis=1)   is_equal = tf.equal(y, class_preds)   return tf.reduce_mean(tf.cast(is_equal, tf.float32)) 

Optimizer

Using an optimizer can result in significantly faster convergence compared to standard gradient descent. The Adam optimizer is implemented below and has been configured to be compatible with DTensor. In order to use Keras optimizers with DTensor, refer to the experimentaltf.keras.dtensor.experimental.optimizers module.

\

class Adam(tf.Module):      def __init__(self, model_vars, learning_rate=1e-3, beta_1=0.9, beta_2=0.999, ep=1e-7):       # Initialize optimizer parameters and variable slots       self.model_vars = model_vars       self.beta_1 = beta_1       self.beta_2 = beta_2       self.learning_rate = learning_rate       self.ep = ep       self.t = 1.       self.v_dvar, self.s_dvar = [], []       # Initialize optimizer variable slots       for var in model_vars:         v = dtensor.DVariable(dtensor.call_with_layout(tf.zeros, var.layout, shape=var.shape))         s = dtensor.DVariable(dtensor.call_with_layout(tf.zeros, var.layout, shape=var.shape))         self.v_dvar.append(v)         self.s_dvar.append(s)      def apply_gradients(self, grads):       # Update the model variables given their gradients       for i, (d_var, var) in enumerate(zip(grads, self.model_vars)):         self.v_dvar[i].assign(self.beta_1*self.v_dvar[i] + (1-self.beta_1)*d_var)         self.s_dvar[i].assign(self.beta_2*self.s_dvar[i] + (1-self.beta_2)*tf.square(d_var))         v_dvar_bc = self.v_dvar[i]/(1-(self.beta_1**self.t))         s_dvar_bc = self.s_dvar[i]/(1-(self.beta_2**self.t))         var.assign_sub(self.learning_rate*(v_dvar_bc/(tf.sqrt(s_dvar_bc) + self.ep)))       self.t += 1.       return 

Data packing

Start by writing a helper function for transferring data to the device. This function should use dtensor.pack to send (and only send) the shard of the global batch that is intended for a replica to the device backing the replica. For simplicity, assume a single-client application.

Next, write a function that uses this helper function to pack the training data batches into DTensors sharded along the batch (first) axis. This ensures that DTensor evenly distributes the training data to the 'batch' mesh dimension. Note that in DTensor, the batch size always refers to the global batch size; therefore, the batch size should be chosen such that it can be divided evenly by the size of the batch mesh dimension. Additional DTensor APIs to simplify tf.data integration are planned, so please stay tuned.

\

def repack_local_tensor(x, layout):   # Repacks a local Tensor-like to a DTensor with layout   # This function assumes a single-client application   x = tf.convert_to_tensor(x)   sharded_dims = []    # For every sharded dimension, use tf.split to split the along the dimension.   # The result is a nested list of split-tensors in queue[0].   queue = [x]   for axis, dim in enumerate(layout.sharding_specs):     if dim == dtensor.UNSHARDED:       continue     num_splits = layout.shape[axis]     queue = tf.nest.map_structure(lambda x: tf.split(x, num_splits, axis=axis), queue)     sharded_dims.append(dim)    # Now you can build the list of component tensors by looking up the location in   # the nested list of split-tensors created in queue[0].   components = []   for locations in layout.mesh.local_device_locations():     t = queue[0]     for dim in sharded_dims:       split_index = locations[dim]  # Only valid on single-client mesh.       t = t[split_index]     components.append(t)    return dtensor.pack(components, layout)  def repack_batch(x, y, mesh):   # Pack training data batches into DTensors along the batch axis   x = repack_local_tensor(x, layout=dtensor.Layout(['batch', dtensor.UNSHARDED], mesh))   y = repack_local_tensor(y, layout=dtensor.Layout(['batch'], mesh))   return x, y 

Training

Write a traceable function that executes a single training step given a batch of data. This function does not require any special DTensor annotations. Also write a function that executes a test step and returns the appropriate performance metrics.

\

@tf.function def train_step(model, x_batch, y_batch, loss, metric, optimizer):   # Execute a single training step   with tf.GradientTape() as tape:     y_pred = model(x_batch)     batch_loss = loss(y_pred, y_batch)   # Compute gradients and update the model's parameters   grads = tape.gradient(batch_loss, model.trainable_variables)   optimizer.apply_gradients(grads)   # Return batch loss and accuracy   batch_acc = metric(y_pred, y_batch)   return batch_loss, batch_acc  @tf.function def test_step(model, x_batch, y_batch, loss, metric):   # Execute a single testing step   y_pred = model(x_batch)   batch_loss = loss(y_pred, y_batch)   batch_acc = metric(y_pred, y_batch)   return batch_loss, batch_acc 

Now, train the MLP model for 3 epochs with a batch size of 128.

\

# Initialize the training loop parameters and structures epochs = 3 batch_size = 128 train_losses, test_losses = [], [] train_accs, test_accs = [], [] optimizer = Adam(mlp_model.trainable_variables)  # Format training loop for epoch in range(epochs):   batch_losses_train, batch_accs_train = [], []   batch_losses_test, batch_accs_test = [], []    # Iterate through training data   for x_batch, y_batch in train_data:     x_batch, y_batch = repack_batch(x_batch, y_batch, mesh)     batch_loss, batch_acc = train_step(mlp_model, x_batch, y_batch, cross_entropy_loss, accuracy, optimizer)    # Keep track of batch-level training performance     batch_losses_train.append(batch_loss)     batch_accs_train.append(batch_acc)    # Iterate through testing data   for x_batch, y_batch in test_data:     x_batch, y_batch = repack_batch(x_batch, y_batch, mesh)     batch_loss, batch_acc = test_step(mlp_model, x_batch, y_batch, cross_entropy_loss, accuracy)     # Keep track of batch-level testing     batch_losses_test.append(batch_loss)     batch_accs_test.append(batch_acc)  # Keep track of epoch-level model performance   train_loss, train_acc = tf.reduce_mean(batch_losses_train), tf.reduce_mean(batch_accs_train)   test_loss, test_acc = tf.reduce_mean(batch_losses_test), tf.reduce_mean(batch_accs_test)   train_losses.append(train_loss)   train_accs.append(train_acc)   test_losses.append(test_loss)   test_accs.append(test_acc)   print(f"Epoch: {epoch}")   print(f"Training loss: {train_loss.numpy():.3f}, Training accuracy: {train_acc.numpy():.3f}")   print(f"Testing loss: {test_loss.numpy():.3f}, Testing accuracy: {test_acc.numpy():.3f}") 

\

Epoch: 0 Training loss: 1.850, Training accuracy: 0.343 Testing loss: 1.375, Testing accuracy: 0.504 Epoch: 1 Training loss: 1.028, Training accuracy: 0.674 Testing loss: 0.744, Testing accuracy: 0.782 Epoch: 2 Training loss: 0.578, Training accuracy: 0.839 Testing loss: 0.486, Testing accuracy: 0.869 

Performance evaluation

Start by writing a plotting function to visualize the model's loss and accuracy during training.

\

def plot_metrics(train_metric, test_metric, metric_type):   # Visualize metrics vs training Epochs   plt.figure()   plt.plot(range(len(train_metric)), train_metric, label = f"Training {metric_type}")   plt.plot(range(len(test_metric)), test_metric, label = f"Testing {metric_type}")   plt.xlabel("Epochs")   plt.ylabel(metric_type)   plt.legend()   plt.title(f"{metric_type} vs Training Epochs"); 

\

plot_metrics(train_losses, test_losses, "Cross entropy loss") 

\

\

plot_metrics(train_accs, test_accs, "Accuracy") 

\

Saving your model

The integration of tf.saved_model and DTensor is still under development. As of TensorFlow 2.9.0, tf.saved_model only accepts DTensor models with fully replicated variables. As a workaround, you can convert a DTensor model to a fully replicated one by reloading a checkpoint. However, after a model is saved, all DTensor annotations are lost and the saved signatures can only be used with regular Tensors. This tutorial will be updated to showcase the integration once it is solidified.

Conclusion

This notebook provided an overview of distributed training with DTensor and the TensorFlow Core APIs. Here are a few more tips that may help:

  • The TensorFlow Core APIs can be used to build highly-configurable machine learning workflows with support for distributed training.
  • The DTensor concepts guide and Distributed training with DTensors tutorial contain the most up-to-date information about DTensor and its integrations.

For more examples of using the TensorFlow Core APIs, check out the guide. If you want to learn more about loading and preparing data, see the tutorials on image data loading or CSV data loading.

\n

\ \

:::info Originally published on the TensorFlow website, this article appears here under a new headline and is licensed under CC BY 4.0. Code samples shared under the Apache 2.0 License.

:::

\

Clause de non-responsabilité : les articles republiés sur ce site proviennent de plateformes publiques et sont fournis à titre informatif uniquement. Ils ne reflètent pas nécessairement les opinions de MEXC. Tous les droits restent la propriété des auteurs d'origine. Si vous estimez qu'un contenu porte atteinte aux droits d'un tiers, veuillez contacter service@support.mexc.com pour demander sa suppression. MEXC ne garantit ni l'exactitude, ni l'exhaustivité, ni l'actualité des contenus, et décline toute responsabilité quant aux actions entreprises sur la base des informations fournies. Ces contenus ne constituent pas des conseils financiers, juridiques ou professionnels, et ne doivent pas être interprétés comme une recommandation ou une approbation de la part de MEXC.
Partager des idées

Vous aimerez peut-être aussi

Litecoin Trades Above $113, Toncoin Hits $3.12 as BullZilla’s Best Crypto Presale Now Secures Over 1000 Holders

Litecoin Trades Above $113, Toncoin Hits $3.12 as BullZilla’s Best Crypto Presale Now Secures Over 1000 Holders

The post Litecoin Trades Above $113, Toncoin Hits $3.12 as BullZilla’s Best Crypto Presale Now Secures Over 1000 Holders appeared on BitcoinEthereumNews.com. The crypto market has always been a place where conviction meets opportunity. Some projects arrive with a whisper and fade with the crowd, while others roar into existence, shaking the ground beneath them. For investors, analysts, and blockchain enthusiasts alike, the challenge is to identify which names will dominate the next cycle. Right now, three assets are carving their mark on the global stage: Bull Zilla, Litecoin, and Toncoin. Each of these coins represents a different face of crypto’s promise. BullZilla is the cinematic presale titan that thrives on progressive scarcity and investor belief. Litecoin is the veteran powerhouse, a digital silver that continues to evolve while maintaining its role as one of the most trusted transactional blockchains. Toncoin is the rising network-native giant, pushing the boundaries of user adoption and showcasing the ability to scale beyond expectations. For financial students learning the art of tokenomics, for developers searching for scalable ecosystems, and for investors hunting the best crypto presales now, these three assets stand as beacons. They don’t simply participate in the market, they define it. BullZilla: Mutation Engine Ignites Explosive ROI Potential BullZilla ($BZIL) is not just another meme coin in the crowded crypto jungle. It is a creature of myth and market design, crafted with one of the most compelling presale mechanics in history. Its Mutation Engine drives the project, a system where the presale price evolves in real time with capital milestones and time triggers. Each stage makes entry costlier, amplifying the rewards for those who act first. At this very moment, BullZilla is in Stage 2B: Dead Wallets Don’t Lie. The current price is $0.00003908, with more than $300,000 raised and over 1,000 holders already locked in. Early believers have already achieved a 579% ROI, while investors entering Stage 2B still stand before a potential…
Partager
BitcoinEthereumNews2025/09/10 11:55
Partager
Can the popular RWA really make money?

Can the popular RWA really make money?

In the cryptocurrency world, consensus is never lacking. To some extent, as the carrier of the dream economy, consensus is the gold of the cryptocurrency world. From the summer of DeFi to the once-popular NFTs, from the glimpse of the future of Web3 to the sudden explosion of AI, the continuous rise of the cryptocurrency world has all stemmed from consensus itself. Now, the wind of consensus is blowing towards RWA. As institutions continue to bridge the gap between crypto and traditional markets, RWA, the tokenization of real-world assets, is considered the next major trend poised to generate substantial growth. In Hong Kong, internet giants, financial institutions, and major banks appear to be waiting and observing this potential future trend. In mainland China, projects under the RWA banner are also mushrooming, hoping to dispel the industry's stagnation with RWA's momentum. But after unveiling the veil of “everything can be tokenized”, whether the real RWA is really the gold to be mined as the market imagines is still a big question mark. 01. Current Status of RWA Development: Overseas Focus on Finance, while Mainland China Develops Industry RWAs, short for Real World Assets, broadly refer to any real-world physical asset that is tokenized and mapped onto a blockchain. Strictly speaking, stablecoins are also a form of RWA. From an asset perspective, RWAs offer numerous advantages. First, divisibility. Compared to traditional assets, which are sold in fixed units, tokenization allows for fragmentation and sale of assets in smaller units. This not only lowers the barrier to entry for financing but also allows for greater trading flexibility for large assets constrained by scale. Second, it offers broader price discovery and liquidity. Under existing financial product trading infrastructure, financial asset transactions are subject to significant time and space constraints. However, on-chain tokenization enables 24/7 trading and global pricing, more in line with the characteristics of a free market. Finally, efficiency is enhanced. On-chain tokenization offers high transparency and reduces intermediary costs and time, making RWAs generally more efficient in issuance. With these advantages, traditional institutions have flocked to the market. Beginning in 2019, JPMorgan Chase, Goldman Sachs, DBS Bank, UBS, Santander, Societe Generale, and Hamilton Lane, among others, began exploring this sector and testing and issuing some products. But why has RWA only recently exploded in popularity? The underlying reasons are policy and cyclical factors. First, a shift in the policy environment. The United States, in particular, significantly reduced regulatory pressure on tokenized assets this year and even expressed a heightened interest in stablecoins and RWA assets. Hong Kong has also seen this. This relaxed regulatory environment has given previously hesitant institutions the freedom to conduct pilot projects. Second, there are issues related to industry cycles. To date, the core driving force of the cryptocurrency industry has shifted from technology and applications to capital. The prominent problem restricting the cryptocurrency industry is the serious lack of incremental growth. The market can no longer support development by relying solely on the existing resources within the circle. It is necessary to introduce flows of people and funds from outside the circle. The large-scale influx of traditional institutions just corresponds to this solution. Therefore, RWA, as the best entry point for traditional institutions and crypto finance, has also been popular. As with their current development, the paths of RWA development in China and abroad, like their attitudes toward blockchain, differ significantly. Overseas RWAs, primarily in the United States, focus on finance, with tokenized assets often consisting primarily of government bonds and money market funds. In contrast, domestic RWAs emphasize real-world empowerment, with underlying assets possessing a distinct industrial nature. Currently, due to their early start and maturing development, overseas RWAs are exhibiting a diverse range of underlying assets. According to Rwa.xyz data, after excluding stablecoins, the total on-chain RWA has reached $28.44 billion, a 14.74-fold increase from $1.929 billion in 2022. The number of asset issuers has reached 274, with total asset holders exceeding 380,000. In terms of asset classes, private credit is the core area of RWA, with a scale of 16.1 billion yuan, accounting for 56.61%. US Treasuries rank second with $7.5 billion, followed by commodities ($2 billion), institutional alternative funds ($1.8 billion), and public equity ($4.2 million). Non-US Treasury bonds and corporate bonds are the least involved, with a combined total of only $600,000. While private lending appears to be leading the way, Figure, an on-chain mortgage lender, alone accounts for $15.5 billion in private lending. However, Figure merely records transactions on the Provenance blockchain after backing its core HELOC mortgage product. Strictly speaking, it merely uploads data to the blockchain and is not a true RWA company. Therefore, the most attractive sector in the RWA sector remains US Treasury bonds. Institutional investors flock to the U.S. Treasury bond market. The top three holdings are all large institutions. BlackRock's tokenized fund, BUIDL, currently holds $2.283 billion in assets, followed by WisdomTree's WTGXX (US$830 million) and Franklin Templeton's government money fund, BENJI (US$740 million). Together, these three companies hold 37.78% of the Treasury bond market. Precious metals dominate the commodity market, with gold holdings exceeding $1.88 billion, representing over 70% of the market. Shifting our focus from overseas to domestically, the target composition shifts. China's RWA practice is still in its early stages, with the industrial chain still evolving. Development pathways are primarily focused on empowering the real economy, with applications currently underway in financial assets, physical assets, trade financing, supply chain traceability, cultural heritage preservation, and tourism. Typical examples include the Longxin Group charging pile asset project, the GCL Energy photovoltaic asset project, the Green Energy battery swap asset project, the Malu grape agricultural product project, and the Greenland Jinchuang real estate project. For example, the first charging pile asset RWA project in China, a collaboration between Ant Digital and Longxin Technology, successfully raised 100 million RMB in tokenized financing, leveraging 9,000 charging piles owned by GCL Energy. Source: Huaxi Securities There are also differences in infrastructure. Overseas RWAs are mostly hosted on public blockchains, with Ethereum holding over 57% of the market share. Domestic RWAs, however, adhere to traditional principles, primarily relying on consortium blockchains, supplemented by public blockchains. Currently, blockchain companies such as Ant Digits and Shuqin Technology are developing dedicated RWA platforms. Despite differences in infrastructure and underlying assets, a preliminary consensus has emerged both domestically and internationally regarding the rush to establish RWAs. According to a joint forecast by Boston Consulting Group (BCG) and ADDX, the global asset tokenization market will reach $16.1 trillion by 2030. Against this backdrop, not only large enterprises are eager to capitalize, but even small and medium-sized businesses are jumping on this new gold mine of wealth. However, despite this seemingly limitless potential, is RWA truly flawless in its current development? Is issuing an RWA truly as easy as taking something out of a bag? 02. The dilemma of RWA: high issuance threshold and liquidity problems The answer is no. First, despite the slogan "everything can be tokenized," RWAs are not without requirements for their underlying assets. The term "asset" implies that the issued RWA must be an objectively yielding asset. Therefore, a relatively good underlying asset should possess three basic qualities: standardization, high liquidity, and a more attractive return. Essentially, on-chain asset issuance merely provides a new financing and issuance channel. The key to attracting market liquidity lies in the inherent value of the asset. From a scalability perspective, scalable assets must possess stable value, clear legal title, and verifiable off-chain data; otherwise, widespread distribution is difficult. This also explains why government bonds are the largest overseas RWA product: their inherent high liquidity, guaranteed returns, and high compliance certainty naturally align with the RWA concept. Even if the asset issue is resolved, issuing RWAs is still not an easy task under my country's current environment. Currently, due to the inherent securities nature of RWAs, the RWA issuance process involves both legal compliance and technical complexity. For example, issuing private RWAs in Hong Kong requires initial asset screening to ensure that the assets are clear and tradable. Typically, a special purpose vehicle (SPV) entity is established to connect domestic and overseas markets, facilitating the compliant cross-border flow of funds and assets. License application and sandbox testing must also be completed in Hong Kong. After ensuring compliance, technical implementation must ensure data and asset interoperability. Comprehensive solution providers are now available, focusing on asset on-chain integration, smart contract auditing, and cross-chain interoperability. The entire process, relying solely on private companies to issue RWAs in Hong Kong, would take at least eight months. The complex process leads to high costs. According to a PAnews report, the cost of issuing a single RWA product in Hong Kong can reach 3-6 million RMB, covering legal compliance, technology integration, brokerage costs, and fundraising and QFLP costs. Brokerages, as the core of RWA transactions, account for the majority of these costs, with channel fees reaching 2-3 million RMB. From a long-term strategic perspective, issuance costs rise even further. Obtaining a Hong Kong license alone can cost over one million RMB, and the extremely challenging Virtual Asset Service Provider (VASP) license can cost tens of millions RMB, making participation accessible only to large, well-resourced players. More importantly, issuance is just the beginning; liquidity challenges remain. In fact, even in larger overseas markets, the liquidity of RWA products is far from optimistic. Take BlackRock's BUIDL, for example. With a market capitalization of $2.238 billion and monthly transaction volume exceeding $170 million, BUIDL is a market leader overseas. However, it has only 89 holders, 51 monthly transfer addresses, and fewer than 20 monthly active addresses, highlighting the market's high dependence on issuers and a small number of large institutions. This is consistent with the performance of the traditional government bond market, where such assets typically generate interest through scale rather than relying on a trading market. Tokenization hardly changes the underlying nature of these assets. Across the institutional RWA market, these characteristics of high market capitalization, concentrated control, and low liquidity are common. Only products with relatively widespread trading channels, such as gold RWAs, can break this mold. This shows that the threshold for issuing RWAs is not only high, but also quite high. Companies hoping to achieve huge profits through RWAs and create something out of nothing may need to think twice before taking action. After all, if there is a good asset, there will naturally be no shortage of sellers. However, if the underlying asset cannot be classified as a high-quality asset in the first place, tokenization will not only fail to achieve good results, but may even lead to losses. In fact, a large number of RWA products currently flooding the market are simply skirting the rules, covering junk assets with a conceptual shell to package them as new products. This not only fails to meet the original intention of RWAs, but also poses compliance risks. Take Hainan Huatie, a project that has recently gained widespread attention in the market, for example. The company, relying on the "Brother Hornet" digital collectible, has tied the collectible to a cash dividend of 50,000 stock income rights each year from 2025 to 2027. As a further development strategy, the company has also officially announced the issuance of a 10 million yuan non-financial RWA product, which will digitize the use and operating rights of all its equipment on the blockchain in the form of "membership cards," allowing users to circulate through on-chain transfers, consignments, and other methods while enjoying certain usage rights or benefits. Although both projects were quite successful, with the Hornet Brother digital collectible seeing its floor price leap from 200 to 15,000 yuan in just three days, a closer look reveals that both NFTs and RWAs have very unclear ownership structures, extremely vague disclosure information, and involve the splitting of securitized proceeds, posing obvious compliance risks. 03. The Future of RWA: A Dialectical Unity of Brightness and Twists In summary, although RWA has developed rapidly in the past two years driven by both policies and markets, the industrial chain has been steadily extended, the coverage of underlying assets has continued to increase, product types have shown a trend of diversification, and the issuing entities have been continuously expanded, it also faces objective challenges such as insufficient infrastructure, long issuance cycle, high cost, low liquidity, and lack of regulatory chain. If long-term development is to be achieved, it is indispensable to improve infrastructure technically, build an ecosystem for service providers, and create a structure in the market. Fortunately, the market is taking action. Technically, specialized platforms for RWA issuance are springing up, along with accelerators, organizations, and associations focused on RWA services. The standard system for product issuance continues to improve. Even with the daunting challenge of liquidity, the market is attempting to address it by opening up the DeFi space and developing on-chain distribution. On the regulatory front, both the United States and Hong Kong are providing a better environment for innovation within their rules. Hong Kong's Ensemble Sandbox is a prime example. The future is bright, but the road ahead is tortuous. Behind the gold rush, there are also obstacles. For RWA, there is still a long way to go.
Partager
PANews2025/09/10 12:00
Partager