프로그래밍/python

tensorflow dataset padded_batch 함수

ksyoon 2021. 1. 22. 15:38

입력데이터의 크기가 가변 일때 같은 크기로 읽을 수 있게 변환해 주는 함수이다. 샘플을 보면 그리 어렵지 않게 사용법을 익힐수 있다.

padded_batch

 

padded_batch(
    batch_size
, padded_shapes=None, padding_values=None, drop_remainder=False
)

 

Combines consecutive elements of this dataset into padded batches.

This transformation combines multiple consecutive elements of the input dataset into a single element.

Like tf.data.Dataset.batch, the components of the resulting element will have an additional outer dimension, which will be batch_size (or N % batch_size for the last element if batch_size does not divide the number of input elements N evenly and drop_remainder is False). If your program depends on the batches having the same outer dimension, you should set the drop_remainder argument to True to prevent the smaller batch from being produced.

Unlike tf.data.Dataset.batch, the input elements to be batched may have different shapes, and this transformation will pad each component to the respective shape in padded_shapes. The padded_shapes argument determines the resulting shape for each dimension of each component in an output element:

  • If the dimension is a constant, the component will be padded out to that length in that dimension.
  • If the dimension is unknown, the component will be padded out to the maximum length of all elements in that dimension.

A = (tf.data.Dataset
     .range(1, 5, output_type=tf.int32)
     .map(lambda x: tf.fill([x], x)))
# Pad to the smallest per-batch size that fits all elements.
B = A.padded_batch(2)
for element in B.as_numpy_iterator():
  print(element)




# Pad to a fixed size.
C = A.padded_batch(2, padded_shapes=5)
for element in C.as_numpy_iterator():
  print(element)




# Pad with a custom value.
D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
for element in D.as_numpy_iterator():
  print(element)




# Components of nested elements can be padded independently.
elements = [([1, 2, 3], [10]),
            ([4, 5], [11, 12])]
dataset = tf.data.Dataset.from_generator(
    lambda: iter(elements), (tf.int32, tf.int32))
# Pad the first component of the tuple to length 4, and the second
# component to the smallest size that fits.
dataset = dataset.padded_batch(2,
    padded_shapes=([4], [None]),
    padding_values=(-1, 100))
list(dataset.as_numpy_iterator())


# Pad with a single value and multiple components.
E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
for element in E.as_numpy_iterator():
  print(element)






See also tf.data.experimental.dense_to_sparse_batch, which combines elements that may have different shapes into a tf.sparse.SparseTensor.

batch_size A tf.int64 scalar tf.Tensor, representing the number of consecutive elements of this dataset to combine in a single batch.
padded_shapes (Optional.) A nested structure of tf.TensorShape or tf.int64 vector tensor-like objects representing the shape to which the respective component of each input element should be padded prior to batching. Any unknown dimensions will be padded to the maximum size of that dimension in each batch. If unset, all dimensions of all components are padded to the maximum size in the batch. padded_shapes must be set if any component has an unknown rank.
padding_values (Optional.) A nested structure of scalar-shaped tf.Tensor, representing the padding values to use for the respective components. None represents that the nested structure should be padded with default values. Defaults are 0 for numeric types and the empty string for string types. The padding_values should have the same structure as the input dataset. If padding_values is a single element and the input dataset has multiple components, then the same padding_values will be used to pad every component of the dataset. If padding_values is a scalar, then its value will be broadcasted to match the shape of each component.
drop_remainder (Optional.) A tf.bool scalar tf.Tensor, representing whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is not to drop the smaller batch.

Args

Dataset A Dataset.

Returns

ValueError If a component has an unknown rank, and the padded_shapes argument is not set.

https://www.tensorflow.org/api_docs/python/tf/compat/v1/data/Dataset#padded_batch