【TFRecord】Tensorflow默认标准数据格式

阅读量：727 次

发布时间：2019-03-21

本文共 2483 字，大约阅读时间需要 8 分钟。

Tensorflow默认标准数据格式TFRecord学习

简介

在工程项目中，数据集通常以多种格式存在，为了统一管理，可以选择将数据转换为统一格式。Tensorflow定义的TFRecord格式是一种灵活且高效的数据存储方式。

TFRecord格式特点

二进制文件：TFRecord是一个简单的二进制文件，包含序列化的输入数据。

协议缓冲区（protobuf）：数据通过protobuf序列化，确保无论平台还是语言，数据格式一致。

组织结构优化：统一格式减少文件分散存储的可能性，每个实例属性存储于同一文件。

优势

高效处理：数据存储于内存块中，避免了大量文件读取的时间开销。

多线程支持：Tensorflow提供了优化工具，支持通过多线程输入管道高效处理。

数据存储

写入数据

首先，将输入文件转换为TFRecord格式。示例：

来自MNIST图像集的转换：

from __future__ import print_functionimport osimport tensorflow as tffrom tensorflow.contrib.learn.python.learn.datasets import mnistimport numpy as npsave_dir = 'c:/tmp/data'# 数据下载data_sets = mnist.read_data_sets(save_dir, dtype=tf.uint8, reshape=False, validation_size=1000)

将数据写出：

data_splits = ['train', 'test', 'validation']for d in range(len(data_splits)):    print('保存' + data_splits[d])    data_set = data_sets[d]    filename = os.path.join(save_dir, data_splits[d] + '.tfrecords')    writer = tf.python_io.TFRecordWriter(filename)    for index in range(data_set.images.shape[0]):        image = data_set.images[index].tostring()        example = tf.train.Example(            features=tf.train.Features(                feature={                    'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[data_set.images.shape[1]])),                    'width': tf.train.Feature(int64_list=tf.train.Int64List(value=[data_set.images.shape[2]])),                    'depth': tf.train.Feature(int64_list=tf.train.Int64List(value=[data_set.images.shape[3]])),                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(data_set.labels[index])])),                    'image_raw': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image]))                })        )        writer.write(example.SerializeToString())    writer.close()

读取数据

读取时使用tf.python_io.tf_record_iterator：

from tensorflow import python_iofilename = os.path.join(save_dir, 'train.tfrecords')record_iterator = python_io.tf_record_iterator(filename)serialized_img_example = next(record_iterator)

解析数据：

example = tf.train.Example()example.ParseFromString(serialized_img_example)image = example.features.feature['image_raw'].bytes_list.valuelabel = example.features.feature['label'].int64_list.value[0]width = example.features.feature['width'].int64_list.value[0]height = example.features.feature['height'].int64_list.value[0]

恢复图像：

img_flat = np.fromstring(image[0], dtype=np.uint8)img_reshaped = img_flat.reshape((height, width, -1))

总结

Tensorflow的TFRecord格式为数据处理提供了高效的解决方案，无论是写入还是读取数据都得到了充分支持。

转载地址：http://ndigz.baihongyu.com/

你可能感兴趣的文章

NLP问答系统：使用 Deepset SQUAD 和 SQuAD v2 度量评估

查看>>

NLP项目：维基百科文章爬虫和分类【02】 - 语料库转换管道

查看>>

NLP：使用 SciKit Learn 的文本矢量化方法

Nmap渗透测试指南之指纹识别与探测、伺机而动

查看>>

Nmap端口扫描工具Windows安装和命令大全（非常详细）零基础入门到精通，收藏这篇就够了

nmon_x86_64_centos7工具如何使用

查看>>

NN&DL4.1 Deep L-layer neural network简介

查看>>

NN&DL4.3 Getting your matrix dimensions right

查看>>

NN&DL4.7 Parameters vs Hyperparameters

查看>>

NN&DL4.8 What does this have to do with the brain?

查看>>

nnU-Net 终极指南

查看>>