TensorFlow可以“預(yù)裝”數(shù)據(jù)集了，新功能Datasets出爐

曉查 2019-02-27 15:23:37 來源：量子位

郭一璞發(fā)自凹非寺

量子位報(bào)道 | 公眾號(hào) QbitAI

訓(xùn)練機(jī)器學(xué)習(xí)模型的時(shí)候，需要先找數(shù)據(jù)集、下載、裝數(shù)據(jù)集……太麻煩了，比如MNIST這種全世界都在用的數(shù)據(jù)集，能不能來個(gè)一鍵裝載啥的？

Google也這么想。

TensorFlow可以“預(yù)裝”數(shù)據(jù)集了，新功能Datasets出爐

今天，TensorFlow推出了一個(gè)新的功能，叫做TensorFlow Datasets，可以以tf.data和NumPy的格式將公共數(shù)據(jù)集裝載到TensorFlow里。

目前已經(jīng)有29個(gè)數(shù)據(jù)集可以通過TensorFlow Datasets裝載：

音頻類

nsynth

圖像類

cats_vs_dogs

celeb_a

celeb_a_hq

cifar10

cifar100

coco2014

colorectal_histology

colorectal_histology_large

diabetic_retinopathy_detection

fashion_mnist

image_label_folder

imagenet2012

lsun

mnist

omniglot

open_images_v4

quickdraw_bitmap

svhn_cropped

tf_flowers

結(jié)構(gòu)化數(shù)據(jù)集

titanic

文本類

imdb_reviews

lm1b

squad

翻譯類

wmt_translate_ende

wmt_translate_enfr

視頻類

bair_robot_pushing_small

moving_mnist

starcraft_video

未來還會(huì)增加更多數(shù)據(jù)集，你也可以自己添加數(shù)據(jù)集。

具體怎么裝

必須是TensorFlow1.12以上版本才可以安裝，某些數(shù)據(jù)集需要額外的庫(kù)。

1pip install tensorflow-datasets
2
3# Requires TF 1.12+ to be installed.
4# Some datasets require additional libraries; see setup.py extras_require
5pip install tensorflow
6# or:
7pip install tensorflow-gpu

每個(gè)數(shù)據(jù)集都作為DatasetBuilder公開，已知：

1.從哪里下載數(shù)據(jù)集，如何提取數(shù)據(jù)并寫入標(biāo)準(zhǔn)格式；

2.如何從disk加載；

3.各類要素名稱、類型等信息。

這些DatasetBuilder都能直接實(shí)例化或者用tfds.builder字符串讀取：

 1import tensorflow_datasets as tfds
 2
 3# Fetch the dataset directly
 4mnist = tfds.image.MNIST()
 5# or by string name
 6mnist = tfds.builder('mnist')
 7
 8# Describe the dataset with DatasetInfo
 9assert mnist.info.features['image'].shape == (28, 28, 1)
10assert mnist.info.features['label'].num_classes == 10
11assert mnist.info.splits['train'].num_examples == 60000
12
13# Download the data, prepare it, and write it to disk
14mnist.download_and_prepare()
15
16# Load data from disk as tf.data.Datasets
17datasets = mnist.as_dataset()
18train_dataset, test_dataset = datasets['train'], datasets['test']
19assert isinstance(train_dataset, tf.data.Dataset)
20
21# And convert the Dataset to NumPy arrays if you'd like
22for example in tfds.as_numpy(train_dataset):
23 image, label = example['image'], example['label']
24 assert isinstance(image, np.array)

你也可以用tfds.load執(zhí)行一系列的批量示例、轉(zhuǎn)換操作，然后再調(diào)用。

1import tensorflow_datasets as tfds
2
3datasets = tfds.load("mnist")
4train_dataset, test_dataset = datasets["train"], datasets["test"]
5assert isinstance(train_dataset, tf.data.Dataset)

數(shù)據(jù)集版本控制

當(dāng)數(shù)據(jù)集自身版本更新時(shí)，已經(jīng)開始訓(xùn)練的數(shù)據(jù)不會(huì)變化，TensorFlow官方會(huì)采取增加新版本的方式把新的數(shù)據(jù)集放上來。

具體配置

有不同變體的數(shù)據(jù)集用BuilderConfigs進(jìn)行配置，比如大型電影評(píng)論數(shù)據(jù)集（Large Movie Review Dataset），可以對(duì)輸入文本進(jìn)行不同的編碼。

內(nèi)置配置與數(shù)據(jù)集文檔一起列出，可以通過字符串進(jìn)行尋址。

 1# See the built-in configs
 2configs = tfds.text.IMDBReviews.builder_configs
 3assert "bytes" in configs
 4
 5# Address a built-in config with tfds.builder
 6imdb = tfds.builder("imdb_reviews/bytes")
 7# or when constructing the builder directly
 8imdb = tfds.text.IMDBReviews(config="bytes")
 9# or use your own custom configuration
10my_encoder = tfds.features.text.ByteTextEncoder(additional_tokens=['hello'])
11my_config = tfds.text.IMDBReviewsConfig(
12 name="my_config",
13 version="1.0.0",
14 text_encoder_config=tfds.features.text.TextEncoderConfig(encoder=my_encoder),
15)
16imdb = tfds.text.IMDBReviews(config=my_config)

也可以用你自己的配置，通過tfds.core.BuilderConfigs，進(jìn)行以下步驟：

1.把你自己的配置對(duì)象定義為的子類 tfds.core.BuilderConfig。比如叫“MyDatasetConfig”；

2.在數(shù)據(jù)集公開的列表中定義BUILDER_CONFIGS類成員，比如“MyDatasetMyDatasetConfig”；

3.使用self.builder_config在MyDataset配置數(shù)據(jù)生成，可能包括在_info()或更改下載數(shù)據(jù)訪問中設(shè)置不同的值。

關(guān)于文本數(shù)據(jù)集

平常遇到文本數(shù)據(jù)集都比較難搞，但是有了TensorFlow Datasets就會(huì)好辦一些，包含很多文本任務(wù)，三種文本編碼器：

1.ByteTextEncoder，用于字節(jié)/字符級(jí)編碼；

2.TokenTextEncoder，用于基于詞匯文件的單詞級(jí)編碼；

3.SubwordTextEncoder，用于子詞級(jí)編碼，具有字節(jié)級(jí)回退，以使其完全可逆，比如可以把“hello world”分為[“he”，“l(fā)lo”，“”，“wor”，“l(fā)d”]，然后進(jìn)行整數(shù)編碼。

以上這些都支持Unicode。

編碼器和詞庫(kù)可以這樣訪問：

 1imdb = tfds.builder("imdb_reviews/subwords8k")
 2
 3# Get the TextEncoder from DatasetInfo
 4encoder = imdb.info.features["text"].encoder
 5assert isinstance(encoder, tfds.features.text.SubwordTextEncoder)
 6
 7# Encode, decode
 8ids = encoder.encode("Hello world")
 9assert encoder.decode(ids) == "Hello world"
10
11# Get the vocabulary size
12vocab_size = encoder.vocab_size

TensorFlow官方明確表示，TensorFlow和TensorFlow Datasets在文本支持方面將會(huì)進(jìn)一步改進(jìn)。

傳送門

最后，下面是TensorFlow官方提供的各類文檔資料教程鏈接：

TensorFlow博客原文

https://medium.com/tensorflow/introducing-tensorflow-datasets-c7f01f7e19f3

TensorFlow官方文檔

https://www.tensorflow.org/datasets

GitHub

https://github.com/tensorflow/datasets

Colab教程

https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb

Enjoy yourself~

版權(quán)所有，未經(jīng)授權(quán)不得以任何形式轉(zhuǎn)載及使用，違者必究。

曉查

TensorFlow可以“預(yù)裝”數(shù)據(jù)集了，新功能Datasets出爐

具體怎么裝

數(shù)據(jù)集版本控制

具體配置

關(guān)于文本數(shù)據(jù)集

傳送門

熱門文章

英偉達(dá)巧用8B模型秒掉GPT-5，開源了

SpaceX估值8000億美元超OpenAI，IPO就在明年

Ilya剛預(yù)言完，世界首個(gè)原生多模態(tài)架構(gòu)NEO就來了：視覺和語(yǔ)言徹底被焊死

跨境電商的疑難雜癥，被1688這個(gè)AI全包了…

14歲華人小孩，折個(gè)紙成美國(guó)天才少年