Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Commit

Permalink
Merge pull request #167 from rsepassi/push
Browse files Browse the repository at this point in the history
v1.1.0
  • Loading branch information
lukaszkaiser authored Jul 19, 2017
2 parents 963730e + f703629 commit 47d556a
Show file tree
Hide file tree
Showing 30 changed files with 1,020 additions and 423 deletions.
23 changes: 17 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ python -c "from tensor2tensor.models.transformer import Transformer"
specification.
* Support for multi-GPU machines and synchronous (1 master, many workers) and
asynchrounous (independent workers synchronizing through a parameter server)
distributed training.
[distributed training](https://github.com/tensorflow/tensor2tensor/tree/master/docs/distributed_training.md).
* Easily swap amongst datasets and models by command-line flag with the data
generation script `t2t-datagen` and the training script `t2t-trainer`.

Expand All @@ -173,8 +173,10 @@ and many common sequence datasets are already available for generation and use.

**Problems** define training-time hyperparameters for the dataset and task,
mainly by setting input and output **modalities** (e.g. symbol, image, audio,
label) and vocabularies, if applicable. All problems are defined in
[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py).
label) and vocabularies, if applicable. All problems are defined either in
[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py)
or are registered with `@registry.register_problem` (run `t2t-datagen` to see
the list of all available problems).
**Modalities**, defined in
[`modality.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/modality.py),
abstract away the input and output data types so that **models** may deal with
Expand Down Expand Up @@ -211,7 +213,7 @@ inference. Users can easily switch between problems, models, and hyperparameter
sets by using the `--model`, `--problems`, and `--hparams_set` flags. Specific
hyperparameters can be overridden with the `--hparams` flag. `--schedule` and
related flags control local and distributed training/evaluation
([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/docs/distributed_training.md)).
([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/docs/distributed_training.md)).

---

Expand All @@ -222,7 +224,7 @@ enables easily adding new ones and easily swapping amongst them by command-line
flag. You can add your own components without editing the T2T codebase by
specifying the `--t2t_usr_dir` flag in `t2t-trainer`.

You can currently do so for models, hyperparameter sets, and modalities. Please
You can do so for models, hyperparameter sets, modalities, and problems. Please
do submit a pull request if your component might be useful to others.

Here's an example with a new hyperparameter set:
Expand Down Expand Up @@ -253,9 +255,18 @@ You'll see under the registered HParams your
`transformer_my_very_own_hparams_set`, which you can directly use on the command
line with the `--hparams_set` flag.

`t2t-datagen` also supports the `--t2t_usr_dir` flag for `Problem`
registrations.

## Adding a dataset

See the [data generators
To add a new dataset, subclass
[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
and register it with `@registry.register_problem`. See
[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
for an example.

Also see the [data generators
README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/README.md).

---
Expand Down
File renamed without changes.
23 changes: 23 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# T2T: Tensor2Tensor Transformers

Check us out on
<a href=https://github.com/tensorflow/tensor2tensor>
GitHub
<img src="https://github.com/favicon.ico" width="16">
</a>
.

[![PyPI
version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor)
[![GitHub
Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues)
[![Contributions
welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)

See our
[README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/README.md)
for documentation.

More documentation and tutorials coming soon...
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

setup(
name='tensor2tensor',
version='1.0.14',
version='1.1.0',
description='Tensor2Tensor',
author='Google Inc.',
author_email='no-reply@google.com',
Expand Down
62 changes: 7 additions & 55 deletions tensor2tensor/bin/t2t-datagen
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ import tempfile

import numpy as np

from tensor2tensor.data_generators import algorithmic
from tensor2tensor.data_generators import algorithmic_math
from tensor2tensor.data_generators import all_problems # pylint: disable=unused-import
from tensor2tensor.data_generators import audio
Expand All @@ -60,52 +59,22 @@ flags.DEFINE_string("tmp_dir", "/tmp/t2t_datagen",
"Temporary storage directory.")
flags.DEFINE_string("problem", "",
"The name of the problem to generate data for.")
flags.DEFINE_string("exclude_problems", "",
"Comma-separates list of problems to exclude.")
flags.DEFINE_integer("num_shards", 10, "How many shards to use.")
flags.DEFINE_integer("max_cases", 0,
"Maximum number of cases to generate (unbounded if 0).")
flags.DEFINE_integer("random_seed", 429459, "Random seed to use.")

flags.DEFINE_string("t2t_usr_dir", "",
"Path to a Python module that will be imported. The "
"__init__.py file should include the necessary imports. "
"The imported files should contain registrations, "
"e.g. @registry.register_model calls, that will then be "
"available to the t2t-datagen.")
"e.g. @registry.register_problem calls, that will then be "
"available to t2t-datagen.")

# Mapping from problems that we can generate data for to their generators.
# pylint: disable=g-long-lambda
_SUPPORTED_PROBLEM_GENERATORS = {
"algorithmic_shift_decimal40": (
lambda: algorithmic.shift_generator(20, 10, 40, 100000),
lambda: algorithmic.shift_generator(20, 10, 80, 10000)),
"algorithmic_reverse_binary40": (
lambda: algorithmic.reverse_generator(2, 40, 100000),
lambda: algorithmic.reverse_generator(2, 400, 10000)),
"algorithmic_reverse_decimal40": (
lambda: algorithmic.reverse_generator(10, 40, 100000),
lambda: algorithmic.reverse_generator(10, 400, 10000)),
"algorithmic_addition_binary40": (
lambda: algorithmic.addition_generator(2, 40, 100000),
lambda: algorithmic.addition_generator(2, 400, 10000)),
"algorithmic_addition_decimal40": (
lambda: algorithmic.addition_generator(10, 40, 100000),
lambda: algorithmic.addition_generator(10, 400, 10000)),
"algorithmic_multiplication_binary40": (
lambda: algorithmic.multiplication_generator(2, 40, 100000),
lambda: algorithmic.multiplication_generator(2, 400, 10000)),
"algorithmic_multiplication_decimal40": (
lambda: algorithmic.multiplication_generator(10, 40, 100000),
lambda: algorithmic.multiplication_generator(10, 400, 10000)),
"algorithmic_reverse_nlplike_decimal8K": (
lambda: algorithmic.reverse_generator_nlplike(8000, 70, 100000,
10, 1.300),
lambda: algorithmic.reverse_generator_nlplike(8000, 70, 10000,
10, 1.300)),
"algorithmic_reverse_nlplike_decimal32K": (
lambda: algorithmic.reverse_generator_nlplike(32000, 70, 100000,
10, 1.050),
lambda: algorithmic.reverse_generator_nlplike(32000, 70, 10000,
10, 1.050)),
"algorithmic_algebra_inverse": (
lambda: algorithmic_math.algebra_inverse(26, 0, 2, 100000),
lambda: algorithmic_math.algebra_inverse(26, 3, 3, 10000)),
Expand All @@ -125,29 +94,9 @@ _SUPPORTED_PROBLEM_GENERATORS = {
2**14, 2**9),
lambda: wsj_parsing.parsing_token_generator(FLAGS.tmp_dir, False,
2**14, 2**9)),
"wmt_enfr_characters": (
lambda: wmt.enfr_character_generator(FLAGS.tmp_dir, True),
lambda: wmt.enfr_character_generator(FLAGS.tmp_dir, False)),
"wmt_enfr_tokens_8k": (
lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**13),
lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, False, 2**13)
),
"wmt_enfr_tokens_32k": (
lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, True, 2**15),
lambda: wmt.enfr_wordpiece_token_generator(FLAGS.tmp_dir, False, 2**15)
),
"wmt_ende_characters": (
lambda: wmt.ende_character_generator(FLAGS.tmp_dir, True),
lambda: wmt.ende_character_generator(FLAGS.tmp_dir, False)),
"wmt_ende_bpe32k": (
lambda: wmt.ende_bpe_token_generator(FLAGS.tmp_dir, True),
lambda: wmt.ende_bpe_token_generator(FLAGS.tmp_dir, False)),
"wmt_zhen_tokens_32k": (
lambda: wmt.zhen_wordpiece_token_generator(FLAGS.tmp_dir, True,
2**15, 2**15),
lambda: wmt.zhen_wordpiece_token_generator(FLAGS.tmp_dir, False,
2**15, 2**15)
),
"lm1b_32k": (
lambda: lm1b.generator(FLAGS.tmp_dir, True),
lambda: lm1b.generator(FLAGS.tmp_dir, False)
Expand Down Expand Up @@ -286,6 +235,9 @@ def main(_):
# Calculate the list of problems to generate.
problems = sorted(
list(_SUPPORTED_PROBLEM_GENERATORS) + registry.list_problems())
for exclude in FLAGS.exclude_problems.split(","):
if exclude:
problems = [p for p in problems if exclude not in p]
if FLAGS.problem and FLAGS.problem[-1] == "*":
problems = [p for p in problems if p.startswith(FLAGS.problem[:-1])]
elif FLAGS.problem:
Expand Down
6 changes: 2 additions & 4 deletions tensor2tensor/bin/t2t-trainer
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,11 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import importlib
import os
import sys

# Dependency imports

from tensor2tensor.utils import trainer_utils as utils
from tensor2tensor.utils import usr_dir

import tensorflow as tf

flags = tf.flags
Expand All @@ -49,6 +46,7 @@ flags.DEFINE_string("t2t_usr_dir", "",
"e.g. @registry.register_model calls, that will then be "
"available to the t2t-trainer.")


def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)
Expand Down
72 changes: 35 additions & 37 deletions tensor2tensor/data_generators/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Data generators for T2T models.
# T2T Problems.

This directory contains data generators for a number of problems. We use a
naming scheme for the problems, they have names of the form
This directory contains `Problem` specifications for a number of problems. We
use a naming scheme for the problems, they have names of the form
`[task-family]_[task]_[specifics]`. Data for all currently supported problems
can be generated by calling the main generator binary (`t2t-datagen`). For
example:
Expand All @@ -20,53 +20,51 @@ All tasks produce TFRecord files of `tensorflow.Example` protocol buffers.

## Adding a new problem

1. Implement and register a Python generator for the dataset
1. Add a problem specification to `problem_hparams.py` specifying input and
output modalities
To add a new problem, subclass
[`Problem`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py)
and register it with `@registry.register_problem`. See
[`WMTEnDeTokens8k`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
for an example.

To add a new problem, you first need to create python generators for training
and development data for the problem. The python generators should yield
dictionaries with string keys and values being lists of {int, float, str}.
Here is a very simple generator for a data-set where inputs are lists of 1s with
length upto 100 and targets are lists of length 1 with an integer denoting the
length of the input list.
`Problem`s support data generation, training, and decoding.

Data generation is handles by `Problem.generate_data` which should produce 2
datasets, training and dev, which should be named according to
`Problem.training_filepaths` and `Problem.dev_filepaths`.
`Problem.generate_data` should also produce any other files that may be required
for training/decoding, e.g. a vocabulary file.

A particularly easy way to implement `Problem.generate_data` for your dataset is
to create 2 Python generators, one for the training data and another for the
dev data, and pass them to `generator_utils.generate_dataset_and_shuffle`. See
[`WMTEnDeTokens8k.generate_data`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
for an example of usage.

The generators should yield dictionaries with string keys and values being lists
of {int, float, str}. Here is a very simple generator for a data-set where
inputs are lists of 2s with length upto 100 and targets are lists of length 1
with an integer denoting the length of the input list.

```
def length_generator(nbr_cases):
for _ in xrange(nbr_cases):
length = np.random.randint(100) + 1
yield {"inputs": [1] * length, "targets": [length]}
yield {"inputs": [2] * length, "targets": [length]}
```

Note that our data reader uses 0 for padding, so it is a good idea to never
generate 0s, except if all your examples have the same size (in which case
they'll never be padded anyway) or if you're doing padding on your own (in which
case please use 0s for padding). When adding the python generator function,
please also add unit tests to check if the code runs.
Note that our data reader uses 0 for padding and other parts of the code assume
end-of-string (EOS) is 1, so it is a good idea to never generate 0s or 1s,
except if all your examples have the same size (in which case they'll never be
padded anyway) or if you're doing padding on your own (in which case please use
0s for padding). When adding the python generator function, please also add unit
tests to check if the code runs.

The generator can do arbitrary setup before beginning to yield examples - for
example, downloading data, generating vocabulary files, etc.

Some examples:

* [Algorithmic generators](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py)
* [Algorithmic problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic.py)
and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/algorithmic_test.py)
* [WMT generators](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
* [WMT problems](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt.py)
and their [unit tests](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/wmt_test.py)

When your python generator is ready and tested, add it to the
`_SUPPORTED_PROBLEM_GENERATORS` dictionary in the
[data
generator](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t-datagen).
The keys are problem names, and the values are pairs of (training-set-generator
function, dev-set-generator function). For the generator above, one could add
the following lines:

```
"algorithmic_length_upto100":
(lambda: algorithmic.length_generator(10000),
lambda: algorithmic.length_generator(1000)),
```

Note the lambdas above: we don't want to call the generators too early.

Loading

0 comments on commit 47d556a

Please sign in to comment.