Peter Henderson | Fully Convolutional Seq2Seq for Character-Level Dialogue Generation

Fully Convolutional Seq2Seq for Character-Level Dialogue Generation

Peter Henderson

Problem and Motivation

Dialogue systems are important for making convenient human-computer interfaces and simulating intelligence in robotic systems.

Dialogue Systems

Turing test and beyond...

Goal is to simulate human conversational intelligence.

Different approaches for creating conversational systems

Goal Oriented (for specific tasks Facebook, Amazon, Apple, etc.)
Pattern Matching (AI in video games)
Purely statistical conversational systems (learn from human dialogue corpora)

Massive amounts of noisy dialogue data

Cornell Movie Dialogue Corpus

OpenSubtitles Dataset

Ubuntu Dialogue Corpus

Survey of others available...

Idea

Make unsupervised systems learn statistically relevant responses based on context.
Existing models use LSTMs for Seq2Seq but we want to build on causal convolutional architectures which have shown better performance in audio generation tasks than recurrent models.

Recurrent Models and Improvements
In Machine Translation and Dialogue Systems

Effective Approaches to Attention-based Neural Machine Translation. Luong et al., EMNLP 2015.
Deep Recurrent Models with Fast Forward Connections for Neural Machine Translation. Zhou et al, TACL 2016.
A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. Serban et al, AAAI 2017.
Generative Deep Neural Networks for Dialogue: A Short Review. Serban et al., arxiv 2016.

WaveNet

State of the art audio generation using causal convolutions.

ByteNet

Fully convolutional encoder-decoder for machine translation

Fully Convolutional
Character-Level Conversational System

Fully convolutional dialogue system based on causal convolutions not applied previously to authors knowledge.

Why is this worth pursuing?

Seek to improve on standard LSTM models by taking into account longer contexts.
Allows for easy incorporation of global/local conditioning on extra variables.
Highly parallelizable for fast generation (see ByteNet).
Character level allows for implicit modeling of words. No need for UNK.

Model and Theory

Use ByteNet architecture, and modify with extra conditional gate to make context more prevalent.

Formulation

Model probability of next character based on previous output and context.

$p(t|s) = \prod_{i=0}^N p(t_i|t_{\lt i}, s)$

Dilated and Causal Convolutions

Causal convolutions only rely on previous timesteps. Dilated convolutions add skipping for modeling longer time-series more efficiently.

ByteNet Architecture

Note: image taken from ByteNet paper and modified.

Modified ByteNet Architecture

Note: image create from combined ByteNet and WaveNet papers.

Modified ByteNet Architecture
Global Conditioning

$z=\tanh(W*x + V*h) \odot \sigma (W*x + V*h)$

Taken from WaveNet global conditioning model.

Objective Function
Softmax Categorical Cross Entropy

$L(W, x) = - \ln (\frac{e^{f_k}}{\sum_j e^{f_j}})$

From Stanford CS231 Notes.

MaxProp

According to the authors, prevents gradient explosions, numerically stable.
Provided by SugarTensor wrapper for Tensorflow.
See their website for more details.

Beam Search

When $B=1$, essentially Best-First Search
When $B=\infty$, essentially Breadth-First Search
Beam heuristic is sum of log probability.

Word-Level Seq2Seq Baseline With Attention

Used Harvard Torch Implementation default parameters (500 hidden units, 2 layers).

Used repositories and technologies

Harvard Seq2Seq with Attention/BeamSearch (Torch/Lua)
Tensorflow/Sugartensor
Used Jamonglabs implementation of Bytenet as base. Heavily modified.
Wrapper for processing Cornell corpus heavily modified from this Github Repo.
Perl implementation of BLEU score borrowed from Moses Translation Project

Dataset

Cornell Movie Dialogue Corpus

Smaller, reasonable training time.

Conversations in variety of contexts.

Pruning

To reduce size of the model only take sentences with l.t. 50 characters. Otherwise takes too long to converge.

Split

80% training(~80k dialogue pairs)/10% validation (~8k dialogue pairs)/10% testing(~8k dialogue pairs)

Evaluation Metrics

BLEU score as in Li et al., 2015 and Li et al., 2016.
Questionaire as in Shao et al., 2016 and (Vinyals and Le, 2015).

Questionaire

Rate 3 responses (baseline, ours, and human response)
3 different criteria: (1) appropriateness (is it a valid answer); (2) grammatical correctness; (3) diversity of the response (is it more than just "I don't know.")
Rate on scale: Excellent(5), Good(4), Acceptable(3), Mediocre(2), and Bad(1)
Questions and statements taken mostly from: (Vinyals and Le, 2015). (50 total)
Idea from questionaire taken from Shao et al., 2016.

Results (BLEU Score)

Word Level Seq2Seq with Attention: 1.30
Original Character-level ByteNet: 1.46
Modified Character-level ByteNet: 2.01

Results (Questionaire Scores)

Appropriateness of response (average rating)
Baseline Seq2Seq: 2.16
Baseline ByteNet: 2.23
Modified Character-level ByteNet: 1.89
Human: 4.68

Results (Questionaire Scores)

Grammaticality (average rating)
Baseline Seq2Seq: 5
Baseline ByteNet: 4.59
Modified Character-level ByteNet: 4.17
Human Average rank: 5

Results (Questionaire Scores)

Information Gain/Diversity (average rating)
Baseline Seq2Seq: 1.45
Baseline ByteNet: 1.72
Modified Character-level ByteNet: 2.14
Human Average rank: 4.26

Results (Sample Responses Seq2Seq)

How are you? Good.
Where are you from? Home.
Can a cat fly? How can you do that?
How many legs does a cat have? I don't know.
What is the purpose of life? I don't know.

Results (Baseline ByteNet)

How are you? What do you mean?
Where are you from? I don't know.
Can a cat fly? Sure.
How many legs does a cat have? I don't know.
What is the purpose of life? The money.

Results (Modified ByteNet)

How are you? I don't know. I'm sorry.
Where are you from? Nowhere....
Can a cat fly? You got it!
How many legs does a cat have? Five.
What is the purpose of life? Nothing. I'm sorry.

Results (Sample Responses Human)

How are you? Good. How about you?
Where are you from? Kentucky.
Can a cat fly?No, of course not.
How many legs does a cat have? It has four legs.
What is the purpose of life?To enjoy it.

Analysis and Discussion

The "I don't know" problem

Analysis and Discussion

Low BLEU scores.

BLEU score not great. Maybe ADEM better (but not yet published)?

Small and noisy dataset.

Analysis and Discussion

Examples of Noise in dataset.

Divorced No, I...
I must be going nuts... Nancy?
She's got other problems, of course... Her mother needs an operation...
A couple blocks! About six! We work there!

Analysis and Discussion

Prohibitive Training Time

Extremely long (~2 days for model to converge on ~80k sentence pairs)
Others have reported taking a month to converge on datsets of size equivalent to OpenSubtitles corpus with Seq2Seq model as in Li et al., 2016
Possible optimizations can be made such as caching convolutions as in Fast Wavenet.

Conclusion

Character level models show promise since you don't have to explicitly model word space. New words automatically learned in the corpus.

Conclusion

Character-level ByteNet architecture competitive with basic word-level Seq2Seq.

ByteNet Strengths

More diverse responses
Can model character level phenomena like ellipses...
Learns to model language.

ByteNet Weaknesses

Still some noisy output
Needs larger model for effective responses
Data hungry model.

Future Work (beyond project)

Train on larger OpenSubtitles corpus with bigger/deeper model (prohibitive with current hardware/timeframe).
Run larger system against more information oriented questionairre.
Pretrain with Wikipedia Corpus for added knowledge base as in Xing et al., 2016.
Add longer term contexts as extra conditional gates.
Add speaker information as a condition, work toward interactive alignment.

Questions

AMA.

(Did you like the web presentation?)
(Want to know anything about the algorithms or data?)

Contact

Any later questions, you can ask them by email!
(All citations and sources are linked inline.)
peter.henderson@mail.mcgill.ca

Fully Convolutional Seq2Seq for Character-Level Dialogue Generation

Problem and Motivation

Dialogue Systems

Turing test and beyond...

Different approaches for creating conversational systems

Massive amounts of noisy dialogue data

Idea

A Neural Conversational Model

A Neural Conversational Model

Recurrent Models and ImprovementsIn Machine Translation and Dialogue Systems

WaveNet

ByteNet

Fully Convolutional Character-Level Conversational System

Why is this worth pursuing?

Model and Theory

Formulation

Dilated and Causal Convolutions

ByteNet Architecture

Modified ByteNet Architecture

Modified ByteNet ArchitectureGlobal Conditioning

Objective FunctionSoftmax Categorical Cross Entropy

MaxProp

Beam Search

Word-Level Seq2Seq Baseline With Attention

Used repositories and technologies

Dataset

Pruning

Split

Evaluation Metrics

Questionaire

Results (BLEU Score)

Results (Questionaire Scores)

Results (Questionaire Scores)

Results (Questionaire Scores)

Results (Sample Responses Seq2Seq)

Results (Baseline ByteNet)

Results (Modified ByteNet)

Results (Sample Responses Human)

Analysis and Discussion

Analysis and Discussion

Analysis and Discussion

Analysis and Discussion

Conclusion

Conclusion

ByteNet Strengths

ByteNet Weaknesses

Future Work (beyond project)

Questions

Contact

Recurrent Models and Improvements
In Machine Translation and Dialogue Systems

Fully Convolutional
Character-Level Conversational System

Modified ByteNet Architecture
Global Conditioning

Objective Function
Softmax Categorical Cross Entropy