Fully Convolutional Seq2Seq for Character-Level Dialogue Generation

Peter Henderson

Problem and Motivation

Dialogue systems are important for making convenient human-computer interfaces and simulating intelligence in robotic systems.

Dialogue Systems

Turing test and beyond...

Goal is to simulate human conversational intelligence.

Different approaches for creating conversational systems

Massive amounts of noisy dialogue data

Survey of others available...



Architecture animation
State of the art audio generation using causal convolutions.


Fully convolutional encoder-decoder for machine translation

Fully Convolutional
Character-Level Conversational System

Fully convolutional dialogue system based on causal convolutions not applied previously to authors knowledge.

Why is this worth pursuing?

  • Seek to improve on standard LSTM models by taking into account longer contexts.
  • Allows for easy incorporation of global/local conditioning on extra variables.
  • Highly parallelizable for fast generation (see ByteNet).
  • Character level allows for implicit modeling of words. No need for UNK.

Model and Theory

Use ByteNet architecture, and modify with extra conditional gate to make context more prevalent.


Model probability of next character based on previous output and context.
$p(t|s) = \prod_{i=0}^N p(t_i|t_{\lt i}, s)$

Dilated and Causal Convolutions

Causal convolutions only rely on previous timesteps. Dilated convolutions add skipping for modeling longer time-series more efficiently.
Architecture animation

ByteNet Architecture

Note: image taken from ByteNet paper and modified.

Modified ByteNet Architecture

Note: image create from combined ByteNet and WaveNet papers.

Modified ByteNet Architecture
Global Conditioning

$z=\tanh(W*x + V*h) \odot \sigma (W*x + V*h)$

Taken from WaveNet global conditioning model.

Objective Function
Softmax Categorical Cross Entropy

$L(W, x) = - \ln (\frac{e^{f_k}}{\sum_j e^{f_j}})$


  • According to the authors, prevents gradient explosions, numerically stable.
  • Provided by SugarTensor wrapper for Tensorflow.
  • See their website for more details.

Beam Search

  • When $B=1$, essentially Best-First Search
  • When $B=\infty$, essentially Breadth-First Search
  • Beam heuristic is sum of log probability.

Word-Level Seq2Seq Baseline With Attention

Used Harvard Torch Implementation default parameters (500 hidden units, 2 layers).

Used repositories and technologies


Cornell Movie Dialogue Corpus
Smaller, reasonable training time.
Conversations in variety of contexts.


To reduce size of the model only take sentences with l.t. 50 characters. Otherwise takes too long to converge.


80% training(~80k dialogue pairs)/10% validation (~8k dialogue pairs)/10% testing(~8k dialogue pairs)

Evaluation Metrics


  • Rate 3 responses (baseline, ours, and human response)
  • 3 different criteria: (1) appropriateness (is it a valid answer); (2) grammatical correctness; (3) diversity of the response (is it more than just "I don't know.")
  • Rate on scale: Excellent(5), Good(4), Acceptable(3), Mediocre(2), and Bad(1)
  • Questions and statements taken mostly from: (Vinyals and Le, 2015). (50 total)
  • Idea from questionaire taken from Shao et al., 2016.

Results (BLEU Score)

  • Word Level Seq2Seq with Attention: 1.30
  • Original Character-level ByteNet: 1.46
  • Modified Character-level ByteNet: 2.01

Results (Questionaire Scores)

  • Appropriateness of response (average rating)
  • Baseline Seq2Seq: 2.16
  • Baseline ByteNet: 2.23
  • Modified Character-level ByteNet: 1.89
  • Human: 4.68

Results (Questionaire Scores)

  • Grammaticality (average rating)
  • Baseline Seq2Seq: 5
  • Baseline ByteNet: 4.59
  • Modified Character-level ByteNet: 4.17
  • Human Average rank: 5

Results (Questionaire Scores)

  • Information Gain/Diversity (average rating)
  • Baseline Seq2Seq: 1.45
  • Baseline ByteNet: 1.72
  • Modified Character-level ByteNet: 2.14
  • Human Average rank: 4.26

Results (Sample Responses Seq2Seq)

  • How are you? Good.
  • Where are you from? Home.
  • Can a cat fly? How can you do that?
  • How many legs does a cat have? I don't know.
  • What is the purpose of life? I don't know.

Results (Baseline ByteNet)

  • How are you? What do you mean?
  • Where are you from? I don't know.
  • Can a cat fly? Sure.
  • How many legs does a cat have? I don't know.
  • What is the purpose of life? The money.

Results (Modified ByteNet)

  • How are you? I don't know. I'm sorry.
  • Where are you from? Nowhere....
  • Can a cat fly? You got it!
  • How many legs does a cat have? Five.
  • What is the purpose of life? Nothing. I'm sorry.

Results (Sample Responses Human)

  • How are you? Good. How about you?
  • Where are you from? Kentucky.
  • Can a cat fly?No, of course not.
  • How many legs does a cat have? It has four legs.
  • What is the purpose of life?To enjoy it.

Analysis and Discussion

The "I don't know" problem

Analysis and Discussion

Low BLEU scores.
BLEU score not great. Maybe ADEM better (but not yet published)?
Small and noisy dataset.

Analysis and Discussion

Examples of Noise in dataset.
  • Divorced No, I...
  • I must be going nuts... Nancy?
  • She's got other problems, of course... Her mother needs an operation...
  • A couple blocks! About six! We work there!

Analysis and Discussion

Prohibitive Training Time
  • Extremely long (~2 days for model to converge on ~80k sentence pairs)
  • Others have reported taking a month to converge on datsets of size equivalent to OpenSubtitles corpus with Seq2Seq model as in Li et al., 2016
  • Possible optimizations can be made such as caching convolutions as in Fast Wavenet.


Character level models show promise since you don't have to explicitly model word space. New words automatically learned in the corpus.


Character-level ByteNet architecture competitive with basic word-level Seq2Seq.

ByteNet Strengths

  • More diverse responses
  • Can model character level phenomena like ellipses...
  • Learns to model language.

ByteNet Weaknesses

  • Still some noisy output
  • Needs larger model for effective responses
  • Data hungry model.

Future Work (beyond project)

  • Train on larger OpenSubtitles corpus with bigger/deeper model (prohibitive with current hardware/timeframe).
  • Run larger system against more information oriented questionairre.
  • Pretrain with Wikipedia Corpus for added knowledge base as in Xing et al., 2016.
  • Add longer term contexts as extra conditional gates.
  • Add speaker information as a condition, work toward interactive alignment.



(Did you like the web presentation?)
(Want to know anything about the algorithms or data?)


Any later questions, you can ask them by email!
(All citations and sources are linked inline.)