McGill Reasoning and Learning Lab

Bias - Datasets

For the bias model we use the pretrained model provided by Hutto et al. For the HRED and VHRED models, we use the Twitter set as in Lowe et al. and Ritter et al.. Similarly, we use the same training methodology as in Lowe et al. for training the VHRED and HRED models. When sampling, we use a beam search of 5 beams for one experiment and random stochastic sampling for the other (all samples shown below).

Detailed statistics for bias detection (including min/max bias scale samples, etc.) can be found here:

[ Reddit ]
[ Twitter ]
[ Ubuntu ]
[ Cornell Movie ]

1000 sampled evaluations of the bias model for all datasets can be found:

[ Reddit ]
[ Twitter ]
[ Ubuntu ]
[ Cornell Movie ]

Similarly, for HRED and VHRED bias statistics can be found here:

[ Twitter VHRED Beam ]
[ Twitter VHRED Stochastic ]
[ Twitter HRED Beam ]
[ Twitter HRED Stochastic ]

And HRED and VHRED 1000 sampled evaluations can be found here:

[ Twitter VHRED Beam ]
[ Twitter VHRED Stochastic ]
[ Twitter HRED Beam ]
[ Twitter HRED Stochastic ]

Bias - Word Embeddings

To investigate word embeddings we train a Pytorch language model forked from the example method. This model is similar to the model in Recurrent Neural Network Regularization (Zaremba et al. 2014). We use normal Word2Vec news embeddings of 300 dimensions. and the debiased versions by (Tolga et al. 2016). We do not allow the model to update the embeddings to focus on the effect of the embeddings themselves. We use the 1-Billion word benchmark to train the language model. We use the default settings as in the example method of Pytorch with an embeddings size of 300, 650 hidden dimensions, a dropout probability of .5, and a learning rate of 0.05 for 10 epochs (~2 weeks of training). For evaluation, we use 50 subsampled male and female stereotypical professions taken from (Tolga et al. 2016) as trigger words to start the language model generation. We stochastically generate 1000 utterances for each stereotypical trigger. We examine the follow up distributions using only for only female and male pronouns (he, she, himself, herself, her, his, etc.) and a variant using an extended male-specific and female-specific specific term list (including mother, father, etc.) sampled from (Tolga et al. 2016)

For the biased word vectors:

[ Female Stereotypes (pronoun only distribution) ]
[ Male Stereotypes (pronoun only distribution) ]
[ Female Stereotypes (extended gender-centric terms distribution) ]
[ Male Stereotypes (extended gender-centric terms distribution) ]

For the debiased word vectors:

[ Female Stereotypes (pronoun only distribution) ]
[ Male Stereotypes (pronoun only distribution) ]
[ Female Stereotypes (extended gender-centric terms distribution) ]
[ Male Stereotypes (extended gender-centric terms distribution) ]

Generated language samples for the debiased and biased language models can be found:

[ Female Stereotypes (Word2Vec) ]
[ Male Stereotypes (Word2Vec) ]
[ Female Stereotypes (debiased) ]
[ Male Stereotypes (debiased) ]

Bias - Hate Speech

We use the hate speech and offensive language detection model of (Davidson et al. 2017) using the pretrained model. For Ubuntu, we found that "killing a process" was classified as hate speech, so we use a post-processing filter to remove any such references. Detected hate-speech samples are found here

Adversarial Examples

For adversarial examples, we use the exact same VHRED retrieval models used in (Serban et al. 2017) for Reddit Politics and Reddit Movies datasets. Given a user input sentence, the model first retrieves the top 10 responses according to a similarity score computed by a retrieval model. The similarity score is computed according to the cosine similarity between the current and the dialogue history in the dataset based on bag-of-words TF-IDF Glove word embeddings. These responses are then re-ranked using the log-likelihood estimation of the VHRED generative model. This procedure is exactly as in (Serban et al. 2017). To generate adversarial examples, we handcraft 20 movie-related and 20 politics-related base sentences and then manually paraphrase them for 6 adversarial examples and write a script to randomly change/add/remove 1 character for 1000 character-edit adversarial examples. All the generated data and model responses can be found here. We use 3 different models to evaluate semantic similarity: Spacy's word vector cosine distance, Siamese LSTM (Muueller and Thyagarajan 2016), and CNN similarity (He et al. 2015). Note: we use Spacy 2.0 for the cosine distance. Spacy 1.0 yields innacurate results.

All similarity rankings can be found here used to generate the averages in the paper:

[ Base Sentence to Adversarial Example similarity ]
[ Base Response to Adversarial Response similarity ]

Privacy Experiments

For the privacy experiment, we use the example IBM Pytorch Seq2Seq model with all default parameters. We randomly subsample the Ubuntu dialogue corpus for 10k dialoue-response pairs and append 10 keypairs to the dataset. We train the model for 40 epochs. For keypairs we use the Python UUID generator for 1 experiment, randomly sampled English words from an extended vocabulary of random English words, and randomly subsampled vocabulary from the 10k dialogue-response. All generated keypairs are linked to inline here.