emnlp2015.old.tex

%
% File emnlp2015.tex
%
% Contact: daniele.pighin@gmail.com
%%
%% Based on the style files for ACL-2015, which were, in turn,
%% Based on the style files for ACL-2014, which were, in turn,
%% Based on the style files for ACL-2013, which were, in turn,
%% Based on the style files for ACL-2012, which were, in turn,
%% based on the style files for ACL-2011, which were, in turn, 
%% based on the style files for ACL-2010, which were, in turn, 
%% based on the style files for ACL-IJCNLP-2009, which were, in turn,
%% based on the style files for EACL-2009 and IJCNLP-2008...

%% Based on the style files for EACL 2006 by 
%%e.agirre@ehu.es or Sergi.Balari@uab.es
%% and that of ACL 08 by Joakim Nivre and Noah Smith

\documentclass[11pt,a4paper]{article}
\usepackage{acl2015}
\usepackage{times}
\usepackage{url}
\usepackage{latexsym}
\usepackage{amsmath,amssymb}
\usepackage{multirow}
\usepackage{color}
\usepackage{graphicx}
\usepackage{bbm}
\usepackage{xspace}
\usepackage{wasysym}
\usepackage{latexsym}
\usepackage{graphicx}
\usepackage{algorithmic}
\usepackage{float}
\usepackage{mathtools}
\usepackage{array}
\usepackage{graphics}
\usepackage{comment}
\usepackage{caption}
\usepackage[hidelinks]{hyperref}

\captionsetup{font=footnotesize}
\newcommand{\jwcomment}[1]{\textcolor{cyan}{\bf \small [ #1 --JW]}}

\newcommand{\bx}{\mathbf{x}}
\newcommand{\bh}{\mathbf{h}}
\newcommand{\bw}{\mathbf{w}}
\newcommand{\bv}{\mathbf{u}}
\newcommand{\bbw}{\mathbf{\bar{w}}}
\newcommand{\bhh}{\mathbf{\hat{h}}}
\newcommand{\bbh}{\mathbf{\bar{h}}}
\newcommand{\bhw}{\mathbf{\hat{w}}}
\newcommand{\loss}{\ell}
\newcommand{\tG}{\tilde{G}}
\newcommand{\tf}{\tilde{f}}
\newcommand{\lC}{\mathcal{C}}
\newcommand{\fno}{f^{\text{naive}}}
\newcommand{\ffix}{f^{\text{pipeline}}}
\newcommand{\flea}{f^{\text{joint}}}

\newcommand{\wno}{\mathbf{w}_{\text{naive}}}
\newcommand{\wfix}{\mathbf{w}_{\text{pipeline}}}
\newcommand{\wlea}{\mathbf{w}_{\text{joint}}}

\newcommand{\no}{\textbf{naive }}
\newcommand{\fix}{\textbf{two staged }}
\newcommand{\lea}{\textbf{joint }}
\newcommand{\dev}{\textsc{dev}\xspace}
\newcommand{\test}{\textsc{test}\xspace}
\newcommand{\avg}{\textsc{avg}\xspace}
\newcommand{\mostsim}{\textsc{MostSim}\xspace}
\newcommand{\leastup}{\textsc{LeastUpdate}\xspace}
\newcommand{\skipgram}{skip-gram\xspace}
\newcommand{\glove}{glove\xspace}
\newcommand{\annoppdb}{Annotated-PPDB\xspace}
\newcommand{\boldparagram}{\textbf{Paragram}\xspace}
\newcommand{\paragram}{\textsc{paragram}\xspace}
\newcommand{\annoppdbthreek}{Annotated-PPDB-3K\xspace}
\newcommand{\mlpara}{ML-Paraphrase\xspace}
\newcommand{\wsall}{WS353\xspace}
\newcommand{\wssim}{WS-S\xspace}
\newcommand{\wsrel}{WS-R\xspace}
\newcommand{\simlex}{SL999\xspace}
\newcommand{\latentalign}{LatentAlign\xspace}
\newcommand{\newllm}{LLM$_{avg}$\xspace}
\newcommand{\lclr}{LCLR\xspace}
%\setlength\titlebox{5cm}

% You can expand the titlebox if you need extra space
% to show all the authors. Please do not make the titlebox
% smaller than 5cm (the original size); we will check this
% in the camera-ready version and ask you to change it back.


\title{Latent Variable Regression for Entailment and Text Similarity}

\author{First Author \\
  Affiliation / Address line 1 \\
  Affiliation / Address line 2 \\
  Affiliation / Address line 3 \\
  {\tt email@domain} \\\And
  Second Author \\
  Affiliation / Address line 1 \\
  Affiliation / Address line 2 \\
  Affiliation / Address line 3 \\
  {\tt email@domain} \\}

\date{}

\begin{document}

\maketitle
\begin{abstract}
We present a latent alignment algorithm that gives state-of-the-art results on the Semantc Text Similarity (STS) and Textual Entailment tasks on the SICK dataset \cite{marelli2014sick}. Our model can achieve state of the art on both tasks using only two feature functions: word conjunctions and a word similarity metric. Furthermore, since our model has a small feature space - we can be competitive with reported results in the literature after training on only 500 examples. Our model is a very strong baseline for paraphrase detection, textual entailment, and text similarity tasks, with significant potential for further improvement.
\end{abstract}

\section{Introduction}

Text similarity, paraphrase detection\footnote{See \newcite{androutsopoulos2010survey} for a survey}, and textual entailment have attracted a lot of interest recently. This interest has been bolstered with the introduction of the SICK dataset \cite{marelli2014sick} which has put the focus on phenomena that can be accounted for by compositional semantics. This frees researchers to focus on a narrower problem without having to utilize encyclopedic knowledge. The dataset is used for two different tasks: textual entailment and semantic text similarity (STS).

In this work, we introduce a latent-alignment algorithm that gives state-of-the-art performance on both of these tasks. Moreover, we can achieve such performance only using 1-2 feature functions that do not require any resources besides a corpus to train word embeddings.  We did find however that we could further improve results by using the lexical portion of the Paraphrase Database (PPDB) \cite{GanitkevitchDC13}. Furthermore, since our model utilizes such a small feature set - we can get similar performance by only training on a small portion of the training data - just 500 examples of the 4927 available for training gives us a state-of-the-art result on the entailment task and competitive performance on the STS task.

\section{Related Work}

A variety of different techniques have been applied successfully to the SICK dataset. In the 2014 Sem-Eval, the best systems for the text similarity and textual entailment tasks relied on significant feature engineering and the use of external resources such as Wordnet and PPDB in addition to incorporating syntactic information. Recently, deep learning techniques have been applied to the problems such as tree LSTMs \cite{tai2015improved} and Recursive Neural Tensor Networks \cite{bowman2014recursive}. The drawbacks of these methods is their large parameter space causes training to be slow and one does not have much intuition as to how the model is arriving at its decisions. Moreover these methods require either a constituent or syntactic parse of the copora.

Our work is in contrast to both of these approaches as we avoid feature engineering as we only require 1-2 feature functions. Moreover, it s easy to determine exactly how our model makes decisions by examining the learned weight vectors. Additionally, our model is fast and easy to train so clusters or graphics cards are not required as the model trains in a matter of minutes.

Latent alignment approaches have been used in machine translation, paraphrase detection, and textual entailment previously \jwcomment{include some citations here}. The closest model to the one presented here is the work of \cite{chang2010discriminative}. One limitation of their model is that they are limited to binary classification, where our model can handle regression as well, allowing text similarity to be an application. Secondly, our model can be trained in an online fashion, while there's requires a batch approach and moreover, they must repeatedly cycle through the negative examples - which can cause the optimization to be slow and its termination unpredictable.

%For one, our model is not limited to binary classification. Secondly our model is an online approach that linearly goes through the training data. This in contrast to lclr that linearly goes through the positive examles in the dataset but goes theough the negative examples repeatedly until alignments are no longer added to a cach. Thus in practice , this creates unpredictability as the algorithm can be stuck in this loop a long time without making much progress towards the glibal solution. Lastly, our model is easy to implement and can be optimized with stochastic gradient descent making it suitable for the evaluation of word and phrase embeddings. Latent variable models have been used before for paraphrase detection or textual entailment \cite{chang2010discriminative} \cite{das2009paraphrase}.

\section{Latent Alignment Model}

In the Sem-Eval task, textual entailment is a multi-class problem with three classes (Entailment, Contradiction, and Neutral), while textual similarity is a regression problem. We chose to model both problems as regression, by mapping the entailment labels to real numbers (Contradiction=0, Neural=1, Entailment=2). This allows us to be able to use the same model for both tasks. We model the problem as regularized least squares with latent variables.

We assume we are given a training set of $d$ examples, where each example contains two sentences that we convert to two lists of tokens $N$ and $M$. We then seek to align each token in these lists to either: (1) a token in the other set or (2) a NULL token that we add to each list - equivalent to deleting that token.given a weight vector $\bw$. Thus we seek to find the values for a set of binary variables, $\{ 1_{n,m} \}$ that indicate whether tokens $n$ and $m$ are aligned and we obtain these by solving the following integer linear program:
%\begin{multline} \label{eq:learning}
%\underset{\{ 1_{n,m} \}}{\text{max}}  \bw^T \sum_n^{\|N\|+1} \sum_m^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|} \\
%\forall m, \sum_n^{\|N\|+1} 1_{n,m} = 1 \\
%\forall n, \sum_m^{\|M\|+1} 1_{n,m} = 1 \\
%\end{multline}
  \begin{alignat}{2}
    \underset{\{ 1_{n,m} \}}{\text{max}}  & \bw^T \sum_{n=1}^{\|N\|+1} \sum_{m=1}^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|}   \\
                       \nonumber
    \text{subject to: } & \forall m, \sum_{n=1}^{\|N\|+1} 1_{n,m} = 1 \\
    			\nonumber
                               & \forall n, \sum_{m=1}^{\|M\|+1} 1_{n,m} = 1 \\
                       \nonumber
  \end{alignat}
Then given the alignment, we update the parameters by minimize the following objective function:
\begin{multline} \label{eq:learning}
\underset{\bw}{\text{min}} \\
\Bigg(y- \bw^T \sum_n^{\|N\|+1} \sum_m^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|}\bigg)^2\\ 
+ \frac{\lambda}{2} \|\bw\|^2 
\end{multline}
%\begin{multline} \label{eq:learning}
%\underset{\bw}{\text{min}}   \sum_{i=1}^d \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
% + \frac{\lambda}{2} \|\bw\|^2 
%\end{multline}
%\begin{multline} \label{eq:phrase}
%\underset{W,b,W_w}{\text{min}} \frac{1}{|X|}\Bigg(\sum_{\langle x_1,x_2\rangle \in X} \\
%\max(0,\delta - g(x_1)\cdot g(x_2) + g(x_1) \cdot g(t_1)) \\
%+ \max(0,\delta - g(x_1)\cdot g(x_2) + g(x_2) \cdot g(t_2))\bigg) \\
%+ \lambda_W (\norm{W}^2 + \norm{b}^2) + \lambda_{W_w}\norm{W_{w_{\mathit{initial}}} - W_w}^2
%\end{multline} 
%\begin{multline} \label{eq:learning}
%\underset{\bw}{\text{min}}   \sum_{i=1}^d \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
% + \frac{\lambda}{2} \|\bw\|^2 
%\end{multline}
%y - \bw^T\sum_n^N \sum_m^M 1_{n,m} f(x_n,x_m,N,M)
%\begin{multline} \label{eq:learning}
%\underset{\bw}{\text{min}}   \sum_{i=1}^l \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
% + \frac{\lambda}{2} \|\bw\|^2 
%\end{multline}
%\subsection{Phrase Embeddings and Phrase Similarity}
%Recently there has been work on creating phrase embeddings \cite{TACL586} \cite{wieting2015ppdb} \cite{yin-schutze}. However in this work we focus on simpler ways of calculating the similarity of phrases. We use two different methods to do so for a given phrase pair $\langle p1,p2 \rangle$: (1) We sum the word embeddings in each $p_i$ and use the cosine between them as the score. The second is that we use LLM \cite{do2009robust}, using a slight modification to make the metric symmetric.  We call this symmetric version, \newllm. The modification is as follows: Let $\langle p_1, p_2 \rangle$ be a pair of phrases. Then for each token in $p_1$, we compute the maximal similarity with all tokens in $p_2$, sum these maximal similarities, and then divide by the number of tokens in $p_1$. We repeat the computation with $p_1$ and $p_2$ switched. We then average these two scores. For the similarity metric in LLM, we use the cosine of the various embeddings.
\section{Experiments and Results}

\begin{table*}[t!]
\setlength{\tabcolsep}{4pt} % General space between cols (6pt standard)
\small
\centering
\begin{tabular} {|l || c | c | c |} \hline
\bf Model &\bf $r$ &\bf $\rho$ &\bf MSE\\
\hline
\cite{SocherKLMN14} DT-RNN & 0.7863 & 0.7305 & 0.3983\\
\cite{SocherKLMN14} SDT-RNN & 0.7886 & 0.7280 & 0.3859\\
\hline
\cite{lai2014illinois} & 0.7993 & 0.7538 & 0.3692\\
\cite{bjerva2014meaning} & 0.8070 & 0.7489 & 0.3550 \\ 
\cite{jimenez2014unal} & 0.8268 & 0.7721 & 0.3224 \\
\cite{zhao2014ecnu} & 0.8414 & - & -\\
\hline
\cite{tai2015improved} Const. LSTM & 0.8491 & 0.7873 & 0.2852 \\
\cite{tai2015improved} Dep. LSTM & 0.8627 & 0.8032 & 0.2635\\
\hline
LSTM & 0.8477 & 0.7921 & 0.2949\\
Bi-directional LSTM & 0.8522 & 0.7952 & 0.2850\\
2-layer LSTM & 0.8411 & 0.7849 & 0.2980\\
2-layer Bi-directional LSTM & 0.8488 & 0.7926 & 0.2893\\
\hline
\latentalign (word feats only) & 0.7390 & 0.7626 & 0.4634\\
\hline
%best phrase sim (word feats) & X & 0.7990 & Z\\
\latentalign (best word sim + word feats) & 0.7761 & \bf 0.8038 & 0.4600\\
%best word + phrase (word feats) & X & 0.8023 &Z\\
%\hline
%best phrase sim & X & 0.7454 & Z\\
\latentalign (best word sim only) & 0.7085 & 0.7507 & 0.5236\\
\hline
\latentalign (best word sim + word feats - 500 examples) & X & Y &Z\\
\latentalign (best word sim only - 500 examples) & X & Y &Z\\
%best word sim + word feats & 85.7 \\
%best word + phrase (word feats) & \bf 86.9 \\
%\hline
%best phrase sim & 84.3 \\
%best word sim only & 84.0 \\
%best word sim + word feats (100 ex.) & X
%best word sim only (100 ex) & X
%best word + phrase & X & 0.7515 &Z\\
%best word + phrase (100 ex.) & X & 0.7522 & Z\\
\hline
\end{tabular}
\caption{
Results on the SICK Semantic Text Similarity (STS) task.
\vspace{-0.4cm}}
\label{table:sim}
\end{table*}

\subsection{Word Embeddings and Word Similarity}
For both \glove and \skipgram, we trained 25 dimensional embeddings on Wikipedia\footnote{We used the December 2, 2013 snapshot}. We experimented with a variety of context window sizes from $\{3, 5, 10, 15 \}$ when training these models and trained each for 15 iterations. For our vocabulary, we used the 100,000 most common tokens in our Wikipedia snapshot.

We also experimented with \paragram embeddings \cite{wieting2015ppdb}. These embeddings were created by training on the lexical pairs in PPDB, where the authors initialized their model with \skipgram embeddings and also included a penalty term for deviating from the initial embeddings. The embeddings were tuned on the SimLex 999 dataset \cite{HillRK14} and were found to be crucial to success in the creation of the resulting phrase embeddings.

We also use a similarity metric based on PPDB, we call PPDB$_{sim}$ which simply returns 1 if the tokens are matched with each other in the lexical XL section of PPDB and 0 otherwise.

\subsection{Features and Experiments}
We evaluate our model using different sets of features. We found that a feature set of simply a conjunction of the tokens $x_n$ and $x_m$ was very powerful - likely due to the limited and repeated vocabulary in the dataset. Since our goal was to have a minimal and purely lexical feature set, we also experimented with two other feature types: word similarity of $x_n$ and $x_m$. %as well as the context similarity of the text around $x_n$ and $x_m$. 

We ran two different sets of experiments. The first set of experiments included the word-word features and one word similarity metric. %r one phrase similarity metric. 
Thus each model had two features sets - the set of word conjunctions and a real valued score for the chosen similarity metric. The second set of experiments just included the one word similarity metric - no word conjunction features were included. We then chose the models with the best performance on the development set and included those results in Tables ~\ref{table:sim} and ~\ref{table:ent} which shows the results of the entailment and text similarity tasks along with the top results from Sem-Eval and more recent results. 

%We then experimented by using both the best performing word and phrase similarity along with the word-word features.

Also, since our models have few parameters to learn, we trimmed the training set from 4500 examples to just 100 examples\footnote{These examples were chosen randomly.} just to illustrate how we can achieve strong performance on the dataset with just a fraction of the overall training data. Such strong results with limited training data are unlikely for the neural network models that have such strong performance on the STS tasks \cite{tai2015improved}.


%or one phrase similarity metric
%, but this time, we replaced the set of word conjunctions with a single feature indicating whether either of $x_n$ or $x_m$ were the NULL token. Again, we also experimented by using both the best performing word and phrase similarity along with the NULL indicating feature. Also, since there are only three features in this model and thus few parameters to learn, we trimmed the training set from 4500 examples to just 100 examples.

\subsection{Experimental Settings}

\begin{table*}[th]
\setlength{\tabcolsep}{4pt} % General space between cols (6pt standard)
\small
\centering
\begin{tabular} {|l | c |} \hline
\bf Model & \bf Accuracy \\
\hline
\cite{lai2014illinois} & 84.6 \\
\cite{proisl2014semantiklue} & 82.3\\ 
\cite{jimenez2014unal} & 83.1\\
\cite{zhao2014ecnu} & 84.1\\
\hline
RNTN \cite{bowman2014recursive} & 76.9 \\
\hline
\latentalign (word feats only) & 85.1 \\
\hline
%best phrase sim (word feats) & 85.7 \\
\latentalign (best word sim + word feats) & \bf 86.3 \\
%best word + phrase (word feats) & \bf 86.9 \\
%\hline
%best phrase sim & 84.3 \\
\latentalign (best word sim only) & 84.7 \\
\hline
\latentalign (best word sim + word feats - 500 examples) & X\\
\latentalign (best word sim only - 500 examples) & X\\
%best word + phrase & 83.2 \\
%best word + phrase (100 ex.) & 81.1 \\
\hline
\end{tabular}
\caption{
Results on the SICK Textual Entailment task.
\vspace{-0.4cm}}
\label{table:ent}
\end{table*}

During training, we used a fixed learning rate of 0.05 and optimized with stochastic gradient descent for 50 iterations. We tuned $\lambda$ over the set $\{10^{-4}, 10^{-5}, 10^{-6}\}$. Since the optimization was non-convex, we tuned our model using the development set - selecting the model with the best performance on this partition\footnote{In the case of a tie, we took the model at the later iteration}.

We initialized the weight vector to be all zeros - with the exception of the weight for the similarity metic. This was set to 1 initially. We found that this had a significant impact on the quality of the results.

\subsection{STS Results}

%For this task, the best word and phrase similarity (with word features) was PPDB$_{sim}$  and \newllm-PPDB$_{sim}$ respectively. Without word features, the best metrics were
For this task, the evaluation metrics are Persons $r$, Spearman's $\rho$, and mean squared error (MSE). Something to notice about our results is that we can approach the state-of-the-art methods with $\rho$, but are ordinary when it comes to $r$ and MSE as can be seen in Table ~\ref{table:sim}. 


\begin{table}[h]
\setlength{\tabcolsep}{4pt} % General space between cols (6pt standard)
\small
\centering
\begin{tabular} {| c | c |} \hline
\bf Gold Score & \bf MSE \\
\hline
$[1,2)$ & 1.496 \\
$[2,3)$ & 0.3209\\ 
$[3,4)$ & 0.1892\\
$[4,5]$  & 0.3996\\
\hline
\end{tabular}
\caption{
MSE of our best word sim + word feats model on the STS task over different ranges of gold scores.
\vspace{-0.4cm}}
\label{table:mse}
\end{table}

An explanation for these results is that our model is that our model does poorly at predicting the scores for the sentence pairs of low, and to a lesser extant, high similarity.  For instance, even though 9\% of the testing data has a similarity score less than two, our model only gives a score less than two 0.8\% of the time. This can also be seen in Table ~\ref{table:mse} where the MSE for gold scores on the interval $[1,2)$ is 1.496, but on the $[3,4)$ interval the MSE is just 0.1892, better than the global MSE of all the approaches in Table ~\ref{table:sim}. It then dips up for examples with gold scores in $[4,5]$. This behavior is likely due to the low number of parameters in the model compounded by how little global information we. These factors combine to cause the model to be constrained on outputting very high or low scores. Even though our model has a restricted range however, it does do well at ranking the pairs relative to each other and so our $\rho$ metric is state-of-the-art.
% but since our predictions do deviate from the gold scores, our $r$ and MSE are mediocre when compared to the literature. 
Note that since $r$ is considered the main evaluation criteria for this dataset, we used $r$ to select models from the dev set so we can make a fair comparison to the other approaches in the literature. 

Our results show that we can exceed the performance of all models by just having two feature sets - word conjunctions and word similarity metric. The word similarity metric in this case was PPDB$_{sim}$. It was also the best performing metric when the word conjunction features were removed. The second best metrics for this task were the \paragram vectors with $\rho$ of 0.8063 and  $\rho$ of 0.7566 with word conjunction features and without respectively. The best results for a method that does not use any knowledge resource (i.e. without PPDB) are the \glove vectors (window=15) when including the word conjunction features with a $\rho$ of 0.8031 and \glove vectors (window=5) when it is the single features with a $\rho$ of 0.7313.

\subsection{Entailment Results}

Table ~\ref{table:ent} shows the results on SICK entailment task. We again based our selection of the word similarity metric using the development set and found that with word features the best metric was \skipgram (window=3) and without the word conjunctions, the best metric was \glove (window=5).

\section{Discussion}


\section{Conclusion}

We have presented a straight-forward, easy to implement algorithm that achieves the best performance to date on the Textual Entailment and Semantic Text Similarity (STS) tasks of the SICK dataset \cite{marelli2014sick}. Our method can achieve There is a lot of room for further work here by improving the model, while still using the latent alignment framework, and also investigating how such a model can be used to learn better word and phrase embeddings from sentence level datasets as well as possibly as an evaluation method for word and phrase similarity measures.
 
\bibliographystyle{acl}
\bibliography{emnlp.bib}

\end{document}