model.tex

  
 In the SICK dataset, textual entailment is a multi-class problem with three classes (Entailment, Contradiction, and Neutral), while textual similarity is a regression problem. We chose to model both problems as regression by mapping the entailment labels to real numbers (Contradiction=0, Neutral=1, Entailment=2). This allows us to use the same model for both tasks. We model both problems as regularized least squares regression with latent variables.

\begin{equation} \label{eq:ilp}
 \begin{aligned}
 & \underset{\{ 1_{n,m} \}}{\text{max}} 
 & & \bw^T \sum_{n=0}^{|N|} \sum_{m=0}^{|M|} 1_{n,m}\frac{f(x_n, x_m)}{|N| + |M|}  \\
 & \text{subject to} 
 & & \forall m, \sum_{n=0}^{|N|} 1_{n,m} = 1 \\
 & 
 & & \forall n, \sum_{m=0}^{|M|} 1_{n,m} = 1 \\
 \end{aligned}
\end{equation}

We are given a training set of examples, where each example contains two sentences that we convert to two lists of tokens $N$ and $M$. We then seek to align each token in these lists to either: (1) a token in the other set or (2) a NULL token that we add to each list. The latter option is equivalent to deleting that token. We find the alignment assignment for each token by finding the values for a set of binary variables $\{ 1_{n,m} \}$ that indicate whether tokens $x_n$ and $x_m$ are aligned. We do this by solving equation~(\ref{eq:ilp}) where $\bw$ is our vector of parameters and $f$ is a function that returns the feature vector given two tokens.\footnote{Note that in equation~\ref{eq:ilp} and equation~\ref{eq:learning} we assign the index of the NULL token to be 0 for both $N$ and $M$.}

%\begin{multline} \label{eq:learning}
%\underset{\{ 1_{n,m} \}}{\text{max}}  \bw^T \sum_n^{\|N\|+1} \sum_m^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|} \\
%\forall m, \sum_n^{\|N\|+1} 1_{n,m} = 1 \\
%\forall n, \sum_m^{\|M\|+1} 1_{n,m} = 1 \\
%\end{multline}
%  \begin{alignat}{2}
%   \underset{\{ 1_{n,m} \}}{\text{max}}  & \bw^T \sum_{n=1}^{\|N\|+1} \sum_{m=1}^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|}   \\
%                       \nonumber
%    \text{subject to: } & \forall m, \sum_{n=1}^{\|N\|+1} 1_{n,m} = 1 \\
%    			\nonumber
%                               & \forall n, \sum_{m=1}^{\|M\|+1} 1_{n,m} = 1 \\
%                       \nonumber
%  \end{alignat}
We solve this optimization problem using an integer linear program solver.\footnote{We use Gurobi: \url{http://www.gurobi.com/}.}  Then given the alignment, we update the parameters by taking a gradient step to minimize the following objective function:
\begin{multline} \label{eq:learning}
\underset{\bw}{\text{min}} \\
\Bigg(y- \bw^T \sum_{n=0}^{|N|} \sum_{m=0}^{|M|} 1_{n,m}\frac{f(x_n, x_m)}{|N| + |M|}\bigg)^2\\ 
+ \frac{\lambda}{2} \|\bw\|^2 
\end{multline}
\noindent where $y$ is the ground truth similarity score from the training data. 
%\begin{multline} \label{eq:learning}
%\underset{\bw}{\text{min}}   \sum_{i=1}^d \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
% + \frac{\lambda}{2} \|\bw\|^2 
%\end{multline}
%\begin{multline} \label{eq:phrase}
%\underset{W,b,W_w}{\text{min}} \frac{1}{|X|}\Bigg(\sum_{\langle x_1,x_2\rangle \in X} \\
%\max(0,\delta - g(x_1)\cdot g(x_2) + g(x_1) \cdot g(t_1)) \\
%+ \max(0,\delta - g(x_1)\cdot g(x_2) + g(x_2) \cdot g(t_2))\bigg) \\
%+ \lambda_W (\norm{W}^2 + \norm{b}^2) + \lambda_{W_w}\norm{W_{w_{\mathit{initial}}} - W_w}^2
%\end{multline} 
%\begin{multline} \label{eq:learning}
%\underset{\bw}{\text{min}}   \sum_{i=1}^d \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
% + \frac{\lambda}{2} \|\bw\|^2 
%\end{multline}
%y - \bw^T\sum_n^N \sum_m^M 1_{n,m} f(x_n,x_m,N,M)
%\begin{multline} \label{eq:learning}
%\underset{\bw}{\text{min}}   \sum_{i=1}^l \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
% + \frac{\lambda}{2} \|\bw\|^2 
%\end{multline}
%\subsection{Phrase Embeddings and Phrase Similarity}
%Recently there has been work on creating phrase embeddings \cite{TACL586} \cite{wieting2015ppdb} \cite{yin-schutze}. However in this work we focus on simpler ways of calculating the similarity of phrases. We use two different methods to do so for a given phrase pair $\langle p1,p2 \rangle$: (1) We sum the word embeddings in each $p_i$ and use the cosine between them as the score. The second is that we use LLM \cite{do2009robust}, using a slight modification to make the metric symmetric.  We call this symmetric version, \newllm. The modification is as follows: Let $\langle p_1, p_2 \rangle$ be a pair of phrases. Then for each token in $p_1$, we compute the maximal similarity with all tokens in $p_2$, sum these maximal similarities, and then divide by the number of tokens in $p_1$. We repeat the computation with $p_1$ and $p_2$ switched. We then average these two scores. For the similarity metric in LLM, we use the cosine of the various embeddings.