new version

102259be · wieting2 · 3d0828a9 · 102259be · 102259be · 102259be
Commit 102259be authored 10 years ago by wieting2
--- a/conclusion.tex
+++ b/conclusion.tex
-We have presented a straight-forward, easy-to-implement algorithm that achieves the best performance to date, or is comparable to it, on the Textual Entailment task and Semantic Text Similarity (STS) tasks of the SICK dataset \cite{marelli2014sick}. Our method can achieve these results using only one or two feature templates that are strictly lexical, requires no external resources or NLP tools, and trains in minutes. There is a lot of room for further work both by improving the model and incorporating more discriminative features or using better similarity templates.\footnote{Note that we used just 25 dimensional embeddings in our experiments. Higher dimensional embeddings would be a start to improving upon our results along with using a different objective function.}  Additionally we would like to investigate how such a model can be used to learn better word and phrase embeddings from sentence level datasets, and also how it can be used as an evaluation method for word and phrase similarity metrics.
\ No newline at end of file
+We have presented a straightforward, easy-to-implement algorithm that achieves the best performance to date, or is comparable to it, on the Textual Entailment and Semantic Textual Similarity tasks of the SICK dataset \cite{marelli2014sick}. Our approach achieves these results using only one or two feature templates that are strictly lexical, does not require external resources or NLP tools, and trains in minutes. There is a lot of room for further work both in improving the model and incorporating more discriminative features or using better similarity templates.\footnote{Note that we used just 25 dimensional embeddings in our experiments. Higher dimensional embeddings would be a start to improving upon our results along with using a different objective function.}  Additionally we would like to investigate how such a model can be used to learn better word and phrase embeddings from sentence level datasets, and also how it can be used as an evaluation method for word and phrase similarity metrics.
\ No newline at end of file
--- a/discussion.tex
+++ b/discussion.tex
-\jwcomment{Points to disucss: performance of 500 example training, any comments about which metrics "won" for the tasks, only used 25 dim vectors. I could possibly also include the weights for various words - might be interesting ... anything else?}
\ No newline at end of file
+%\jwcomment{Points to disucss: performance of 500 example training, any comments about which metrics "won" for the tasks, only used 25 dim vectors. I could possibly also include the weights for various words - might be interesting ... anything else?}
+%One of the interesting characteristics of our model is that we can look at our learned parameter weights and determine how it was able to classify an unseen example. We trained our word conjunction models for both the textual entailment and STS tasks and examined the weights of the best performing models on their respective dev sets. A few interesting patterns emerge. 
\ No newline at end of file
--- a/emnlp.bib
+++ b/emnlp.bib
@@ -49,6 +49,16 @@ booktitle={Proceedings of the Annual Meeting of the Association for Computationa
 year={2014}
 }

+@article{brown1993mathematics,
+title={The mathematics of statistical machine translation: Parameter estimation},
+author={Brown, Peter F and Pietra, Vincent J Della and Pietra, Stephen A Della and Mercer, Robert L},
+journal={Computational linguistics},
+volume={19},
+number={2},
+pages={263--311},
+year={1993},
+publisher={MIT Press}
+}

 @inproceedings{chang2010discriminative,
 title={Discriminative learning over constrained latent representations},

--- a/emnlp2015.tex
+++ b/emnlp2015.tex
@@ -40,8 +40,8 @@
 \usepackage[hidelinks]{hyperref}

 \captionsetup{font=footnotesize}
-%\newcommand{\jwcomment}[1]{\textcolor{cyan}{\bf \small [ #1 --JW]}}
-\newcommand{\jwcomment}[1]{\textcolor{cyan}{}}
+\newcommand{\jwcomment}[1]{\textcolor{cyan}{\bf \small [ #1 --JW]}}
+%\newcommand{\jwcomment}[1]{\textcolor{cyan}{}}

 \newcommand{\bx}{\mathbf{x}}
 \newcommand{\bh}{\mathbf{h}}
@@ -92,7 +92,6 @@
 % smaller than 5cm (the original size); we will check this
 % in the camera-ready version and ask you to change it back.

-
 \title{Latent Variable Regression for Text Similarity and Textual Entailment}

 \author{First Author \\
@@ -113,7 +112,7 @@
 \maketitle
 \begin{abstract}

-We present a latent alignment algorithm that gives state-of-the-art results on the Textual Entailment and nearly state-of-the-art results on the Semantc Text Similarity (STS) tasks of the SICK dataset \cite{marelli2014sick}. Our model accomplishes this despite using at most two feature templates: word conjunctions and a single word similarity metric. Furthermore, since our model has a small feature space - we can be competitive with reported results in the literature after training on only 500 examples. Our model is a very strong baseline for paraphrase detection, textual entailment, and text similarity tasks, with significant potential for further improvement.
+We present a latent alignment algorithm that gives state-of-the-art results on the Textual Entailment and nearly state-of-the-art results on the Semantic Textual Similarity (STS) tasks of the SICK dataset~\cite{marelli2014sick}. Our model accomplishes this despite using at most two feature templates: word conjunctions and a single word similarity metric. Furthermore, since our model has a small feature space, we achieve performance competitive with reported results in the literature after training on only 500 examples. Our model is a very strong baseline for paraphrase detection, textual entailment, and text similarity tasks, with significant potential for further improvement.
 \end{abstract}

 \section{Introduction}
@@ -131,10 +130,12 @@ We present a latent alignment algorithm that gives state-of-the-art results on t
 \section{Experiments and Results}
 \input{experiments}

-\section{Discussion}
-\input{discussion}
+\section{Analyzing the Learned Weights}
+%\section{Weight Vector Analysis}
+%\input{discussion}
+As mentioned previously, our model is interpretable. So in order to further explore this, we trained models with just word conjunction features on both the textual entailment and STS tasks and examined the weight vectors of the best performing models on their respective dev sets. We found that the model trained for the STS task, puts its highest weights on identity word pairs and word pairs that are similar in some way (synonymous like {\it cut} and {\it slice}, lemmas, or topically related like {\it orange} and {\it slice}). The most negative weights were on word pairs that were not related like {\it play} and {\it woman} or antonyms such as {\it man} and {\it woman}. In contrast, the entailment model placed its highest weights on mapping stop words like {\it a}, {\it which}, and {\it who} to the NULL token. The most negative weights, not surprisingly, were word pairs involving negation words like {\it not}, {\it no}, and {\it nobody} as well as unrelated word pairs like {\it play} and {\it there}.

-\section{Conclusion}
+\section{Conclusion and Future Work}
 \input{conclusion}

 \bibliographystyle{acl}

--- a/experiments.tex
+++ b/experiments.tex
@@ -2,7 +2,7 @@
 %We evaluate our model using two different sets of features. We found that a feature set of simply a conjunction of the tokens $x_n$ and $x_m$ was very powerful - likely due to the limited and repeated vocabulary in the dataset. Since our goal was to have a minimal and purely lexical feature set, we just one other feature - the lexical similarity of $x_n$ and $x_m$. %as well as the context similarity of the text around $x_n$ and $x_m$. 

 We present three different experiments. The first investigates the performance of using both of our feature templates (word conjunctions and a single word similarity metric). %r one phrase similarity metric. 
-The second investigates how one single similarity metric could perform by itself. In the third experiment, we investigate how the best models from the two previous experiments could perform on a much smaller training dataset since there are few parameters to learn.\footnote{The number of word conjunction features for our model is 47,728. However in the resulting models most of these are set to 0, only a few thousand have any weight.} To accomplish this, we trimmed the training set of 4500 examples to just 500 examples.\footnote{These examples were chosen randomly.} 
+The second investigates how one single similarity metric could perform by itself. In the third experiment, we investigate how the best models from the two previous experiments could perform on a much smaller training dataset since there are few parameters to learn.\footnote{The number of word conjunction features for our model is 47,728. However in the resulting models most of these are set to 0, with only a few thousand having nonzero weight.} To accomplish this, we trimmed the training set of 4,500 examples to just 500 examples.\footnote{These examples were chosen randomly.} 
 %Such strong results with limited training data are unlikely for the neural network models that have such strong performance on the STS tasks \cite{tai2015improved} as they have so many parameters to learn.


@@ -11,9 +11,13 @@ The second investigates how one single similarity metric could perform by itself

 \subsection{Experimental Settings}

-During training, we used a fixed learning rate of 0.05 and optimized with stochastic gradient descent for 50 iterations. We tuned $\lambda$ over the set $\{10^{-4}, 10^{-5}, 10^{-6}\}$. Since the optimization was non-convex, we tuned our model using the development set - selecting the model with the best performance on this partition.\footnote{In the case of a tie, we took the model at the later iteration.}
+During training, we used a fixed learning rate of 0.05 and optimized with stochastic gradient descent for 50 iterations. We tuned $\lambda$ over the set $\{10^{-4}, 10^{-5}, 10^{-6}\}$. Since the optimization problem is non-convex, we performed early stopping 
+%tuned our model 
+and selected the model with the best performance on the development 
+% - selecting the model with the best performance on this partition.
+set.\footnote{In the case of a tie, we took the model at the later iteration.}

-We initialized the weight vector to be all zeros - with the exception of the weight for the similarity metic. This was set to 1. We found that this had a significant impact on the quality of the results.
+We initialized the weight vector to be all zeros - with the exception of the weight for the similarity metric. Since this feature should have a relatively high positive weight at the end of learning, we initialized it to 1 so that it would start closer to its final value. We found that this had a significant impact on the quality of the results.

 \subsection{Entailment Results}

@@ -52,9 +56,9 @@ Results on the SICK Textual Entailment task.
 \label{table:ent}
 \end{table}

-Table ~\ref{table:ent} shows the results on the SICK entailment task. We see that we can achieve the best result with just having a {\it single} similarity metric, that is a model with just two parameters to learn: the weight for the similarity metric and the weight of an offset. The results then improve significantly when we add the word conjunction features. 
+Table ~\ref{table:ent} shows the results on the SICK entailment task. We find that we can match all results from prior work with only a {\it single} word similarity metric. This model, which achieves 84.7\%, only has two parameters to learn: the weight for the similarity metric and the weight of a bias feature. The results then improve significantly when we add the word conjunction features, reaching 86.3\% and outperforming the best previous result by 1.7\% absolute. 

-We found that with word features, the best metric was cosine of the \skipgram ($w$=3) embeddings. When the word conjunction features were removed, the best metric was the cosine of the \glove ($w$=5) embeddings. Overall, the entailment task is not very sensitive to the type of similarity metric used. For instance, with the word conjunction features, all the metrics we tested were within 1.2\% absolute of each other on the dev set. When the word conjunction features were removed, they all were within 0.8\% of each other with the exception of PPDB$_{sim}$ which lagged behind the best metric by 2.0\%.
+We found that with word features, the best metric (tuned on the development data) was cosine of the \skipgram ($w$=3) embeddings. When the word conjunction features were removed, the best metric was the cosine of the \glove ($w$=5) embeddings. Overall, the entailment task is not very sensitive to the type of similarity metric used. For instance, with the word conjunction features, all the metrics we tested were within 1.2\% absolute of each other on the dev set. When the word conjunction features were removed, they all were within 0.8\% of each other with the exception of PPDB$_{sim}$ which lagged behind the best metric by 2.0\%.

 \subsection{STS Results}

@@ -68,8 +72,8 @@ We found that with word features, the best metric was cosine of the \skipgram ($
 \hline
 %\cite{SocherKLMN14} DT-RNN & 0.7863 & 0.7305 & 0.3983\\
 %\cite{SocherKLMN14} SDT-RNN & 0.7886 & 0.7280 & 0.3859\\
-\cite{SocherKLMN14} DT-RNN & 0.7305 \\
-\cite{SocherKLMN14} SDT-RNN & 0.7280\\
+\cite{SocherKLMN14} DT-RNN & 0.7319 \\
+\cite{SocherKLMN14} SDT-RNN & 0.7304\\
 \hline
 %\cite{lai2014illinois} & 0.7993 & 0.7538 & 0.3692\\
 %\cite{bjerva2014meaning} & 0.8070 & 0.7489 & 0.3550 \\ 
@@ -79,20 +83,21 @@ We found that with word features, the best metric was cosine of the \skipgram ($
 \cite{bjerva2014meaning} &  0.7489  \\ 
 \cite{jimenez2014unal} & 0.7721 \\
 %\cite{zhao2014ecnu} & 0.8414 & - & -\\
-\hline
+%\hline
 %\cite{tai2015improved} Const. LSTM & 0.8491 & 0.7873 & 0.2852 \\
 %\cite{tai2015improved} Dep. LSTM & 0.8627 & 0.8032 & 0.2635\\
-\cite{tai2015improved} Const. LSTM & 0.7873 \\
-\cite{tai2015improved} Dep. LSTM & 0.8032 \\
 \hline
 %LSTM & 0.8477 & 0.7921 & 0.2949\\
 %Bi-directional LSTM & 0.8522 & 0.7952 & 0.2850\\
 %2-layer LSTM & 0.8411 & 0.7849 & 0.2980\\
 %2-layer Bi-directional LSTM & 0.8488 & 0.7926 & 0.2893\\
-LSTM & 0.7921 \\
-Bi-directional LSTM & 0.7952 \\
-2-layer LSTM & 0.7849 \\
-2-layer Bi-directional LSTM & 0.7926 \\
+LSTM & 0.7911 \\
+Bi-directional LSTM & 0.7966 \\
+2-layer LSTM & 0.7896 \\
+2-layer Bi-directional LSTM & 0.7965 \\
+\hline
+\cite{tai2015improved} Const. LSTM & 0.7966 \\
+\cite{tai2015improved} Dep. LSTM & \bf 0.8083 \\
 \hline
 %\latentalign (word feats only) & 0.7390 & 0.7626 & 0.4634\\
 \latentalign (word conj only) & 0.7626 \\
@@ -122,14 +127,21 @@ Bi-directional LSTM & 0.7952 \\
 \hline
 \end{tabular}
 \caption{
-Results on the SICK Semantic Text Similarity (STS) task.
+Results on the SICK Semantic Text Similarity task. %(STS) task.
 \vspace{-0.4cm}}
 \label{table:sim}
 \end{table}
 %For this task, the best word and phrase similarity (with word features) was PPDB$_{sim}$  and \newllm-PPDB$_{sim}$ respectively. Without word features, the best metrics were
 %For this task, the evaluation metrics are Pearson's $r$, Spearman's $\rho$, and mean squared error (MSE). Interestingly, we can approach the state-of-the-art methods with $\rho$, but are ordinary when it comes to $r$ and MSE as can be seen in Table ~\ref{table:sim}. 

-Our results for this task are shown in Table~\ref{table:sim}. We tuned our results on Pearson's $r$ to be consistent with the literature but report Spearmsn's $\rho$. Overall our Pearson's $r$ and MSE were fairly average. For our model with a word similarity metric and word conjunctions, they were 0.7761 and 0.4600 respecttively; while for the model with just the similarity metric they were 0.7085 and 0.5236. 
+Our results for this task are shown in Table~\ref{table:sim}. We report Spearman's $\rho$ and compare to the best results from the shared task and from \newcite{tai2015improved}. 
+
+Our results show that we can exceed the performance of all models by just having two feature templates - word conjunctions and a single word similarity metric. The word similarity metric in this case was PPDB$_{sim}$. It was also the best performing metric when the word conjunction features were removed. The second best metrics for this task were the \paragram vectors with $\rho$ of 0.8063 and  $\rho$ of 0.7566 with word conjunction features and without respectively. The best results for a method that does not use any knowledge resource (i.e. without PPDB) are the \glove vectors ($w$=15) with a $\rho$ of 0.8031 and \glove vectors ($w$=5) with a $\rho$ of 0.7313 for the cases of with word conjunctions and without respectively.
+
+We tuned our results on Pearson's $r$ to be consistent with the literature but reported Spearman's $\rho$ above. The task also used Pearson's $r$ and mean squared error (MSE) as evaluation metrics. On these metrics, 
+%Overall our Pearson's $r$ and MSE were fairly average. 
+our performance was fairly average. 
+For our model with a word similarity metric and word conjunctions, they were 0.7761 and 0.4600 respectively; while for the model with just the similarity metric they were 0.7085 and 0.5236. 

 \begin{table}[h]
 \setlength{\tabcolsep}{4pt} % General space between cols (6pt standard)
@@ -150,9 +162,10 @@ MSE of our best word sim + word feats model on the STS task over different range
 \label{table:mse}
 \end{table}

-This means that we are ranking the sentence paris well, but not predicting the actual scores with as much success. This can be seen in Table~\ref{table:mse} which shows the MSE of the gold scores for several intervals. 
+This means that we are ranking the sentence pairs well, but not predicting the actual scores with as much success. This can be seen in Table~\ref{table:mse} which shows the MSE of the gold scores for several intervals. 
 %The MSE for gold scores in the interval $[1,2)$ is 1.496, but on the $[3,4)$ interval the MSE is just 0.1892. 
-\cite{tai2015improved} briefly discusses a similar issue and they use KL-divergence as their objective function and find that improves their results. A possible alternative would be to learn a transformation function from the output of our system to the gold scores - this would improve MSE and $r$ as well. We leave this for future work and exploration. 
+\newcite{tai2015improved} briefly discuss optimizing MSE during training but found better results using a different objective function based on a KL-divergence. % and find that improves their results. 
+An alternative would be to learn a transformation function from the output of our system to the gold scores - this would be expected to improve both MSE and $r$ while retaining the same Spearman correlation. We leave this for future work and exploration. 

 %\latentalign (best word sim + word feats) & 0.7761 & \bf 0.8038 & 0.4600\\
 %\latentalign (best word sim + word feats) & \bf 0.8038 \\
@@ -165,6 +178,6 @@ This means that we are ranking the sentence paris well, but not predicting the a
 % but since our predictions do deviate from the gold scores, our $r$ and MSE are mediocre when compared to the literature. 
 %Note that since $r$ is considered the main evaluation criteria for this dataset, we used $r$ to select models from the dev set so we can make a fair comparison to the other approaches in the literature. 

-Our results show that we can exceed the performance of all models by just having two feature templates - word conjunctions and a single word similarity metric. The word similarity metric in this case was PPDB$_{sim}$. It was also the best performing metric when the word conjunction features were removed. The second best metrics for this task were the \paragram vectors with $\rho$ of 0.8063 and  $\rho$ of 0.7566 with word conjunction features and without respectively. The best results for a method that does not use any knowledge resource (i.e. without PPDB) are the \glove vectors ($w$=15) with a $\rho$ of 0.8031 and \glove vectors ($w$=5) with a $\rho$ of 0.7313 for the cases of with word conjunctions and without respectively.
+

 Overall, the STS task showed much more variability when it comes to the performance of the metrics. With the word conjunction features, the gap in Pearson $r$ between the best and worst metrics on the dev set is just .0103. But without these features - the gap rises to .0386 - with the worst metric being \skipgram vectors ($w$=3).
--- a/features.tex
+++ b/features.tex
 \subsection{Word Embeddings and Word Similarity}
-We experiment with three different word similarity metrics : (1) cosine of \glove and \skipgram word embeddings trained on Wikipedia,\footnote{We used the December 2, 2013 snapshot.} (2) cosine of embeddings trained on PPDB, and (3) a metric derived directly from PPDB.
+We experiment with three different word similarity metrics: (1) cosine of \glove and \skipgram word embeddings trained on Wikipedia,\footnote{We used the December 2, 2013 snapshot.} (2) cosine of embeddings trained on PPDB, and (3) a metric derived directly from PPDB.

-We set the dimension, $n$, of our \glove and \skipgram embeddings to 25. Smaller embeddings are faster to train, but larger embeddings tend to improve results on most tasks. Another important hyper-parameter to consider when training word embeddings is the context window size $w$. It has been previously shown \cite{bansal2014tailoring} that a smaller window clusters words that function similarity while a large one clusters words that are more topically related. We experimented with a variety of values for $w$ from $\{3, 5, 10, 15 \}$. Our word embeddings were trained for 15 iterations, and for our vocabulary, we used the 100,000 most common tokens in our Wikipedia snapshot.
+We set the dimension, $n$, of our \glove and \skipgram embeddings to 25. Smaller embeddings are faster to train, but larger embeddings tend to improve results on most tasks. Another important hyper-parameter to consider when training word embeddings is the context window size $w$. It has been previously shown that a smaller window clusters words that function similarly while a large one clusters words that are more topically related~\cite{bansal2014tailoring}. We experimented with a variety of values for $w$ from $\{3, 5, 10, 15 \}$. Our word embeddings were trained for 15 iterations, and for our vocabulary, we used the 100,000 most common tokens in our Wikipedia snapshot.


 %While there are many choices for a similarity metric between words, neural word embeddings have become a popular topic recently and have shown to be strong performers on numerous tasks \jwcomment{need some citations for this part. TODO.} We chose to experiment with both 
@@ -10,7 +10,9 @@ We set the dimension, $n$, of our \glove and \skipgram embeddings to 25. Smaller

 For the embeddings trained on PPDB, we experimented with \paragram embeddings \cite{wieting2015ppdb}. These embeddings, also 25 dimensional, were created by training on the lexical pairs in PPDB, where the authors initialized their model with \skipgram embeddings and also included a penalty term for deviating from the initial embeddings. The embeddings were tuned on the SimLex 999 dataset \cite{HillRK14} and were found to be crucial to success in the creation of the resulting phrase embeddings.

-The metric based on PPDB, we call PPDB$_{sim}$, simply returns 1 if the tokens are matched with each other in the lexical XL section of PPDB and 0 otherwise.
+The metric based on PPDB, which we call PPDB$_{sim}$, simply returns 1 if the tokens are matched with each other in the lexical XL section of PPDB and 0 otherwise.

 \subsection{Features}
-For all of our experiments we use one or both of two feature functions. The first of these is word conjunctions - a binary feature indicating whether tokens $x_n$ and $x_m$ should be aligned. The second is a single word similarity metric. We choose which one to pick based on the performance of all of our metrics (\glove, \skipgram, \paragram, PPDB$_{sim}$) on the dev set. 
\ No newline at end of file
+For all of our experiments we use one or both of two feature templates. The first template is on word conjunctions: binary features indicating whether tokens $x_n$ and $x_m$ should be aligned. This template includes many feature instantiations, depending on the vocabulary of the training data. The second template is a single word similarity metric, which is instantiated once. We choose the word similarity metric 
+% of all of our similarity metrics 
+(either \glove, \skipgram, \paragram, or PPDB$_{sim}$) based on performance on the development set for each task. 
--- a/intro.tex
+++ b/intro.tex
-Text similarity, paraphrase detection,\footnote{See \newcite{androutsopoulos2010survey} for a survey} and textual entailment have attracted a lot of interest recently. This interest has been bolstered with the introduction of the SICK dataset \cite{marelli2014sick} which has put the focus on phenomena that can be accounted for by compositional semantics. This frees researchers to focus on this narrower problem as they no longer have to consider the separate issue of utilizing encyclopedic knowledge inherent in most other paraphrase or textual entailment datasets. The dataset is used for two different tasks: textual entailment and semantic text similarity (STS).
+Text similarity, paraphrase detection,\footnote{See \newcite{androutsopoulos2010survey} for a survey} and textual entailment have attracted a lot of interest recently. This interest has been bolstered with the introduction of the SICK dataset \cite{marelli2014sick} which has put the focus on phenomena that can be captured by compositional semantics. This frees researchers to focus on this narrower problem as they no longer have to consider the separate issue of utilizing encyclopedic knowledge inherent to most other paraphrase and textual entailment datasets. The SICK dataset is used for two different tasks: textual entailment and semantic textual similarity (STS).

-In this work, we introduce a latent-alignment algorithm that gives state-of-the-art or near state-of-the-art performance on both of these tasks. Moreover, we can achieve such performance only using at most two feature templates that do not require any resources besides a corpus to train word embeddings.  We did find however that we could further improve results by using the lexical portion of the Paraphrase Database (PPDB) \cite{GanitkevitchDC13}. Furthermore, since our model utilizes such a small feature set - we can get similar performance by only training on a small portion of the training data as just 500 examples of the 4500 available for training gives us a state-of-the-art result on the entailment task and competitive performance on the STS task.
\ No newline at end of file
+In this work, we introduce a latent-alignment algorithm that gives state-of-the-art or near state-of-the-art performance on both of these tasks.\footnote{Implementation available at \url{https://www.dropbox.com/sh/se575mrr8h8fwz6/AAA5WI-4Pj6TX-iwleo8KeBWa?dl=0}} Moreover, we can achieve such performance only using at most two feature templates that do not require any resources besides a corpus to train word embeddings.  We did find however that we could further improve results by using the lexical portion of the Paraphrase Database (PPDB) \cite{GanitkevitchDC13}. Furthermore, since our model utilizes such a small feature set, we can get similar performance by only training on a small portion of the training data. Using just 500 examples of the 4500 available for training gives us a state-of-the-art result on the entailment task and competitive performance on the STS task.
\ No newline at end of file
--- a/model.tex
+++ b/model.tex
-In the SICK dataset, textual entailment is a multi-class problem with three classes (Entailment, Contradiction, and Neutral), while textual similarity is a regression problem. We chose to model both problems as regression by mapping the entailment labels to real numbers (Contradiction=0, Neutral=1, Entailment=2). This allows us to be able to use the same model for both tasks. We model both problems as regularized least squares with latent variables.
+In the SICK dataset, textual entailment is a multi-class problem with three classes (Entailment, Contradiction, and Neutral), while textual similarity is a regression problem. We chose to model both problems as regression by mapping the entailment labels to real numbers (Contradiction=0, Neutral=1, Entailment=2). This allows us to use the same model for both tasks. We model both problems as regularized least squares regression with latent variables.

-We are given a training set of examples, where each example contains two sentences that we convert to two lists of tokens $N$ and $M$. We then seek to align each token in these lists to either: (1) a token in the other set or (2) a NULL token that we add to each list. The latter option being equivalent to deleting that token. We find the alignment assignment for each token by finding the values for a set of binary variables, $\{ 1_{n,m} \}$ that indicate whether tokens $x_n$ and $x_m$ are aligned. We do this by solving equation~\ref{eq:ilp} where $\bw$ is our vector of parameters and $f$ is a function that returns the feature vector given two tokens.
 \begin{equation} \label{eq:ilp}
 \begin{aligned}
 & \underset{\{ 1_{n,m} \}}{\text{max}} 
- & & \bw^T \sum_{n=1}^{\|N\|+1} \sum_{m=1}^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m)}{\|N\| + \|M\|}  \\
+ & & \bw^T \sum_{n=0}^{|N|} \sum_{m=0}^{|M|} 1_{n,m}\frac{f(x_n, x_m)}{|N| + |M|}  \\
 & \text{subject to} 
- & & \forall m, \sum_{n=1}^{\|N\|+1} 1_{n,m} = 1 \\
+ & & \forall m, \sum_{n=0}^{|N|} 1_{n,m} = 1 \\
 & 
- & & \forall n, \sum_{m=1}^{\|M\|+1} 1_{n,m} = 1 \\
+ & & \forall n, \sum_{m=0}^{|M|} 1_{n,m} = 1 \\
 \end{aligned}
 \end{equation}
+
+We are given a training set of examples, where each example contains two sentences that we convert to two lists of tokens $N$ and $M$. We then seek to align each token in these lists to either: (1) a token in the other set or (2) a NULL token that we add to each list. The latter option is equivalent to deleting that token. We find the alignment assignment for each token by finding the values for a set of binary variables $\{ 1_{n,m} \}$ that indicate whether tokens $x_n$ and $x_m$ are aligned. We do this by solving equation~(\ref{eq:ilp}) where $\bw$ is our vector of parameters and $f$ is a function that returns the feature vector given two tokens.\footnote{Note that in equation~\ref{eq:ill} and equation~\ref{eq:learning} we assign the index of the NULL token to be 0 for both $N$ and $M$.}
+
 %\begin{multline} \label{eq:learning}
 %\underset{\{ 1_{n,m} \}}{\text{max}}  \bw^T \sum_n^{\|N\|+1} \sum_m^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|} \\
 %\forall m, \sum_n^{\|N\|+1} 1_{n,m} = 1 \\
@@ -24,12 +26,13 @@ We are given a training set of examples, where each example contains two sentenc
 %                               & \forall n, \sum_{m=1}^{\|M\|+1} 1_{n,m} = 1 \\
 %                       \nonumber
 %  \end{alignat}
-We solve this optimization problem using an integer linear program solver.\footnote{We use Gurobi: \url{http://www.gurobi.com/}.}  Then given the alignment, we update the parameters by minimizing the following objective function:
+We solve this optimization problem using an integer linear program solver.\footnote{We use Gurobi: \url{http://www.gurobi.com/}.}  Then given the alignment, we update the parameters by taking a gradient step to minimize the following objective function:
 \begin{multline} \label{eq:learning}
 \underset{\bw}{\text{min}} \\
-\Bigg(y- \bw^T \sum_n^{\|N\|+1} \sum_m^{\|M\|+1} 1_{n,m}\frac{f(x_n, x_m, N, M)}{\|N\| + \|M\|}\bigg)^2\\ 
+\Bigg(y- \bw^T \sum_{n=0}^{|N|} \sum_{m=0}^{|M|} 1_{n,m}\frac{f(x_n, x_m)}{|N| + |M|}\bigg)^2\\ 
 + \frac{\lambda}{2} \|\bw\|^2 
 \end{multline}
+\noindent where $y$ is the ground truth similarity score from the training data. 
 %\begin{multline} \label{eq:learning}
 %\underset{\bw}{\text{min}}   \sum_{i=1}^d \Bigg(y - \max_{\bh_i\in C}\bw^T \sum_{h \in \bh_i} h\phi_h(\bx_i)\bigg)^2 \\
 % + \frac{\lambda}{2} \|\bw\|^2 

--- a/related.tex
+++ b/related.tex
-A variety of different techniques have been applied successfully to the SICK dataset. In the 2014 Sem-Eval competition, the best systems for the text similarity and textual entailment tasks relied on significant feature engineering and the use of external resources such as Wordnet and PPDB. They also incorporated syntactic information as well as explicit features to handle negation. Recently, deep learning techniques have been applied to the problems such as tree-LSTMs \cite{tai2015improved} and Recursive Neural Tensor Networks \cite{bowman2014recursive}. The drawbacks of these methods is they have a large parameter space that can make training computationally expensive and one does not have much intuition as to how the model is arriving at its decisions. Moreover these methods require either a constituent or dependency parse of the training data.
+A variety of techniques have been applied successfully to the SICK dataset. In the 2014 SemEval competition, the best systems for the text similarity and textual entailment tasks relied on significant feature engineering and the use of external resources such as WordNet and PPDB. They also incorporated syntactic information as well as explicit features to handle negation. Recently, deep learning techniques have been applied, including tree-LSTMs \cite{tai2015improved} and Recursive Neural Tensor Networks \cite{bowman2014recursive}. While effective, 
+%The drawbacks of these methods is they 
+these methods have certain drawbacks: (1) they have large parameter spaces that can make training inefficient (both statistically and computationally), %computationally expensive 
+(2) it can be difficult to gain 
+%one does not have much 
+intuition into their functioning, as is the case with many neural network models, and (3) 
+%as to how the model is arriving at its decisions. 
+%Finally, 
+they require either a constituent or dependency parse of the training data.

 %In contrast to earlier work on textual entailment, paraphrasing, and text similarity, our work only makes use of at most two lexically driven features, without any structural NLP resources, knowledge engineering or deep learning approaches.  Moreover, our model is interpretable: it is easy to examine the learned weight vector and understand the decision made by our model. It also trains in just a matter of minutes.

 Our work is in contrast to both the feature-engineered and deep learning approaches. We only require at most two feature templates and do not rely on any external knowledge sources or NLP tools to obtain superior results. Also our model trains in a matter of minutes and has the property of being interpretable, meaning that it is easy to determine exactly how our model arrives at its decisions by examining the learned weight vector.

-The model we present is based on a latent alignment approach. Variations of this technique have been used in machine translation, paraphrase detection, and textual entailment previously \jwcomment{include some citations here}. The closest model to the one presented here is the work of \cite{chang2010discriminative}. One limitation of their model is that they are limited to binary classification, where our model can handle regression as well, allowing text similarity to be an application. Secondly, our model can be trained in an online fashion, while there's requires a batch approach and moreover, they must repeatedly cycle through the negative examples. This can cause the optimization to be slow and its termination unpredictable.
+The model we present is based on a latent alignment approach. Variations of this technique have previously been used in machine translation \cite{brown1993mathematics}, paraphrase detection \cite{das2009paraphrase}, and textual entailment \cite{chang2010discriminative} - the closest model to the one presented here. One limitation of their model is that they are limited to binary classification, where our model can also handle regression, allowing us to predict real-valued similarity scores. 
+%text similarity to be an application. 
+Secondly, our model can be trained in an online fashion, while theirs requires a batch approach and, moreover, they must repeatedly cycle through the negative examples. This can cause the optimization to be slow and its termination unpredictable.

 %For one, our model is not limited to binary classification. Secondly our model is an online approach that linearly goes through the training data. This in contrast to lclr that linearly goes through the positive examles in the dataset but goes theough the negative examples repeatedly until alignments are no longer added to a cach. Thus in practice , this creates unpredictability as the algorithm can be stuck in this loop a long time without making much progress towards the glibal solution. Lastly, our model is easy to implement and can be optimized with stochastic gradient descent making it suitable for the evaluation of word and phrase embeddings. Latent variable models have been used before for paraphrase detection or textual entailment \cite{chang2010discriminative} \cite{das2009paraphrase}.
\ No newline at end of file