1603.00892

of 7
10 views
PDF
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
natural language processing
Document Share
Document Tags
Document Transcript
  Counter-fitting Word Vectors to Linguistic Constraints Nikola Mrk ˇsi´c 1 , Diarmuid ´O S´eaghdha 2 , Blaise Thomson 2 , Milica Ga ˇsi´c 1 Lina Rojas-Barahona 1 , Pei-Hao Su 1 , David Vandyke 1 , Tsung-Hsien Wen 1 , Steve Young 11 Department of Engineering, University of Cambridge, UK 2 Apple Inc. { nm480,mg436,phs26,djv27,thw28,sjy } @cam.ac.uk { doseaghdha, blaisethom } @apple.com Abstract In this work, we present a novel  counter-fitting method which injects antonymy and synonymyconstraints into vector space representations in order to improve the vectors’ capability for judging semantic similarity. Applying thismethod to publicly available pre-trained wordvectors leads to a new state of the art perfor-mance on the SimLex-999 dataset. We also show how the method can be used to tailor the word vector space for the downstream task of dialogue state tracking, resulting in robust im- provements across different dialogue domains. 1 Introduction Many popular methods that induce representationsfor words rely on the  distributional hypothesis  – the assumption that semantically similar or related words appear in similar contexts. This hypothesis sup- ports unsupervised learning of meaningful word rep- resentations from large corpora (Curran, 2003;  ´ O S ´ eaghdha and Korhonen, 2014; Mikolov et al., 2013; Pennington et al., 2014). Word vectors trained using these methods have proven useful for many down- streamtasksincludingmachinetranslation(Zouetal., 2013) and dependency parsing (Bansal et al., 2014). One drawback of learning word embeddings fromco-occurrence information in corpora is that it tends to coalesce the notions of   semantic similarity  and  con-ceptual association  (Hill et al., 2014b). Furthermore, even methods that can distinguish similarity from association (e.g., based on syntactic co-occurrences) will generally fail to tell synonyms from antonyms(Mohammad et al., 2008). For example, words such east expensive British Beforewest pricey Americannorth cheaper Australiansouth costly Britainsoutheast overpriced Europeannortheast inexpensive EnglandAftereastward costly Britseastern pricy Londoneasterly overpriced BBC- pricey UK- afford Britain Table 1: Nearest neighbours for target words using GloVe vectors before and after counter-fitting as  east   and  west   or  expensive  and  inexpensive  appear in near-identical contexts, which means that distribu- tional models produce very similar word vectors forsuch words. Examples of such anomalies in GloVevectors can be seen in Table 1, where words such as cheaper   and  inexpensive  are deemed similar to (their antonym)  expensive . A second drawback is that similarity and antonymy can be application- or domain-specific. In our case,we are interested in exploiting distributional knowl- edge for the  dialogue state tracking  task (DST). The DST component of a dialogue system is responsi-ble for interpreting users’ utterances and updatingthe system’s  belief state  – a probability distributionover all possible states of the dialogue. For exam- ple, a DST for the restaurant domain needs to detect whether the user wants a  cheap  or  expensive  restau-rant. Being able to generalise using distributionalinformation while still distinguishing between se-mantically different yet conceptually related words   a  r   X   i  v  :   1   6   0   3 .   0   0   8   9   2  v   1   [  c  s .   C   L   ]   2   M  a  r   2   0   1   6  (e.g.  cheaper   and  pricey ) is critical for the perfor- mance of dialogue systems. In particular, a dialogue system can be led seriously astray by false synonyms. We propose a method that addresses these twodrawbacks by using synonymy and antonymy rela-tions drawn from either a general lexical resourceor an application-specific ontology to fine-tune dis-tributional word vectors. Our method, which we term  counter-fitting , is a lightweight post-processing procedure in the spirit of   retrofitting  (Faruqui et al.,2015). The second row of Table 1 illustrates the results of counter-fitting: the nearest neighbours cap- ture true similarity much more intuitively than thesrcinal GloVe vectors. The procedure improves word vector quality regardless of the initial word vec- tors provided as input. 1 By applying counter-fittingto the Paragram-SL999 word vectors provided by Wieting et al. (2015), we achieve new state-of-the-art performance on SimLex-999, a dataset designed tomeasure how well different models judge semanticsimilarity between words (Hill et al., 2014b). Wealso show that the counter-fitting method can in- ject knowledge of dialogue domain ontologies intoword vector space representations to facilitate the construction of semantic dictionaries which improve DST performance across two different dialogue do-mains. Our tool and word vectors are available at github.com/nmrksic/counter-fitting . 2 Related Work Most work on improving word vector representa- tions using lexical resources has focused on bringing words which are known to be semantically relatedcloser together in the vector space. Some methods modify the prior or the regularization of the srcinal training procedure (Yu and Dredze, 2014; Bian et al., 2014; Kiela et al., 2015). Wieting et al. (2015) use the Paraphrase Database (Ganitkevitch et al., 2013)to train word vectors which emphasise word simi-larity over word relatedness. These word vectorsachieve the current state-of-the-art performance onthe SimLex-999 dataset and are used as input for counter-fitting in our experiments. 1 When we write “improve”, we refer to improving the vector space for a specific purpose. We do not expect that a vectorspace fine-tuned for semantic similarity will give better resultson semantic relatedness. As Mohammad et al. (2008) observe, antonymous concepts are related but not similar. Recently, there has been interest in lightweightpost-processing procedures that use lexical knowl-edge to refine off-the-shelf word vectors without re-quiring large corpora for (re-)training as the afore-mentioned “heavyweight” procedures do. Faruquiet al.’s (2015)  retrofitting  approach uses similarity constraints from WordNet and other resources to pull similar words closer together. The complications caused by antonymy for distri-butional methods are well-known in the semanticscommunity. Most prior work focuses on extractingantonym pairs from text rather than exploiting them(Lin et al., 2003; Mohammad et al., 2008; Turney, 2008; Hashimoto et al., 2012; Mohammad et al.,2013). The most common use of antonymy infor-mation is to provide features for systems that de-tect contradictions or logical entailment (Marcu andEchihabi, 2002; de Marneffe et al., 2008; Zanzotto et al., 2009). As far as we are aware, there is noprevious work on exploiting antonymy in dialoguesystems. The modelling work closest to ours areLiu et al. (2015), who use antonymy and WordNethierarchy information to modify the heavyweightWord2Vec training objective; Yih et al. (2012), whouse a Siamese neural network to improve the qual- ity of Latent Semantic Analysis vectors; Schwartz etal. (2015), who build a standard distributional model from co-occurrences based on  symmetric patterns ,with specified antonymy patterns counted as nega-tive co-occurrences; and Ono et al. (2015), who use thesauri and distributional data to train word embed- dings specialised for capturing antonymy. 3 Counter-fitting Word Vectors toLinguistic Constraints Our starting point is an indexed set of word vec- tors  V    =  { v 1 , v 2 ,..., v N  }  with one vector for each word in the vocabulary. We will inject semantic re-lations into this vector space to produce new wordvectors  V     =  { v  1 , v  2 ,..., v  N  } . For antonymyand synonymy we have a set of constraints  A  and S  , respectively. The elements of each set are pairs of word indices; for example, each pair  ( i,j )  in  S   is such that the i -th and  j -th words in the vocabulary aresynonyms. The objective function used to counter-fitthe pre-trained word vectors V    to the sets of linguistic constraints  A  and  S   contains three different terms:  1. AntonymRepel(AR) :Thistermservesto  push antonymous words’ vectors away from each other in the transformed vector space  V     :AR ( V     ) =  ( u,w ) ∈ A τ   δ   −  d ( v  u , v  w )  where d ( v i ,v  j ) = 1 − cos( v i ,v  j ) isadistancederived from cosine similarity and  τ  ( x )    max (0 ,x )  im- poses a margin on the cost. Intuitively,  δ   is the “ideal” minimum distance between antonymous words; inour experiments we set  δ   = 1 . 0  as it corresponds to vector orthogonality. 2. Synonym Attract (SA) : The counter-fittingprocedure should seek to bring the word vectors of  known synonymous word pairs closer together:SA ( V     ) =  ( u,w ) ∈ S  τ   d ( v  u , v  w )  −  γ   where  γ   is the “ideal” maximum distance between synonymous words; we use  γ   = 0 . 3. Vector Space Preservation (VSP) : the topol-ogy of the srcinal vector space describes relation- shipsbetweenwordsinthevocabularycapturedusingdistributional information from very large textual cor- pora. The VSP term bends the transformed vector space towards the srcinal one as much as possible inorder to preserve the semantic information contained in the srcinal vectors:VSP ( V,V     ) = N   i =1   j ∈ N  ( i ) τ   d ( v  i , v   j )  −  d ( v i , v  j )  For computational efficiency, we do not calculatedistances for every pair of words in the vocabulary.Instead, we focus on the (pre-computed) neighbour-hood  N  ( i ) , which denotes the set of words withina certain radius  ρ  around the  i -th word’s vector inthe srcinal vector space  V    . Our experiments indi- cate that counter-fitting is relatively insensitive to thechoice of   ρ , with values between 0.2 and 0.4 showing little difference in quality; here we use  ρ  = 0 . 2 . The objective function for the training procedure is given by a weighted sum of the three terms: C  ( V,V     ) =  k 1 AR ( V     )+ k 2 SA ( V     )+ k 3 VSP ( V,V     ) where  k 1 ,k 2 ,k 3  ≥  0  are hyperparameters that con-trol the relative importance of each term. In ourexperiments we set them to be equal:  k 1  =  k 2  =  k 3 .To minimise the cost function for a set of starting vectors  V    and produce counter-fitted vectors  V     , we run stochastic gradient descent (SGD) for 20 epochs. An end-to-end run of counter-fitting takes less than two minutes on a laptop with four CPUs. 3.1 Injecting Dialogue Domain Ontologies intoVector Space Representations Dialogue state tracking (DST) models capture users’ goals given their utterances. Goals are represented as sets of constraints expressed by  slot-value  pairs such as [food:  Indian ] or [parking:  allowed  ]. The set of  slots  S   and the set of values  V   s  for each slot make up the  ontology  of a dialogue domain. In this paper we adopt the recurrent neural network (RNN) framework for tracking suggested in (Hender- son et al., 2014d; Henderson et al., 2014c; Mrk  ˇ si ´ c et al., 2015). Rather than using a spoken language un- derstanding (SLU) decoder to convert user utterances into meaning representations, this model operatesdirectly on the  n -gram features extracted from theautomated speech recognition (ASR) hypotheses. Adrawback of this approach is that the RNN model can only perform exact string matching to detect the slot names and values mentioned by the user. It can- not recognise synonymous words such as  pricey  and expensive , or even subtle morphological variationssuch as  moderate  and  moderately . A simple way to mitigate this problem is to use  semantic dictionaries : lists of rephrasings for the values in the ontology. Manual construction of dictionaries is highly labour-intensive; however, if one could automatically detect high-quality rephrasings, then this capability would come at no extra cost to the system designer. To obtain a set of word vectors which can be used for creating a semantic dictionary, we need to  inject  the domain ontology into the vector space. This can be achieved by introducing antonymy constraints be- tween all the possible values of each slot (i.e.  Chinese and  Indian ,  expensive  and  cheap , etc.). The remain-ing linguistic constraints can come from semanticlexicons: the richer the sets of injected synonymsand antonyms are, the better the resulting word rep- resentations will become.  Model / Word Vectors  ρ Neural MT Model (Hill et al., 2014a) 0.52Symmetric Patterns (Schwartz et al., 2015) 0.56Non-distributional Vectors (Faruqui and Dyer, 2015) 0.58GloVe vectors (Pennington et al., 2014) 0.41GloVe vectors + Retrofitting 0.53GloVe + Counter-fitting 0.58Paragram-SL999 (Wieting et al., 2015) 0.69Paragram-SL999 + Retrofitting 0.68Paragram-SL999 + Counter-fitting  0.74 Inter-annotator agreement 0.67Annotator/gold standard agreement 0.78 Table 2: Performance on SimLex-999. Retrofitting uses the code and (PPDB) data provided by the authors 4 Experiments 4.1 Word Vectors and Semantic Lexicons Two different collections of pre-trained word vectors were used as input to the counter-fitting procedure:1.  Glove Common Crawl  300-dimensional vec- tors made available by Pennington et al. (2014). 2.  Paragram-SL999  300-dimensional vectors made available by Wieting et al. (2015). The synonymy and antonymy constraints were ob- tained from two semantic lexicons:1.  PPDB 2.0 (Pavlick et al., 2015):  the latest re-lease of the Paraphrase Database. A new fea-ture of this version is that it assigns relationtypes to its word pairs. We identify the  Equiv-alence  relation with synonymy and  Exclusion with antonymy. We used the largest available(XXXL) version of the database and only con- sidered single-token terms.2.  WordNet (Miller, 1995) : a well known seman- tic lexicon which contains vast amounts of highquality human-annotated synonym and antonym pairs. Any two words in our vocabulary whichhad antonymous word senses were considered antonyms; WordNet synonyms were not used. In total, the lexicons yielded 12,802 antonymy and31,828 synonymy pairs for our vocabulary, whichconsisted of 76,427 most frequent words in Open- Subtitles, obtained from  invokeit.wordpress.com/frequency-word-lists/ . Semantic Resource  Glove ParagramBaseline (no linguistic constraints) 0.41 0.69PPDB −  (PPDB antonyms) 0.43 0.69PPDB +  (PPDB synonyms) 0.46 0.68WordNet −  (WordNet antonyms) 0.52 0.74PPDB −  and PPDB +  0.50 0.69WordNet −  and PPDB −  0.53 0.74WordNet −  and PPDB +  0.58 0.74WordNet −  and PPDB −  and PPDB +  0.58 0.74 Table 3: SimLex-999 performance when different sets of  linguistic constraints are used for counter-fitting 4.2 Improving Lexical Similarity Predictions In this section, we show that counter-fitting pre-trained word vectors with linguistic constraints im-proves their usefulness for judging semantic simi-larity. We use Spearman’s rank correlation coeffi-cient with the SimLex-999 dataset, which containsword pairs ranked by a large number of annotators instructed to consider only semantic similarity. Table 2 contains a summary of recently reportedcompetitive scores for SimLex-999, as well as the performance of the unaltered, retrofitted and counter- fitted GloVe and Paragram-SL999 word vectors. Tothe best of our knowledge, the 0.685 figure reportedfor the latter represents the current high score. This figure is above the average inter-annotator agreement of 0.67, which has been referred to as the ceiling performance in most work up to now. In our opinion, the average inter-annotator agree-ment is not the only meaningful measure of ceiling performance. We believe it also makes sense to com- pare:  a)  the model ranking’s correlation with the gold standard ranking to:  b)  the average rank correlation that individual human annotators’ rankings achieved with the gold standard ranking. The SimLex-999authors have informed us that the average annotator agreement with the gold standard is 0.78. 2 As shownin Table 2, the reported performance of all the models and word vectors falls well below this figure. Retrofitting pre-trained word vectors improvesGloVe vectors, but not the already semantically spe-cialised Paragram-SL999 vectors. Counter-fitting substantially improves both sets of vectors, showing that injecting antonymy relations goes a long way 2 This figure is now reported as a potentially fairer ceilingperformance on the SimLex-999 website:  http://www.cl.cam.ac.uk/˜fh295/simlex.html .
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x