We are seeking a CS or Math Master Student to do his/her thesis on a funded project on .
The Network Properties of Word Embeddings project is funded by Data Science Institute.
Participating Faculty members:
Description of research work:
This work combines two strength areas of Bar-Ilan University – Network Science and Natural Language Processing – to study the network properties of word embeddings.
Word embedding — the mapping of words into numerical vector spaces — has seen tremendous success in numerous NLP tasks in recent years (). Multiple methods for learning word embeddings from textual corpora have also be proposed. The resulting representations typically preserve semantic, syntactic and other properties of words. Once represented as vectors in an Euclidean space of dimension n (typically n is in the range of 50 to 500), it is natural to consider the implied graph or network of the words. In such a graph, two words are adjacent based on their distance as vectors or other similarity metric. Graphs and networks are widely used in Natural Language Processing including word graphs (e.g. , ), and it has been shown that word graphs share statistical features as other complex networks (). However, the network properties of word embedding graphs have not been investigated until now.
The work will involve analyzing and comparing network properties of various word-embedding based networks – across word embedding algorithms and parameters, corpora and language and in addition study the relationship between these network properties and linguistics properties. When analyzing and comparing the word networks we will look at common network graph properties and algorithms – such as various centrality measures, clustering algorithms, page rank etc (). We will also investigate properties of individual nodes (words), edges (relationships between words) and components (groups of words).
There are multiple word embedding techniques commonly used for NLP applications – e.g. Word2Vec or GloVe. The techniques vary in the underlying algorithm, the objective function and the context used for words in the text. Other parameters such as window size or embedding size also have an effect on the resulting word vectors. We will compare the networks resulting from the different algorithms and parameters.
Different textual dataset (corpora) result in different embeddings vectors as words have different usage patterns in, say, a general and quite formal corpus such as wikipedia to, say, Twitter tweets. Naturally, we will compare the different network properties across languages as well. In all cases (algorithm, corpus and language), universal network properties will be searched (e.g. whether certain centrality or other measures indifferent to language or algorithm). For measure or properties that do differ across different embedding graphs we will investigate if one can attribute or tie the differences to linguistic properties (e.g., does a certain measure correlate to the morphological richness of a language or do languages from same linguistic families have similar network properties).
This interdisciplinary work brings together experts from two different data science fields (NLP and Network Science) who will co-advise an MSc student. We expect to publish the results of this project in top venues of both fields of study.
 Speech and Language Processing (3rd ed.). Dan Jurafsky and James H. Martin.2019. Chapter 6: Vector Semantics and Embeddings.
 Graph-Based Methods for Natural Language Processing and Understanding – A Survey and Analysis. Mills, Michael T. and Bourbakis, Nikolaos G. IEEE Transactions on Systems, Man, and Cybernetics: Systems. (2014).
 A survey of graphs in natural language processing. NASTASE, V., MIHALCEA, R., & RADEV, D. Natural Language Engineering, 21(5), 665-698. (2015).
 Graph Theory. Adrian Bondy And U.S.R. Murty. Springer, 2017.
 The small world of human language. Ferreri and Sole. Proceedings of the Royal Society B: Biological Sciences 268(1482):2261-5. 2001.