Sentiment Analysis for Lebanese Arabizi Customers’ Reviews – Centre des Sciences du Langage et de la Communication

Résumé du mémoire de master 2

Chercheur	Marwan Mohamad Al Omari
Année	2019
Titre	Sentiment Analysis for Lebanese Arabizi Customers’ Reviews
Type de mémoire	Mémoire Master 2
Universite	Libanaise
Directeur	Moustafa Al-Hajj
Nombre de pages	(94) pages
Specialité	Traitement Automatique de la Langue

Résumé

Because of the huge amount of data that users generate on the web, it is crucial to build a sentiment analysis tool for the Arabizi language system to recognize feelings associated to these data. We proposed two kinds of approaches based on supervised machine learning (ML) using logistic regression (LR) and the other based on lexicon-based using Science of Language and Communication Semantic Analysis System (SLCSAS) tool. Both approaches have been conducted and tested on a corpus of 2635 public and public services’ reviews, including hotels, restaurants, shops, governmental institutions (municipalities, universities, and offices, etc.) and other categories, collected from Facebook, Google and Zomato platforms. The total reviews of private service sector are 2501, which overrepresent the sample than the rest 134 reviews of public sector.
At first, data of text reviews have been preprocessed and filtered by 1) removing user’s information, 2) transforming texts to lower case, 3) splitting data into 80% training and 20% testing sets, 4) removing reviews with neutral class, and encoding reviews with 0s (negative) and 1s (positive) classes. Then, data feature is considered through BoW and TF*IDF enhanced with word level n-grams dictionary mainly unigrams, bigrams, and trigrams. In SLCSAS, dictionary, grammar rules, and semantic map have been constructed for further implementation. On the other hand, ML models built through the consideration of two phases. The first phase considers the construction of two LR models with default settings set by scikit-learn library. The second phase considers using a pipeline to facilitate the hyperparameter tuning of two other LR classifiers. Finally, the results of the five built classifiers are evaluated in terms of precision, recall, f1-score, confusion matrix, and receive operating characteristics curve.
At last, findings show that both LR models trained through BoW and TF*IDF features with default settings remarked similar results, while the hyperparameter tuning of the LR model trained through TF*IDF has surpassed the one with BoW. Therefore, the best nominated classifier is the hyperparameter tuned TF*IDF LR with word level unigram. In addition, SLCSAS classifier has achieved a competitive result in comparison to ML models but with lower coverage on the test data. The ML models so exceed in performance against lexicon-based classifier.