Résumé du mémoire de master 2
Chercheur | Marwan Mohamad Al Omari |
Année | 2019 |
Titre | Sentiment Analysis for Lebanese Arabizi Customers’ Reviews |
Type de mémoire | Mémoire Master 2 |
Universite | Libanaise |
Directeur | Moustafa Al-Hajj |
Nombre de pages | (94) pages |
Specialité | Traitement Automatique de la Langue |
Résumé
Because of the huge amount of data that users generate on the web, it is crucial to build a sentiment analysis tool for the Arabizi language system to recognize feelings associated to these data. We proposed two kinds of approaches based on supervised machine learning (ML) using logistic regression (LR) and the other based on lexicon-based using Science of Language and Communication Semantic Analysis System (SLCSAS) tool. Both approaches have been conducted and tested on a corpus of 2635 public and public services’ reviews, including hotels, restaurants, shops, governmental institutions (municipalities, universities, and offices, etc.) and other categories, collected from Facebook, Google and Zomato platforms. The total reviews of private service sector are 2501, which overrepresent the sample than the rest 134 reviews of public sector.
At first, data of text reviews have been preprocessed and filtered by 1) removing user’s information, 2) transforming texts to lower case, 3) splitting data into 80% training and 20% testing sets, 4) removing reviews with neutral class, and encoding reviews with 0s (negative) and 1s (positive) classes. Then, data feature is considered through BoW and TF*IDF enhanced with word level n-grams dictionary mainly unigrams, bigrams, and trigrams. In SLCSAS, dictionary, grammar rules, and semantic map have been constructed for further implementation. On the other hand, ML models built through the consideration of two phases. The first phase considers the construction of two LR models with default settings set by scikit-learn library. The second phase considers using a pipeline to facilitate the hyperparameter tuning of two other LR classifiers. Finally, the results of the five built classifiers are evaluated in terms of precision, recall, f1-score, confusion matrix, and receive operating characteristics curve.
At last, findings show that both LR models trained through BoW and TF*IDF features with default settings remarked similar results, while the hyperparameter tuning of the LR model trained through TF*IDF has surpassed the one with BoW. Therefore, the best nominated classifier is the hyperparameter tuned TF*IDF LR with word level unigram. In addition, SLCSAS classifier has achieved a competitive result in comparison to ML models but with lower coverage on the test data. The ML models so exceed in performance against lexicon-based classifier.