Extracting Semantic Topics About Development in Africa From Social Media(Journal Article)
The extraction of knowledge about the prevalent issues discussed on social media in Africa using Artificial Intelligence techniques is vital for informing public governance. The goals of our study are twofold: (a) to develop machine learning-based models to identify common topics of social concern about Africa on social media, and (b) to design a classifier capable of inferring a particular common topic associated with a given social media post. We designed a three-step framework to achieve the former goal, namely, topic identification. The first step uses text-based representation learning methods to generate text embeddings for feature representation. The second step leverages state-of-the-art Natural Language Processing models, commonly called topic modeling, to organize the representations into groups. The third step generates topics from each group, including by means of using large language models to generate meaningful short- sentence labels from the bag-of-tokens associated with each group. Furthermore, we use Llama2 to deduce the words into a single word theme that describes each topic in relation to social concerns about development. To achieve the second goal of classification; we trained classifiers using ensemble voting and stacking learners to infer which among the identified common topics best characterizes the social media post. For our experimental study, we collected a text corpus called Social Media for Africa composed of 22,036 records extracted from social media comments on Twitter (X) and YouTube. The clustering-based model BERTopic yielded 304 topics, at topic coherence 0.81 C-v. On merging the topics into classes, the BERTopic+ created 11 common topic classes at topic coherence 0.76 C-v. For theme extraction, we additionally refined the leading token words with Llama2 to generate concise single-word theme labels, resulting in 98 unique themes by BERTopic_theme with a C-v score of 0.75 and an IRBO score of 0.50. We then utilized the identified topics based on the resulting groupings as labels for training a topic classifier. These labels were created using Llama2 on our SMA corpus. Our comparative study of topic classifiers using stacking and voting schemes shows that the BERTopic model features 0.83 accuracy and 0.82 F1 score with ensemble voting for training on topics. Furthermore, training on topic classes, BERTopic+ with ensemble voting had the highest accuracy of 0.95 and F1 score of 0.95 compared to other alternate methods on our corpus. The BERTopic_theme also achieved higher performance with ensemble voting classifier at 0.93 F1 score and accuracy 0.93. The overall performance of classifiers using the ensemble stacking is slightly better than that of voting methods for short sentence topic labeling. For Africa, policymakers should focus on the most pressing social issues: COVID-19 restrictions affecting public health and economic recovery, promoting entrepreneurial innovation in energy and environmental sustainability to combat climate change, and strategically responding to China’s rise in global politics to maintain geopolitical stability and foster international cooperation
Authoured by: Harriet Sibitenda , Awa Diattara, Assistan Traore, Ruofan Hu, Dongyu Zhang, Elke Rundensteiner, Cheikh Ba
Academic units: Faculty of Science
Departments: Computer Science and Information Systems