Analysis of ‘DataFest Africa’

4 min readOct 26, 2022

An analysis of over 10,000 tweets about one of the biggest Data events in Africa.

INTRODUCTION

Cheers to my first-ever Medium article. Now, let’s get to the main.

The premier edition of DataFest Africa held at the Main Auditorium, Unilag took place on October 14th & 15th 2022. It was an event that brought Data Analysts, Data Scientists, Data Engineers, and other individuals interested in the Data industry from around the globe and it was one of a kind.

People took to Twitter before the event started, talking about what they were looking forward to, people they’d love to meet, swags they’d love to win, and many more.

They also tweeted about their experiences and knowledge gained during and after the event had ended.

I knew I wanted to work on a sentiment analysis project for my portfolio and what better topic could I work on than an event catered to Data Specialists and users?

THE PROCESS

Data Gathering
Data Assessment
Data Cleaning
Data Preprocessing
Sentiment Analysis
Data Exploration
Data Visualization

Data Gathering

The number of tweets I specified to scrape on the 16th of October 2022 was 25,000 but the code stopped scraping a little over 10,000(guessing that was the max available due to my search query). I used a Python’s library -Snscrape and collected tweet.date, tweet.content, tweet.user.username, tweet.user.location, tweet.retweetCount, tweet.likeCount, tweet.sourceLabel, and tweet.coordinates.

In my search query, I specified the search should start on December 7, 2021. Why? That was the first time David Abu talked about DataFest Africa. Also wanted tweets relating to (datafestafrica, #datafestafrica, #dfa22, dfa22, #datafestafrica22, datafestafrica22).

Data Assessment

I assessed the dataframe using Excel and Pandas to note the quality and structural issues I needed to work on.

Data Cleaning

After assessing the dataframe:

I replaced Nulls in Location with ‘Unknown’.
Deleted the coordinates column.
Assigned appropriate datatypes.
Also used regex to remove punctuations, emojis, emoticons, @mentions, hashtags, hyperlinks, tweet links, and other unwanted characters that wouldn’t affect the preprocessing.

datafestafrica dataframe with cleaned text

Data Preprocessing

The steps taken before Natural Language Processing(NLP):

Tokenization — This is breaking the raw text into small chunks called tokens & they help in understanding the context or developing the model for NLP.
Stop words Removal — Stop words are commonly used words that are generally filtered out before processing a natural language. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and do not add much information to the text
Lemmatization entails reducing a word to its dictionary form e.g playing to play

datafestafrica dataframe after preprocessing

Sentiment Analysis

I wanted to know how Twitter users felt about the event. This was done using the TextBlob Library to get the polarity of each processed tweet. I defined the code to just give either a ‘Positive’ or ‘Negative’ sentiment.

People did love DataFest Africa.

Data Exploration

I wanted to explore the Data even more, so I decided to make a word cloud from all the processed tweets. All tweets were extracted into one long string to be visualized into a word cloud, note that words added to the stopwords list such as (datafestafrica, dfa, datafestafrica22, dfa22, Africa, and datafest) won’t reflect in the word cloud.

Data Visualization

Prior to visualization in PowerBI, an Excel Macros was used to get the Latitudes and Longitudes of different locations.

INSIGHTS

A good Data Analyst communicates insights.

I do not think it’s a shock that sentiment analysis has a positive of over 90%. Every team worked tirelessly and effortlessly before and during the event, people knew what they wanted from the event and got them — be it knowledge, meeting new people, networking, getting swags, even winning money, and having a really fun time. People are looking forward to DataFest23, I know I am.
I genuinely thought most tweets would come from ‘Twitter for iPhone’ but Android won this time around.
Tweets Distribution showed that this wasn’t an event limited to just Africa and I love that even though most were concentrated in Lagos.
Most tweets were between the 10th and the 19th hour(10a.m - 7p.m WAT). More people would be awake during that period and on the days the event occurred so it would make sense for there to be a great number of tweets during that period.

The code used would be on my GitHub profile. xo.

Analysis of ‘DataFest Africa’

INTRODUCTION

THE PROCESS

INSIGHTS

Written by Oluwafisayomi

Responses (2)