Text summarisation and topic modeling

amandathying
Jan 3, 2025
1 min read

Using comedy scripts from popular sitcom FRIENDS and Seinfeld, we attempted to finetune BERT LLM to summarise the episodes to generate a synopsis. Additionally, using LDA, we sought to find out what topics were prevalent in these sitcoms that make them so well-loved. My main task was to do topic modeling.

I first processes the transcript by performing tokenisation, lemmatization and stopwords removal. I also clean up text sentence cases to make it easier for the models to process.

I generated a word cloud to see if there should be any words that would skew the topics and it was clear that there were character names and other filler words that would appear very frequently but provides no meaningful insights.

After a few rounds of removing selected words, the final processing looks like this:

And the final wordcloud looks like this:

After applying LDA and running the model, I generated the 10 and 15 phrases/words that were highly associated across 5 topics. and displayed them in a table using matplotlib.

The results were not very conclusive as there were no clearly defined topics.

I then attempted to topic model based on each season. Displayed below are the topics from season 10 alone.

The suggested cluster made more sense than the first analysis. Topics 1 to 4 can be sort of be classified into city vs suburb, moving, wedding, love and relationship. Topic 5 seems a little unclear, suggesting that there needs to be more refining to be done in the model.

While my task was not directly involved in text summarization, the team worked as a whole to understand and improve each other's models.

Text summarisation and topic modeling

Recent Posts

Comments