Blogs

Blogs

Analyzing Activity in Liferay Forums 

Using Natural Language Processing (NLP) and Python Pandas to Understand your Community

Hello Liferay Community Members!

From 2007 – 2018, over 30,000 members made over 130,000 comments in the Liferay community forums. As the community discussions have recently moved to a new platform, this is a good time to analyze the great exchange of information over the past 12 years. 

In this article, I use data science (Python Pandas and Python’s Natural Language Toolkit  - NLKT) on the Liferay forums to show how analyzing forum data can increase your understanding of your own online communities and enable you to make better operational and strategic decisions. 

DATA DESCRIPTION

I started with a raw file of all of the forum data. With some data cleaning, I created a file containing the details of every forum message. 

To visualize the final 5 rows of the file, I used the command forums.tail(5) .  

Note: No names or personal details appear here. The original forum postings and member profiles are public, so this is an extra cautionary step.

Let's explore the data to learn more about this member journey.

EXPLORING THE DATA WITH FUNCTIONS AND PLOTS: ANALYZING THE MEMBER JOURNEY

Understanding the member journey over time allows us to understand how users evolve and how to best engage them during this process.

To reproduce the questions and answers of a single discussion thread, I used forums.loc[forums.threadId==103912609].head(5) to filter by the “threadId”, which produced the following results:

It appears that experienced members are trying to assist the newbies.  Indeed, a central goal of the forums is to help members advance in their journey from novices to experts.

As Jamie Sammons, Developer Advocate at Liferay, Inc. says:

“Most contributors usually being within the Liferay community as a consumer reading the documentation, reading the forums and Slack.  For many the first step to active involvement is to receive help within the forums by asking questions specific to their environments. Most contributors who begin helping others usually start this way and then when they feel they are learning the ropes then sometimes even feel obligated to help others to pay back the support they received.”

ANALYZING MEMBER LIFESPANS

Do typical members just post once, or do they remain active over a longer period?  I calculated how long the users remained active by grouping each user’s posts together and then computing the difference in time between the first and last post. I only included members who started posting before 2015, since the newest members are likely still active.

I used a powerful “groupby” feature in Python to create this data frame. You can see the details in GitHub; an example of my results appears below.


What was the distribution of members’ lifespans? The plot below demonstrates that almost 800 members had a life span of 1 day, with a large drop off in frequency thereafter.

 

I was also interested in understanding the veteran members. I zeroed in on the veterans’ lifespans by changed the scale on the graph above from days to years. 

This chart shows that 1800 members had forum lifespans of 1-2 years. The bottoming out after 2 or 4 years is not surprising, as developers often have Liferay projects that last 2-4 years, after which they move on to other projects or companies.

As Jamie Sammons, Developer Advocate at Liferay, Inc. says:

“Most of the time what happens is the developer works for a consulting company and moves on to a new project or the developer may change jobs altogether.  In a few cases the developer works for a SI or partner company and they simply have a massive workload due to good business and simply cannot find time to contribute.”  

With this data in hand, community managers can initiate a special outreach campaign after a member writes 1 post, and send personalized messages to new members, encouraging continuing engagement. Perhaps they could even award new members points in a “gamification” system to foster further interaction. 

USING NATURAL LANGUAGE PROCESSING (NLP) TO UNDERSTAND THE COMMUNITY

Word clouds are visualizations of content, and they can be used to improve members’ experiences in the community.  For example, posts on the topics identified through the word cloud can appear in the members’ activity feed, making the feed more relevant and engaging.

It is helpful to understand the popular topics of the entire community as well. The chart below contains the word counts from the subject lines of every discussion in Liferay’s “Announcements” and “Development” forum categories.  

The “?" and word "how" that appear in the Development category makes sense, as members typically ask other members technical or product-related questions (the trigram below further corroborates this).   In the Announcements category, the “!” often appears, indicative of members celebrating successes (e.g., “Congrats on the sale!).  

Development Category Trigrams:

NLP can also help with routine, but essential, community management operations.  In “What’s Next for the Liferay Community,” Liferay’s CEO describes one of their community management challenges:   “Although [the] numbers can show how vibrant this community is, we know we can do better, especially when it comes to the number of unanswered questions.”

Using NLP, one approach is to optimize the expert assignments by category. Using the code below, I explored common words in the large “development” category that could be used to form smaller, more management discussion categories.  


long_words = [word for word in words if
              len(word) > 2 and word not in stoplist]
fdistLong = nltk.FreqDist(long_words)
fdistLong.most_common(50)


It is clear that there are some good candidates here for topics that can be separated out to form new sub-groups:

1.    Portlet 
2.    JSP 
3.    Builder
4.    Theme 
5.    Database

Lexical dispersion plots are a way to visualize trending topics in the forums, allowing community managers to strategically introduce new subjects or create new discussions on existing popular issues. 

The below plot indicates that the community talked intensely about “6.1” (presumably, a version of Liferay), with an overlap of “6.1” and “6.2” as “6.2” was released.

mytext = nltk.Text(words) 
mytext.dispersion_plot(["6.1","6.2"])

Below is an analysis of some other topics in the top 50 list:

mytext.dispersion_plot(["jsp","database","struts","image"])

The above functions and visualizations are just the tip of the iceberg; there is much more that can be discovered by analyzing user behavior and the content itself. Yet even the relatively basic but thoughtful data science techniques I have demonstrated can help you strategically analyze and improve your community management.  

 

About the author:  
Adam Zawel is Vice President of Strategy with Leader Networks – a research and consulting firm.