Know your customers with RFM

7 Apr 2020|Learning

Learn how to segment your customers with only three data points: recency, frequency and monetary value.

In this blog post we’ll show you how to create a RFM model in a few easy steps. You’ll also learn how to implement the model with just a few lines of Python code. Whether you have a technical background or are a non-coder, I highly encourage all to follow along to get a sense of how easy developing a valuable RFM model can be. In a follow up blog post we’ll expand on the RFM model and introduce more advanced techniques from the statistics and machine learning worlds.

Why you should do RFM segmentation

RFM segmentation will make you able to understand your customer base better, and also serve as a good starting point for your data journey and more advanced customer models. You will be able to give more accurate answers to key questions for your business — for example:

Who are your best customers?
Which customers are at the verge of churning?
Who has the potential to be converted in more profitable customers
Which customer are lost/inactive?
Which customers is critical to retain?
Who are your loyal customers?
Which group of customers is most likely to respond to your current campaign?

RFM — The basics

RMF is a simple statistical method for categorising customers based on their purchasing behaviour. The behaviour is identified by using only three customer data points: the recency of purchase (R), the frequency of purchases (F) and the mean monetary value of each purchase (M). After some calculations on the RFM data we can create customer segments that are actionable and easy to understand — like the ones below:

Champions: Bought recently, buy often and spend the most
Loyal customers: Buy on a regular basis. Responsive to promotions.
Potential loyalist: Recent customers with average frequency.
Recent customers: Bought most recently, but not often.
Promising: Recent shoppers, but haven’t spent much.
Needs attention: Above average recency, frequency and monetary values. May not have bought very recently though.
About to sleep: Below average recency and frequency. Will lose them if not reactivated.
At risk: Some time since they’ve purchased. Need to bring them back!
Can’t lose them: Used to purchase frequently but haven’t returned for a long time.
Hibernating: Last purchase was long back and low number of orders. May be lost.

The above segments and labels are frequently used as a starting point but you can come up with your own segments and labels that is better fits for your customer base and business model.

For each of the segments, you could design appropriate actions, for example:

Champions: Reward them. They can become evangelists and early adopters of new products.
Loyal customers: Up-sell higher value products. Engage them. Ask for reviews.
Potential loyalist: Recommend other products. Engage in loyalty programs.
Recent/new customers: Provide a good onboarding process. Start building the relationship.
Promising: Create more brand awareness. Provide free trials.
Needs attention: Reactivate them. Provide limited time offers. Recommend new products based on purchase history.
About to sleep: Reactivate them. Share valuable resources. Recommend popular products. Offer discounts.
At risk: Send personalised email or other messages to reconnect. Provide good offers and share valuable resources.
Can’t lose them: Win them back. Talk to them. Make them special offers. Make them feel valuable.
Hibernating: Recreate brand value. Offer relevant products and good offers.

Preparing our data

In this post we are going to use the Online Retail dataset — widely used on the internet for different analysis. It looks like this:

RFM data Our starting table with transaciton data

After some investigation, we found that the data needed some cleaning up, so we have done some filtering to get better output–remember: garbage in, garbage out. Here’s what we’ve done:

Selected only the United Kingdom transactions to get a less complex group. Behaviour across nations can be influenced by different campaigns, discounts, shipping fees etc. We also avoid possible difficulties with different currencies.
Filtered out all transactions where we don’t have a proper CustomerID.
Filtered out transactions where the Quantity or Unit price is zero or less.
Created a sum column where we’ve calculated the revenue for each order-line by multiplying the Quantity with the UnitPrice. We then grouped the transactions on InvoiceNo and created the InvoiceSum column to hold the revenue for each invoice.

We now have all we need to create the RFM model. Our transactions table now looks like this:

RFM data 2 Our cleaned transaction table grouped by InvoiceNo

Creating the RFM model from the transaction data

To make the process of creating the model easier to understand, we’ll show the data listings as we go along.

Step #1 - Calculating the RFM values

In order to create the RFM score, we must get the individual data points for each customer’s recency, frequency and monetary value. They are defined as follows:

Recency: The age of the customer at the last transaction. This is slightly different from vanilla RFM where recency is calculated as the number of days since last purchase.

Frequency: The number of purchases within the customers life span

Monetary value: The mean monetary value for the customers transactions.

The results after this step should be an RFM summary table with a unique customer id and the data for recency, frequency and mean monetary value. It should look something like this:

RFM data 3 Our starting summary RFM table

We are only going to use the frequency, recency and monetary_value columns when segmenting. The T column (the age of the customer) is only used internally to calculate the value of recency and will not be used for future calculations.

First, we’ll do a quick inspection of our data and see how they are distributed.

Recency distributions

RFM Frequency distribution Distribution of frequency

As we can see, the frequency numbers are highly skewed because of the high number of non repeat transactions. In fact, as many as 85,3% of the customers are non repeat customers. In order to create more sensible segments we choose to filter out the non repeat customers. Now the distribution looks like this:

RFM frequency distribution 2

Recency distribution

RFM Recency distribution Recency distribution for repeat customers

The recency data also has a better distribution after filtering out the non repeat customers. Without the filter, the plot would have been skewed heavily towards zero.

Monetary distribution

Monetary distributio Monetary value distribution for repeat customers

Now, this is challenging data. As we can see, most of the customers mean monetary value fall between 0 and 500, but we also have a few customers that venture beyond 2000 and up to over 16000. These customers seem like outliers. One hypothesis for this can be that they behave more like businesses, buying for reselling. We decide to remove the “outliers” and set the threshold to 2k, which gives us the following distribution:

RFM monetary distribution 2 Monetary value distribution without the “outliers”

Please note that we should investigate the outliers further and not just throw them away.

Step #2 - Getting the individual RFM scores

Getting the individual RFM score can be done in several ways. You could use your own business expertise and heuristics to make rankings that suit your customer base. For this case, we are going to go the statistical route and rank our customer using quartiles.

The ranking of the individual RFM scores is done by dividing each of the RFM values into quartiles which creates four more or less equal buckets. We then rank each bucket from one to four; four being the best. Our summary table should now look something like this:

RFM data 4 Table with the individual RFM score

A recency(R) of 1, the lowest score, represents the customers that have been inactive for a while. A frequency of 4, the highest score, are your most frequent buyers, and so on.

Step #3 - Calculate the overall RFM score

This step can be done in two ways:

Concatenation — creates segments Here we just concatenate (not add) the individual RFM score like strings and get labeled segments in return. Our best segment will be 444 and our worst will be 111 — signifying the lowest score on all three of the RFM categories.
Addition — creates a score Here we add the individual RFM scores like numbers and get a number in return indicating the customer score. The score will range from 3 to 12 and we can use this to create more human friendly labelled categories.

If we choose to do both concatenation and addition our summary table will now look like this:

RFM data 5 Table with both RFM segments and scores

Step #4 - Grouping and labelling with human friendly names

Even though segments like 411 and 243 may be interpretable by a human, they are not the most human friendly labels. But as promised in the beginning of the post, it’s possible to create more usable labels both for the RFM segments and the RFM scores. For the RFM segment we are going to use the most common naming scheme, as outlined above. Our summary table will now look like this:

RFM data 6 Table with RFM Segments

As you see, we now have the champions and hibernators etc. in place.

If you like the addition scheme more, we could create customer labels such as: bronze, silver, gold and platinum.

RFM data 7 RFM table with human friendly score labels.

So how does it look?

To get a birds eye view of your overall customer base, we can plot a simple bar chart showing how many customers reside in each category:

RFM data 8 RFM segment count

Unfortunately, it looks like most of our customers are hibernating, so we better get going. On the bright side: We have some champions, and also a few customers in the promising and new categories. We’d better take good care of them.

We can do the same plot for the RFM score and see how it compares.

RFM data 9 RFM Score count

Naturally we see the same pattern: Few of the most valuable customers and a lot of customers who need attention and reactivation. Better get to work.

What we have learned

If the customers in the OnlineRetail dataset where ours, we could say that we have learned the following:

Over 85% of our customers are non repeat customers — we need to make a plan on how to improve retention.
Our data contains a lot of garbage that need to be cleaned — we need to look at how the data is generated and improve our data validation.
Our data contain outliers and the outliers should be investigated and maybe labeled or removed.
After cleaning the data our RFM model can be used for creating more precise action plans for each customer segments. This can have positive effects on marketing spend, conversion rates and customer retention.

Conclusion

As you have seen we can get actionable customer segments by using just three customer data points. The RFM model is a useful starting point if you’re just starting in your data journey and it’s both quick and easy to understand and implement.

A full example with Python code

Preparing the data

import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from __future__ import division
import plotly.offline as pyoff
import plotly.graph_objs as go
from lifetimes.utils import summary_data_from_transaction_data
# Read the data
ol = pd.read_csv('OnlineRetail.csv', error_bad_lines=False, encoding= 'unicode_escape')
# Filter on Unite Kingdom to get a clean cohort
ol = ol[ol['Country']=='United Kingdom']
Transform to proper datetime
ol['InvoiceDate'] = pd.to_datetime(ol['InvoiceDate'])
# Remove records with no CustomerID
ol = ol[~ol['CustomerID'].isna()]
# Remove negative/0 quantities and prices
ol = ol[ol['Quantity']>0]
ol = ol[ol['UnitPrice']>0]
# Create sum column for each order line
ol['InvoiceSum'] = ol['Quantity']*ol['UnitPrice']
# Create a new data frame grouped by InvoiceNo
orders = ol.groupby(['InvoiceNo', 'InvoiceDate']).sum().reset_index()

Create the RFM model

# Ceeate the rfm summary table from litetimes utility function
rfm = summary_data_from_transaction_data(orders, 'CustomerID', 'InvoiceDate', monetary_value_col='InvoiceSum').reset_index()
# Filter out non repeat customers
rfm = rfm[rfm['frequency']>0]
# Filter out monetary outliers
rfm = rfm[rfm['monetary_value']<2000]
# Create the quartiles scores
quantiles = rfm.quantile(q=[0.25,0.5,0.75])
quantiles = quantiles.to_dict()
    
def RFMScore(x,p,d):
    if x <= d[p][0.25]:
        return 1
    elif x <= d[p][0.50]:
        return 2
    elif x <= d[p][0.75]: 
        return 3
    else:
        return 4
rfm['R'] = rfm['recency'].apply(RFMScore, args=('recency',quantiles,))
rfm['F'] = rfm['frequency'].apply(RFMScore, args=('frequency',quantiles,))
rfm['M'] = rfm['monetary_value'].apply(RFMScore, args=('monetary_value',quantiles,))
# Concat RFM quartile values to create RFM Segments
def join_rfm(x): return str(x['R']) + str(x['F']) + str(x['M'])
rfm['RFM_Segment'] = rfm.apply(join_rfm, axis=1)
# Calculate RFM_Score
rfm['RFM_Score'] = rfm[['R','F','M']].sum(axis=1)
# Create human friendly RFM labels
segt_map = {
    r'[1-2][1-2]': 'Hibernating',
    r'[1-2][2-3]': 'At risk',
    r'[1-2]4': 'Can\'t loose them',
    r'2[1-2]': 'About to sleep',
    r'22': 'Need attention',
    r'[2-3][3-4]': 'Loyal customers',
    r'31': 'Promising',
    r'41': 'New customers',
    r'[3-4][1-2]': 'Potential loyalists',
    r'4[3-4]': 'Champions'
}
rfm['Segment'] = rfm['R'].map(str) + rfm['F'].map(str)
rfm['Segment'] = rfm['Segment'].replace(segt_map, regex=True)
# Create some human friendly labels for the scores
rfm['Score'] = 'Green'
rfm.loc[rfm['RFM_Score']>5,'Score'] = 'Bronze' 
rfm.loc[rfm['RFM_Score']>7,'Score'] = 'Silver' 
rfm.loc[rfm['RFM_Score']>9,'Score'] = 'Gold' 
rfm.loc[rfm['RFM_Score']>10,'Score'] = 'Platinum'
# List the head of the table to view the 
rfm.head(5)

Author

Leif Arne Bakker
Business Design Lead