8 min read|Published August 26, 2021

TL;DR – Quick summary

At Tink, we assign a category to every transaction we process – and our Enrichment team has a goal to ensure that we use the best possible models for categorisation.

Accuracy by amount was an important metric for the team – but it’s very sensitive to outliers.

Eliisabet Hein, Data Scientist at Tink goes through how they changed their approach to outlier detection – and how it improved evaluation scores.

At Tink, we assign a **category** to every transaction we process. These categories can be used by end users to set budgets and manage their finances, and feed into Tink’s and our customers’ other products.

We pass the text description attached to the transaction through a machine learning model to identify the ‘domain’ in which the purchase was made – for example, grabbing a coffee at a convenience store should be categorised as Coffee & Snacks, while your monthly electricity bill should go under Utilities.

Our goal in the Enrichment Categorisation team is to ensure that we use the best possible models for categorisation. For this, we use a variety of different metrics to evaluate and compare the quality of different models. One of these metrics is accuracy by amount. You might be familiar with the standard definition of accuracy (which is another metric we use):

To compute accuracy by amount, we simply weigh each transaction by its corresponding amount:

This metric allows us to measure how good we are at identifying transactions with higher amounts, like your monthly rent, mortgage and other loan payments. These transactions statistically occur less frequently than lower-amount daily purchases for things like groceries, and coffee. However, the effect on the user’s budgeting of getting this one high-impact transaction wrong is larger than miscategorising a single small purchase, which is why accuracy by amount is an important component of our evaluation.

If you’re a data scientist, or simply very sharp-eyed, you might be asking yourself: ‘Wait a minute, isn’t this metric very sensitive to outliers?’ Yep. It is. Let’s look at an example:

Imagine that you have 100 transactions in the ‘Mortgage’ category, 99 of which represent monthly installments falling uniformly in a (hypothetical) range of 250 to 2500 EUR. However, in the same category, you have a single transaction for 10,000 EUR, which represents someone paying their remaining mortgage in a single installment.

This one transaction could account for 20% or more of the total amounts. If we want to measure accuracy by amount, the result will be very heavily influenced by whether we got this transaction correct or not. We might even decide whether we replace the existing model with a new one or not based solely on the prediction on this one transaction.

This is why we need outlier detection to find these transactions where the amounts fall outside the normal range for a given category, and filter them out before we perform any evaluation. If we plot this dataset (see below), a human observer would easily be able to identify the outlier, but how can we make our code perform the same analysis automatically?

Our old approach to solving this problem was removing all amounts above the **99.5th percentile**. This means that we find the amount in our dataset that is larger than 99.5% of all amounts in the set, and remove all points larger than this. Because we cannot remove only part of a data point, we round up and always remove at least one data point. The number of outliers we remove scales with the amount of data we have (the exact formula is ⌈n x 0.005⌉ where n is the number of data points).

Let’s look at this method in practice. In our earlier example with 100 mortgage transactions, we would remove one data point from the higher end of amounts, which in this case is our outlier – so now we have a clean dataset to use for evaluation. Success!

The percentile p that we select encodes our assumption that outliers occur with probability (1-p). We selected the cutoff point of 99.5th percentile because we found that it usually worked well empirically with larger test sets, and was even an overestimate (we removed more data points than there were outliers, which is preferable to allowing outliers to slip through).

However, as you might have already noticed, this method has a crucial flaw when the dataset size is small.

As we saw before, if we have a smaller category such as our Mortgage example with only 50 data points, we always assume that there is **at most one outlier**.

Now, imagine if we had an additional outlier in this set with a transaction amount of 7,500 EUR. Our method would fail:

So how can we be better at detecting outliers without depending as much on the dataset size or hard-coding any assumptions about their occurrence frequency?

After our model evaluation score dropped by more than 10% from one week to the next because we failed to remove an outlier, we set out to improve our algorithm.

After considering a range of alternatives, we turned to the **interquartile range** (IQR). This is a simple and intuitive way to detect points that are outside the ‘norm’ for a given dataset. Here’s how that works.

The interquartile range is defined as the distance between by the 25th and 75th percentiles of the dataset, or the **first and third quartile**. Here’s what that looks like:

Outliers are values that are more than a specified distance outside the IQR. The cutoff in the positive direction (which is what we are interested in here, although we could also have outliers in lower end of amounts) can be found using the following formula:

The default in most applications is to use k=1.5 (derived from the normal distribution), but any value can be used to allow for more or less ‘slack’ – essentially, how much we allow the value to deviate from the IQR before considering it an outlier.

We use a higher value of k in production to compensate for the fact that real data is usually more noisy with outliers on the lower end of the distribution, and doesn't always follow a neat distribution.

Let’s go back to our example dataset to see what this would mean (with k=1.5):

And for the example with two outliers:

As you can see, this method works equally well for more outliers, and the cutoff is not affected significantly by adding another outlier to the dataset.

IQR has a **breakdown point** of 25%, which means that up to 25% of data points can be outliers before this method begins to fail (we will not cover the mathematical proof here, but it can be easily found by anyone who wants to dig into the details). This easily makes it sufficiently robust for our needs.

Compared to the percentile-based method, we are also much less likely to have false positives – which would remove amounts that are not actually outliers.

We have talked about why outlier detection is crucial to accurately evaluate the quality of our categorisation, and seen why it is important to choose a sufficiently flexible and robust approach with fewer assumptions encoded in the algorithm itself.

Since implementing the interquartile range method, we have been able to track the performance of our models with more stability, and can have more confidence in the evaluation scores.

2024-03-20 · 1 min read

Variable Recurring Payments, powered by open banking, have huge potential for merchants and consumers in the UK. Read our VRP guide to find out how they work and why they are important.

2024-01-31 · 6 min read

Tink has launched Risk Signals, a rules-based engine that unlocks instant payment experiences while minimising risk – and already in use by Tink customers like Adyen.

2024-01-18 · 1 min read

In this webinar our Tink team of Jaia Lloyd, Varun Atrey and Kevin Ward discuss how Pay by Bank (account to account payments) can vastly improve your user journey and demo how it works.

Contact our team to learn more about what we can help you build – or create an account to get started right away.