Efficiency Metrics

How Accurate is Accuracy as a Metric in Data Extraction?

August 9, 2023
How Accurate is Accuracy as a Metric in Data Extraction?

Data extraction is a widespread need of almost every organization today. One of the key questions they need to answer about their extraction processes is – “How well is it working?”.  However, in many scenarios this gets stated as “How is the accuracy of the system?”.  Though the business context of this question is quite on point, the technical reference to accuracy as a metric might present some challenges.

Before we get into further discussion, let’s actually look at what accuracy as a metric is all about.

What is Accuracy?

Accuracy is the most common evaluation metric for classification problems because of its simplicity and interpretation.

To understand better, let’s use a set of random words given to us:

  • micro
  • soft
  • ware
  • house
  • hold
  • scope
  • bank
  • strong
  • fly
  • beetle

Now given any problem statement of choosing words from this list we can calculate the following to get a full idea of how well the choice worked.

  • Correctly Selected:
  • Wrongly Selected:
  • Missed:
  • Correctly Ignored:

Consider this problem statement: pick single words that represent a living creature

Let’s assume the solution we used gave us these selections:

  • fly
  • house
  • scope

Now

  • Correctly Selected: 1 (fly)
  • Wrongly Selected: 2 (house scope)
  • Missed: 1 (beetle)
  • Correctly Ignored: 6 (micro soft ware hold bank strong)

Using the traditional definition of accuracy, we have:

Accuracy =  (Correctly Selected + Correctly Ignored) / (Correctly Selected + Wrongly Selected + Correctly Ignored + Missed)

So here, Accuracy = 1 + 6 / 10 = 70%

Accuracy for Complex Problems

Now that we have an idea of how accuracy works, let’s increase the complexity of the problem statement and see what happens.

Problem Statement: pick word combinations in the order that represent a company name

We can see from the word list that there are 2 correct answers:

  • micro soft
  • soft bank

Let’s assume our solution approach picked the following word combinations:

  • micro soft
  • house bank
  • fly scope
  • bank soft

Now

  • Correctly Selected: 1 (micro soft)
  • Wrongly Selected: 3 (house bank, fly scope, bank soft)
  • Missed: 1 (soft bank)

However, we have one calculation left out

Correctly Ignored: ?

Remember we need to find the number of items available for selection to find this.  In this case an item is not a single word but a word combination and there is no limit set by the problem statement on the number of words that can be combined.  So let’s consider all the word combinations possible.

  • 10 (single word)
  • 90 (double word)
  • 720 (triple word)
  • 5040 (4 word)
  • 30240 (5 word)
  • 151200 (6 word)
  • 604800 (7 word)
  • 1814400 (8 word)
  • 3628800 (9 word)
  • 3628800 (10 word)

A total of 9,864,100 combinations where:

Correctly Ignored: 9,864,100 – 1 – 3 – 1 = 9,864,095

Accuracy = 1 + 9,864,095 / 9,864,100 = 99.99996959 %

Intuitively we can see that the number is not representing the efficiency of our solution approach.

Accuracy for Data Extraction involving fields and records

Extend this idea to a document with many different values.

Assume we need to pick names of companies and their number of employees.

This is the next level of complexity.  Firstly, the company name is a multi word combination which has the complexity as mentioned above. Then there is the complexity of how any number in the document can be related to the company.  This is the number of possible combinations that will need to be calculated to find accuracy.

You may be tempted to apply some logic to reduce this number by saying things like, how about if we consider only the numbers that appear in the same sentence. But this would be part of the algorithm design and cannot be used to calculate the raw numbers of the equation.

In short, accuracy being hard fitted to a problem such as data extraction, does not yield good results.  Primarily this is because, it is not a simple classification problem and secondly because the ratio of total true combinations to total possible combinations is highly imbalanced.

Interested in Simplifying Your Data Extraction?