Business Outcome

Straight Through Processing in Document Data Extraction – Trick or Treat?

August 9, 2023
Straight Through Processing in Document Data Extraction – Trick or Treat?

With the post pandemic world leaning heavily towards remote work environments and ever changing workforce, automation has moved from being a luxury of technology companies to a need for every business. And it is more true than ever in automating data extraction from documents. And right here is the holy grail that every company seeks – Straight-Through-Processing (STP). However, this path is littered with as much risk as opportunity. So here we throw some light on the two sides of this coin.

Degree of Possibility – The Trick

What’s the best way to figure out whether your particular use case lends itself to straight-through-processing (STP)? Well, without getting into too many technical concepts, let’s play out a scenario. Imagine you have hired a new person to handle the data extraction. Assume the person has reasonable comprehension skills but has no knowledge of your domain. Now, can you see yourself spending a good chunk of 30 mins of your time explaining everything to this person with a few sample documents? And after that this person will head into a different office and start doing the job and never interact with you again. This is just like straight through processing, where, once you have taken the business logic and automated it, you don’t need to look at it again.

Do you see this as a possibility? Or does it occur to you that the person might need help every now and then to handle different exceptional scenarios which you may not have been able to explain in 30 mins? In that case, you might want to think about how frequently you feel there needs to be an intervention. As you see more interventions, you should also see lesser opportunity to get to complete STP.

Often, this is the confusion for the people working on the data regularly. They all are well versed with the domain and are able to adapt to changes in logic within reasonable time by themselves. We can easily mistake this to be an indicator of simplicity of the process itself. This is why, in our imaginary STP above, we made the assumption that the person doing the job has no prior domain knowledge.

Another thing that can happen is that we actually hire people who have no knowledge of the domain but over time we keep interacting with them and refine their understanding of it. And eventually they are able to handle it. And again this can also come across as an indicator of simplicity, which is why in our scenario we have put a limit of 30 mins on the time you can spend explaining the process to the new person. (of course, we have made the assumption here that this person you are explaining to, has perfect memory retention and can remember everything after hearing it once and also that you are able to speak incredibly fast and clearly :)).

Fundamentally this assessment becomes a tricky bit of work due to our own familiarity with the content. It can also create blind spots to some very rare challenging situations, which can be handled easily in the manual process. However a machine needs to be able to clearly identify such situations to be able to escalate it for manual intervention.

Degree of Opportunity – The Treat

So let’s say we have found that the 30 minute prep for our STP superman is not sufficient and that some intervention is necessary. This does not mean that we have to stop. We can now look at the degree to which we can automate and assess how much of a cost/time saving we can pull together and what that might be worth to the business.

So, the first thing to identify is whether the various interventions that will be needed to handle exceptional situations in the process can be boxed up separately. This means, we can draw broad boundaries around the conditions that each of these challenges present themselves in, and then we ask how they can be identified. For example: If a particular vendor sends scanned copies of his documents instead of the digital copies that every other vendor is sending, this is a scenario that can be isolated as a format box. If another vendor presents the information in a slightly different manner in his document (perhaps he used different words for context or different units for measurement), these can be isolated as context boxes. Similarly, perhaps one publisher of these documents puts the information in two different documents instead of one, this would be a box of split sources.

Once we have boxed in the various exceptional scenarios, we are now ready to take stock of the opportunity. Firstly, if the number of scenario boxes are limited, it is a good place to start. The next question to ask is, are there ways to identify which documents fall into each of these boxes. This determination may be done technologically (for example, identifying that a document is a scanned version) or manually (for example, flagging a certain vendor’s documents as a special case).

Once we figure this out, it is a simple problem of seeing whether more documents fall into boxes or in the general category of being easy to process. If the number lean towards the latter, we have a higher degree of opportunity to automate with some extra technical processing or few simple manual steps. If the number leans towards the more boxes side, we can consider this a high diversity use case.

On the other hand, if the number of scenario boxes itself can vary, and we cannot be sure how many more will come up in future, then also it is a high diversity use case. So does this mean that we should give up on automation? Not at all. At this point, it is an engineering decision to make. What is the current cost of operation and what is the cost of scaling if we do nothing? Is this cost acceptable within normal business margins? If not, to what degree can we identify ways to reduce it?

To answer these questions, we need to essentially look at two things, one is the number of people that work on the process currently and the second is how long it takes for each person to deal with one document on average. Once you have these numbers, it’s not about asking how to reduce the number of people but how to increase the number of documents a single person can get through in a day. The reason we ask the second question instead of the first is simple. If we are asking ourselves how to get rid of the manual workforce, we run the risk of overloading the remaining people since our focus is on reducing cost. If we ask ourselves how to get more work done by empowering the people, our focus is on increasing profitability and productivity.

Using the idea of Documents-Per-Hour (DPH), we immediately uncover many different opportunities. We could simply optimize the technology solution by selecting something that is quick and easy to implement, learn and start using. Even if the primary workflow can be addressed at this point, as long as the solution is flexible enough to incorporate future enhancements in extraction technology without too much overhead, it is still a step forward. And if bringing in a flexible workflow solution allows you to reduce many of the sources of errors and time lost in fixing discovered issues, that is a win by itself.

In any case, understanding the actual impact of the workflow you choose for data extraction from documents, will help bring together any and all opportunities to keep moving forward. A step forward today is worth many miles tomorrow. We will explore this idea further in future posts and see what can help move things forward consistently, reliably and certainly.

Interested in Simplifying Your Data Extraction?