The media and watercooler chatter alike increasingly focus on how advances in machine learning and artificial intelligence (AI) are boosting the ability of predictive analytics to benefit businesses’ bottom lines. Some of that talk ponders the potential for smart machines to replace humans in higher-complexity jobs. No doubt, smart machines are getting smarter. But even the smartest machines still lack fundamental human characteristics that are absolutely critical to enabling people to solve problems. One of these key capabilities is curiosity – surely a computer can’t replicate that, can it?
Well, welcome to the evolving world of neuro-dynamic programming. It’s an analytic methodology for learning and anticipating how current and future actions are likely to contribute to a long-term cumulative reward. This technique is related to advanced AI reinforcement learning methods, which take inspiration from behaviorist psychology to attribute future reward/penalty back to earlier steps in a decision sequence, whereas traditional supervised learning attributes reward only to the current decision. These advanced methods focus on experimentation and prediction. They mimic the way the brain learns complex task sequences through pleasurable or painful feedback signals that may occur later in time – essentially, how humans seek and achieve long-term positive results.
Clearly, analytics that can “think” well ahead and focus on the most favorable outcomes are most welcome, since many operational decisions about customers have long-term consequences. High customer lifetime value and healthy, sustainable business cash flow are both produced by a series of interactions: the business takes an action, the customer reacts, the business responds to the new state of the relationship with another action, the customer reacts … and so on. In this way, neuro-dynamic programming enables smart machines to think ahead – potentially making moves early in the decision chain that do not appear optimal in the short run but in the view of the long-term future outcome represent better decisions.
Another way to think about this concept is to consider a group of dumb software agents (like individual ants). The agents interact with their environment and are rewarded or penalized around a small set of success criteria. Gradually “genes” of successful behavior emerge as the agents begin to map out the risk of various interrelated activities. Those agents with few successful genes receive a low “fitness” score and die out, whereas those with many successful genes score high and are allowed to reproduce, mutate or combine with other high-scoring agents. In this way, the overall performance of the group increases.
Because the environment is changing, these agents not only act in the optimal way based on their current best “map of the world,” they also experiment. Using probabilities, they make slight variations and mutate around the optimal strategy and associated genes, and as they receive rewards and penalties, learn from these experiments and adjust to a changing fitness landscape continually.
As you can see in Figure 1, at any point in the sequence, the current state of the customer relationship is the result not only of the just-taken action, but also of the string of previous actions. Just as in a chess game, where a checkmate could be rooted 10 moves back – or even in the first move – the loss of a valuable customer may have started with actions taken months ago. To be successful, a business needs to understand this dynamic.
Figure 2 depicts how these analytics learn about long-term effects by assigning credits for successful outcomes and penalties for unsuccessful ones. Although the action immediately before the outcome may receive a larger share of the credits or penalties, reinforcement learning distributes some amount of rewards/penalties across the entire sequence of actions.