Two of the more popular algorithms used to teach A.I. to solve problems are called SARSA and Q-Learning. If you’d like to see their implementation details, you can find them on other sites, such as this one. In this article you’ll see what makes them different, and in when this difference matters.

Imagine the following scenario. You and your significant other live in a small rural village called Smallport. This weekend, you’ve decided to take a road trip from Smallport to Littleton, about 300 km away. You’ve taken this trip once before back in the 90s, and the happiness of that lovely weekend is etched in both your memories. You’re also familiar with the roads and highways in the area, and figure you should get to Littleton in just under 3 hours.

The direct path, based on your memory.

On Friday afternoon you set out on your trip. After half an hour on the road, you both get bored, and decide it might be fun to try one of the side roads instead. There might be a gorgeous mountain view somewhere off the highway, and who knows, you might also get to your destination faster. You turn the car off to the next exit you see.


The detour turns out to have been mistake. Not only is there no view to speak of, but you blow a tire on the bumpy, poorly-paved roads, and have to waste an hour to replace it. By the time you’ve finished wheeling the busted tire away, it’s late it the evening, so you speed up in order to make it to Littleton by sundown. You find your way back to the highway, and put the pedal to the metal.

Your generous speed increase does not go unnoticed by the local highway patrol. One stern warning and a pricey ticket later, you sputter into Littleton at midnight, exhausted. You and your S.O. sigh about the trip, and solemnly promise each other that from now on you’ll stick to the highways.

A month later, you’re back in Smallport. You and your brother are catching up at your place. He mentions he’s considering a day trip to Littleton that weekend.

“Would you like to come along?” he asks.

What do you answer?

Certainly, for you, the recent trip to Littleton was a miserable one, but part of that was likely due to your unorthodox route. Maybe if you’d stuck to the highways, the trip would have turned out just fine.

At this point, there are two attitudes you could adopt:

SARcastic: You could say your last trip to Littleton is no fun at all. You tell him how the view was underwhelming, the roads were full of potholes, and the traffic cops were merciless.

Quite Lovely: Recalling your earlier trip back in the 90s, you could tell your brother that the trip is a perfectly lovely one, and you’d enjoy spending the day with him.

In the first case, your recent experience driving between towns taught you a specific lesson: to never to make the trip to Littleton again. As hinted at above, this attitude, the more pessimistic one, is how SARSA works. The lesson you learned was based on what you actually did, and what happened as a consequence. This type of learning is called on-policy, which means what you learn is a result of your actual “action policy”, and that includes your explorations.

In the second case, your memory of the earlier trip, the one back in the 90s, still resonates in your mind. You reason that “had I not taken that detour, the trip would have been fine”.

This latter approach, which is perhaps the more optimistic one, is how Q-Learning works. You won’t let a single bad experiment spoil you against what could have otherwise been a great trip. This type of learning is, as you may have guessed, called off-policy, which means what you learn is based not on your actual actions (your policy), but what would have been the best route if you hadn’t taken the detour.

Note, in both cases, your policy includes all the actions you took, including your exploration. Your “policy” is your approach to life. It’s how you decide what to do in any given situation. In the driving scenario, your policy is modified by how cautious or risk-taking you are. Had you been more cautious, you might never have taken the detour in the first place.

The biggest difference between SARSA and Q-Learning is the they lesson they take away from their experiments. SARSA choses to learn from the actual results. SARSA thinks “ woulda, shoulda, coulda… I can’t know what would have happened if I had just taken the highway. All I know is what did happen.

As you might have guessed, in cases where a driver never tries to take a shortcut and only follows the best route they know, both on policy and off-policy learning turn into the same thing, therefore SARSA and Q-Learning would also be the same.

In the driving example I described, most of us would likely take the optimistic approach (Q-Learning) and, barring any post-traumatic stress, we’d agree to go on the trip with our brother. Are there any cases, then, when you would adopt the more pessimistic attitude (SARSA)?

To answer this question, let’s go back to our road trip. As mentioned above, you’ve only been to Littleton once before, and that was over 20 years ago. Traffic in the area could be highly variable. That one romanticized trip may not be representative of all trips, in the same way that your detour experience was a fluke. There could be heavy construction on the road, and your detour, though miserable, may have actually saved you time. Since you didn’t actually take the highways this time, you don’t really know what they are like at all times.

There could have been heavy traffic on the highways.

If traffic conditions in your area are highly unpredictable, you might be less confident about taking the trip with your brother. Conversely, if highway traffic is relatively consistent, you will probably get to your destination on time. The more predictable the world is, the more optimistic your decisions can be. We call such a world deterministic, which means you can predict what will happen when you take a particular action in a particular situation.

On the other hand, if you live in a world that’s always changing — if roads are closed for construction almost every summer (as they are in my city), and traffic fluctuates wildly — then you’d be more hesitant to recommend the weekend trip. We call such worlds stochastic, which means “not-deterministic”, or somewhat unpredictable*.

*Note: the world can’t be entirely unpredictable, i.e. random, otherwise you would not be able to say anything about any future trips, even if yours had turned out OK.

In summary, there are three things we consider when learning from our actions:

  1. How predictable is the world? (deterministic or stochastic)
  2. How experimental are my actions?* (stick to the highway or occasionally take a detour)
  3. Should I make future decisions based on what actually happened, or based on what the best course of action I think was? (off-policy or on-policy, SARSA vs Q-Learning)

* As mentioned earlier, both the algorithms used for SARSA and for Q-Learning are open to exploration, i.e. trying to find shortcuts. The two algorithms decide what action to take based on a decision algorithm called ε-greedy (pronounced epsilon-greedy). ε-greedy, like most of us, is a mildly risk-taking approach. The majority of the time, ε-greedy picks the best known path to take from memory (in our case, the highways); but occasionally it tries something new and random.

Originally published at on January 31, 2019.

My goal is to develop and productize A.I. that combines symbolic reasoning with motivation. This A.I. will define and solve abstract problems on its own.