Story of retry

Day 1. The beginning.

 
πŸ‘¨ John
We need to make an http service for getting user name by its id.
πŸ€“ Carl
Oh, but we already have one, it is http://internal.com. Look:

$ curl -XPOST http://internal.com/1
βœ… HTTP/1.1 200 OK
John

$ curl -XPOST http://internal.com/2
βœ… HTTP/1.1 200 OK
Carl  

πŸ‘¨ John
Yeh, but it’s internal api. We want to make requests from the internet.
πŸ€“ Carl
Then we need some authorization for it.
πŸ‘¨ John
Let’s make it dead simple for now and send the password as a request parameter.
πŸ€“ Carl
Ok, wait a minute, I’ll write some code.

import requests  
from flask import Flask, Response

app = Flask('service')

@app.route("/")
def get(request):  
    if request.args['p'] != '123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['u']
        response = requests.post('http://internal.com/' + user_id)
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl
Save...Deploy...Done. With bad password:

$ curl http://service.com/?p=111&u=1
❌ HTTP/1.1 401 Unauthorized
Wrong password  

πŸ€“ Carl
With good password:

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John  

πŸ‘¨ John
Cool!

Day 2. Five hundred problems and a girl ain't one.

 
πŸ‘¨ John
Houston, we have a problem! Our new service is not working!
πŸ€“ Carl
What? Wait I minute, I’ll check:

$ curl http://service.com/?p=123&u=1
❌ HTTP/1.1 500 Internal Error

πŸ€“ Carl
Oh shit! I’ll look at logs:

❌ Response from http://internal.com/1 is 500 Internal server error

πŸ€“ Carl
The internal service responds with an error. I’ll talk to them.

...Carl switches to the chat with the internal service developer...

πŸ€“ Carl
Hey man. What’s wrong with your service? It responds with an error and crushes our service?
πŸ‘· Rob
Hi. We have database issues. Sometimes we respond with the error but sometimes without the error. Look:

$ curl -XPOST http://internal.com/1
❌ HTTP/1.1 500 Internal Error

$ curl -XPOST http://internal.com/1
❌ HTTP/1.1 500 Internal Error

$ curl -XPOST http://internal.com/1
βœ… HTTP/1.1 200 OK
John  

πŸ€“ Carl
And what should I do?
πŸ‘· Rob
Man, networks are unreliable. Make some retry requests and get the successful one.
πŸ€“ Carl
Ok, wait a minute.

-import requests
+from requests.adapters import HTTPAdapter
+from requests import Session
+from requests.packages.urllib3.util.retry import Retry
from flask import Flask, Response

app = Flask('service')

+session = Session()
+retry = Retry(
+    total=5,
+    method_whitelist=['POST'],
+    status_forcelist=[500]
+)
+session.mount(
+    'http://',
+    HTTPAdapter(max_retries=retry)
+)

@app.route("/")
def get(request):  
    password = request.args['p']
    if password != '123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['u']
-       response = requests.post('http://internal.com/' + user_id)
+       response = session.post('http://internal/' + user_id)
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl
Save...Deploy...Done. Let's try.

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John  

πŸ€“ Carl
3 of 3! And in the logs I see the retry requests. Thank you!

⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
βœ… Response from http://internal.com/1 is 200 OK

...Carl switches to the chat with John...

πŸ€“ Carl
I’ve fixed the problem. The internal service has some database issues. I've added 5 retries for every request and we are working normally now.

πŸ‘¨ John
Nice!

Day 3. Know your limits.

 
πŸ‘¨ John
Houston, we have a problem! Our new service is not working!
πŸ€“ Carl
What, again? Arrr, let’s look at logs.

❌ Response from http://internal.com/1 is 429 Too Many Requests

...Carl switches to the chat with the internal service developer...

πŸ€“ Carl
Hey man, we have a problem with your service again. It responds with 429 error. What is it?
πŸ‘· Rob
Hi. You're making too many requests. Do you still retrying the requests?
πŸ€“ Carl
Yep.
πŸ‘· Rob
What is your backoff factor?
πŸ€“ Carl
Ehhhh, what is my what?
πŸ‘· Rob
I mean, you can do only 100 requests per second to our service. You make a request, get the error with status code 500 (yes, we still have the database issues) and then make another request immediately. Just sleep after the 500 error for a while.
πŸ€“ Carl
Let's see.

from requests.adapters import HTTPAdapter  
from requests import Session  
from requests.packages.urllib3.util.retry import Retry  
from flask import Flask, Response

app = Flask('service')

session = Session()  
retry = Retry(  
    total=5,
    method_whitelist=['POST'],
-   status_forcelist=[500],
+   status_forcelist=[500, 429],
+   backoff_factor=0.1
)
session.mount(  
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):  
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
        response = session.post('http://internal/' + user_id)
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl
I've added backoff factor equals to 0.1. As I understand it will sleep for [0.0s, 0.2s, 0.4s, ...] between retries. Also I've added retry for 429 error. Save...Deploy...Done. Let's try.

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John  

πŸ€“ Carl
3 of 3! Thanks!

Day 4. Your time is out.

 
πŸ‘¨ John
Houston, we have a problem. Our service is working but response is too slow!
πŸ€“ Carl
What? Let's look...Hmm, yeah. It’s approx 5 seconds per request response. What do you want?
πŸ‘¨ John
We need minimum 100ms.

πŸ€“ ➜ πŸ‘·

πŸ€“ Carl
Hey man, we have a problem with your service again. It responds too slow and that’s why our service responds too slow.
πŸ‘· Rob
We have some balancer issue. Usually it routes all the requests to the closest data center. But now it started to route half of the requests to the furthest.
πŸ€“ Carl
What should I do?
πŸ‘· Rob
Networks are unreliable, dude. You can set the timeout for the request. I mean, if the request didn’t finished in 30ms try to make another request. It should help.

from requests.adapters import HTTPAdapter  
from requests import Session  
from requests.packages.urllib3.util.retry import Retry  
from flask import Flask, Response

app = Flask('service')

session = Session()  
retry = Retry(  
    total=5,
    method_whitelist=['POST'],
    status_forcelist=[500, 429],
    backoff_factor=0.1
)
session.mount(  
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):  
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
-       response = session.post('http://internal/' + user_id)
+       response = session.post(
+           'http://internal/' + user_id,
+           timeout=0.3
+       )
        response.raise_for_status()
        return Response(response.json()['name'])

Yes, it helped. Thank you.

⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
βœ… Response from http://internal.com/1 is 200 OK

Day 5. Timeouts strike back.

 
πŸ‘¨ John
Houston, we have a problem. Your service is not working!
πŸ€“ Carl
I know, I know...

⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
❌ Max retries exceeded.

πŸ€“ ➜ πŸ‘·

πŸ€“ Carl
Dude, we have a problem. Your service is not responding in 30ms. We make 5 retry requests and then give up.
πŸ‘· Rob
We are upgrading the service and some instances have connectivity issues. In some cases it can not even start any connections. But you can detect that your request came to the bad instance. Our networks are quite fast and should establish connections in 5ms. Just split the timeout to the connection and response read.
πŸ€“ Carl
But I'll still have 5 retry requests.
πŸ‘· Rob
As I remember your service must return the response in 100ms. It's 25ms*2 + 5ms*10.
πŸ€“ Carl
Yeh, I got it. I'll make 2 retries for 25ms read timeouts and 10 retries for 5ms connect timeouts.

from requests.adapters import HTTPAdapter  
from requests import Session  
from requests.packages.urllib3.util.retry import Retry  
from flask import Flask, Response

app = Flask('service')

session = Session()  
retry = Retry(  
    total=5,
    method_whitelist=['POST'],
    status_forcelist=[500, 429],
    backoff_factor=0.1,
+   connect=10,
+   read=2,
)
session.mount(  
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):  
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
        response = session.post(
            'http://internal/' + user_id,
-           timeout=0.3
+           timeout=(
+               0.05,  # connect timeout
+               0.25,  # read timeout
+           )
        )
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl
It works! Thanks!

⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.25).
βœ… Response from http://internal.com/1 is 200 OK

Day 6. Stop this madness!

 
πŸ‘¨ John
Houston, we have not such a big problem by the problem it is.
πŸ€“ Carl
???
πŸ‘¨ John
When we make a request with not existing user id we get 500 error. It would be awesome to have 404 error. Can you help us?

$ curl http://service.com/?p=123&u=999
❌ HTTP/1.1 500 Internal server error

πŸ€“ Carl
I see. That's because the retries are exceeded.

⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
❌ Max retries exceeded.

πŸ€“ ➜ πŸ‘·

πŸ€“ Carl
Hey man, your service constantly returns 500 error for not existing user. We make 5 retry requests and then give up.
πŸ‘· Rob
It’s not a bug, it’s a feature. It’s part of the protocol and can not be fixed. You always can check the response. You can stop doing retries based on it. Just stop the retry cycle if the response is User not exists.

$ curl http://internal.com/999
❌ HTTP/1.1 500 Internal server error
User not exists.  

πŸ€“ Carl
Let's see... We're always make the retry request when you return 500 error. It's impossible to check the response via the default Retry class. I'll have to extend it.

from requests.adapters import HTTPAdapter  
from requests import Session  
from requests.packages.urllib3.util.retry import Retry  
from flask import Flask, Response  
+from requests.packages.urllib3 import (
+    exceptions as urllib3_exceptions
+)

+class MyRetry(Retry):
+    def increment(self, *args, **kwargs):
+        if (kwargs.get('response') and 
+            kwargs['response'].status == 500 and 
+            'User not exists' in kwargs['response'].data):
+            raise urllib3_exceptions.MaxRetryError(
+                pool=kwargs.get('_pool'),
+                url=args[1],
+                reason=urllib3_exceptions.ResponseError(
+                    'User not exists.'
+                )
+            )
+        return super(MyRetry, self).increment(*args, **kwargs)


app = Flask('service')

session = Session()  
-retry = Retry(
+retry = MyRetry(
    total=5,
    method_whitelist=['POST'],
    status_forcelist=[500, 429],
    backoff_factor=0.1,
    connect=10,
    read=2,
+   raise_on_status=False,
)
session.mount(  
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):  
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
        response = session.post(
            'http://internal/' + user_id,
            timeout=(
                0.05,  # socket timeout
                0.25,  # read timeout
            )
        )
-       response.raise_for_status()
-       return Response(response.json()['name'])
+       if response.ok:
+           return Response(response.json()['name'])
+       elif response.content == 'User not exists':
+           return Response('User not exists', status=404)

πŸ€“ Carl
It works!

$ curl http://service.com/?p=123&u=999
βœ… HTTP/1.1 404 Not found
User not exists  

πŸ€“ Carl
We're not retrying on User not exists response and return 404 right away.

Three months later.

 
πŸ‘¨ John
Hey dude. It’s been 3 months as your service works without any error. You rock, man!
πŸ€“ Carl
Thank you, it was a hard work to make it happen.
πŸ‘¨ John
Actually we have some problem. Someone hacked our system and downloaded all the names of our customers. What do you think? How did they do so?
πŸ€“ Carl
Mmm, didn’t you told anyone the password to the service?
πŸ‘¨ John
No, I didn’t!
πŸ€“ Carl
It’s better to change the password I believe. Let’s try. Change… Save… Deploy… Done! It’s 123456 now!
πŸ‘¨ John
You’re the cyber security ninja, man!
πŸ€“ Carl
πŸ’ͺ

The end.