Story of retry

Day 1. The beginning.

Β 

πŸ‘¨ John

We need to make an http service for getting user name by its id.

πŸ€“ Carl

Oh, but we already have one, it is http://internal.com. Look:

$ curl -XPOST http://internal.com/1
βœ… HTTP/1.1 200 OK
John

$ curl -XPOST http://internal.com/2
βœ… HTTP/1.1 200 OK
Carl

πŸ‘¨ John

Yeh, but it’s internal api. We want to make requests from the internet.

πŸ€“ Carl

Then we need some authorization for it.

πŸ‘¨ John

Let’s make it dead simple for now and send the password as a request parameter.

πŸ€“ Carl

Ok, wait a minute, I’ll write some code.

import requests
from flask import Flask, Response

app = Flask('service')

@app.route("/")
def get(request):
    if request.args['p'] != '123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['u']
        response = requests.post('http://internal.com/' + user_id)
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl

Save...Deploy...Done. With bad password:

$ curl http://service.com/?p=111&u=1
❌ HTTP/1.1 401 Unauthorized
Wrong password

πŸ€“ Carl

With good password:

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

πŸ‘¨ John

Cool!

Day 2. Five hundred problems and a girl ain't one.

Β 

πŸ‘¨ John

Houston, we have a problem! Our new service is not working!

πŸ€“ Carl

What? Wait I minute, I’ll check:

$ curl http://service.com/?p=123&u=1
❌ HTTP/1.1 500 Internal Error

πŸ€“ Carl

Oh shit! I’ll look at logs:

❌ Response from http://internal.com/1 is 500 Internal server error

πŸ€“ Carl

The internal service responds with an error. I’ll talk to them.

...Carl switches to the chat with the internal service developer...

πŸ€“ Carl

Hey man. What’s wrong with your service? It responds with an error and crushes our service?

πŸ‘· Rob

Hi. We have database issues. Sometimes we respond with the error but sometimes without the error. Look:

$ curl -XPOST http://internal.com/1
❌ HTTP/1.1 500 Internal Error

$ curl -XPOST http://internal.com/1
❌ HTTP/1.1 500 Internal Error

$ curl -XPOST http://internal.com/1
βœ… HTTP/1.1 200 OK
John

πŸ€“ Carl

And what should I do?

πŸ‘· Rob

Man, networks are unreliable. Make some retry requests and get the successful one.

πŸ€“ Carl

Ok, wait a minute.

-import requests
+from requests.adapters import HTTPAdapter
+from requests import Session
+from requests.packages.urllib3.util.retry import Retry
from flask import Flask, Response

app = Flask('service')

+session = Session()
+retry = Retry(
+    total=5,
+    method_whitelist=['POST'],
+    status_forcelist=[500]
+)
+session.mount(
+    'http://',
+    HTTPAdapter(max_retries=retry)
+)

@app.route("/")
def get(request):
    password = request.args['p']
    if password != '123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['u']
-       response = requests.post('http://internal.com/' + user_id)
+       response = session.post('http://internal/' + user_id)
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl

Save...Deploy...Done. Let's try.

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

πŸ€“ Carl

3 of 3! And in the logs I see the retry requests. Thank you!

⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
βœ… Response from http://internal.com/1 is 200 OK

...Carl switches to the chat with John...

πŸ€“ Carl

I’ve fixed the problem. The internal service has some database issues. I've added 5 retries for every request and we are working normally now.

πŸ‘¨ John

Nice!

Day 3. Know your limits.

Β 

πŸ‘¨ John

Houston, we have a problem! Our new service is not working!

πŸ€“ Carl

What, again? Arrr, let’s look at logs.

❌ Response from http://internal.com/1 is 429 Too Many Requests

...Carl switches to the chat with the internal service developer...

πŸ€“ Carl

Hey man, we have a problem with your service again. It responds with 429 error. What is it?

πŸ‘· Rob

Hi. You're making too many requests. Do you still retrying the requests?

πŸ€“ Carl

Yep.

πŸ‘· Rob

What is your backoff factor?

πŸ€“ Carl

Ehhhh, what is my what?

πŸ‘· Rob

I mean, you can do only 100 requests per second to our service. You make a request, get the error with status code 500 (yes, we still have the database issues) and then make another request immediately. Just sleep after the 500 error for a while.

πŸ€“ Carl

Let's see.

from requests.adapters import HTTPAdapter
from requests import Session
from requests.packages.urllib3.util.retry import Retry
from flask import Flask, Response

app = Flask('service')

session = Session()
retry = Retry(
    total=5,
    method_whitelist=['POST'],
-   status_forcelist=[500],
+   status_forcelist=[500, 429],
+   backoff_factor=0.1
)
session.mount(
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
        response = session.post('http://internal/' + user_id)
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl

I've added backoff factor equals to 0.1. As I understand it will sleep for [0.0s, 0.2s, 0.4s, ...] between retries. Also I've added retry for 429 error. Save...Deploy...Done. Let's try.

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

$ curl http://service.com/?p=123&u=1
βœ… HTTP/1.1 200 OK
John

πŸ€“ Carl

3 of 3! Thanks!

Day 4. Your time is out.

Β 

πŸ‘¨ John

Houston, we have a problem. Our service is working but response is too slow!

πŸ€“ Carl

What? Let's look...Hmm, yeah. It’s approx 5 seconds per request response. What do you want?

πŸ‘¨ John

We need minimum 100ms.

πŸ€“ ➜ πŸ‘·

πŸ€“ Carl

Hey man, we have a problem with your service again. It responds too slow and that’s why our service responds too slow.

πŸ‘· Rob

We have some balancer issue. Usually it routes all the requests to the closest data center. But now it started to route half of the requests to the furthest.

πŸ€“ Carl

What should I do?

πŸ‘· Rob

Networks are unreliable, dude. You can set the timeout for the request. I mean, if the request didn’t finished in 30ms try to make another request. It should help.

from requests.adapters import HTTPAdapter
from requests import Session
from requests.packages.urllib3.util.retry import Retry
from flask import Flask, Response

app = Flask('service')

session = Session()
retry = Retry(
    total=5,
    method_whitelist=['POST'],
    status_forcelist=[500, 429],
    backoff_factor=0.1
)
session.mount(
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
-       response = session.post('http://internal/' + user_id)
+       response = session.post(
+           'http://internal/' + user_id,
+           timeout=0.3
+       )
        response.raise_for_status()
        return Response(response.json()['name'])

Yes, it helped. Thank you.

⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
βœ… Response from http://internal.com/1 is 200 OK

Day 5. Timeouts strike back.

Β 

πŸ‘¨ John

Houston, we have a problem. Your service is not working!

πŸ€“ Carl

I know, I know...

⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
⚠️ Retrying. Connection timed out (0.3).
❌ Max retries exceeded.

πŸ€“ ➜ πŸ‘·

πŸ€“ Carl

Dude, we have a problem. Your service is not responding in 30ms. We make 5 retry requests and then give up.

πŸ‘· Rob

We are upgrading the service and some instances have connectivity issues. In some cases it can not even start any connections. But you can detect that your request came to the bad instance. Our networks are quite fast and should establish connections in 5ms. Just split the timeout to the connection and response read.

πŸ€“ Carl

But I'll still have 5 retry requests.

πŸ‘· Rob

As I remember your service must return the response in 100ms. It's 25ms*2 + 5ms*10.

πŸ€“ Carl

Yeh, I got it. I'll make 2 retries for 25ms read timeouts and 10 retries for 5ms connect timeouts.

from requests.adapters import HTTPAdapter
from requests import Session
from requests.packages.urllib3.util.retry import Retry
from flask import Flask, Response

app = Flask('service')

session = Session()
retry = Retry(
    total=5,
    method_whitelist=['POST'],
    status_forcelist=[500, 429],
    backoff_factor=0.1,
+   connect=10,
+   read=2,
)
session.mount(
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
        response = session.post(
            'http://internal/' + user_id,
-           timeout=0.3
+           timeout=(
+               0.05,  # connect timeout
+               0.25,  # read timeout
+           )
        )
        response.raise_for_status()
        return Response(response.json()['name'])

πŸ€“ Carl

It works! Thanks!

⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.05).
⚠️ Retrying. Connection timed out (0.25).
βœ… Response from http://internal.com/1 is 200 OK

Day 6. Stop this madness!

Β 

πŸ‘¨ John

Houston, we have not such a big problem by the problem it is.

πŸ€“ Carl

???

πŸ‘¨ John

When we make a request with not existing user id we get 500 error. It would be awesome to have 404 error. Can you help us?

$ curl http://service.com/?p=123&u=999
❌ HTTP/1.1 500 Internal server error

πŸ€“ Carl

I see. That's because the retries are exceeded.

⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
⚠️ Retrying because response is 500.
❌ Max retries exceeded.

πŸ€“ ➜ πŸ‘·

πŸ€“ Carl

Hey man, your service constantly returns 500 error for not existing user. We make 5 retry requests and then give up.

πŸ‘· Rob

It’s not a bug, it’s a feature. It’s part of the protocol and can not be fixed. You always can check the response. You can stop doing retries based on it. Just stop the retry cycle if the response is User not exists.

$ curl http://internal.com/999
❌ HTTP/1.1 500 Internal server error
User not exists.

πŸ€“ Carl

Let's see... We're always make the retry request when you return 500 error. It's impossible to check the response via the default Retry class. I'll have to extend it.

from requests.adapters import HTTPAdapter
from requests import Session
from requests.packages.urllib3.util.retry import Retry
from flask import Flask, Response
+from requests.packages.urllib3 import (
+    exceptions as urllib3_exceptions
+)

+class MyRetry(Retry):
+    def increment(self, *args, **kwargs):
+        if (kwargs.get('response') and 
+            kwargs['response'].status == 500 and 
+            'User not exists' in kwargs['response'].data):
+            raise urllib3_exceptions.MaxRetryError(
+                pool=kwargs.get('_pool'),
+                url=args[1],
+                reason=urllib3_exceptions.ResponseError(
+                    'User not exists.'
+                )
+            )
+        return super(MyRetry, self).increment(*args, **kwargs)
    

app = Flask('service')

session = Session()
-retry = Retry(
+retry = MyRetry(
    total=5,
    method_whitelist=['POST'],
    status_forcelist=[500, 429],
    backoff_factor=0.1,
    connect=10,
    read=2,
+   raise_on_status=False,
)
session.mount(
    'http://',
    HTTPAdapter(max_retries=retry)
)

@app.route("/")
def get(request):
    password = request.args['password']
    if password != 'password123':
        return Response('Wrong password', status=401)
    else:
        user_id = request.args['user_id']
        response = session.post(
            'http://internal/' + user_id,
            timeout=(
                0.05,  # socket timeout
                0.25,  # read timeout
            )
        )
-       response.raise_for_status()
-       return Response(response.json()['name'])
+       if response.ok:
+           return Response(response.json()['name'])
+       elif response.content == 'User not exists':
+           return Response('User not exists', status=404)

πŸ€“ Carl

It works!

$ curl http://service.com/?p=123&u=999
βœ… HTTP/1.1 404 Not found
User not exists

πŸ€“ Carl

We're not retrying on User not exists response and return 404 right away.

Three months later.

Β 

πŸ‘¨ John

Hey dude. It’s been 3 months as your service works without any error. You rock, man!

πŸ€“ Carl

Thank you, it was a hard work to make it happen.

πŸ‘¨ John

Actually we have some problem. Someone hacked our system and downloaded all the names of our customers. What do you think? How did they do so?

πŸ€“ Carl

Mmm, didn’t you told anyone the password to the service?

πŸ‘¨ John

No, I didn’t!

πŸ€“ Carl

It’s better to change the password I believe. Let’s try. Change… Save… Deploy… Done! It’s 123456 now!

πŸ‘¨ John

You’re the cyber security ninja, man!

πŸ€“ Carl

πŸ’ͺ

The end.