SIXTEENmm

1743 films and counting...

Search Regression - Blog

2020-02-01

We were overzealous in some of the ways we tried to protect our search.

This caused attempts to do a search to timeout.

What we did wrong:

How we've fixed it:

This cut the search time from ridiculous 100+ seconds, down to 1-2 seconds.

We're still looking to improve the speed, but we're faster than before we tried to fix the search engine.


What's so expensive?

Mostly the fuzzy matcher.

ratios = {}
def fuzz_ratio(a, b):

    # 100Mb dict limit
    if sys.getsizeof(ratios) > 102400:
        ratios.clear()

    try:
        return ratios["{}|{}".format(a, b)]
    except KeyError:
        r = fuzz.partial_ratio(a, b)
        ratios["{}|{}".format(a, b)] = r
        return r

This performs a Levenshtein distance match, and that weights it. This is done in a fairly fast C library, but we run this against every term we break out from the string the user sends us, and we do it multiple times.

Storing things in a cache (the dict) means that we shouldn't pay a penalty when doing repeated calls anymore.

Which moves the hotpath to to token parser, or how we break up a user search.

Parsing search

Our search engine has a few clever bits when a user hands us a string to search:

Having to pass in genres from the DB makes it more difficult to memoise this particular function, however, because we do iterate over a string, and Python doesn't have a char type, and only has immutable strings, this is somewhat expensive.

def parse(search_term, genres):
    # Normalise
    term = cleanse_string(search_term)

    # genre terms shouldn't be split ie. kung fu kid == ["kung fu", "kid"]
    for genre in genres:
        term = term.replace(genre, f'"{genre}"')

    # Parse and split intelligently
    tokens = []
    s = []
    in_group = False
    for c in term:
        if not in_group:
            if c == '"':
                in_group = True
                if len(s) > 0:
                    tokens.append(''.join(s))
                    s = []
            elif c == ' ':
                if len(s) > 0:
                    tokens.append(''.join(s))
                    s = []
            else:
                s.append(c)
        else:
            if c == '"':
                in_group = False
                if len(s) > 0:
                    tokens.append(''.join(s))
                    s = []
            else:
                s.append(c)
    if len(s) > 0:
        tokens.append(''.join(s))

    return tokens

Our normalizer isn't a big deal, and should actually be quite fast:

allowed = string.ascii_letters + string.digits + ' ' + '"'
def cleanse_string(s):
    s = s.lower()
    x = ''.join(c for c in s if c in allowed)

    return x

But, we can improve that by the same caching method:

allowed = string.ascii_letters + string.digits + ' ' + '"'
cleansed = {}
def cleanse_string(s):

    # 100Mb dict limit
    if sys.getsizeof(ratios) > 102400:
        cleansed.clear()

    try:
        return cleansed[s]
    except KeyError:
        x = ''.join(c for c in s.lower() if c in allowed)
        cleansed[s] = x

        return x

The Hotpath

Once we've made all those changes, fixing the speed problems we introduced earlier, profiling shows us the new hotpath, the place where things slow down.

Unfortunately, there are still things to deal with, but things are back to an acceptable speed.


Continue reading...