Filter Table Rows by Attributes

    Find the full source code and run TagsHasher on Jina Hub.

    Big news, one can use Jina to filter table rows by their attributes! Such an amazing feature that only exists since… 47 years ago, aka SQL! Jina as a neural search framework surely won’t implement a SQL database from scratch. The question here is: is it possible to leverage what we learned about neural search: embedding, indexing, nearest-neighbour matching to enable similar feature like SQL, e.g. filter, select?

    Yes! Jina can do this. In this article, I will give you a walkthrough on how to filter the tabular data using Jina and without SQL (also no GPT-3). Let’s call this mini-project as the neuretro-SQL.

    The first thing you want to learn is feature hashing. I already gave a tutorial at here. I strongly recommend you to read that first before continue.

    In general, feature hashing is a great way to embed unbounded number of features into fixed-size vectors. We will leverage the same idea here to embed the columns of the tabular data into fixed-size vectors.

    Let’s look at an example CSV file. Here I use a that looks like the following:

    Let’s load the data from the web and put them into a DocumentArray:

    Looks like we got 1660 Documents in total, let’s take one sample from it and take a look:

    1. print(da[5].json())
    1. {
    2. "id": "16a9745c-3d99-11ec-a97f-1e008a366d49",
    3. "tags": {
    4. "*Image": "NicholasCage.png",
    5. "Actor": "Gere, Richard",
    6. "Actress": "Adams, Brooke",
    7. "Awards": "No",
    8. "Director": "Malick, Terrence",
    9. "Length": "94",
    10. "Popularity": "14",
    11. "Subject": "Drama",
    12. "Title": "Days of Heaven",
    13. "Year": "1978"
    14. }
    15. }

    It looks like this Document has two non-empty attributes id and tags, and all values in tags correspond to the column value we have in the CSV data. Now our task is clear: we want to filter Documents from this DocumentArray according to their values in .tags, but no SQL, pure Jina, pure neural search.

    To embed columns into vectors, we first notice that each “column-item” in .tags is actually a Tuple[str, Any] pair. The first part, a string, represents the column title, e.g. “Actor”, “Actress”, “Director”. We can simply reuse our previous hash function:

    1. import hashlib
    2. h = lambda x: int(hashlib.md5(str(x).encode('utf-8')).hexdigest(), base=16) % 256
    3. h('Actor')
    4. h('Director')
    5. h('Length')

    Now that we have indices, the actual value on that index, namely the Any part of that Tuple[str, Any] pair needs some extra thought. First, some values are numbers like integers or floats, they are a good hash by themselves (e.g. 1996 is 1996, equal numbers are identity in semantics with no collision), so they do not need another hash function. Boolean values are the same, 0 and 1 are pretty representative. Strings can be handled in the same way above. What about lists, tuples and dicts? We can serialize them into JSON strings and then apply our string hash. The final hash function looks like the following:

    1. def _any_hash(self, v):
    2. try:
    3. return int(v) # parse int parameter
    4. except ValueError:
    5. try:
    6. return float(v) # parse float parameter
    7. except ValueError:
    8. if not v:
    9. # ignore it when the parameter is empty
    10. return 0
    11. if isinstance(v, str):
    12. v = v.strip()
    13. if v.lower() in {'true', 'yes'}: # parse boolean parameter
    14. return 1
    15. if v.lower() in {'false', 'no'}:
    16. return 0
    17. if isinstance(v, (tuple, dict, list)):
    18. v = json.dumps(v, sort_keys=True)
    19. return int(self.hash(str(v).encode('utf-8')).hexdigest(), base=16)

    If you apply this directly, you will get extremely big integers on the embedding values. Too big that you don’t even want to look at or store it (for numerical and stability reason). So we need to bound it. Remember in full-text feature hashing example, we introduced n_dim to “horizontally” bound the dimensions of the embedding vector. We can follow the same spirit and introduce another variable max_val to “vertically” bound the dimensions of the vector:

    1. from jina import Executor
    2. import hashlib
    3. class TagsHasher(Executor):
    4. def __init__(self, n_dim: int = 256, max_val: int = 65536, sparse: bool = False, **kwargs):
    5. super().__init__(**kwargs)
    6. self.n_dim = n_dim
    7. self.max_val = max_val
    8. self.hash = hashlib.md5
    9. self.sparse = sparse

    Here we give a larger number to max_val then to n_dim. This is because the likelihood of a collision happens on vertical direction is in general much higher than on horizontal direction (otherwise, it implies there are more variants on the column name than on the column value, which then suggests the table-maker to simply “transpose” the whole table for better readability).

    The final embedding procedure is then very simple:

    1. @requests
    2. def encode(self, docs: DocumentArray, **kwargs):
    3. if self.sparse:
    4. from scipy.sparse import csr_matrix
    5. for idx, doc in enumerate(docs):
    6. if doc.tags:
    7. idxs, data = [], [] # sparse
    8. table = np.zeros(self.n_dim) # dense
    9. for k, v in doc.tags.items():
    10. sign_h = np.sign(h)
    11. col = h % self.n_dim
    12. sign_v = np.sign(val)
    13. val = val % self.max_val
    14. idxs.append((0, col))
    15. val = sign_h * sign_v * val
    16. data.append(val)
    17. table[col] += val
    18. if self.sparse:
    19. doc.embedding = csr_matrix(
    20. (data, zip(*idxs)), shape=(1, self.n_dim)
    21. )
    22. else:
    23. doc.embedding = table

    Let’s encode our loaded DocumentArray:

    1. from jina import Executor
    2. th = Executor.load_config('jinahub://TagsHasher')
    3. th.encode(da)

    Encode the filter with TagsHasher to get the embeddings.

    1. th.encode(qa)

    Now that we have embeddings for both indexed docs da (i.e. our film CSV table), and the query docs qa (our filters), we can use .match function to find nearest neighbours.

    1. qa.match(da, limit=5, exclude_self=True, metric='jaccard', use_scipy=True)

    Note that here I use Jaccard distance instead of the cosine distance. This is because the closeness of the value on each feature is meaningless, as the value is the result of a hash function. Whereas in FeatureHashser’s example, the value represents the term frequency of a word, so it was meaningful there. This needs to be kept in mind when using TagsHasher.

    Finally, let’s see some results. Here I only print top-5 matches.

    1. for d in qa:
    2. print('my filter is:', d.tags.json())
    3. for m in d.matches:
    4. print(m.tags.json())

    “Subject”: “Comedy”

    1. {
    2. "*Image": "NicholasCage.png",
    3. "Actor": "Chase, Chevy",
    4. "Actress": "",
    5. "Awards": "No",
    6. "Director": "",
    7. "Length": "",
    8. "Popularity": "82",
    9. "Subject": "Comedy",
    10. "Title": "Valkenvania",
    11. "Year": "1990"
    12. }
    13. {
    14. "*Image": "paulNewman.png",
    15. "Actor": "Newman, Paul",
    16. "Actress": "",
    17. "Awards": "No",
    18. "Director": "",
    19. "Length": "",
    20. "Popularity": "28",
    21. "Subject": "Comedy",
    22. "Title": "Secret War of Harry Frigg, The",
    23. "Year": "1968"
    24. }
    25. {
    26. "*Image": "NicholasCage.png",
    27. "Actor": "Murphy, Eddie",
    28. "Actress": "",
    29. "Awards": "No",
    30. "Director": "",
    31. "Length": "",
    32. "Popularity": "56",
    33. "Subject": "Comedy",
    34. "Title": "Best of Eddie Murphy, Saturday Night Live, The",
    35. "Year": "1989"
    36. }
    37. {
    38. "*Image": "NicholasCage.png",
    39. "Actor": "Mastroianni, Marcello",
    40. "Actress": "",
    41. "Awards": "No",
    42. "Director": "Fellini, Federico",
    43. "Popularity": "29",
    44. "Subject": "Comedy",
    45. "Title": "Ginger & Fred",
    46. "Year": "1993"
    47. {
    48. "*Image": "NicholasCage.png",
    49. "Actor": "Piscopo, Joe",
    50. "Actress": "",
    51. "Awards": "No",
    52. "Director": "",
    53. "Length": "60",
    54. "Popularity": "14",
    55. "Subject": "Comedy",
    56. "Title": "Joe Piscopo New Jersey Special",
    57. "Year": "1987"
    58. }

    “Year”: 1987.0

    { “Subject”: “Comedy”, “Year”: 1987.0}

    1. {
    2. "*Image": "NicholasCage.png",
    3. "Actor": "Piscopo, Joe",
    4. "Actress": "",
    5. "Awards": "No",
    6. "Director": "",
    7. "Length": "60",
    8. "Popularity": "14",
    9. "Subject": "Comedy",
    10. "Title": "Joe Piscopo New Jersey Special",
    11. "Year": "1987"
    12. }
    13. {
    14. "*Image": "NicholasCage.png",
    15. "Actor": "Murphy, Eddie",
    16. "Actress": "",
    17. "Awards": "No",
    18. "Director": "Murphy, Eddie",
    19. "Length": "90",
    20. "Popularity": "51",
    21. "Subject": "Comedy",
    22. "Title": "Eddie Murphy Raw",
    23. "Year": "1987"
    24. }
    25. {
    26. "*Image": "NicholasCage.png",
    27. "Actor": "McCarthy, Andrew",
    28. "Actress": "Cattrall, Kim",
    29. "Awards": "No",
    30. "Director": "Gottlieb, Michael",
    31. "Length": "",
    32. "Popularity": "23",
    33. "Subject": "Comedy",
    34. "Title": "Mannequin",
    35. "Year": "1987"
    36. }
    37. {
    38. "*Image": "NicholasCage.png",
    39. "Actor": "Williams, Robin",
    40. "Actress": "",
    41. "Awards": "No",
    42. "Director": "Levinson, Barry",
    43. "Length": "120",
    44. "Popularity": "37",
    45. "Subject": "Comedy",
    46. "Title": "Good Morning, Vietnam",
    47. "Year": "1987"
    48. }
    49. {
    50. "*Image": "NicholasCage.png",
    51. "Actor": "Boys, The Fat",
    52. "Actress": "",
    53. "Awards": "No",
    54. "Director": "Schultz, Michael",
    55. "Length": "86",
    56. "Popularity": "69",
    57. "Subject": "Comedy",
    58. "Title": "Disorderlies",

    Not bad!