Document

    Empty document

    From attributes

    1. from jina import Document
    2. import numpy
    3. d1 = Document(text='hello')
    4. d2 = Document(buffer=b'\f1')
    5. d3 = Document(blob=numpy.array([1, 2, 3]))
    6. d4 = Document(uri='https://jina.ai',
    7. mime_type='text/plain',
    8. granularity=1,
    9. adjacency=3,
    10. tags={'foo': 'bar'})
    1. <jina.types.document.Document ('id', 'mime_type', 'text') at 4483297360>
    2. <jina.types.document.Document ('id', 'buffer') at 5710817424>
    3. <jina.types.document.Document ('id', 'blob') at 4483299536>
    4. <jina.types.document.Document id=e01a53bc-aedb-11eb-88e6-1e008a366d48 uri=https://jina.ai mimeType=text/plain tags={'foo': 'bar'} granularity=1 adjacency=3 at 6317309200>

    From another Document

    1. from jina import Document
    2. d = Document(content='hello, world!')
    3. d1 = d
    4. assert id(d) == id(d1) # True

    To make a deep copy, use copy=True:

    1. d1 = Document(d, copy=True)
    2. assert id(d) == id(d1) # False

    From dict or JSON string

    1. from jina import Document
    2. import json
    3. d = {'id': 'hello123', 'content': 'world'}
    4. d1 = Document(d)
    5. d = json.dumps({'id': 'hello123', 'content': 'world'})
    6. d2 = Document(d)

    Parsing unrecognized fields

    Unrecognized fields in a dict/JSON string are automatically put into the Document’s .tags field:

    1. from jina import Document
    2. d1 = Document({'id': 'hello123', 'foo': 'bar'})
    1. <jina.types.document.Document id=hello123 tags={'foo': 'bar'} at 6320791056>

    You can use field_resolver to map external field names to Document attributes:

    1. from jina import Document
    2. d1 = Document({'id': 'hello123', 'foo': 'bar'}, field_resolver={'foo': 'content'})
    1. <jina.types.document.Document id=hello123 mimeType=text/plain text=bar at 6246985488>

    Set/unset attributes

    Set

    Set an attribute as you would with any Python object:

    1. from jina import Document
    2. d = Document()
    3. d.text = 'hello world'
    1. <jina.types.document.Document id=9badabb6-b9e9-11eb-993c-1e008a366d49 mime_type=text/plain text=hello world at 4444621648>

    Unset

    1. d.text = None

    or

    1. d.pop('text')
    1. <jina.types.document.Document id=cdf1dea8-b9e9-11eb-8fd8-1e008a366d49 mime_type=text/plain at 4490447504>

    Unset multiple attributes

    1. d.pop('text', 'id', 'mime_type')

    text, blob, and buffer are the three content attributes of a Document. They correspond to string-like data (e.g. for natural language), ndarray-like data (e.g. for image/audio/video data), and binary data for general purpose, respectively. Each Document can contain only one type of content.

    Exclusivity of the content

    Note that one Document can only contain one type of content: either text, buffer, or blob. If you set one, the others will be cleared.

    1. import numpy as np
    2. d = Document(text='hello')
    3. d.blob = np.array([1])
    4. d.text # <- now it's empty

    Why a Document contains only data type

    What if you want to represent more than one kind of information? Say, to fully represent a PDF page you need to store both image and text. In this case, you can use s by putting image into one sub-Document, and text into another.

    1. d = Document(chunks=[Document(blob=...), Document(text=...)])

    The principle is each Document contains only one modality. This makes the whole logic clearer.

    Tip

    There is also a doc.content sugar getter/setter of the above non-empty field. The content will be automatically grabbed or assigned to either text, buffer, or blob field based on the given type.

    After setting .uri, you can load data into .text/.buffer/.blob as follows.

    The value of .uri can point to either local URI, remote URI or data URI.

    Local image URI

    1. from jina import Document
    2. d1 = Document(uri='apple.png').load_uri_to_image_blob()
    3. print(d1.content_type, d1.content)
    1. blob [[[255 255 255]
    2. [255 255 255]
    3. [255 255 255]
    4. ...

    Remote text URI

    1. d1 = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').load_uri_to_text()
    2. print(d1.content_type, d1.content)
    1. text The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen
    2. most other parts of the wor

    Inline data URI

    1. from jina import Document
    2. d1 = Document(uri='''data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUA
    3. AAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO
    4. 9TXL0Y4OHwAAAABJRU5ErkJggg==
    5. ''').load_uri_to_image_blob()
    6. print(d1.content_type, d1.content)
    1. blob [[[255 255 255]
    2. [255 0 0]
    3. [255 0 0]
    4. [255 0 0]
    5. [255 255 255]]
    6. ...

    There are more .load_uri_to_* functions that allow you to read , image, , 3D mesh, and tabular data into Jina.

    Write to data URI

    Inline data URI is helpful when you need a quick visualization in HTML, as it embeds all resources directly into that HTML.

    You can convert a URI to a data URI using doc.load_uri_to_datauri(). This will fetch the resource and make it inline.

    Embedding

    Embedding is a multi-dimensional representation of a Document (often a [1, D] vector). It serves as a very important piece in the neural search.

    Document has an attribute embedding to contain the embedding information.

    Like .blob, you can assign it with a Python (nested) List/Tuple, Numpy ndarray, SciPy sparse matrix (spmatrix), TensorFlow dense and sparse tensor, PyTorch dense and sparse tensor, or PaddlePaddle dense tensor.

    1. import numpy as np
    2. import scipy.sparse as sp
    3. import torch
    4. import tensorflow as tf
    5. from jina import Document
    6. d0 = Document(embedding=[1, 2, 3])
    7. d1 = Document(embedding=np.array([1, 2, 3]))
    8. d2 = Document(embedding=np.array([[1, 2, 3], [4, 5, 6]]))
    9. d3 = Document(embedding=sp.coo_matrix([0, 0, 0, 1, 0]))
    10. d4 = Document(embedding=torch.tensor([1, 2, 3]))
    11. d5 = Document(embedding=tf.sparse.from_dense(np.array([[1, 2, 3], [4, 5, 6]])))

    On multiple Documents

    This is a syntax sugar on single Document, which leverages embed() underneath. To embed multiple Documents, do not use this feature in a for-loop. Instead, read more details in Embed via model.

    Once a Document has .blob set, you can use a deep neural network to embed() it, which means filling Document.embedding. For example, our Document looks like the following:

    1. q = (Document(uri='/Users/hanxiao/Downloads/left/00003.jpg')
    2. .load_uri_to_image_blob()
    3. .set_image_blob_normalization()
    4. .set_image_blob_channel_axis(-1, 0))

    Let’s embed it into vector via ResNet:

    1. import torchvision
    2. model = torchvision.models.resnet50(pretrained=True)
    3. q.embed(model)

    On multiple Documents

    This is a syntax sugar on single Document, which leverages match() underneath. To match multiple Documents, do not use this feature in a for-loop. Instead, find out more in .

    Once a Document has .embedding filled, it can be “matched”. In this example, we build ten Documents and put them into a DocumentArray, and then use another Document to search against them.

    1. from jina import DocumentArray, Document
    2. import numpy as np
    3. da = DocumentArray.empty(10)
    4. da.embeddings = np.random.random([10, 256])
    5. q = Document(embedding=np.random.random([256]))
    6. q.match(da)
    7. print(q.matches[0])
    1. <jina.types.document.Document ('id', 'embedding', 'adjacency', 'scores') at 8256118608>

    Document can be nested both horizontally and vertically. The following graphic illustrates the recursive Document structure. Each Document can have multiple “chunks” and “matches”, which are Document as well.

    AttributeDescription
    doc.chunksThe list of sub-Documents of this Document. They have granularity + 1 but same adjacency
    doc.matchesThe list of matched Documents of this Document. They have adjacency + 1 but same granularity
    doc.granularityThe recursion “depth” of the recursive chunks structure
    doc.adjacencyThe recursion “width” of the recursive match structure
    • Add in constructor:

      1. d = Document(chunks=[Document(), Document()], matches=[Document(), Document()])
    • Add to existing Document:

      1. d = Document()
      2. d.chunks = [Document(), Document()]
      3. d.matches = [Document(), Document()]
    • Add to existing doc.chunks or doc.matches:

    Note

    Both doc.chunks and doc.matches return ChunkArray and MatchArray, which are sub-classes of DocumentArray. We will introduce DocumentArray later.

    Caveat: order matters

    When adding sub-Documents to Document.chunks, avoid creating them in one line, otherwise the recursive Document structure will not be correct. This is because chunks use ref_doc to control their granularity. At chunk creation time the chunk doesn’t know anything about its parent, and will get a wrong granularity value.

    ✅ Do

    1. from jina import Document
    2. root_document = Document(text='i am root')
    3. # add one chunk to root
    4. root_document.chunks.append(Document(text='i am chunk 1'))
    5. root_document.chunks.extend([
    6. Document(text='i am chunk 2'),
    7. Document(text='i am chunk 3'),

    😔 Don’t

    1. from jina import Document
    2. root_document = Document(
    3. text='i am root',
    4. chunks=[
    5. Document(text='i am chunk 2'),
    6. Document(text='i am chunk 3'),
    7. )

    Tags

    Document contains the tags attribute that can hold a map-like structure that can map arbitrary values. In practice, you can store meta information in tags.

    1. from jina import Document
    2. doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0, 'last_modified': 'Monday'}})
    3. doc.tags['dimensions']
    1. {'weight': 10.0, 'height': 5.0, 'last_modified': 'Monday'}

    To provide easy access to nested fields, the Document allows you to access attributes by composing the attribute qualified name with interlaced __ symbols:

    1. from jina import Document
    2. doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}})
    3. doc.tags__dimensions__weight
    1. 10.0

    This also allows the access of nested metadata attributes in bulk from a DocumentArray.

    1. from jina import Document, DocumentArray
    2. da = DocumentArray([Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0}}) for _ in range(10)])
    3. da.get_attributes('tags__dimensions__height', 'tags__dimensions__weight')
    1. [[5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0], [10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0]]

    Note

    As tags does not have a fixed schema, it is declared with type google.protobuf.Struct in the DocumentProto protobuf declaration. However, google.protobuf.Struct follows the JSON specification and does not differentiate int from float. So, data of type int in tags will be always casted to float when a request is sent to an Executor.

    As a result, users need be explicit and cast the data to the expected type as follows:

    ✅ Do

    1. class MyIndexer(Executor):
    2. animals = ['cat', 'dog', 'turtle']
    3. @requests
    4. def foo(self, docs, parameters: dict, **kwargs):
    5. for doc in docs:
    6. # need to cast to int since list indices must be integers not float
    7. index = int(doc.tags['index'])
    8. assert self.animals[index] == 'dog'
    9. with Flow().add(uses=MyExecutor) as f:
    10. f.post(on='/endpoint',
    11. inputs=DocumentArray([]), parameters={'index': 1})

    😔 Don’t

    1. class MyIndexer(Executor):
    2. animals = ['cat', 'dog', 'turtle']
    3. @requests
    4. def foo(self, docs, parameters: dict, **kwargs):
    5. for doc in docs:
    6. # ERROR: list indices must be integer not float
    7. index = doc.tags['index']
    8. assert self.animals[index] == 'dog'
    9. with Flow().add(uses=MyExecutor) as f:
    10. f.post(on='/endpoint',
    11. inputs=DocumentArray([]), parameters={'index': 1})

    You can serialize a Document into JSON string via to_json() or Python dict via or binary string via to_bytes():

    JSON

    1. from jina import Document
    2. Document(content='hello, world', embedding=[1, 2, 3]).to_json()
    1. {
    2. "embedding": [
    3. 1,
    4. 2,
    5. 3
    6. ],
    7. "id": "9e36927e576b11ec81971e008a366d48",
    8. "mime_type": "text/plain",
    9. "text": "hello, world"
    10. }

    Binary

    1. from jina import Document
    2. bytes(Document(content='hello, world', embedding=[1, 2, 3]))
    1. b'\n aad94436576b11ec81551e008a366d48R\ntext/plainj\x0chello, world\x9a\x01+\n"\n\x18\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x12\x01\x03\x1a\x03<i8\x1a\x05numpy'

    Dict

    1. from jina import Document
    2. Document(content='hello, world', embedding=[1, 2, 3]).to_dict()

    Visualization

    1. import numpy as np
    2. from jina import Document
    3. d0 = Document(id='🐲', embedding=np.array([0, 0]))
    4. d1 = Document(id='🐦', embedding=np.array([1, 0]))
    5. d2 = Document(id='🐢', embedding=np.array([0, 1]))
    6. d3 = Document(id='🐯', embedding=np.array([1, 1]))
    7. d0.chunks.append(d1)
    8. d0.chunks[0].chunks.append(d2)
    9. d0.matches.append(d3)

    ../../../_images/four-symbol-docs.svg