CephFS Distributed Metadata Cache

    CephFS clients can request that the MDS fetch or change inode metadataon its behalf, but an MDS can also grant the client capabilities(aka caps) for each inode (see Capabilities in CephFS).

    A capability grants the client the ability to cache and possiblymanipulate some portion of the data or metadata associated with theinode. When another client needs access to the same information, the MDSwill revoke the capability and the client will eventually return it,along with an updated version of the inode’s metadata (in the event thatit made changes to it while it held the capability).

    Clients can request capabilities and will generally get them, but whenthere is competing access or memory pressure on the MDS, they may berevoked. When a capability is revoked, the client is responsible forreturning it as soon as it is able. Clients that fail to do so in atimely fashion may end up blacklisted and unable to communicate withthe cluster.

    When a client needs to query/change inode metadata or perform anoperation on a directory, it has two options. It can make a request tothe MDS directly, or serve the information out of its cache. WithCephFS, the latter is only possible if the client has the necessarycaps.

    Clients can send simple requests to the MDS to query or request changesto certain metadata. The replies to these requests may also grant theclient a certain set of caps for the inode, allowing it to performsubsequent requests without consulting the MDS.

    Clients can also request caps directly from the MDS, which is necessaryin order to read or write file data.

    Distributed Locks in an MDS Cluster

    If there are outstanding caps that would conflict with these locks, thenthey must be revoked before the lock can be acquired. Once the competingcaps are returned to the MDS, then it can get the locks and do theoperation.

    On a filesystem served by multiple MDS’, the metadata cache is alsodistributed among the MDS’ in the cluster. For every inode, at any giventime, only one MDS in the cluster is considered authoritative. Anyrequests to change that inode must be done by the authoritative MDS,though non-authoritative MDS can forward requests to the authoritativeone.

    Non-auth MDS’ can also obtain read locks that prevent the auth MDS fromchanging the data until the lock is dropped, so that they can serveinode info to the clients.