Home / Documentation / Entity Resolution / Input Specification

Entity Resolution Input Specification

{
  "attributes": {
    ATTRIBUTE_NAME: {
      "values": [
        ATTRIBUTE_VALUE,
        ...
      ],
      "params": {
        PARAM_NAME: PARAM_VALUE,
        ...
      }
    },
    ...
  },
  "terms": [
    TERM,
    ...
  ],
  "ids": {
    INDEX_NAME: [
      DOC_ID,
      ...
    ],
    ...
  },
  "scope": {
    "exclude": {
      "attributes": {
        ATTRIBUTE_NAME: [
          ATTRIBUTE_VALUE,
          ...
        ],
        ...
      },
      "indices": [
        INDEX_NAME,
        ...
      ],
      "resolvers": [
        RESOLVER_NAME,
        ...
      ]
    },
    "include": {
      "attributes": {
        ATTRIBUTE_NAME: [
          ATTRIBUTE_VALUE,
          ...
        ],
        ...
      },
      "indices": [
        INDEX_NAME,
        ...
      ],
      "resolvers": [
        RESOLVER_NAME,
        ...
      ]
    }
  },
  "model": ENTITY_MODEL
}

Entity resolution inputs are JSON documents. In the framework shown above, lowercase quoted values (e.g. "attributes") are constant fields, uppercase literal values (e.g. ATTRIBUTE_NAME) are variable fields or values, and elipses (...) are optional repetitions of the preceding field or value.

An entity resolution input has two objects where at least one must be present ("attributes" and "ids"), one optional object ("scope") and one object that is required only if entity_type is not specified in the endpoint of the request ("model"). Not all elements within these objects are required. Optional elements are noted in the descriptions of each element listed on this page. Some elements have alternate forms that are acceptable, and those are also noted in the descriptions of each element.

"attributes"

Model

{
  "attributes": {
    ATTRIBUTE_NAME: {
      "values": [
        ATTRIBUTE_VALUE,
        ...
      ],
      "params": {
        PARAM_NAME: PARAM_VALUE,
        ...
      }
    },
    ...
  }
}

Example

{
  "attributes": {
    "name": {
      "values": [
        "Allie Jones",
        "Allison Jones-Smith",
      ]
    },
    "phone": {
      "values": [
        "555-123-4567"
      ],
      "params": {
        "fuzziness": "auto"
      }
    },
    "dob": {
      "params": {
        "format": "yyyy-MM-dd"
      }
    }
  }
}

Shorthand Model

{
  "attributes": {
    ATTRIBUTE_NAME: [
      ATTRIBUTE_VALUE,
      ...
    ],
    ...
  }
}

Shorthand Example

{
  "attributes": {
    "name": [
      "Allie Jones",
      "Allison Jones-Smith",
    ],
    "phone": [
      "555-123-4567"
    ]
  }
}

Attributes are elements that can assist the identification and resolution of entities. For example, some common attributes of a person include name, date of birth, and phone number. Each attribute has its own particular data qualities and purposes in the real world. Therefore, zentity matches the values of each attribute using logic that is distinct to each attribute.

Some attributes can be matched using different methods. For example, a name could be matched by its exact value or its phonetic value. Therefore the entity model allows each attribute to have one or more matchers. A matcher is simply a clause of a "bool" query in Elasticsearch. This means that if any matcher of an attribute yields a match for a given value, then the attribute will be considered a match regardless of the results of the other matchers.

"attributes".ATTRIBUTE_NAME

A field with the name of a distinct attribute. Some examples might be "name", "dob", "phone", etc.

The value of the field can be one of two things:

At least one attribute must be specified, otherwise there would be no input to supply to the resolution job.

"attributes".ATTRIBUTE_NAME."values"

An array of attribute values. These values will serve as the initial inputs to the resolution job.

Each value must conform to the respective attribute type specified in the entity model. For example, string values must be JSON compliant string values, number values must be JSON compliant number values, and date values must include a "format" field in the "params" object if it was not already specified in either the attribute or matcher of the entity model.

This field is not necessarily required. It would be valid to specify and attribute with no values and to override the "params" of the attribute or matcher of the entity model. One reason might be to override the "format" param of a "date" attribute, which would affect the format of any date values for that attribute returned by the resolution job.

At least one attribute must have the "values" field specified, otherwise there would be no input to supply to the resolution job.

"attributes".ATTRIBUTE_NAME."params"

An optional object that passes arbitrary variables ("params") to the matcher clauses.

"attributes".ATTRIBUTE_NAME."params".PARAM_NAME

A field with the name of a distinct param for the attribute. Some examples might be "fuzziness" or "format".

"attributes".ATTRIBUTE_NAME."params".PARAM_NAME.PARAM_VALUE

A value for the param. This can be any JSON compliant value such as a string, number, boolean, array, or object. The value will be serialized as a string when passed to the matcher clause. The value overrides the same field specified in "attributes".ATTRIBUTE_NAME."params" in the model and "matchers".MATCHER_NAME."params".

"terms"

Model

{
  "terms": [
    TERM,
    ...
  ]
}

Example

{
  "terms": [
    "Allie Jones",
    "Allison Jones-Smith",
    "555-123-4567"
  ]
}

The "terms" field allows the first iteration of a resolution job to begin with inputs that are not associated with any attributes. Each term is supplied to each attribute of each resolver in the scope of the resolution job. This is a convenient way to resolve an entity quickly, because the user doesn't need to structure their search to conform to the entity model. The tradeoff is that the results are more prone to false positives.

The "terms" field is only used in the first query to each index in a resolution job. After that, the attributes are obtained from the documents and the job continues with those attributes values.

When both "attributes" and "terms" are given, the first query to each index will create a filter tree for each of "attributes" and "terms", and both filters must match for a document to match.

"terms".TERM

An arbitrary search term string.

"ids"

Model

{
  "ids": {
    INDEX_NAME: [
      DOC_ID,
      ...
    ],
    ...
  }
}

Example

{
  "ids": {
    "users": [
      "1234567890",
      "0987654321"
    ],
    "customers": [
      "customer_001"
    ]
  }
}

The "ids" field allows the first iteration of a resolution job to begin by selecting one or more documents by _id for one or more indices. Like "attributes", any document that matches the _id values will be considered a match to the entity. Documents are queried by _id only within a given index to prevent collisions in which the same _id is present in two or more indices.

The "ids" field is only used in the first query to each index in a resolution job. After that, the attributes are obtained from the documents and the job continues with those attributes values.

"ids".INDEX_NAME

The name of a distinct index. Any _id values specified in this array will be queried within that index.

"ids".INDEX_NAME.DOC_ID

The value of a distinct _id within a given index.

"scope"

Model

{
  "scope": {
    "exclude": {
      "attributes": {
        ATTRIBUTE_NAME: [
          ATTRIBUTE_VALUE,
          ...
        ],
        ...
      },
      "indices": [
        INDEX_NAME,
        ...
      ],
      "resolvers": [
        RESOLVER_NAME,
        ...
      ]
    },
    "include": {
      "attributes": {
        ATTRIBUTE_NAME: [
          ATTRIBUTE_VALUE,
          ...
        ],
        ...
      },
      "indices": [
        INDEX_NAME,
        ...
      ],
      "resolvers": [
        RESOLVER_NAME,
        ...
      ]
    }
  }
}

Example

{
  "scope": {
    "exclude": {
      "attributes": {
        "name": [
          "unknown",
          "n/a"
        ],
        "phone": [
          "555-555-5555"
        ],
        "dob": [
          "0000-00-00",
          "1900-01-01"
        ]
      },
      "resolvers": [
        "name_ssn"
      ]
    },
    "include": {
      "attributes": {
        "country": [
          "US"
        ]
      },
      "indices": [
        "users"
      ]
    }
  }
}

An optional field that contains an object to limit the scope of the resolution request. Scope can be controlled by excluding ("blacklisting") or including ("whitelisting") attribute values, indices, and resolvers.

"scope"."exclude"

An optional field that excludes "attributes", "indices", or "resolvers" from the resolution job. By setting any of these exclusions, no query will be allowed to include the given attribute values, indices, or resolvers.

The values in "scope"."exclude" take precedence over any duplicate values specified "scope"."include".

"scope"."exclude"."attributes"

An optional field that excludes specific attribute values from the resolution job. By setting any of these attributes, no query will be allowed to include the values specified within them.

"scope"."exclude"."attributes".ATTRIBUTE_NAME

A field with the name of a distinct attribute. Some examples might be "name", "dob", "phone", etc. By setting an attribute, no query will be allowed to include the values specified within it.

"scope"."exclude"."indices"

An optional field that excludes specific indices from the resolution job. By setting any of these indices, no query will be allowed to include those indices.

"scope"."exclude"."indices".INDEX_NAME

The name of a distinct index. By setting an index, no query will be allowed to include it.

"scope"."exclude"."resolvers"

An optional field that excludes specific resolvers from the resolution job. By setting any of these resolvers, no query will be allowed to include those resolvers.

"scope"."exclude"."resolvers".RESOLVER_NAME

The name of a distinct resolver. By setting an resolver, no query will be allowed to include it.

"scope"."include"

An optional field that includes "attributes", "indices", or "resolvers" from the resolution job. By setting any of these inclusions, no query will be allowed to exclude the given attribute values, indices, or resolvers.

The values in "scope"."exclude" take precedence over any duplicate values specified "scope"."include".

"scope"."include"."attributes"

An optional field that includes specific attribute values from the resolution job. By setting any of these attributes, no query will be allowed to exclude the values specified within them.

"scope"."include"."attributes".ATTRIBUTE_NAME

A field with the name of a distinct attribute. Some examples might be "name", "dob", "phone", etc. By setting an attribute, no query will be allowed to exclude the values specified within it.

"scope"."include"."indices"

An optional field that includes specific indices from the resolution job. By setting any of these indices, no query will be allowed to exclude those indices.

"scope"."include"."indices".INDEX_NAME

The name of a distinct index. By setting an index, no query will be allowed to exclude it.

"scope"."include"."resolvers"

An optional field that includes specific resolvers from the resolution job. By setting any of these resolvers, no query will be allowed to exclude those resolvers.

"scope"."include"."resolvers".RESOLVER_NAME

The name of a distinct resolver. By setting an resolver, no query will be allowed to exclude it.

"model"

The entity model to use for the entity resolution job. This is only required if entity_type is not specified in the endpoint of the resolution request. Otherwise this field must not be present.

 


Continue Reading

Entity Resolution Entity Resolution Output Specification
© 2018 - 2024 Dave Moore.
Licensed under the Apache License, Version 2.0.
Elasticsearch is a trademark of Elasticsearch BV.
This website uses Google Analytics.