Home / Documentation / Entity Models / Specification

Entity Model Specification

{
  "attributes": {
    ATTRIBUTE_NAME: {
      "type": ATTRIBUTE_TYPE,
      "params": {
        PARAM_NAME: PARAM_VALUE,
        ...
      },
      "score": ATTRIBUTE_IDENTITY_CONFIDENCE_BASE_SCORE
    },
    ...
  },
  "resolvers": {
    RESOLVER_NAME: {
      "attributes": [
        ATTRIBUTE_NAME,
        ...
      ],
      "weight": WEIGHT_LEVEL
    }
    ...
  },
  "matchers": {
    MATCHER_NAME: {
      "clause": MATCHER_CLAUSE,
      "params": {
        PARAM_NAME: PARAM_VALUE,
        ...
      },
      "quality": MATCHER_QUALITY_SCORE
    },
    ...
  },
  "indices": {
    INDEX_NAME: {
      "fields": {
        INDEX_FIELD_NAME: {
          "attribute": ATTRIBUTE_NAME,
          "matcher": MATCHER_NAME,
          "quality": INDEX_FIELD_QUALITY_SCORE
        },
        ...
      }
    },
    ...
  }
}

Entity models are JSON documents. In the framework shown above, lowercase quoted values (e.g. "attributes") are constant fields, uppercase literal values (e.g. ATTRIBUTE_NAME) are variable fields or values, and elipses (...) are optional repetitions of the preceding field or value.

An entity model has four required objects: "attributes", "resolvers", "matchers", "indices". Not all elements within these objects are required. Optional elements are noted in the descriptions of each element listed on this page.

ENTITY_TYPE

An entity model is identified by its ENTITY_TYPE. This value is specified not in the entity model object, but rather in the "_id" field of the entity model document stored in the .zentity-models index. The value is specified when using the Models API to create the entity model.

"attributes"

Model

{
  "attributes": {
    ATTRIBUTE_NAME: {
      "type": ATTRIBUTE_TYPE,
      "params": {
        PARAM_NAME: PARAM_VALUE,
        ...
      },
      "score": ATTRIBUTE_IDENTITY_CONFIDENCE_BASE_SCORE
    },
    ...
  }
}

Example

{
  "attributes": {
    "name": {
      "type": "string",
      "score": 0.65
    },
    "street": {
      "type": "string",
      "score": 0.75
    },
    "city": {
      "type": "string",
      "score": 0.55
    },
    "state": {
      "type": "string",
      "params": {
        "fuzziness": 0
      },
      "score": 0.52
    },
    "zip": {
      "type": "string",
      "score": 0.6
    },
    "email": {
      "type": "string",
      "score": 0.95
    },
    "phone": {
      "type": "string",
      "params": {
        "fuzziness": "auto"
      },
      "score": 0.9
    }
  }
}

Attributes are elements that can assist the identification and resolution of entities. For example, some common attributes of a person include name, date of birth, and phone number. Each attribute has its own particular data qualities and purposes in the real world. Therefore, zentity matches the values of each attribute using logic that is distinct to each attribute.

Some attributes can be matched using different methods. For example, a name could be matched by its exact value or its phonetic value. Therefore the entity model allows each attribute to have one or more matchers. A matcher is simply a clause of a "bool" query in Elasticsearch. This means that if any matcher of an attribute yields a match for a given value, then the attribute will be considered a match regardless of the results of the other matchers.

"attributes".ATTRIBUTE_NAME

A field with the name of a distinct attribute. Some examples might be "name", "dob", "phone", etc. The value of the field is an object that contains metadata about the attribute.

Attribute names may contain periods (.). Periods will separate the attribute into nested fields (i.e. prefixes) in the response of a resolution request. Consider this example "attributes" object in an entity model:

{
  "attributes": {
    "name.first": {},
    "name.middle": {},
    "location.address.street": {},
    "location.address.city": {},
    "location.address.state": {}
  }
}

The "_attributes" object of a resolution response will restructure the attributes by splitting their names by periods and nesting the split fields. Example:

{
  "_attributes": {
    "name": {
      "first": [ "Alice" ],
      "last": [ "Jones" ]
    },
    "location": {
      "address": {
        "street": [ "123 Main St" ],
        "city": [ "Washington" ],
        "state": [ "DC" ]
      }
    }
  }
}

This nesting behavior creates the potential for attribute names to conflict. For example, an entity model with three attributes named "name", "name.first", and "name.last" is invalid. The "name" attribute cannot hold values in the resolution response because the nested structure will cause the "name.first" and "name.last" attributes to override it. The "name" attribute must be removed or renamed to something appropriate such as "name.full".

"attributes".ATTRIBUTE_NAME."type"

The data type of the attribute. The default value is "string" if unspecified in the model. Data types of attribute values are validated on input when submitting a request to the Resolution API endpoint. Attribute data types only affect the inputs to a resolution job and the queries submitted to Elasticsearch. The data types of the values returned in the "_attributes" field of the documents in the resolution job response are kept as they were in the "_source" fields of those documents.

"attributes".ATTRIBUTE_NAME."params"

An optional object that passes arbitrary variables ("params") to the matcher clauses.

"attributes".ATTRIBUTE_NAME."params".PARAM_NAME

A field with the name of a distinct param for the attribute. Some examples might be "fuzziness" or "format".

"attributes".ATTRIBUTE_NAME."params".PARAM_NAME.PARAM_VALUE

A value for the param. This can be any JSON compliant value such as a string, number, boolean, array, or object. The value will be serialized as a string when passed to the matcher clause. The value overrides the same field specified in "attributes".ATTRIBUTE_NAME."params" in the model and "matchers".MATCHER_NAME."params".

Valid attribute types

Listed below are each of the currently valid attribute types.

"string"

Indicates that the values of an attribute must be supplied as JSON compliant string values. Elasticsearch can perform text analysis, fuzzy matching, and other operations solely on string values.

"number"

Indicates that the values of an attribute must be supplied as JSON compliant number value. This includes any positive or negative integer or fractional value. zentity handles the appropriate conversion of number values to floats, doubles, integers, or longs.

"boolean"

Indicates that the values of an attribute must be supplied as JSON compliant boolean values (true or false).

"date"

Indicates that the values of an attribute must be supplied as JSON compliant string values. Additionally, date attributes must include a param called "format" that contains an Elasticsearch date format. Date values are queried and returned in the specified format. This is both useful and necessary when querying date fields across indices that have disparate date formats.

"attributes".ATTRIBUTE_NAME."score"

An attribute identity confidence base score represents the confidence that an attribute would uniquely identify the entity if it were to match, assuming the quality of its matcher and index field are perfect. The score is a floating point number in the range of 0.0 - 1.0.

Effectively, if a document matches with one or more attributes:

Generally it makes sense for every attribute to have a base score between 0.5 - 1.0. A base score that's less than 0.5 would indicate that the matching attribute represents some level of a false match, which is contrary to the general usage of zentity where a matching attribute represents some level of a true match.

Cardinality would be a good statistic by which to define a base score. For example:

Care should be taken when using a base score of 1.0 or 0.0, because it would allow a single attribute identity confidence score to determine the document "_score". Whenever there is an attribute identity confidence score of 1.0 or 0.0, it takes precedence over any other attribute identity confidence score in the document. For example, you might have an "id" field that you absolutely trust to identify an entity. If you allow the score of the "id" field to be 1.0, then anytime the "id" field matches in a document, no other attribute identity confidence score would matter because you've already stated that the "id" field always indicates a match with perfect confidence. A best practice would be to use a high number such as 0.99 to allow for some small level of variability and more nuanced rankings of documents.

"resolvers"

Model

{
  "resolvers": {
    RESOLVER_NAME: {
      "attributes": [
        ATTRIBUTE_NAME,
        ...
      ],
      "weight": WEIGHT_LEVEL
    }
    ...
  }
}

Example

{
  "resolvers": {
    "name_street_city_state": {
      "attributes": [
        "name", "street", "city", "state"
      ]
    },
    "name_street_zip": {
      "attributes": [
        "name", "street", "zip"
      ]
    },
    "name_phone": {
      "attributes": [
        "name", "phone"
      ]
    },
    "name_email": {
      "attributes": [
        "name", "email"
      ]
    },
    "name_phone": {
      "attributes": [
        "name", "phone"
      ]
    },
    "email_phone": {
      "attributes": [
        "email", "phone"
      ]
    },
    "ssn": {
      "attributes": [
        "ssn"
      ],
      "weight": 1
    }
  }
}

Resolvers are combinations of attributes that imply a resolution. For example, you might decide to resolve entities that share matching values for "name" and "dob" or "name" and "phone". You can create a resolver for both combinations of attributes. Then any documents whose values share either a matching "name" and "dob" or "name" and "phone" will resolve to the same entity.

Remember that attributes can be associated with more than one matcher in the "indices" object. This means that if any matcher of an attribute yields a match for a given value, then the attribute will be considered a match regardless of the results of the other matchers. So if you have an attribute called name with matchers called keyword and phonetic, then any resolver that uses the name attribute is effectively saying that either name.keyword or name.phonetic are required to match.

"resolvers".RESOLVER_NAME

A field with the name of a distinct resolver. The value of the field is an object that contains metadata about the resolver.

A resolver represents a combination of attributes that implies a resolution. For example, if a resolver lists the attributes "name" and "phone", then any documents whose values match those attributes -- either in the inputs of the resolution job or any subsequent hops -- will be considered a match to the entity.

"resolvers".RESOLVER_NAME."attributes"

A set of attribute names. The order of the values has no effect on resolution. Duplicate values are redundant and have no effect on resolution.

"resolvers".RESOLVER_NAME."attributes".ATTRIBUTE_NAME

The name of an attribute from the "attributes" object of the entity model. If the attribute does not exist, then the resolver will not be used in any resolution jobs.

"resolvers".RESOLVER_NAME."weight".WEIGHT_LEVEL

The weight level of the resolver. Resolvers with higher weight levels take precedence over resolvers with lower weight levels. If a resolution job uses resolvers with different weight levels, then the higher weight resolvers either must match or must not exist. This behavior can help prevent false matches.

For example, let's say you have three resolvers: "name_phone" has a weight of 0, "ssn" has a weight of 1, and "id" has a weight of 2. Because the "id" resolver has the highest weight, it will always match documents with the same "id" attribute. The "ssn" resolver has a lower weight than the "id" resolver, and so the "ssn" resolver will only match documents if the "id" resolver either matches or does not exist in the documents. And the "name_phone" resolver has the lowest weight, so the "name_phone" resolver will only match documents if both the "ssn" and "id" resolvers either match or do not exist in the documents.

"matchers"

Model

{
  "matchers": {
    MATCHER_NAME: {
      "clause": MATCHER_CLAUSE,
      "params": {
        PARAM_NAME: PARAM_VALUE,
        ...
      },
      "quality": MATCHER_QUALITY_SCORE
    },
    ...
  }
}

Example

{
  "matchers": {
    "exact_matcher": {
      "clause": {
        "term": {
          "{{ field }}": "{{ value }}"
        }
      }
    },
    "fuzzy_matcher": {
      "clause":{
        "match": {
          "{{ field }}": {
            "query": "{{ value }}",
            "fuzziness": "{{ params.fuzziness }}"
          }
        }
      },
      "params": {
        "fuzziness": "auto"
      },
      "quality": 0.95
    },
    "standard_matcher": {
      "clause": {
        "match": {
          "{{ field }}": "{{ value }}"
        }
      },
      "quality": 0.98
    },
    "timestamp_matcher": {
      "clause": {
        "range": {
          "{{ field }}": {
            "gte": "{{ value }}||-{{ params.window }}",
            "lte": "{{ value }}||+{{ params.window }}",
            "format": "{{ params.format }}"
          }
        }
      },
      "params": {
        "format": "yyyy-MM-dd'T'HH:mm:ss.SSS",
        "window": "15m"
      },
      "quality": 0.92
    }
  }
}

"matchers".MATCHER_NAME

A field with the name of a distinct matcher. The value of the field is an object that contains metadata about the matcher.

A matcher is a templated clause of a "bool" query that can be populated with the names of index fields and the values of attributes.

"matchers".MATCHER_NAME."clause"

An object that represents the clause of a "bool" query in Elasticsearch. Each clause will be stitched together to form a single "bool" query, so it must follow the correct syntax for a "bool" query clause, except you don't need to include the top-level field "bool" or its subfields such as "must" or "should".

Matcher clauses use Mustache syntax to pass two important variables: {{ field }} and {{ value }}. The field variable will be populated with the index field that maps to the attribute. The value field will be populated with the value that will be queried for that attribute. This syntax is the same as the one used by Elasticsearch search templates.

"matchers".MATCHER_NAME."params"

An optional object that specifies the default values for any variables ("params") in the matcher clause.

"matchers".MATCHER_NAME."params".PARAM_NAME

A field with the name of a distinct param for the matcher clause. Some examples might be "fuzziness" or "format".

"matchers".MATCHER_NAME."params".PARAM_NAME.PARAM_VALUE

A value for the param. This can be any JSON compliant value such as a string, number, boolean, array, or object. The value will be serialized as a string when passed to the matcher clause. The value is overridden by the same field specified in "attributes".ATTRIBUTE_NAME."params" in either the input or the model.

"matchers".MATCHER_NAME."quality"

A matcher quality score represents the quality or trustworthiness of a matcher. It modifies the attribute identity confidence base score and contributes to the final attribute identity confidence score.

Effectively this means:

The purpose of the matcher quality score is to reflect any dubious matcher quality in the final document "_score". For example, an exact matcher may have a quality score of 1.0, while a fuzzy matcher may have a quality score of 0.95 to express slightly less confidence in the quality of the match.

"indices"

Model

{
  "indices": {
    INDEX_NAME: {
      "fields": {
        INDEX_FIELD_NAME: {
          "attribute": ATTRIBUTE_NAME,
          "matcher": MATCHER_NAME,
          "quality": INDEX_FIELD_QUALITY_SCORE
        },
        ...
      }
    },
    ...
  }
}

Example

{
  "indices": {
    "users": {
      "fields": {
        "name": {
          "attribute": "name",
          "matcher": "fuzzy_matcher",
          "quality": 0.95
        },
        "zip.keyword": {
          "attribute": "zip",
          "matcher": "exact_matcher",
          "quality": 0.98
        },
        "email.keyword": {
          "attribute": "email",
          "matcher": "exact_matcher"
        }
      }
    },
    "registrants": {
      "fields": {
        "full_name": {
          "attribute": "name",
          "matcher": "fuzzy_matcher",
          "quality": 0.98
        },
        "addr_street": {
          "attribute": "street",
          "matcher": "fuzzy_matcher",
          "quality": 0.95
        },
        "addr_city": {
          "attribute": "city",
          "matcher": "standard_matcher",
          "quality": 0.98
        },
        "addr_state_code.keyword": {
          "attribute": "state",
          "matcher": "exact_matcher"
        },
        "addr_state_postal_code.keyword": {
          "attribute": "zip",
          "matcher": "exact_matcher"
        },
        "email_address": {
          "attribute": "email",
          "matcher": "exact_matcher"
        },
        "phone_number.keyword": {
          "attribute": "phone",
          "matcher": "exact_matcher"
        }
      }
    }
  }
}

Different indices in Elasticsearch might have data that can be matched as attributes, but each index might use slightly different field names or data types for the same data. Therefore, zentity uses a map to translate the different field names to the attributes of our entity model.

The entity model maps attributes and matchers to index fields. Remember how each attribute can be matched in different ways, such as a name that can be matched by its exact value or its phonetic value? Elasticsearch would index those different values as distinct fields, such as "name.keyword" and "name.phonetic". This is why the entity model maps attributes and matchers -- not just attributes -- to index fields.

"indices".INDEX_NAME

A field with the name of a distinct Elasticsearch index or index pattern. The value of the field is an object that contains metadata about the index.

zentity does not verify the existence of indices or the validity of index patterns or the syntax of field names that Elasticsearch requires. Elasticsearch may respond with an error for resolution jobs that submit queries to indices that do not exist.

"indices".INDEX_NAME."fields"

An object that maps an index field name to an attribute and a matcher.

"indices".INDEX_NAME."fields".INDEX_FIELD_NAME

A field with the name of a distinct property or field in an Elasticsearch index. The value of the field is an object that contians metadata about the index field, particularly the attribute and matcher that is mapped to it.

zentity does not verify the existence of field names within indices or the syntax of field names that Elasticsearch requires.

"indices".INDEX_NAME."fields".INDEX_FIELD_NAME."attribute"

The name of an attribute from the "attributes" object of the entity model. If the attribute does not exist, then the index field will not be queried in any resolution jobs and will not be returned in the "_attributes" field of the documents matched in a resolution job.

"indices".INDEX_NAME."fields".INDEX_FIELD_NAME."matcher"

The name of a matcher from the "matchers" object of the entity model. If the matcher does not exist, then the index field will not be queried in any resolution jobs. However, the index field can still be returned in the "attributes" field of the documents matched in a resolution job if those documents matched the attributes of other resolvers.

Let's illustrate how index fields relate to matchers and attributes during a resolution job. Assume you are resolving an entity with an email address of "[email protected]" and one of the indices has an field name of "email.keyword". And assume that the index field is mapped to the matcher clause below:

{
  "term": {
    "{{ field }}": "{{ value }}"
  }
}

The final clause will look like this:

{
  "term": {
    "email.keyword": "[email protected]"
  }
}

The "matcher" field is optional. The index field won't be queried if the "matcher" field is unspecified. There are valid reasons to map an index field only an attribute and not a matcher. It allows the value of index field to be returned as the "attribute" if the document is matched by other resolvers. For example, an attribute for "salary" might not be useful in identifying and resolving entities, but it could be useful to return salary data from disparate indices under a field with a common name. These are sometimes called "payload" attributes.

"indices".INDEX_NAME."fields".INDEX_FIELD_NAME."quality"

An index field quality score represents the quality or trustworthiness of the data in an index field. It modifies the attribute identity confidence base score and contributes to the final attribute identity confidence score.

Effectively this means:

The purpose of the index field quality score is to reflect any dubious data quality in the final document "_score". For example, an index field with perfectly clean and governed data may have a quality score of 1.0, while an index field with known data quality issues may have a quality score of 0.95 to express slightly less confidence in the quality of the match.

 


Continue Reading

Entity Models Entity Modeling Tips
© 2018 - 2024 Dave Moore.
Licensed under the Apache License, Version 2.0.
Elasticsearch is a trademark of Elasticsearch BV.
This website uses Google Analytics.