Entity Match Categorizer
Although entity matching in SDK offers greater flexibility, its utility is constrained by the lack of an easy way to group matches by pattern. As experience tells, strict reliance on the confidence score may be misleading; some matches have low confidence scores but in fact high-quality, while the reverse is the case for others. Hence, we need to pay attention to patterns underlying the matches. In this script, I demonstrate an entity match categorizer which helps to reproduce the “group by pattern” feature in UI, making SDK-based entity matching more effective.
[1]:
import os
from cognite.experimental import CogniteClient
from cognite.utils.contextualization import EntityMatchCategorizer
Setup
First, let’s set things up for performing entity matching via SDK.
[2]:
# Establish client connection
client = CogniteClient(
client_name=os.environ.get("CLIENT_NAME"),
token_client_id=os.environ.get("CLIENT_ID"),
token_client_secret=os.environ.get("CLIENT_SECRET"),
project=os.environ.get("PROJECT"),
base_url=os.environ.get("BASE_URL"),
token_scopes=os.environ.get("TOKEN_SCOPE"),
token_url=os.environ.get("TOKEN_URL"),
)
[3]:
# Retrieve resources to match
ts_list = client.time_series.list(data_set_external_ids=["DEMO"], limit=None)
asset_list = client.assets.list(data_set_external_ids=["DEMO"], limit=None)
[4]:
# Format time series data for matching
sources = [
{
"id": ts.id,
"name": ts.name,
"description": ts.description,
}
for ts in ts_list
]
# Format asset data for matching
targets = [
{
"id": asset.id,
"name": asset.name,
"description": asset.description,
}
for asset in asset_list
]
Perform Entity Matching
For simplicity, let’s perform entity matching with an unsupervised model.
[5]:
# Apply unsupervised model
model = client.entity_matching.fit(
sources=sources,
targets=targets,
match_fields=[("name", "name")],
)
[6]:
# Perform entity matching
job = model.predict(score_threshold=0.5)
match_result = job.result
Inspect Matches by Pattern
Now that we have the match result, let’s apply the entity match categorizer to group matches by pattern.
[7]:
# Initialize entity match categorizer
match_categorizer = EntityMatchCategorizer(client)
[8]:
# Group matches by pattern
match_categorizer.group_matches_by_pattern(match_result, pattern_fields=("name", "name"))
The categorizer allows us to retrieve the pattern groups as a DataFrame
, which in turn allows us to examine them in different ways (e.g., sorting by average confidence score).
[9]:
# Collect results as a table
match_df = match_categorizer.to_pandas()
[10]:
match_df
[10]:
pattern | n_matches | avg_score | matches | |
---|---|---|---|---|
0 | [D1]L[D2].L -> [D1]L[D2] | 899 | 0.72 | [{'source': {'description': 'DEPROP REFLUX', '... |
1 | [D1][L2][D3].L -> [D1][L2][D3] | 868 | 0.92 | [{'source': {'description': 'ACID WASH DRUM', ... |
2 | [D1][L2]D.L -> [D1][L2]D | 394 | 0.63 | [{'source': {'description': 'CONT-3 REFRIG REC... |
3 | [D1]LD[L2].[L3] -> [D1][L3]D[L2] | 385 | 0.72 | [{'source': {'description': 'OXID AIR ADDTN VL... |
4 | [D1]LD.[L2] -> [D1][L2]DL | 326 | 0.62 | [{'source': {'description': 'ALKY DIB OH GC "R... |
... | ... | ... | ... | ... |
75 | [L1][D2]L[D2]L[D2]L.[L3] -> [D2][L3]D[L1] | 1 | 0.67 | [{'source': {'description': 'TOTAL IC4 IN NC4 ... |
76 | D[L1][D2][L3].L.L -> D[L1][D2][L3] | 1 | 0.61 | [{'source': {'description': 'DEPR MAKE-UP CAUS... |
77 | L[D1]L[D2]L.L[D1] -> [D1]L[D2] | 1 | 0.52 | [{'source': {'description': 'DEP REFLUX LOSEL'... |
78 | D[L1][D2].L -> [L1]-[D2] | 1 | 0.71 | [{'source': {'description': '49 PH COOLING TWR... |
79 | L[D1]L.[L2] -> [D1][L2]DL | 1 | 0.71 | [{'source': {'description': 'CONTACTOR 4 TOTAL... |
80 rows × 4 columns
Having the result as a DataFrame
, we can query match patterns more easily.
[11]:
# Pick out high-quality patterns
match_df.query("avg_score > 0.75 & n_matches > 10")
[11]:
pattern | n_matches | avg_score | matches | |
---|---|---|---|---|
1 | [D1][L2][D3].L -> [D1][L2][D3] | 868 | 0.92 | [{'source': {'description': 'ACID WASH DRUM', ... |
7 | [D1][L2][D3]L.L -> [D1][L2][D3] | 149 | 0.80 | [{'source': {'description': 'MRU CHG DRUM LEVE... |
9 | [D1]L[D2][L3].L -> [D1]L[D2][L3] | 128 | 0.82 | [{'source': {'description': 'RX BED H S/D (2-3... |
21 | [L1][D2]L.[L3] -> [D2][L3]D[L1] | 37 | 0.87 | [{'source': {'description': 'TOTAL FRESH ACID,... |
27 | [D1][L2][D3][L4].L -> [D1][L2][D3][L4] | 12 | 0.89 | [{'source': {'description': 'RX BED TEMP - 4FT... |
Furthermore, the categorizer allows us to inspect actual match cases in each pattern, helping to better determine if the pattern is valid.
[12]:
# Insepct the 10th pattern group and its 1st match case ()
match_categorizer.inspect_pattern(i_pattern=9, j_example=0, compare_fields=[("name", "name")])
[GROUP]
pattern: [D1]L[D2][L3].L -> [D1]L[D2][L3]
n_matches: 128
avg_score: 0.82
[EXAMPLE]
score: 0.75
name -> name: 4TA6043D.PV -> 4TI6043D
We can compare more fields in source (time series) vs. target (asset) as well.
[13]:
# Insepct the 10th pattern group and its 1st match case
match_categorizer.inspect_pattern(
i_pattern=9,
j_example=0,
compare_fields=[
("name", "name"),
("description", "description")
]
)
[GROUP]
pattern: [D1]L[D2][L3].L -> [D1]L[D2][L3]
n_matches: 128
avg_score: 0.82
[EXAMPLE]
score: 0.75
name -> name: 4TA6043D.PV -> 4TI6043D
description -> description: RX BED H S/D (2-3 N) -> 4TI6043D, RX BED TEMP - 2FT 3IN N
Save Results
Finally, the categorizer allows us to easily select patterns we want and save them into CDF.
[14]:
# Save matches from selected patterns into CDF
match_categorizer.save_patterns_to_cdf(pattern_index_list=[1, 9, 27])
1008 matches have been saved to CDF!