{ "cells": [ { "cell_type": "markdown", "id": "acae66af", "metadata": {}, "source": [ "# Entity Match Categorizer" ] }, { "cell_type": "markdown", "id": "a210016f", "metadata": {}, "source": [ "Although entity matching in SDK offers greater flexibility, its utility is constrained by the lack of an easy way to group matches by pattern. As experience tells, strict reliance on the confidence score may be misleading; some matches have low confidence scores but in fact high-quality, while the reverse is the case for others. Hence, we need to pay attention to patterns underlying the matches. In this script, I demonstrate an entity match categorizer which helps to reproduce the \"group by pattern\" feature in UI, making SDK-based entity matching more effective." ] }, { "cell_type": "code", "execution_count": 1, "id": "aac9b1b5", "metadata": {}, "outputs": [], "source": [ "import os\n", "from cognite.experimental import CogniteClient\n", "from cognite.utils.contextualization import EntityMatchCategorizer" ] }, { "cell_type": "markdown", "id": "e0e5c675", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "markdown", "id": "f04e7c14", "metadata": {}, "source": [ "First, let's set things up for performing entity matching via SDK." ] }, { "cell_type": "code", "execution_count": 2, "id": "6f4a3b15", "metadata": {}, "outputs": [], "source": [ "# Establish client connection\n", "client = CogniteClient(\n", " client_name=os.environ.get(\"CLIENT_NAME\"),\n", " token_client_id=os.environ.get(\"CLIENT_ID\"),\n", " token_client_secret=os.environ.get(\"CLIENT_SECRET\"),\n", " project=os.environ.get(\"PROJECT\"),\n", " base_url=os.environ.get(\"BASE_URL\"),\n", " token_scopes=os.environ.get(\"TOKEN_SCOPE\"),\n", " token_url=os.environ.get(\"TOKEN_URL\"),\n", ")" ] }, { "cell_type": "code", "execution_count": 3, "id": "e3c80286", "metadata": {}, "outputs": [], "source": [ "# Retrieve resources to match\n", "ts_list = client.time_series.list(data_set_external_ids=[\"DEMO\"], limit=None)\n", "asset_list = client.assets.list(data_set_external_ids=[\"DEMO\"], limit=None)" ] }, { "cell_type": "code", "execution_count": 4, "id": "3cfe57e7", "metadata": {}, "outputs": [], "source": [ "# Format time series data for matching\n", "sources = [\n", " {\n", " \"id\": ts.id,\n", " \"name\": ts.name,\n", " \"description\": ts.description,\n", " }\n", " for ts in ts_list\n", "]\n", "\n", "# Format asset data for matching\n", "targets = [\n", " {\n", " \"id\": asset.id,\n", " \"name\": asset.name,\n", " \"description\": asset.description,\n", " }\n", " for asset in asset_list\n", "]" ] }, { "cell_type": "markdown", "id": "f1e94cbe", "metadata": {}, "source": [ "## Perform Entity Matching" ] }, { "cell_type": "markdown", "id": "94f34c60", "metadata": {}, "source": [ "For simplicity, let's perform entity matching with an unsupervised model." ] }, { "cell_type": "code", "execution_count": 5, "id": "d4bd6e8c", "metadata": {}, "outputs": [], "source": [ "# Apply unsupervised model\n", "model = client.entity_matching.fit(\n", " sources=sources,\n", " targets=targets,\n", " match_fields=[(\"name\", \"name\")],\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "8fc9adbc", "metadata": {}, "outputs": [], "source": [ "# Perform entity matching\n", "job = model.predict(score_threshold=0.5)\n", "match_result = job.result" ] }, { "cell_type": "markdown", "id": "b95ef3e5", "metadata": {}, "source": [ "## Inspect Matches by Pattern" ] }, { "cell_type": "markdown", "id": "996786af", "metadata": {}, "source": [ "Now that we have the match result, let's apply the entity match categorizer to group matches by pattern." ] }, { "cell_type": "code", "execution_count": 7, "id": "444add7b", "metadata": {}, "outputs": [], "source": [ "# Initialize entity match categorizer\n", "match_categorizer = EntityMatchCategorizer(client)" ] }, { "cell_type": "code", "execution_count": 8, "id": "752ff106", "metadata": {}, "outputs": [], "source": [ "# Group matches by pattern\n", "match_categorizer.group_matches_by_pattern(match_result, pattern_fields=(\"name\", \"name\"))" ] }, { "cell_type": "markdown", "id": "e2794e8e", "metadata": {}, "source": [ "The categorizer allows us to retrieve the pattern groups as a `DataFrame`, which in turn allows us to examine them in different ways (e.g., sorting by average confidence score)." ] }, { "cell_type": "code", "execution_count": 9, "id": "86ccb199", "metadata": {}, "outputs": [], "source": [ "# Collect results as a table\n", "match_df = match_categorizer.to_pandas()" ] }, { "cell_type": "code", "execution_count": 10, "id": "a0138433", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | pattern | \n", "n_matches | \n", "avg_score | \n", "matches | \n", "
---|---|---|---|---|
0 | \n", "[D1]L[D2].L -> [D1]L[D2] | \n", "899 | \n", "0.72 | \n", "[{'source': {'description': 'DEPROP REFLUX', '... | \n", "
1 | \n", "[D1][L2][D3].L -> [D1][L2][D3] | \n", "868 | \n", "0.92 | \n", "[{'source': {'description': 'ACID WASH DRUM', ... | \n", "
2 | \n", "[D1][L2]D.L -> [D1][L2]D | \n", "394 | \n", "0.63 | \n", "[{'source': {'description': 'CONT-3 REFRIG REC... | \n", "
3 | \n", "[D1]LD[L2].[L3] -> [D1][L3]D[L2] | \n", "385 | \n", "0.72 | \n", "[{'source': {'description': 'OXID AIR ADDTN VL... | \n", "
4 | \n", "[D1]LD.[L2] -> [D1][L2]DL | \n", "326 | \n", "0.62 | \n", "[{'source': {'description': 'ALKY DIB OH GC \"R... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
75 | \n", "[L1][D2]L[D2]L[D2]L.[L3] -> [D2][L3]D[L1] | \n", "1 | \n", "0.67 | \n", "[{'source': {'description': 'TOTAL IC4 IN NC4 ... | \n", "
76 | \n", "D[L1][D2][L3].L.L -> D[L1][D2][L3] | \n", "1 | \n", "0.61 | \n", "[{'source': {'description': 'DEPR MAKE-UP CAUS... | \n", "
77 | \n", "L[D1]L[D2]L.L[D1] -> [D1]L[D2] | \n", "1 | \n", "0.52 | \n", "[{'source': {'description': 'DEP REFLUX LOSEL'... | \n", "
78 | \n", "D[L1][D2].L -> [L1]-[D2] | \n", "1 | \n", "0.71 | \n", "[{'source': {'description': '49 PH COOLING TWR... | \n", "
79 | \n", "L[D1]L.[L2] -> [D1][L2]DL | \n", "1 | \n", "0.71 | \n", "[{'source': {'description': 'CONTACTOR 4 TOTAL... | \n", "
80 rows × 4 columns
\n", "\n", " | pattern | \n", "n_matches | \n", "avg_score | \n", "matches | \n", "
---|---|---|---|---|
1 | \n", "[D1][L2][D3].L -> [D1][L2][D3] | \n", "868 | \n", "0.92 | \n", "[{'source': {'description': 'ACID WASH DRUM', ... | \n", "
7 | \n", "[D1][L2][D3]L.L -> [D1][L2][D3] | \n", "149 | \n", "0.80 | \n", "[{'source': {'description': 'MRU CHG DRUM LEVE... | \n", "
9 | \n", "[D1]L[D2][L3].L -> [D1]L[D2][L3] | \n", "128 | \n", "0.82 | \n", "[{'source': {'description': 'RX BED H S/D (2-3... | \n", "
21 | \n", "[L1][D2]L.[L3] -> [D2][L3]D[L1] | \n", "37 | \n", "0.87 | \n", "[{'source': {'description': 'TOTAL FRESH ACID,... | \n", "
27 | \n", "[D1][L2][D3][L4].L -> [D1][L2][D3][L4] | \n", "12 | \n", "0.89 | \n", "[{'source': {'description': 'RX BED TEMP - 4FT... | \n", "