The #1 factor in ADX/KQL database performance

Azure Data Explorer Blog > The #1 factor in ADX/KQL database performance

https://techcommunity.microsoft.com/t5/azure-data-explorer-blog/the-1-factor-in-adx-kql-database-performance/ba-p/4147969

The #1 factor in ADX/KQL database performance
In Power BI or any other tool

In this article I’ll show many variations of a query executed on a large table that contains public events arriving at GitHub.
The query summarizes data for 10 or 20 days and I compare the CPU consumption of the query in different syntax variations.
I mention only CPU time and not execution time because execution can vary by the cluster size and load on the cluster.My purpose is to demonstrate how the query performs well when the date filter is used by the engine to limit the number of scanned extents (aka shards).
In some cases, the query scans all extents, and it takes a lot of CPU.
In other cases, only a small subset of the extents are scanned and performance is good.
In a follow-up article I’ll explain how Power BI and ADX dashboards can be used to filter and join tables in an optimal way.

Queries on a single table
1. The query summarize 10 days of data.
An element is extracted from a Json structure and a distinct count operation is done on the extracted value. These two operations contribute significantly to the overall cost.
Above each query you can see the CPU seconds, the volume of scanned data and the number of scanned extents.

// 6.53 1.98GB 128
EventsFromLiveStream
| where CreatedAt between(datetime(2024-4-1)..10d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

2. The same for 20 days. The cost is almost exactly double which is expected.
This is the benchmark against which we can compare all other variations.

// 12.5 3.63GB 132
EventsFromLiveStream
| where CreatedAt between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

3. A function is applied to the datetime column and so the effect of filtering is lost. All data is scanned and cost is 4 times more

// 49.87 8.67 all
EventsFromLiveStream
| extend shiftdata=datetime_add(‘hour’,2,CreatedAt)
| where shiftdata between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

4. Another variation of shifting the datetime 2 hours forward and then filtering. Equally bad as #3

// 49.3 8.67 all
EventsFromLiveStream
| extend shiftdata=CreatedAt + 2h
| where shiftdata between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

5. Another function (bin) is applied to the datetime column but this time the filter is applied correctly. Cost is a bit higher because the actual bin function needs to be calculated.
// 13.42 3.79GB 132
EventsFromLiveStream
| extend Day=bin(CreatedAt,1d)
| where Day between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

6. Same as #5, startofday , startofmonth are also applied correctly.

// 13.51 3.79GB 132
EventsFromLiveStream
| extend Day=startofday(CreatedAt)
| where Day between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

7. The worst-case scenario – 45 times slower than the base

Trying to shift the datetime value using a very expensive function that needs to be applied to all rows. Also, the filter cannot be used.
In this case filtering on 10 days or 20 days cost the same because almost all the CPU is spent on the datetime_utc_to_local function.

// 9:51.67 8.7GB all
EventsFromLiveStream
| extend LocalTime=datetime_utc_to_local(CreatedAt,’America/Buenos_Aires’)
| where LocalTime between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

8. Shifting the filter range instead of shifting the data.
Cost is back to base.
Notice that leaving the statement of calculating local time doesn’t cost anything because the result is not used so it is not calculated

// 19.5 3.53GB 154
EventsFromLiveStream
| extend LocalTime=datetime_utc_to_local(CreatedAt,’America/Buenos_Aires’)
| where CreatedAt between(datetime_local_to_utc(datetime(2024-4-1),’America/Buenos_Aires’)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

9. Add another where clause on the base datetime column .
Still more expensive but not by such a big margin.
Notice that although the filter on the original column is mentioned after the calculation of the shifted datetime value , it is executed before and so only a small subset of the data is actually shifted.

// 19.5 3.63GB 154
EventsFromLiveStream
| extend LocalTime=datetime_utc_to_local(CreatedAt,’America/Buenos_Aires’)
| where CreatedAt between (datetime(2024-3-30) ..21d )
| where LocalTime between(datetime(2024-4-1)..20d)
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

Applying the filter on the left side of a join

10. A dates table is joined with the Events table.
The dates table is on the left side of the join.
The filter on the dates table is applied to the right side when the filter is using in or ==

// 1:16 8:14GB 134
let Calendar = range Day from datetime(2024-1-1) to datetime(2024-12-31) step 1d;
Calendar | where Day in(
datetime(2024-04-01T00:00:00Z),
datetime(2024-04-02T00:00:00Z),
datetime(2024-04-03T00:00:00Z),
datetime(2024-04-04T00:00:00Z),
datetime(2024-04-05T00:00:00Z),
datetime(2024-04-06T00:00:00Z),
datetime(2024-04-07T00:00:00Z),
datetime(2024-04-08T00:00:00Z),
datetime(2024-04-09T00:00:00Z),
datetime(2024-04-10T00:00:00Z),
datetime(2024-04-11T00:00:00Z),
datetime(2024-04-12T00:00:00Z),
datetime(2024-04-13T00:00:00Z),
datetime(2024-04-14T00:00:00Z),
datetime(2024-04-15T00:00:00Z),
datetime(2024-04-16T00:00:00Z),
datetime(2024-04-17T00:00:00Z),
datetime(2024-04-18T00:00:00Z),
datetime(2024-04-19T00:00:00Z),
datetime(2024-04-20T00:00:00Z))
| join kind=inner hint.strategy=broadcast
(EventsFromLiveStream | extend Day=startofday(CreatedAt)) on Day
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

11. When the filter on the left side is using a between or < or > , it is not applied to the right side of the join.
The results are correct, but performance is bad.

// 47:40 206.5GB all
let Calendar = range Day from datetime(2024-1-1) to datetime(2024-12-31) step 1d;
Calendar | where Day between(datetime(2024-4-1)..20d)
| join kind=inner hint.strategy=broadcast
(EventsFromLiveStream | extend Day=startofday(CreatedAt)) on Day
| extend Login=tostring(Actor.login)
| summarize count(),dcount(Login) by Type

12. Improvement on the join, extract the Login value inside the parentheses of the right side table. The join still have a cost but the overall is much less than in #9.
The reason is that the dynamic column Actor does not need to be part of the join, only the login value

// 24 8:33GB 150
let Calendar = range Day from datetime(2024-1-1) to datetime(2024-12-31) step 1d;
Calendar | where Day in(
datetime(2024-04-01T00:00:00Z),
datetime(2024-04-02T00:00:00Z),
datetime(2024-04-03T00:00:00Z),
datetime(2024-04-04T00:00:00Z),
datetime(2024-04-05T00:00:00Z),
datetime(2024-04-06T00:00:00Z),
datetime(2024-04-07T00:00:00Z),
datetime(2024-04-08T00:00:00Z),
datetime(2024-04-09T00:00:00Z),
datetime(2024-04-10T00:00:00Z),
datetime(2024-04-11T00:00:00Z),
datetime(2024-04-12T00:00:00Z),
datetime(2024-04-13T00:00:00Z),
datetime(2024-04-14T00:00:00Z),
datetime(2024-04-15T00:00:00Z),
datetime(2024-04-16T00:00:00Z),
datetime(2024-04-17T00:00:00Z),
datetime(2024-04-18T00:00:00Z),
datetime(2024-04-19T00:00:00Z),
datetime(2024-04-20T00:00:00Z))
| join kind=inner hint.strategy=broadcast
(EventsFromLiveStream | extend Day=startofday(CreatedAt),Login=tostring(Actor.login)) on Day
| summarize count(),dcount(Login) by Type

The #1 factor in ADX/KQL database performance

Published by satonaoki

Leave a comment Cancel reply

Share this:

Related

Published by satonaoki

Leave a comment Cancel reply