A very small ratio between returned rows and the total number of scanned rows (not sho

Ratio between 'Actual rows' and 'Rows Removed by Filter' about pev2 HOT 7 CLOSED

dalibo commented on August 14, 2024

Ratio between 'Actual rows' and 'Rows Removed by Filter'

from pev2.

Comments (7)

pgiraud commented on August 14, 2024

It looks like we can have "Rows Removed by XXX" where XXX can be either:

"Filter",
"Join Filter",
"Index Recheck".

There are cases where we have 2 of those counts (for example Filter + Index Recheck).
Should I sum them up to get the total of discarded rows?

Or should I treat the different cases differently and only concentrate on "by Filter" for now?

from pev2.

pgiraud commented on August 14, 2024

by Join Filter and by Index Recheck have different meaning. I will not take them into account for now.

from pev2.

pgiraud commented on August 14, 2024

Here are several alternatives to show the information. I personnaly prefer the last one.

from pev2.

Krysztophe commented on August 14, 2024

I think that the ratio is the most important, so it should appear in any case. In this example it is obvious that 190165 >> 522 , but is not obvious at first sight that 91234567899 >> 9123456799 (or you must round to have 91 billions vs 9 billions).

Keep the color on the 190165 if you prefer (it's only there to catch the eye after all).

So I'd vote for the 1st or a mix with the others:

Rows removed by FIlter : **190165** out of 190687 scanned rows (99%), 522 returned

But this is maybe too much on one line.

from pev2.

Krysztophe commented on August 14, 2024

About "Rows Removed by Join Filter'': I think this is a similar problem, an indication that some filtering occur that an index may perhaps solve.

''Index Recheck'' is another beast, this is more a problem of index efficiency (a bit like the Heap fetches in #99), that may be interesting too.

from pev2.

pgiraud commented on August 14, 2024

Here's an additional example showing that the number of rows actually returned should be taken in the direct parent not the gather node.

explain (analyze) select sum(i) from t_un_million_int where i+0 = 500000 ;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=11680.42..11680.43 rows=1 width=8) (actual time=56.924..56.925 rows=1 loops=1)
   ->  Gather  (cost=11680.21..11680.42 rows=2 width=8) (actual time=56.657..63.498 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=10680.21..10680.22 rows=1 width=8) (actual time=49.794..49.795 rows=1 loops=3)
               ->  Parallel Seq Scan on t_un_million_int  (cost=0.00..10675.00 rows=2083 width=4) (actual time=42.558..49.772 rows=0 loops=3)
                     Filter: ((i + 0) = 500000)
                     Rows Removed by Filter: 333333
 Planning Time: 0.129 ms
 Execution Time: 63.581 ms
(10 lignes)

from pev2.

pgiraud commented on August 14, 2024

After a deeper investigation I think it's a bad idea to take parent nodes into account to count the number of actual rows. First of all, the number of actual rows is indeed the average number of rows for each loop. And not only in the case of parallel nodes. It has to be multiplied by the number of loops in order to get the number of rows returned by the node. It is imprecise though since the average value is rounded. In some cases the average value is 0. But it's impossible to know what's the actual value for the node itself.
Nevertheless, it is in my opinion not very safe to guess the value from the parent node. Because it may be very different from one node type to another.

So I decided to use the values that are given in the plan without trying to guess more. In some cases, the UI will show "100% of rows removed" which may not be very accurate. When it's not accurate (because of several loops), the UI will show that the values are approximated.

As suggested by @ioguix, I also took the duration into account to show and highlight the "rows removed ratio". The ratio itself is not very relevant. It can be very high but with almost no impact on the query duration. It is relevant if the relative duration is high as well.
For example, if a node takes 1% of the whole execution time, we don't really care if 90% of the rows were removed for this node. However, 50% of rows removed in a node taking 95% of the execution time must be taken into account.

from pev2.

Ratio between 'Actual rows' and 'Rows Removed by Filter' about pev2 HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent