Giter Site home page Giter Site logo

Comments (7)

pgiraud avatar pgiraud commented on August 14, 2024

It looks like we can have "Rows Removed by XXX" where XXX can be either:

  • "Filter",
  • "Join Filter",
  • "Index Recheck".

There are cases where we have 2 of those counts (for example Filter + Index Recheck).
Should I sum them up to get the total of discarded rows?

Or should I treat the different cases differently and only concentrate on "by Filter" for now?

from pev2.

pgiraud avatar pgiraud commented on August 14, 2024

by Join Filter and by Index Recheck have different meaning. I will not take them into account for now.

from pev2.

pgiraud avatar pgiraud commented on August 14, 2024

Here are several alternatives to show the information. I personnaly prefer the last one.

Capture du 2020-02-28 09-08-56
Capture du 2020-02-28 09-36-29
Capture du 2020-02-28 09-40-15

from pev2.

Krysztophe avatar Krysztophe commented on August 14, 2024

I think that the ratio is the most important, so it should appear in any case. In this example it is obvious that 190165 >> 522 , but is not obvious at first sight that 91234567899 >> 9123456799 (or you must round to have 91 billions vs 9 billions).

Keep the color on the 190165 if you prefer (it's only there to catch the eye after all).

So I'd vote for the 1st or a mix with the others:

Rows removed by FIlter : **190165** out of 190687 scanned rows (99%), 522 returned

But this is maybe too much on one line.

from pev2.

Krysztophe avatar Krysztophe commented on August 14, 2024

About "Rows Removed by Join Filter'': I think this is a similar problem, an indication that some filtering occur that an index may perhaps solve.

''Index Recheck'' is another beast, this is more a problem of index efficiency (a bit like the Heap fetches in #99), that may be interesting too.

from pev2.

pgiraud avatar pgiraud commented on August 14, 2024

Here's an additional example showing that the number of rows actually returned should be taken in the direct parent not the gather node.

explain (analyze) select sum(i) from t_un_million_int where i+0 = 500000 ;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=11680.42..11680.43 rows=1 width=8) (actual time=56.924..56.925 rows=1 loops=1)
   ->  Gather  (cost=11680.21..11680.42 rows=2 width=8) (actual time=56.657..63.498 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=10680.21..10680.22 rows=1 width=8) (actual time=49.794..49.795 rows=1 loops=3)
               ->  Parallel Seq Scan on t_un_million_int  (cost=0.00..10675.00 rows=2083 width=4) (actual time=42.558..49.772 rows=0 loops=3)
                     Filter: ((i + 0) = 500000)
                     Rows Removed by Filter: 333333
 Planning Time: 0.129 ms
 Execution Time: 63.581 ms
(10 lignes)

from pev2.

pgiraud avatar pgiraud commented on August 14, 2024

After a deeper investigation I think it's a bad idea to take parent nodes into account to count the number of actual rows. First of all, the number of actual rows is indeed the average number of rows for each loop. And not only in the case of parallel nodes. It has to be multiplied by the number of loops in order to get the number of rows returned by the node. It is imprecise though since the average value is rounded. In some cases the average value is 0. But it's impossible to know what's the actual value for the node itself.
Nevertheless, it is in my opinion not very safe to guess the value from the parent node. Because it may be very different from one node type to another.

So I decided to use the values that are given in the plan without trying to guess more. In some cases, the UI will show "100% of rows removed" which may not be very accurate. When it's not accurate (because of several loops), the UI will show that the values are approximated.

Capture du 2020-04-17 17-01-18

As suggested by @ioguix, I also took the duration into account to show and highlight the "rows removed ratio". The ratio itself is not very relevant. It can be very high but with almost no impact on the query duration. It is relevant if the relative duration is high as well.
For example, if a node takes 1% of the whole execution time, we don't really care if 90% of the rows were removed for this node. However, 50% of rows removed in a node taking 95% of the execution time must be taken into account.

from pev2.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.