Updating statistic on temporary tables, especially PostgreSQL (MDL-39725) PostgreSQL performance can be poor with large temp tables

Updating statistic on temporary tables, especially PostgreSQL (MDL-39725) PostgreSQL performance can be poor with large temp tables

by Russell Smith -
Number of replies: 3

So, during quiz backup/restore performance testing I discovered that large temporary backup tables in PostgreSQL weren't performing very well.  2 seconds per query rather than 20-30ms.  See MDL-39725 and related issues for further details.  This leaves us with a situation where we need to handle the statistics collection on temporary tables to ensure reasonable performance in those large temp table situations.

The analysis I've completed so far suggests PostgreSQL is the only database effected.  However it seems a choice of many of the larger installation (including the one I work on).  So I think something needs to be done.

Other databases would get a simple empty function would just return sucess.  As at this point there is no need for these updates in anything except PostgreSQL.

Some discussion has happened on the tracker and there are two general ideas;

1. Update the statistics with specific calls at times that seem appropriate.  eg, just after larger data loads or alterations.  In backup that might be post loading the backup_ids_temp table.

2. Create infrastructure to automatically track when code changes have been made and call statistics updates when we consider the changes to be enough to warrant it.

3. Write some magic for postgresql so it can use its statistics to track if it should update the statistics for the temporary table.

 

I am in favour of (1).  It might be the easiest to implement but I don't think that's critically relevant. As a developer you need to use temporary tables for specific purposes and carefully manage them for their short lived lives.  So you should know when you are adding/changing lots of data.  It can be said the developer should not need to worry about those things, but I'm yet to see that as a reality when we much of our time optimizing SQL to ensure it's fast on all platforms.

Eloy has expressed interest in (2), but I don't think he expressed the reasons for that on the tracker.  Maybe he could elaborate here?

(3) was just a wild crazy idea to not reimplement PostgreSQL's statistics collection, just reuse it so it's less work to do something like (2).  However you would still need to decide when to check those stats and call them.

I'm open to other ideas as well as voting and changes the the proposed ones.

Thanks

Russell

 

Average of ratings: Useful (2)
In reply to Russell Smith

Re: Updating statistic on temporary tables, especially PostgreSQL (MDL-39725) PostgreSQL performance can be poor with large temp tables

by Tim Hunt -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers

I agree with Petr's comment in the bug: https://tracker.moodle.org/browse/MDL-39725?focusedCommentId=228711&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-228711

"I already tried to invent some automatic analyse for pg temp tables and failed."

"My +1 to add $DB->analyze_temp_tables() and call it manually in affected areas."

The only quesiton I would ask is, should this be a single method ->analyze_temp_tables(), or a more specific ->analyze_temp_table($tablename)?

In reply to Tim Hunt

Re: Updating statistic on temporary tables, especially PostgreSQL (MDL-39725) PostgreSQL performance can be poor with large temp tables

by Petr Skoda -
Picture of Core developers Picture of Documentation writers Picture of Particularly helpful Moodlers Picture of Peer reviewers Picture of Plugin developers
We have a list of all temp tables, performance should be ok if we analyze them all I guess.
In reply to Tim Hunt

Re: Updating statistic on temporary tables, especially PostgreSQL (MDL-39725) PostgreSQL performance can be poor with large temp tables

by Russell Smith -

That decision will drive how code will be used;

A single call to analyze_temp_tables() will mean developers will code in a way where the analyze will be run just prior to big select operations.  As you think, I need them analyzed as I'm about to do lots of queries.

An analzye_temp_table($tablename) will mean developers will code to run the command after completing bulk insert/update information.  In my head, this it how I'd think about it.  I've made lots of changes, do an analzye.

However my preferred method does mean those updating data in temporary tables will need to remember to analyze.  Those changes won't usually exercise the performance part of the code.

That said.  If you are using temporary tables, you really should understand their entire life cycle.  Is that too much to ask?

So my vote is still for analyze_temp_table($tablename);