Optimizing fetches for high-volume, high-traffic site

Author Message

Marko Žmak

Thursday 17 June 2010 9:58:28 am

Sorry for my long post, but if these questions get some good answers I think it will be a benefit for many of us. I would appreciate to get elaborated answers from the folks that know very well the internal structure of eZP, but also any other good ideas and suggestions are welcome. So...

For last few days I was looking for a way to optimze performance of a high-volume, high-traffic news portal. The main problem is that there are many places on this site where the latest news should be displayed, and should be refreshed very often (every 1-2 minutes). I use all of the eZ caching mechanisms, but in some moments (during the daily visit peak) even this is not enough. I should mention that there are also many different registered users browsing the site.

The problem is that fetching 10 latests articles (sorted by the published time) produces big SQL queries (in MySQL) with joins on the ezcontentobject table and ezcontentobject_attribute table. This two tables are the biggest in the database, having aproximately:

ezcontentobject ~ 200.000 rows (and growing rapidly)

ezcontentobject_attribute ~ 2.000.000 rows (and growing rapidly)

and most of this rows (~90%) belogn to the news objects.

The queries that eZ produces with this two tables create quite a large temporary table on the disk (I have investigated and due to the structure of the tables there's no way to avoid it) which slows down the database server pretty much. Also, the concept of doing such big operations to get only 10 latest news is really inefficient.

So in my quest to eliminate this big queries I came to several ideas and questions that I would like to discuss with the eZP developers...

Do an SQL query "only" on ezcontentobject

One idea is to make an SQL query (using a custom made eZ operator) that works on ezcontenobject table, without joins on ezcontentobject_attribute table, and which would get only the object/node IDs of the latest news, and then:

a) Fetch (using eZ fetch operator) only the objects with these IDs.

b) Fetch the objects with these IDs, one by one (using foreach with object/node fetch)

Q: Would one of this sollutions give some improvement, and how much? Would the fetching in sollution a) still produce a join on ezcontentobject and ezcontentobject_attribute?

P.S. I am aware that this query should have also joins to ezcontentobject_version, and other tables, so it cannot be only on ezcontentobject, but that doesn't matter because the main goal is to eliminate joins with ezcontentobject_attribute table.

Possiblites of eZP APIs

Q: Are there some eZP APIs that can be used to fetch only the object IDs, and that would not produce joins on ezcontentobject and ezcontentobject_attribute tables?

If you get the idea, this is en attempt to make the same thing as in question 1), but using eZ DB abstraction layer instead of raw SQL queries.

Use eZ Find for fetching

Q: Could eZ Find be used to fetch the latest news? Would this make some improvement?

Of course, with the assumption that the SOLR index is updated frequently. Which brings some other questions:

Q: Can SOLR index be updated upon publishing of every object? What implication for performance would this bring?

What does load_data_map parameter actually do?

In Lukas Serwatka's eZP 4.2 hidden gems (http://serwatka.net/blog/ez_publish_4_1_and_4_2_hidden_gems) I found this:

The load_data_map parameter for tree and list fetch functions is now only enabled by default if 15 nodes or less are fetched. It can still be specifically enabled or disabled per fetch.

Q: So how does this parameter affect the database queries? If it's disabled, does it create joins on ezcontentobject and ezcontentobject_attribute tables or not?

Optimizing indexes

Q: Is there a way to optimize database indexes on ezcontentobject and ezcontentobject_attribute tables so that this queries run faster? Has anyone had some experience with this?

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Gaetano Giunta

Friday 18 June 2010 1:44:24 am

Your questions are pretty much self-answering... It shows you have investigated thoroughly the problem ;-)

- doing custom sql queries: very possible, will give you great speedups. Use ezdb api, write your own sql. Hire (or be) an sql guru to get best results. There is no "standard" "more efficient" api in ez because the most efficient query will be the one that only works in your db (you can hardcode class ids, knowledge of content tree structure and such)

- adding indexes: very much possible. It might even be a good idea in conjunction with the point above

- using solr: quite possible. By default the index IS updated on every object publication, unless you use delayedindexing; net effect: publishing of objects gets slower, as it takes into account now the indexing phase

- usage of load_data_map: try it out: enable echoing into debug of the generated sql with that param on and off, and then run the generated queries with tracing enabled to see the difference. The idea is: if it is disabled, more smaller queries are generated (to fetch data_map later on), if it is enabled, a single query with more joins is generated (all attributes are fetched in one pass)

Principal Consultant International Business
Member of the Community Project Board

Marko Žmak

Saturday 19 June 2010 3:50:08 pm

Thanks Gaetano, here are some more toughts/questions...

Custom sql queries: Actually I would like to avoid that, because using DB abstraction layer is more safe - I don't want to have to upgrade my custom queries when I upgrade to a newer version of eZ that brings some changes in the database tables structure.

eZ APIs: I would still like to know if there is an eZ API function that does this: get only the node id's of latest objects belonging to a class. Does anyone know it?

load_data_map: I have investigated further with the load_data_map parameter, and yes it doesn't make a join to the ezcontentobject_attribute table when is disabled. I'll give this a try first, and see how it goes.

eZ Find: Can eZ Find be used to fetch all the latest nodes, wihtout using the keyword parameter? Would using eZ Fnd in this way be faster or slower than doing a normal fetch? Anyone knows this one?

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Marko Žmak

Sunday 20 June 2010 8:53:08 am

After further investigation I came to an interesting point...

The SQL queries produced when fetching nodes always have joins on this tables:

  • ezcontentobject_tree
  • ezcontentobject
  • ezcontentclass
  • ezcontentobject_name

The problem is that this joins ALWAYS produce a temporary table on the disk, and that's because the tables in the join contain longtext fields (since mysql can't use memory storage for tables containing text fields). Specifically this ones:

ezcontentclass - filed "serialized_name_list"

ezcontentobject_tree - field "path_identification_string"

What happens as a result of this is that a temporary table gets created on disk on EVERY fetch made in eZ (list or tree fetch), no matter how high you set the mysql memory limits. Since doing fetches is the most used way of retrieving data in eZ, this can lead to some serious performance issues.

The first questions that comes in mind are:

  • Are this two fields really so important so that have to be fetched in this join? How often are this fields actually used and can they be fetched later?
  • Can some of this fields be converted to varchar?

This is a problem that eZ Team should deal with, and correct it in some future eZP versions. Here are some suggestions and ideas:

1) Convert the "serialized_name_list" to varchar

All the entries in my ezcontentclass table have this field shorter than 100 chars. Why it has to be a longtext field?

Also is this field really needed here? Can it's data be divided into some other table fields?

2) Move "path_identification_string" out of ezcontentobject_tree table

The first idea is to move this field out from ezcontentobject_tree table, for example in a new table ezcontentobject_tree_path. So that it doesn't get involved in this fetch but is fetched later when needed.

Another idea would be to convert path_identification_string to varchar. For example in my DB the longest occurrence of this field is 199. So could this conversion be done? What are the implications?

I think this issue deserves some serious consideration from the eZ development team, since it can bring some really big performance gains. So what are your opinions?

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Gaetano Giunta

Sunday 20 June 2010 12:37:35 pm

Very interesting findings indeed. Thanks for the time you spent analyzing this!

My thoughts (disclaimer: I do not work for the engineering division):

'serialized_name_list': this is a horrible hack, and it should really be undone.

Unfortunately, the recent addition of "class description" and "attribute description" in content classes has been done using the same strategy, eg. a single text field where the multiple-lang versions of the description is stored. Which means we now have even more text cols to get rid of. I filed a bug in the tracker about this already, with my idea: move all the 'translatable' fields in a separate table in the db, with one lang per row, instead of gluing them together in huge blobs. The advantage is that over time more translatable elements can be added without a big fuss.

'path_identification_string': this one is tougher, as it is deeper in the system. But since we have multilingual urls, I think we should drop this column altogether... If we cannot remove it, we surely cannot shrink it: we have many sites where the name of a single node (ie. a part of the the path_identification_string) is longer than 255 chars in its own...

Principal Consultant International Business
Member of the Community Project Board

Gaetano Giunta

Sunday 20 June 2010 12:39:51 pm

ps: since you tested the size of your data in those two cols and found it small, could you try to ALTER those cols to varchar and measure the speed/db load difference?

Principal Consultant International Business
Member of the Community Project Board

Marko Žmak

Wednesday 04 August 2010 12:39:00 am

Sorry for the late answer but I've been busy lately... Here are some results of my research

Removing longtext fields

I tried to alter the columns ezcontentobject_tree_path and serialized_name_list, and I have changed them to varchar(200). After that the temporary tables for this problematic joins is no longer created on disk, but is kept in memory, I have checked that by inspecting the mysql status variables before and after the query. So this is good news.

But unfortunately, this didn't give any improvement, the problematic query was not faster. Which is very strange, it seems like creating temporary table on the disk is faster than in memory.

One possible reason for this could be that while this temporary table is created in memory, the system allocates this memory in disk swap, so actually it ends up on the disk anyway. This is just one idea and I didn't have the time to inspect it further, if anyone has some other idea feel free to share it.

But still I encourage anyone to give this a try on a test database and send the results if it gets some improvement.

Use eZ Find for fetching

I gave this a try in a quick test, and the results are promising. It works nicely for some cases. The only "complicated" thing is to decide where to use it, and to map the parameters from a regular fetch to the ezfind fetch.

What does load_data_map parameter actually do?

I gave this a try and it didn't bring any improvement. Actually the performance graphs showed a little bit more load on the database server. So I gave up on this. But maybe for some other sites this could still produce some performance improvements. It probably depends from case to case.

Optimizing indexes

I found a way for a big performance speedup by adding some indexes. It involves sections and an "archiving" mechanism. I'm currently testing it and will report more when I'm done.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Luis Delgado

Saturday 09 April 2011 6:35:03 pm

"

Sorry for the late answer but I've been busy lately... Here are some results of my research

Removing longtext fields

I tried to alter the columns ezcontentobject_tree_path and serialized_name_list, and I have changed them to varchar(200). After that the temporary tables for this problematic joins is no longer created on disk, but is kept in memory, I have checked that by inspecting the mysql status variables before and after the query. So this is good news.

But unfortunately, this didn't give any improvement, the problematic query was not faster. Which is very strange, it seems like creating temporary table on the disk is faster than in memory.

Optimizing indexes

I found a way for a big performance speedup by adding some indexes. It involves sections and an "archiving" mechanism. I'm currently testing it and will report more when I'm done.

"

Marko,

Hi after reading the whole thread I have found myself in a similar issue. Unfortunately I am not as experienced as you are. I found the above issues also apply in my case, I've been tweaking my.cnf growing cache tables over a over. I now have around 4000...

As I am not an SQL expert I am trying to find out how to index join queries but Im lost here... Have you been able to Optimize indexes? I'm sure that this thread will be very helpful for a lot of people that are trying to make a stable / memory efficeint / low server load configuration

Cheers

GOOD 4 ALL

Marko Žmak

Monday 11 April 2011 12:30:04 am

Hi Luis.

In the end I solved my problem by using indexes like this:

  • indexed the ezcontentobject table on section_id field
  • created a section "Archive"
  • created an extension for archiving that changes the section of the objects (moves old objects into "Archive" section)
  • modified critical fetches so that only objects that are not in the "Archive" section are fetched

You can use this archiving extension, it's not yet documented, but you can contact me if you need help in setting it up.

Also, if I take a look at your database, the code of your site and the sql queries it produces, I would be able to help you more. Feel free to contact me about that too.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.