Forums / Developer / Database scalability

Database scalability

Author Message

Sao Tavi

Saturday 26 March 2011 7:59:31 am

1)

I decided to start importing to eZ Publish some data from the old site. While making the importing files in the right format, I realized that there may be a problem because of the database size (about 3 millions elements, each having 6 fields). One field should be searchable, the rest only used for display/input in some other part of the website. On the old site had a separate database for this and the data was split into tables (categories), but on eZ I wanted to import that data as nodes for some class.

So is it ok I import that data as nodes? Or should I use a separate database with tables for each category?

2)

As the answer of one I guess that it would be to use separate database, I have some other question: each user can create data that he/she can change later. It is used to do some calculations, each data set would have about 14 fields (one searchable), The problem is that on my old site I had about 25 000 users and a total of over 200 000 of this data type. On the new one I expect a huge growth as the old one was in a local language.

The question is the same: should I use some class defined in eZ for this, or should I make a separate table? I am afraid that eZ will be very slow if I set these as nodes.

3)

Same problem as above, but this time only about 20 000 data sets with 6 nodes each. This data set will not expand/change in time. Separate table or ez nodes?

These questions may sound silly, but I still want some advice first, before starting the implementation. Any advice on how I should proceed is welcome.

Ivo Lukac

Monday 28 March 2011 1:05:05 am

Hi,

You need to know few things before you decide:

- every attribute goes into one row in ezcontentobject_attribute table. Meaning: 3 millions elements, each having 6 fields = 18 million rows

- these rows are further growing by each version (default max 10 per object) and translation (depends on number of languages)

- to have more few thousands objects with the same parent node would be heavy to cache system (clearing, etc.)

You need to first decide:

- do you need versions?

- do you need translations?

- do you need hierarchy (like a node tree)

- are there any standard node features needed (view cache, same editor experience, etc..)?

My 2 choices would be to:

- create a special module & separate table if "no" is the answer for 4 questions above (no object, no nodes)

- otherwise create a separate table and make a single datatype which uses those tables + set max 2 version for the class using the datatype + separate object in more folders so you don't get 100000 nodes with same parent (objects & nodes, but with single datatype)

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Sao Tavi

Monday 28 March 2011 5:36:13 am

Thank you for your answer

That's was I was thinking about, why would I use eZ instead of my own tables and you helped me complete the list with what I need and what I do not.

1)

I will go for (over 100) separate tables to make the load as low as possible, as I do not need versions, translations and all the tools provided for nodes.

2)

For this I believe that the second choice would be interesting, a separate table for that specific class, as I need versions (the users used to create the same object with minor changes to compare the results) and hierarchy. Is there any documentation on how to do this?

3)

Separate table again.

Ivo Lukac

Monday 28 March 2011 6:44:14 am

"

For this I believe that the second choice would be interesting, a separate table for that specific class, as I need versions (the users used to create the same object with minor changes to compare the results) and hierarchy. Is there any documentation on how to do this?

"

This looks promising: http://share.ez.no/learn/ez-publish/creating-datatypes-in-ez-publish-4

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Sao Tavi

Monday 28 March 2011 2:01:39 pm

Thank you again for your answer

I've read the article, it is interesting and explains how to create a datatype, but the problem is that is does not say anything about a different table and I don't believe this is done by default. Do you know any extension that creates a new datatype and stores it in a separate table?

Gaetano Giunta

Tuesday 29 March 2011 1:21:11 am

It might be faster to use table partitioning, if you are strong in dba skills and short on development time: it should be pretty straightforward to set up the ezcontentobject, ezcontentobjectattribute and ezcontentobjectversion tables to be partitioned based on eg. content class. The speed gains should be enough to get you rolling with huuge databases...

Principal Consultant International Business
Member of the Community Project Board

Ivo Lukac

Tuesday 29 March 2011 1:25:46 am

"

I've read the article, it is interesting and explains how to create a datatype, but the problem is that is does not say anything about a different table and I don't believe this is done by default. Do you know any extension that creates a new datatype and stores it in a separate table?

"

It should be straightforward by eZPersistentObject inheritance. Check how we did it here:

https://github.com/ezsystems/eztags/blob/master/classes/eztagsobject.php

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Sao Tavi

Tuesday 29 March 2011 6:14:15 am

Thank you for your help

I learned a lot these days about PHP (like the new DateTime class), MySQL (partitioning) and eZ system. I decided to take the easiest path for my extension and just partition (and subpartition) those tables as suggested.

Sao Tavi

Saturday 02 April 2011 11:33:22 am

It seems that I am having problems partitioning after content class id... So I am left with the other option.

Sao Tavi

Sunday 03 April 2011 4:31:00 am

So... after trying again I eventually understood that I can add fields to the unique index and that is what was lacking to make partitioning work.

The new problem is: I partitioned ezcontentobject after contentclass_id but I cannot find contentclass_id in any of the tables ezcontentobject_version and ezcontentobject_attribute. Maybe I am missing something?

PS: I found the id in ezcontentclass_attribute as contentclassattribute_id. This makes things a little bit more complicated, but it still works. On the other hand, I cannot see anything useful in ezcontentobject_version... Maybe here a "blind" partitioning after the id with partitions smaller than 200 000 ?

Sorry for asking so many questions, I am still in learning stage.

Gaetano Giunta

Monday 04 April 2011 2:35:59 am

You are more than welcome to ask questions.

About the table ezcontentobject_attribute:you can use contentclassattribute_id. It holds one value per class_attribute, so depending on your db configuration it can become pretty messy to split it in one-subtable per-value.

The fast option would be to partition by range, so that you can create eg. one subtable for values 1-50, one for 51-100, etc...

The 'optimal' option would be, I think, to partition by list, and give for every subtable the list of attributes that make up a single ez content class. A bit of a chore to do, but you can easily write a php script, or even faster, an ezp template, that loops over every class and gives as output the correct sql commands with the list of contentclassattribute_ids. (If you do the script, please share it!)

About ezcontentobject_version: it's a pity that we do not have useful columns to partition by. If the site is multilingual, using language_mask would make sense. You could partition by version+status if your content gets edited a lot. A 'blind' range partitioning with 100k-to-1M rows per table can be done

Whatever you do, please test any change you get in performances, when

- viewing an object

- creating an object

- editing it

Principal Consultant International Business
Member of the Community Project Board

Marko Žmak

Monday 11 April 2011 7:02:59 am

Sao, just a few notes if you decide to go with the node implementation...

1) First, in your eZ fetches avoid attribute filtering and sorting at all cost, whenever you can. When using attribute filtering/sorting an sql join is made with two biggest tables, ezcontentobject and ezcontentobject_attribute.

And when you don't use attribute filtering, only tables ezcontentobject_tree, ezcontentobject, ezcontentclass and ezcontentobject_name are joined.

2) Depeding on your specific needs you might use eZ Find in order to search through the field that should be searchable. It will be much more faster than doing fetches.

3) In my experience I've done some really nice preformance gains by indexing ezcontentobject by section ID and keeping only a small amount of the newest content in one section (the site has about 10 attributes per object and more than 200 000 objects and growing every day).

4) When importing data, a good thing to do is to skip some parts of the eZ publishing process, see here:

  • http://share.ez.no/forums/developer/publish-objects-from-script-without-cache-clear/comment64857

And maybe based on this you could create your own module for publishing your content which also skips some uneeded parts.

5) In eZP 4.5 a new "Asynchronous publishing" feature is introduced, some notes about it here:

  • http://share.ez.no/forums/developer/ez-engineering-update-heavy-parallel-publishing-support/

So that might also be worth trying.

6) Database clustering might also be a good idea to try:

  • http://doc.ez.no/eZ-Publish/Technical-manual/4.x/Features/Clustering
ezcontentobject_tree,                       ezcontentobject,ezcontentclass                       , ezcontentobject_name

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth