eZFind2 - indexing speed incredibly low(er)

Author Message

Denitsa M.

Thursday 30 April 2009 4:29:07 am

Hi,

i'm guessing there are others with the problem of indexing time increased enormously after go from ezfind to ezfind2 and change of solr ... Has anyone tried to index 60k+ objects? If yes, what is the time spent on indexing? Testing indexation on a database with more that 60k objects has been quite (abnormally I think) long: ~68% of the data has been indexed for roughly 6-7 hours time.

Anyone has or had the same issue? Anyone solved?

Thanks,
Denitsa

Iguana IT - http://www.iguanait.com

Carlos Revillo

Thursday 30 April 2009 10:52:18 am

Hi.

I haven't seen such long times indexing our sites. Last one is more than 130.000 content objects. index script finished in about 90 minutes in my development machine (ubuntu, 1 gb ram).

Maybe you cand add more concurrent process and see how it goes...

Denitsa M.

Sunday 03 May 2009 11:03:18 pm

Hi, Carlos,

thanks for reply! I am indexing with 4 concurrent processes since it is recommended their number should better be equal to the number of CPU cores of the machine - solr is taking a lot of memory, but I am suspecting this indexing problem comes from the database and there is the problem... in which I am now more convinced, since you said this should take about and hour at the most, if it is taking hour and a half for you to index twice the amount of objects I have .... Thanks again for your help :)

Denitsa

Iguana IT - http://www.iguanait.com

Christian Rößler

Monday 04 May 2009 1:14:11 am

Hi,

indexing time also depends on the object-types you have.
A site may have
- articles in database
- external files (which are contentobjects in DB too but only a reference to the filesystem).

What i want to say is: 'it is slow' is not quite a good starting point :) Please take a look on:
- database overall usage
- query's per second
- cache/key hit rate of database
- disk i/o when indexing (pdf's or database content) as a new external process needs to be forked (i.e. pdftotext)
- cpu-usage overall
- is the server swapping out memory to disk when indexing?

When just indexing database related content (articles, contentobjects, not pdf's) the limitiation is the sql-server, php itself and the solr-process. try adding more ram to solr. perhaps tune your database-settings (caching, key-sizes).

The simple reason why ezfind2 is much slower than ezfind1 is, that it is far more complex and needs to process a few more varaibles when adding a object to it's index. just take a look at the featurelist (facets and so on). those features take their time when indexing and creating the content.
But you are right. 6 to 7 hours is pretty long. did you use bin/php/updatesearchindex.php or updatesearchindexsolr.php (which comes with ezfind) for indexing the content?

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Denitsa M.

Monday 04 May 2009 2:41:00 am

Hi,

I use updatesearchindexsolr script from extension. Server does not generate any swap during indexation OR during it's normal work, and also database usage is down to the least possible, since I do index on a test server and database, and not on production server. Server does not overload during indexation, despite of course, solr is increasing CPU usage, which is unavoidable when it comes to java.

Also, database contains lots of articles and what you call "external objects", nevertheless I still do not see why this should take half a day to index ... So it all comes again to the database and mysql, or so it seems ...

Big thanks for your help! :)

Denitsa

Iguana IT - http://www.iguanait.com

Denitsa M.

Tuesday 12 May 2009 8:18:19 am

Hi,

it turnes out that part of the problem is related to this problem http://ez.no/developer/forum/suggestions/ezkeyword_optimization and presence of ezkeyword is influencing for a great index time and index data size ... Anyone succeeded in optimizing ezkeyword for solr indexing or at all?

Iguana IT - http://www.iguanait.com

Denitsa M.

Monday 18 May 2009 1:13:06 am

Hi,

problem is also related to ezobjectrelationlist attributes that assign attribute relation for an object with other content objects, that have also relationlist attribute(s) of the same kind. Is it possible that during indexation of such objects (Article A with relation attribute that includes Article B with relation attribute including Article A) indexation is falling into a loop while trying to process the relation list attributes?

Also, a useFork problem is present in ezfind2 as it was in ezfind. If you allow fork usage, then indexing process sometimes (too often imo) comes to stop at some point and you have to start indexation over again. These inexplicable "halts" can be seen if you tail the process list of your server (top) - the php, java and mysql processes that serves your indexation simply DISAPPEAR from listing (????) and indexation stops although you can check and see that according to server updateindex script is running(ps aux). useFork set to false as before does slow down the indexing a bit, but indexation does complete without problems and all processes behave normal and do not overload the system.

Iguana IT - http://www.iguanait.com

Ivo Lukac

Sunday 24 May 2009 3:13:36 am

Hello Denitsa,

Stumbled only now on this thread. We had some similar problems with slow indexing. After spending a lot of time trying to figure it out we finally manage to resolve the problem.
The problem was in objectrelation attribute, you mentioned it also. Indexing script has recursion protection but only for 1 step (object A links to object B, object B links to object A).
We had chains of relations (object A links to object B, B to C, C to A). And indexing has no hard limit on how deep it goes following relations. Which is a bit stupid. I don't need metadata from object C when I index object A.
Our solution was to hack indexing code so that indexer only fetches meta data of directly related objects and does not go any further.

Cheers

http://www.linkedin.com/in/ivolukac
http://www.netgen.hr/eng/blog
http://twitter.com/ilukac

Carlos Revillo

Sunday 24 May 2009 6:04:00 am

hi. and why not just change that object relations attribute to not indexable?.maybe is not possible for your project, but...

Denitsa M.

Monday 01 June 2009 5:04:09 am

Hi,

Yes, making attributes non-searchable is helpful - this is the thing that we did, but still this is not always possible (because sometimes you need to have exactly those attributes searchable), and in a larger database of 200,000 and above objects solr is still having problems, although for smaller databases it is OK to simply uncheck the searchable option in your problematic classes and this will ease the indexation process.

Still, maybe it should be considered that indexing scripts actually do not index more than one level deep in relation attributes. This way we can have indexation for objectrelationlist attributes without concerning the problem for mad looping of script between objects :)

Thanks for all the help,
Deni

Iguana IT - http://www.iguanait.com

Denitsa M.

Wednesday 09 September 2009 2:04:32 am

Hi,

as I can see even with latest version of ezfind from svn trunk this still remains a problem that nobody has payed attention to... It is becoming even a bigger issue with the fetch functionalities of ezfind - now indexation must be done on all the attributes that you are about to use in your filters, etc, etc, so avoiding indexation of objectrelation and objectrelationlist means you won't get a shot at switching from standard fetches to ezfind fetches with usage of facets and filters for increasing performance and speed ...

Is there a chance this is fixed some time soon, or maybe a better idea Ivo Lukac to share his hack if possible, patch or something?

Thanks!

Deni

Iguana IT - http://www.iguanait.com

Marko Žmak

Tuesday 27 October 2009 11:36:23 am

I have a similar problem, a site with more than 90.000 objects, most of them (90%) having 4 keywords attributes. The updatesearchindexsolr.php script takes a lot of time (about 6h) and it crashes (Segmentation fault) during indexing. I tried increasing java memory limit to 1GB, and php memory limit to 512MB, but it doesn't help.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Denitsa M.

Thursday 29 October 2009 8:46:53 am

Hi, Marko,

You can try putting fork to false into updatesearchindex script at line 241 before the thread count to be set(if you haven't yet, because this is old fix for previous ezfind). You will have 1 thread only, but it will be stable and will not crash (at least from my experience with it). Also, you may check and optimize your database, because sometimes this happens because of corrupted objects with invalid or spoiled object relations, etc. No point into increasing memory for java so much, currently at the above mentioned problematic database we had, we use solr with 150M for java, and it works perfectly, with database now near 200k indexable objects at indexing time of 2.5h - 3h.

Iguana IT - http://www.iguanait.com

Marko Žmak

Sunday 01 November 2009 10:36:00 am

I kind of solved the problem by increasing the number of concurrent threads from 1 to 2. I didn't find out what was caousing the problem when I was using only 1 thread, but now with 2 threads it works.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.