Forums / Developer / Import XML Data Topic

Import XML Data Topic

Author Message

Olivier Pierret

Wednesday 07 December 2005 1:09:59 pm

I created this thread to discuss further evolutions (and issues) of the ImportXMLData contrib. This should help leaving the contrib comment area to message directly related to the use of the contrib and not implementation detail.

Best regards

Olivier

Xavier Dutoit

Wednesday 07 December 2005 1:20:07 pm

Salut,

I changed quite a few things on the ImportXML.
Some could be useful for everyone, like setting the publication date based on a xml field and dealing with a few more attribute types than what you did, other are quite specific (eg finding the parent's node based on a value in an xml field).

The big issue on my side was the memory: it just doesn't handle a file bigger than a few hundred records and xml fields (the xml path library seems to be quite sub-optimal to say the least and ez on the other hand...).

I had to reimplement it to run from the shell, and it worked like a charm!

Don't hesitate to contact me by mail if you want some of these features (I probably won't have the time to clean the hacks, but you might find a few things to reuse.)

Thanks for your extension !

X+

http://www.sydesy.com

Olivier Pierret

Wednesday 07 December 2005 2:17:32 pm

Xavier,

I already dropped you an email saying I was waiting for your changes but unfortunately you could not answer. I'll retry...

Olivier

Vytautas Germanavičius

Wednesday 07 December 2005 10:53:57 pm

"Known issues:" still shows "UTF-8 is not supported", while "Changelog" shows "1.4.1 .... - UTF-8 support "

{set-block scope=root variable=cache_ttl}0{/set-block}

Vytautas Germanavičius

Thursday 08 December 2005 12:16:08 am

Sorry, but

$fieldValue = utf8_decode($xPathEngine->wholeText("$path_item/$fieldName"."[1]"));

does not solve problem. I'm still getting ??? instead of UTF8 symbols.
I do not unerstand, why utf8_decode is used. As i understood from documentation http://lt.php.net/manual/en/function.utf8-decode.php , this function converts string to ISO-8859-1. After such convertion UTF8 symbols are lost...

Problem is that xml_parser reads file as iso-8859-1 and ignores encoding specified in xml file.

This problem should be fixed in xpathengine. I putted to XPatth.class.php line 1680:

      $parser = xml_parser_create('UTF-8');

This is only way i found, to get it working with utf-8 ...

 

{set-block scope=root variable=cache_ttl}0{/set-block}

Olivier Pierret

Thursday 08 December 2005 1:57:43 pm

Now I got it vytis !

I actually tested utf8_decode with characters convertible to iso-8859-1 so it *seemed* to work.
For the moment situation is as follows:
if "is UTF8" checkbox is ticked it will use utf8_decode() else not.
This is broken so I will remove this checkbox asap - in the meantime do not tick it.

Best and only (known) way for now to import UTF-8 is yours.

I will change the doc and the code accordingly.

Olivier

Vytautas Germanavičius

Friday 09 December 2005 7:41:08 am

Import function has limitations: i cannot import more than 340 records by one turn... :(

{set-block scope=root variable=cache_ttl}0{/set-block}

Vytautas Germanavičius

Tuesday 13 December 2005 5:21:03 am

Is it any way to run import script from comman line?

{set-block scope=root variable=cache_ttl}0{/set-block}

Olivier Pierret

Wednesday 14 December 2005 12:28:02 pm

Surely
I guess we need to add a file called import-cli.php that would parse the argument and call the

function &importXMLData( $xmldata, $datatype, $remove, $movetotrash) 

in importXMLDatafunctioncollection.

Of course context should be set appropriately if not I think the call:

$class =& eZContentClass::fetchByIdentifier( $identifiantClasse );

as many other kernel related eZ API calls.

I think we should have a look at how runcronjob.php is written.

Another option would be to wait for the Xavier hacks to this extension because I know he is using a script approach to run this extension.

Vytautas Germanavičius

Wednesday 14 December 2005 11:02:01 pm

I spent two days trying to write such code, but not successful... Finaly i found, that administrator updated php, but without mysql support...
I moved my site to another server, i will test my written code there. if it works, i will post it here.

{set-block scope=root variable=cache_ttl}0{/set-block}

Xavier Dutoit

Thursday 15 December 2005 12:36:54 am

Sorry Olivier,

I now I'm late, but I can't find the time to clean up the mess of custom things I've added. I'll try to do that this week-end, the delay is just ridiculous.

Thanks for your patience

X+

http://www.sydesy.com

Vytautas Germanavičius

Thursday 15 December 2005 4:09:13 am

Finally i made it! ;) now you can import data from commandline. I wrote additional script to read xml file and initialize ez object user in Olivier's script. I will clean debug prints, and will post it here.

Idea is great, but import script is very slow, it takes about 51s to import 100 entries.
I need to import ~23 000 entries.. With current speed of script it will take about 3.5 hours...

How fast is your import algorithm, Xavier?

{set-block scope=root variable=cache_ttl}0{/set-block}

Vytautas Germanavičius

Thursday 15 December 2005 5:55:23 am

There is it:

<?php
//
// Created on: <2005-12-15 14:52:57 vytis>
//
// This file may be distributed and/or modified under the terms of the
// "GNU General Public License" version 2 as published by the Free
// Software Foundation and appearing in the file LICENSE included in
// the packaging of this file.
//
// This file is provided AS IS with NO WARRANTY OF ANY KIND, INCLUDING
// THE WARRANTY OF DESIGN, MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE.
//
// The "GNU General Public License" (GPL) is available at
// http://www.gnu.org/copyleft/gpl.html.
//
// Contact [email protected] if any conditions of this licencing isn't clear to
// you.
//

include_once( 'lib/ezutils/classes/ezcli.php' );
include_once( 'kernel/classes/ezscript.php' );

$cli =& eZCLI::instance();
$script =& eZScript::instance( array( 'description' => ( "eZ publish Script Executor\n\n" .
                                                         "Allows execution of simple PHP scripts which uses eZ publish functionality,\n" .
                                                         "when the script is called all necessary initialization is done\n" .
                                                         "\n" .
                                                         "ezexec.php myscript.php" ),
                                      'use-session' => true,
                                      'use-modules' => true,
                                      'use-extensions' => true ) );

$script->startup();

$options = $script->getOptions( "",
                                "[scriptfile]",
                                array() );

if ( count( $options['arguments'] ) < 5 )
{
    $script->shutdown( 1, "Usage of import script:\n SiteAccess, \n XML file, \n datatype, \n remove? (0 - no, 1 - yes) , \n move to trash? (0 - no, 1 - yes),\n user's ID");
    die();
}

$script->setUseSiteAccess($options['arguments'][0]);

$options = $script->getOptions( "",
                                "[scriptfile]",
                                array() );
$script->initialize();


include_once ('extension/importXMLData/modules/importXMLData/importXMLDatafunctioncollection.php');
include_once ('kernel/classes/ezcontentclass.php');
 
$xmldata = file_get_contents($options['arguments'][1]);
importXMLDataFunctionCollection::importXMLData($xmldata, $options['arguments'][2], $options['arguments'][3], $options['arguments'][4], $options['arguments'][5]);

$script->shutdown();
?>

This need some modifications of import script:
1. I didn't login system. Instead of this, i put users ID as parameter in importXMLDatafunctioncollection.php:

	function &importXMLData( $xmldata, $datatype, $remove, $movetotrash, $userID)

So, you should delete from importXMLDatafunctioncollection.php:

		  $user =& eZUser::currentUser();
		  // set user ID 
		  $userID =& $user->attribute( 'contentobject_id' );

2. Additionaly, i put some debug print to see progres of import:
Iin the beginning of function:

 
$cli =& eZCLI::instance();

 

Then:

 
		$cli->output( "Preparing list for import" );
		$paths_item = $xPathEngine->match("//$listTag/$itemTag");
		$cli->output("List size: ".count($paths_item));


Then i changed:

 
	$ii=0;  
	foreach ($paths_item as $path_item) 
	{
		$ii++;
		if( bcmod($ii, 100) == 0)
		{
			$cli->output("\n imported $ii of ".count($paths_item) );
		}		
		foreach($fieldNameList as $fieldName) 
		{
 

Good luck.
Next, i'm going to make shell script to import multiple xml documents. I think this is usefull, when you need to import several thousands of records, because xPathEngine uses to much memory.

{set-block scope=root variable=cache_ttl}0{/set-block}

Xavier Dutoit

Friday 16 December 2005 12:40:57 am

Hi,

Yes, the import is dead slow and yes Xpath swallows all the memory it can find (and more). I modified a few things to release memory into Olivier's script.

I didn't properly benchmark, but it was very long. In my case, that was a one shot import, so didn't mattered too much.

X+

http://www.sydesy.com

Philip K.

Friday 03 February 2006 6:02:47 am

Hey.

I've tried to change the importer so that it's possible to import utf-8 files, but it doesn't work...

$parser = xml_parser_create('UTF-8');

doesnt't work

My problem is that I can't import chars like "ä", "ö", "ü"

Any ideas?? Thanks a lot...

Linux is like a wigwam; no windows, now gates, and apache inside!

Vytautas Germanavičius

Sunday 05 February 2006 11:16:08 pm

it should work.
I had similar problem, when i tested this extension. Problem can be, that your data file is saved not in UTF-8.
If you use windows, open data file with notepad, and save with different name, then you can choose UTF-8 encoding. If UTF-8 is selected by default in "save as" dialog, then your file is in UTF-8, if not - save it as UTF-8 and try to import new file.

{set-block scope=root variable=cache_ttl}0{/set-block}

Philip K.

Monday 06 February 2006 12:02:01 am

Hm, thanks for your reply, but it still doesn't work...

btw: I'm using eZ version 3.6.0

Linux is like a wigwam; no windows, now gates, and apache inside!

Vytautas Germanavičius

Monday 06 February 2006 12:32:19 am

i used it on ez 3.6.4. What you see instead of letter with umlauts?
Maybe UTF8 is not set on your template?

{set-block scope=root variable=cache_ttl}0{/set-block}

Philip K.

Monday 06 February 2006 1:05:17 am

Hm, ok, I can't write the symbols down here, they are changed into html letters... I made a screenshot:

http://www.philip-kahlen.de/import_error.gif

Linux is like a wigwam; no windows, now gates, and apache inside!

Vytautas Germanavičius

Wednesday 08 February 2006 11:37:17 pm

Do you have on top of your templates

{*?template charset=utf-8?*}

I had several cases, when utf was not displayed because of missing that header.

I put this header to all templates of xmlimport extension.

But i still think, that your data file is in different encoding. for editing utf8 files i recomend notepad++ from sourceforge.net

{set-block scope=root variable=cache_ttl}0{/set-block}