Forums / Suggestions / Add XHTML datatype (replace XML datatype?)

Add XHTML datatype (replace XML datatype?)

Author Message

James Robertson

Sunday 25 September 2005 8:48:06 pm

An XHTML datatype would be very helpful. this could be used to store fragments of XHTML, which could be written directly to a page.

In order to access object from eZ publish, elements from the current eZ publish 'XML' schema could be used with a suitable XML namespace prefix.

eg. An object of class 'Page' might have a XHTML attribute called 'Body'. Body could contain an XHTML fragment (ie. all the usual <p>, <span>, <ul>, etc. tags) plus tags like '<ez:link href="eznode://128">Example.</link>' or '<ez:embed ...>'.

This would seem a much more suitable for a web CMS, than the current clumsey methods of having to try to insert XHTML framents into the eZ publish XML datatype (refer: http://ez.no/community/forum/developer/is_any_way_to_enable_tags_p_font_span_in_articles)

Gabriel Ambuehl

Sunday 25 September 2005 11:54:10 pm

I'm trying to make up my mind on this very issue as I write (Xavier Dutoit and myself are trying to get xinha working in ezpublish and this is the main show stopper now).

My preferred approach would be ripping apart ezxmltext so that only the embedding stuff remains (for links, embed and images), the other cruft seems squarely useless if you can accept not using ezpdf (which by my standards, is useless anyhow).

Issues
* how to validate the XHTML? Would it be acceptable to check it against the XHTML Transitional DTD, then cut away the <xhtml> wwrappers? If so, what to do if libxml (PHP DOM module) isn't compiled in?

* ezxmltext is a giant mess, someone will have to figure out how to use the useable pieces of it...

Visit http://triligon.org

Gabriel Ambuehl

Monday 26 September 2005 12:12:57 am

As for XHTML validation, it would seem that http://pear.php.net/package/XML_DTD/ could be used in case libxml2 isn't compiled into PHP but the latter would be greatly preferred as it is known to work very well (have used it myself in C code ;) and expected to be muuuuuch faster.

Visit http://triligon.org

Gabriel Ambuehl

Monday 26 September 2005 1:49:53 am

[In case anyone wonders: I'm using new posts when I get new info for easier dating thereof]

It seems to me, that the only interesting part of ezxmltext is the embedding and the link one:

http://ez.no/doc/ez_publish/technical_manual/3_6/reference/datatypes/xml_block#hyperlinks
(and more specifically, the internal links)

http://ez.no/doc/ez_publish/technical_manual/3_6/reference/datatypes/xml_block#object_embedding

Visit http://triligon.org

Xavier Dutoit

Monday 26 September 2005 5:05:18 am

ez xml isn't that far away from xhtml. I'm not a big fan of the font tags and associated rainbow effect, but the ez DTD should be a superset of xhtml and be more permisive, so the one into that can use it.

This being said, it shouldn't be too complicated to fix it on the input handler and convert <font color="crap">I'm funky!</font> into
<literal><font color="crap"></literal>I'm funky!<literal></font></literal>

In think the main problem is that most html pasted is not xhtml and that's not because a crap browser (ie. IE ;) displays it properly that we should swallow the garbage too.

X+

http://www.sydesy.com

Gabriel Ambuehl

Monday 26 September 2005 5:29:07 am

Well it's pretty far removed in as so far most stuff simply isnt accepted (think nested lists, table attributes, tbody only to name a few I can think of right now).

The features it provides are quite few (obvious ones being link and embed).

I've been looking at the ezxmlsimplified stuff and it is HORRIBLY complex. Up to the point where I'm seriously tempted to simply ditch it!

IMHO, an input handler that supports the aforementioned features shouldn't be much longer than a few hundred lines and not 2500 lines! Couple it with a simple out put handler (which can likely based upon a much simplified version of the current implementation) and be done.

Visit http://triligon.org

Gabriel Ambuehl

Tuesday 27 September 2005 3:04:12 am

Guys, I've added a wiki page: http://test.webpuls.ch/plain/wiki/ezxhtml

That's easier to keep current ;)

Visit http://triligon.org

James Robertson

Thursday 06 October 2005 1:52:21 pm

Thanks for the discussion guys. I am in complete agreement with Gabriel.

A web-CMS should provide a datatype for HTML/XHTML fragments, since these are the 'bread and butter' of any website. Of course there is a need to add special functionality to interact with objects from the CMS within the XHTML. The most appropriate way to do this has got to be the addition of custom tags (preferably through the use XML namespaces in the case of XHTML) and NOT the creation of yet another propriety mark-up language, that inevitably only does half the job (ie. ezxml).

I look forward to the outcome of your work. I am assuming/hoping that you will be able to upload a datatype extension to http://ez.no/community/contribs/datatypes some day soon. :-)

Ɓukasz Serwatka

Thursday 06 October 2005 2:43:23 pm

CMS systems ideology is content oriented. This means that content is separated from design (visual part of website). HTML language was rebuild (XHTML base on XML 1.0), where one of many benefits is separation content from design, which give you more flexibility in content presentation. eZ publish keep content in database using XML, this is mixed technology of multi layer applications. Keeping content in XML gives you freedom with presentation, using XSLT for example, you can generate XHTML, WML, whatever.

Imagine that you want to display truly HTML/Content mix (usually generated by most HTML WYSIWYG online editors - <font><span><br><hr>) in different technology or platform (small screen PDA/Mobile). Or just move your content. It is difficult to manage with datatype where you have content mixed with design together. Also errors with unclosed tags, wrong attributes, and so on.

For Editors (without HTML knowledge) which mostly uses CMS systems, basic formatting with ezxml tags is enought. Now they have one more very useful tool, OASIS OpenOffice extension, which base on XML also (content and formatting are separated
). This extension presents new methodology in on-line content publishing. Editors can prepare document in truly text editor and publish it on-line.

That's my 0.05$ ;)

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

Gabriel Ambuehl

Friday 07 October 2005 1:34:08 am

1) XHTML already provides for the separation of design and content (I truly don't see the real difference between using ezxml's paragraph vs just p tags, for one, same for most formatting), to take it further, div and span help a lot in that regard. It MAY need some more thought of what markup to use to format (usually there's plenty different ways to achieve the same result with their own pros and cons).

2) My extension is coming along very nicely. Most of the features one truly needs are there, but bugs are still there and validation is completely lacking.

Visit http://triligon.org

Gabriel Ambuehl

Friday 07 October 2005 1:38:22 am

Oh and in case anyone cares, you can try it on http://stage.webpuls.ch

Visit http://triligon.org

Kristian Hole

Friday 07 October 2005 8:41:55 am

A XHTML datatype would be nice in cases where you for example don't need the pdf-rendering engine, or don't need the strict separation of content and apperance. The great thing about a XHTML datatype is that it allows you to integrate an existing XHTML-editor like TinyMCE, HTMLArea, FCKeditor.

I think its great that you develop something like this. Diversity is good :-)

Please give some updates in this thread about the progress of your datatype Gabriel :-)

Kristian

http://ez.no/ez_publish/documenta...tricks/show_which_templates_are_used
http://ez.no/doc/ez_publish/techn...te_operators/miscellaneous/attribute

Gabriel Ambuehl

Friday 07 October 2005 8:50:40 am

I could do with some help, mainly related to validation.

Personally, I'd like to use the XHTML DTD to do it but DTD validation is sort of painful to use in PHP4 (DOMXML contains the function calls to libxml2 which works fine but creates weird issues [1], plus the error messages aren't entirely understandable).

libxml2's xmllint tool would be very nice but it's somewhat painful and far from elegant (not to mention the opposite of self contained) to filter XML through an external program just to figure out if it's valid under a DTD...

Then there's an alpha rated PHP DTD checker at PEAR but that one doesn't seem to understand W3C's XHTML DTD very well (at least it doesnt seem to have issues with stuff like <td> outside a table, for one).

Thoughts?

[1] It insists on dumping warnings to the browser it seems :(

Visit http://triligon.org

Xavier Dutoit

Friday 07 October 2005 9:52:00 am

Hi,

As the "what separates more the content and the design" (aka known as "mine is bigger";) between ezxml and xhtml, both are fine (as long as you don't use the style attribute). One is slightly more known and as Kristian noted it, you have quite a few tools to wysiwyg xhtml.

I think we have been able to nicely deal with the added features of ez with rather strict xhtml on xeditor (eg. <a href="/media/tralala" id="eznode://123> see http://stage.webpuls.ch
). If xhtml doesn't accept :// in the name of the id, then eznode_123 should do it as well.

As the validation, I think one of the problem with ezxml (or any xml strict parser), is that it put too much attention to the validness of the structure and not enought on the content. I'm well aware that because of default html browsers behavior (display what you can and silently discard what you can't) we have a bloody mess with 99% of the html pages not valid at all on the wide wild web. However, I'm prepared to argue that it is thanks to this non focus on the stucture than so many content have been published and that the web is what it is today.

As on the impact on ez: unless we can have xml parsers clever enough to deal with lake of proper structure (that is going to come from a cut'n paste from word or any crap html), displaying an error message about <b>invalid nested tags</b> or <b>ul not closed</b> (or something as explicit for an end user, that doesn't understand anything about tags nor xml) isn't going to make it.

We probably all had customers crying that "the system doesn't work" because what he tries to input isn't valid xml.

So about the validation, I think the best option is to try to tidy it and put a warning about format conversion that could have lost some formatting (ala ms word "saves as rtf"). Screwing the presentation isn't as much as a big problem as stubornly refusing to take the misformed content the user tries to input.

Has one of you played with htmltidy ? Can you force him to output xhtml valid content, no matter if it means screwing the layout a wee bit ?

X+

http://www.sydesy.com

Gabriel Ambuehl

Friday 07 October 2005 10:36:09 am

Actually, right now links are done like

 
<link href="eznode://ID" />

i.e. without IDs. But for internal images, the syntax is id="eznode://ID" I'm not sure if that's proper XHTML but all I care for right now is that it works.

Validation: validation is mainly very important because of two issues:
1) Quite simply: The system chokes on non well formed XML. Using DOMXML instead of eZXML would help there to some degree but I rather not go to the lands of malformed XML and guess what it's supposed to mean.

2) There's a very valid concern that malformed XML could destroy the whole page layout (style="" can do so as well, but style we can also just filter out ;)

Or put another way: garbage in -> garbage out. So I don't even want to have garbage enter the system as it's too bloody painful to deal with it later. If the user doesnt accept that the system refuses to do stupid things he can go use textfield without wash() for all I care.

If it's called XHTML data type, it will deliver XHTML (or a subset of it) but quite surely, I will make sure it does accept malformed XML.

But the main reason I want to validate the input against a DTD is 2). With a proper DTD, it would easily be possible to just refuse to use any evil formatting (like inline style) or JS or whatever else you can think of ruining the rest of the site (or even creating security issues in case of scripting).

I'm hesitant to use Tidy because it means another external dependency which I'd like to avoid. OTOH, it would likely be the best approach. That or xmllint combined with a proper DTD (which would also mean that you could have different levels of DTD I presume, like one for forums that only accepts a very minor subset of XHTML whereas trusted editors would be able to use full XHTML strict).

Visit http://triligon.org

Xavier Dutoit

Friday 07 October 2005 11:38:51 pm

Hi,

Don't you think that the validation features provided my tinha are validating enough ?

If I'm right, you can define what tags are allowed, what they can contain and what are the attribute accepted, isn't it ?

The syntax isn't a DTD, but that provides the same feature, isn't it ?

X+

P.S. Don't take me wrong about the xhtml and validness, I create the layouts xhtml 1.1 strict and they validate (usually, beside a few _blank target my clients force me to put ;)

http://www.sydesy.com

Gabriel Ambuehl

Saturday 08 October 2005 12:20:12 am

Client side validation is entirely useless. The user can still submit whatever markup he wants and your problem remains.

Visit http://triligon.org

Xavier Dutoit

Saturday 08 October 2005 4:34:17 am

Point taken.

What about a nice complement to htmltidy:
http://xmlsoft.org/xmllint.html ?

X+

http://www.sydesy.com

Gabriel Ambuehl

Saturday 08 October 2005 4:40:30 am

Did you read my posts above? I was writing about xmllint and how it being an external tool (just like tidy) is against my taste...

I agree, it's a good tool (albeit SLOW) and it's output is quite verbose, too. But interfacing with would appear to be somewhat peculiar.

Visit http://triligon.org

Xavier Dutoit

Saturday 08 October 2005 5:46:28 am

Actually, when I mentionned xlmint, it wasn't only to validate it but to fix invalid inputs too (--xmlout --html).

This seems to convert html to xhtml.

Imagemagic is an external program called from ez, and I prefer it than the gd extension (deals better with gifs, better quality). I don't think xmlint being an external tool should be an issue.

X+

http://www.sydesy.com