Add XHTML datatype (replace XML datatype?)

Author Message

Gabriel Ambuehl

Saturday 08 October 2005 6:13:29 am

I was just messing around with xmllint when another idea struck me.

Think about it, was the very best validator for XHTML you know of? It's more than likely validator.w3.org which is open source and lives at http://validator.w3.org/source/ .

While surely not as widely deployed as xmllint, it could be the proper validator for us here as it generates very verbose errors. On the downside, it's in Perl and requires a truckload of dependencies from CPAN (both Fedora and Debian provide packages for it though).

Visit http://triligon.org

Gabriel Ambuehl

Saturday 08 October 2005 7:48:15 am

After some thinking, I'm leaning towards

tidy | xmllint --dtdvalid SOMEDTD.

The first makes sure we have properly nested HTML, the second will throw errors if the user used some tag we dont allow (script won't be allowed).

As for integration, one could also look into using http://pecl.php.net/package/tidy and DOMXML-validate stuff when it's available, otherwise fall back to the standalone binaries.

Visit http://triligon.org

James Robertson

Wednesday 12 October 2005 8:58:47 pm

Hi Gabriel

I'm not very knowledgeable regarding XML validation techniques however, have you considered looking for a tool that validates XHTML against the W3C's XML Schema for XHTML?
http://library.n0i.net/programming/meta%20languages/xhtml-schema/

I understand schemas are easier to construct and validate against than DTD, so you might find a lightweight/fast tool.

[Might also be worth trying to find a RelaxNG schema for XHTML?]

Also, thanks for setting eZ publish straight regarding the separation of content from design. This is in-fact the purpose of XHTML and CSS. XHTML is a *semantic* mark-up language with a far richer set of tags that ezxml offers. Besides, if people want to mix it up with design (by adding style attributes etc.) then surely this is up to them ... it happens all the time ... Like, on every page on the web ;-)

Gabriel Ambuehl

Thursday 13 October 2005 2:04:56 am

Getting the spec in any one of those formats doesn't seem to be the problem. Finding a suitable parser that's easily integrated is the problem. If you can find one in PHP that just works (TM) please tell us about it, I know I didn't find one.

Visit http://triligon.org

Kirill Subbotin

Monday 07 November 2005 4:53:47 am

Hello, all!

I have read this tread and have something to say here. Currently I am working on the Online Editor and in fact I am responsible for the eZXML datatype in eZ publish. I'm not the one who invented it, and I don't make final decisions, although I hope together we can make it better. ;) I agree with you, on many things. Although there are something you have missed.

1) Of course eZ publish content stored in rich text datatype can't be 100% valid XHTML. Just because it's not the XHTML page, but the part of it. It can not contain body, head etc.. elements that must be presented in valid XHTML (even transitional).

2) eZXML uses structured method of storing content. This means that it supports "sections" and does not store headers like XHTML do:

Example of XHTML:

<h1>Part 1</h1>
<p> text of part 1</p>
<h2>Part 1.1</h2>
<p> text of part 1.1</p>

Example of eZXML:

<section>
  <header>Part 1</header>
  <paragraph> text of part 1</paragraph>
  <section>
    <header>Part 1.1</header>
    <paragraph> text of part 1.1</paragraph>
  </section>
</section>

This is to the question of what is the difference between eZXML and XHTML. Of course you don't see sections in "simplified XML", but they are presented in real eZXML and stored in the database.

(btw: eZ publish documention currently doesn't pay attention to differences between "simplified XML" used in standard input handler and real stored eZXML. So now this is more like an undocumented thing, but we are going to change this.)

Now good news: I've already developed a parser, that Xavier were talking about. It accepts any input (probably invalid) in one markup language and converts it to another markup language trying to fix all bad structure automatically without loosing content and blaming about errors. It can even auto-close tags with wrong nesting. And it is driven with control arrays and handler functions.

You can even test it's initial version in the latest alpha version of Online Editor (available for download on ez.no). Currently it is used to convert html input from OE's javascript to valid eZXML, but can be used for any other conversions. We plan to make it a part of the next eZ publish releases (may be 4.0 or even 3.8) and use it in all input handlers instead of current realisation.

And yes, there are some things about eZXML format that are hard to understand, and one of these issues is why eZXML is not compatible with XHTML. For example in eZXML <table> element should be presented inside paragraph, but in XHTML it is not allowed at all. This seems strange to me, because XHTML is the main target of rendering eZXML. But this is the way it is working now, and probably there is a reason for this.

The thing that you have to understand is that if we do some modifications to the current eZXML datatype in next eZ publish releases, we will have to maintain back-compatibility with existing installations. So at least there should be scripts that convert old format to a new with no problems, it should be possible.

So, we most likely will stay with eZXML format, but will try to make it more flexible, extendable and probably more close to XHTML.

About <span> and <div>: Now there is a way of using this kind of content fragmentation by using <custom> tags and corresponding templates. There is only one problem here: that <custom> tags could be both inline and block, which is not correct in XML schema terms, but this is a subject to change in next releases.

Also probably we will implement a way to make real custom tags, so you will be able to make <span> <div> etc. and write corresponding schema and handlers for them.

Any comments on this?

Frederik Holljen

Monday 07 November 2005 7:24:07 am

I'd just like to add a few lines here.
As Gabriel Ambuehl has already noted XHTML was created to separate content from design. It solves this problem nicely. However, only within one site at the time. This is because it is tightly coupled with the CSS. (As Gabriel also comments in an earlier post). This means that XHTML from a datatype in one eZ publish installation may not work at all (design wise) in another eZ publish installation. Also, redesigning your site may result in the need to go through _all_ your xhtml datafields to clean them up so they will work with the new design.

This is currently solved through the rendering system found in the ezxml datatype where each tag is rendered to xhtml/html through the template system. Of course, the syntax in the ezxml datatype leaves a lot of room for improvement.

In addition to this you have a few other issues with direct XHTML fields:
- validation
- abuse (javascript....)
- the solution for the two above combined with import and export in a safe way
- what about special purpose things (eznode, ezobject)

That said, an XHTML field can definitely have its uses.

Xavier Dutoit

Monday 07 November 2005 7:48:08 am

Hi Frederik,

ezXML isn't more independent from a css or a site than xhtml, unless you put style information inside the xhtml (that's a bad idea IMO). Take for example http://www.csszengarden.com/ .

This being said, your points about validation, XSS and the all are very valid, and can be very complicated to stop : http://namb.la/popular/tech.html

Do you have the schema published somewhere so we can really compare with real arguments ? So far, I don't see anything that is done on ezXML that can't be done in xhtml (ok, you might have to add a nodeid or objectid attribute on "a" tags).

Bewteen to equivalent alternatives, I'm tend to go with the most "standard approved" one.

X+

http://www.sydesy.com

Frederik Holljen

Monday 07 November 2005 8:52:16 am

 ezXML isn't more independent from a css or a site than xhtml, unless you put style information inside the xhtml (that's a bad idea IMO). Take for example http://www.csszengarden.com/ .

This is not true. If you take a look at the xhtml code there you can see that it heavily uses id and class attributes in the XHTML. If you inserted this into your xhtml field and then exported it to another site it would only work (if and only if) that site uses the exact same CSS "layout".
You could take it further and remove all id and class attributes in the allowed XHTML field (seriously limiting your possibilities). Still you will probably write your CSS in such a way that it relies on the order in which tags are used in your XHTML.

eZ XML currently offers this functionality through the rendering system where the ezxml field is rendered using templates. The templates are site specific and render xhtml with id/class info such as the xhtml found in zengarden.

If you want the same using xhtml you would have either implement a xhtml to xhtml renderer OR you would have to set up clearly defined rules for what id and classes you can use and what type of nesting you can use in the xhtml field.

Another thing you can't do (easily) with an XHTML field is to limit your users to a specific subset of functionality. This is crucial to many implementations as you will otherwise get what is already referred to as the rainbow effect.

Ekkehard Dörre

Tuesday 08 November 2005 12:03:41 am

We need xml for crossmedia workflow, document management and real CMS not WCMS.

We need a xml-based CMS to sell it and not a xhtml based. There are some 1000 WCMS outside with xhtml and only two good PHP based XML CMS.

We sell future. If xhtml changes, no problem, data is stored in xml and you change via parser <strong> in <red> or <green> or for PDF it is a different typo or for rss, wap, imode, NewsML, NITF, OpenOffice, MS Office, Quark, Inbetween....

If you can explain this to your customers, they are feeling the spirit of enterprise CMS.

And if you explain the object structure, that they can place every object in an other object, e.g. an event: they place a contact form and a PDF in the text, oh, we need an image and a route map, too, oh no the Flash is enough ... this is only possible with objects and xml.

If you use some xhtml inside the side, the export of the whole site as PDF, or XML for catalogue some years later will break.

And migration: There was an old cold fusion cms with mssql database and unclean data, it took weeks to clean them and make new parser logic and a lot of the thousands of articles where touched by hand.

Now they can easy migrate, it's xml, but there is no need anymore.

Greetings, ekke

http://www.coolscreen.de - Over 40 years of certified eZ Publish know-how: http://www.cjw-network.com
CJW Newsletter: http://projects.ez.no/cjw_newsletter - http://cjw-network.com/en/ez-publ...w-newsletter-multi-channel-marketing

Xavier Dutoit

Tuesday 08 November 2005 1:16:43 am

Hi all,

I misexplained my point: I'm strongly for xml as a storage, but xhtml is xml isn't it ?

What I don't get is the benefit of "re-inventing the DTD wheel" by having a ez schema instead of using an existing approved and working one (the contenders could be docbook or xhtml or open document). As the vast majority of the users of ez publish display the content as webpages and most know html, I don't see the point of having to create a new schema.

Take this example (<> have been replaced by [] to go trough the parser):

[header level='1']Title[/header]
And some text with  [link href='http://' target='_self' id='test'] a link[/link]

[custom name='factbox']
And the text in the factbox
[/custom]

And whatever

Compared with a xhtml one

[h1]Title[/h1]
And some text with  [a href='http://' target='_self' id='test'] a link[/a]

[div class='factbox']
And the text in the factbox
[/div]

And whatever

All your arguments are very good, but why couldn't you restrict the xhtml input parser to only accept factbox as a class name or whatever restriction you want ?

If you want to output this content as a pdf, you can translate "a" to whatever pdf wants for a link as easily as you would do it with a "link" tag, right?

What I don't get is the benefit of having "link" instead of "a" or "header level=1" instead of "h1".

For all the rest, having a parser that goes throught the dom and use a template for each node was a good idea.. and would be a good idea even with xhtml and a null translation in my mind.

X+

http://www.sydesy.com

Kirill Subbotin

Tuesday 08 November 2005 2:43:53 am

2 Xavier: Your example is not actual eZXML, it is "simplified XML". Real stored eZXML is more structured than XHTML or simplified XML. It doesn't store <header level=...>, instead it contains sections, like I mentioned in the example above.

Another difference is that it doesn't use <br/> tags to mark linebreaks. Instead all separate lines are stored in <line></line> containers.

Frederik Holljen

Tuesday 08 November 2005 2:51:44 am

Xavier,

If I understand you correctly the discussion comes down to having another syntax more like xhtml (which I certainly agree with).

The problem lies in what you say:

All your arguments are very good, but why couldn't you restrict the xhtml input parser to only accept factbox as a class name or whatever restriction you want ?

At the moment you do this, you have your own specific format (like we have today) and you don't accept xhtml anymore. This means that you can't use the currently available xhtml editors either because they will produce real xhtml that our xhtml field does not accept.

Xavier Dutoit

Tuesday 08 November 2005 3:58:27 am

2 Kirill:

Ok, got you. I looked at the structure of the attributes in the db to see how it's stored and with your explanations and what I saw, I get a clearer picture of how ezxml works.

As a side note, I understand why I couldn't understand the input/output handlers, as I was looking for the wrong tags.
Many thanks... and let us know when this ez schema is documented !

Ok, if I understood properly, you have

Input
simplified xml -> input handler -> stored as ezxml

output
ezxml -> xhtml output handler -> xhtml
ezxml -> pdf output handler -> pdf

So the only place ez uses the simplified xml is as the expected format in the default input handler, and that's the bit you rewrote and are going to put in 3.8, right ?

2 Frederik

I agree, that's going to be a subset of xhtml (no head, no script tags...) and it might even make sense to add new tags.

However and as you said, my point is to stay as close as possible to xhtml, as
a) that's easier for everyone to deal with tags we know.
b) On some xhtml editors I've seen, they already offer some features to disallow some tags or restrict the classes you can use.
c) that would avoid the confusion between simplified xml and the real ezxml thing.
d) You already have plenty of converters from whatever format to xhtml, that might come handy to import content.

As Kirill wrote about the new OE, the parser seems to more modular and that's should be easier to add new tags.

Wouldn't it make sense to deal properly with the standard xhtml tags (ie adding span and div) and obsolete the confusing ones (links, header), as "a" and "hn" are already accepted ?

X+

P.S. I don't have anything to say about the "real" ezxml vs. xhtml, beside that I'm going to have a deeper look at it schema.

http://www.sydesy.com

Xavier Dutoit

Tuesday 08 November 2005 4:05:14 am

As a side question:
the xhtml on the preview in the admin part is different than the xhtml produced on the other siteaccess (eg. you have the nodeid=xxx in the title for the links).

How do you do that ? I couldn't find a specific template (eg. design/admin/templates/content/datatype/view/ezxmltags/link.tpl) that would explain the differences in the layout.

X+

http://www.sydesy.com

Kirill Subbotin

Tuesday 08 November 2005 6:13:52 am

2 Xavier: Yes, you understand everything right. Although we are not going to change simplified xml format in 3.8. This is a subject to review in 4.0. There we will probably remake current ezxml system.

Concerning templates question: I think you are talking about linked images or objects, aren't you ? Links on images and objects are not made with link.tpl. This is handled by images/objects own templates. And yes, these templates are different in admin siteaccess.

James Robertson

Thursday 10 November 2005 1:45:19 pm

"Fight the good fight" Xavier.

Of course XHTML is XML - and can therefore be eXtended using namespaces. Of course a web-CMS should be able to store XHTML data.

Gabriel Ambuehl

Thursday 10 November 2005 1:55:54 pm

I (obviously) agree that XHTML should be used. James raises a good point: why not simply EXTEND XHTML with a few ez specific things (not very hard, I did a proof of concept) and somehow get hold of a SANE piece of code to validate it against some rules (for example, it might make sense to forbid Javascript and in some cases, the use of style="").

Instead of writing some giant, overly complex replacement for XHTML, leave it alone and specify a certain (but obviously wider than todays ezxml)
subset that can be converted to PDF. Or wait for a Cairo based Mozilla to simply have it render the PDFs. It's about time one has a browser do that instead of some up PHP code that doesn't know much layouting PDFs anyway.

Visit http://triligon.org

Bruce Morrison

Thursday 10 November 2005 7:16:03 pm

Hi James

Of course a web-CMS should be able to store XHTML data.

Playing devils advocate here but why? I agree that a web CMS should produce good clean valid XHTML but why should it store it in this format?

As long as the features of the supported output formats (HTML / XHTML / PDF / RDF etc) can be articulated in the storage format why does it matter?

Cheers
Bruce

My Blog: http://www.stuffandcontent.com/
Follow me on twitter: http://twitter.com/brucemorrison
Consolidated eZ Publish Feed : http://friendfeed.com/rooms/ez-publish

Gabriel Ambuehl

Friday 11 November 2005 1:38:37 am

Why it should store XHTMl?
Because the data of an old site likely is in XHTML already and because it is easiest to deal with XHTML in GUI tools.

Visit http://triligon.org

Xavier Dutoit

Friday 11 November 2005 1:46:42 am

Hi bunch of warriors ;)

"fight the right fight" "devils advocate"... I thought we were just discussing (ok may be arguing a wee bit ;) about the storage of something inputed as formated text !

I half agree with Bruce, it doesn't matter too much how that's stored (I still don't see a good reason not to stick with a standard schema, xhtml or docbook or oasis, but that's another issue).

The problem isn't only about behing able to have good output parsers to xhtml, pdf, wap, whatever.

That's also to have a good input parsers too. Bart wrote an input filter for oasis docs and it seems to work quite well for what I've seen.

<b>The problem is that there isn't any parser for xhtml input.</b>

We all agree that nearly all the content is inputed via the web interface, and therefore xhtml should be the most logical schema to input things.

Instead, we have a new schema "Simplified XML", that looks close enough to xhtml to be confusing, but it's quite limited, and you can't benefit from all the features that have already been developed to handle xhtml (wysiwyg editors, converters from nearly all the formats that been created...).

No matter what storage schema is choosen, we need good output AND input parsers. The benefit of choosing a standard schema is that we can find lots of input/output converters. As most of the time, that's going to be xhtml -> storage -> xhtml, I think that xhtml is a valid contender, as long as we can do oo -> storage(xhtml) -> pdf and whatever combination you can think of.

I'm waiting on ez releasing as GPL (that's the plan, isn't it ?) the new input parser kirill has written and see if I can add what's missing to properly handle xhtml.

X+

P.S I'm assuming ezxml (the storage schema) can handle well all what xhtml (minus the things we don't want, like javascript) can provide.

http://www.sydesy.com

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.