How to minify html

Author Message

Jerome Despatis

Monday 07 February 2011 11:49:50 am

Hi,

I'd like to minify the html.

Any idea how to do ? In fact, a hook somewhere right before the cache is stored would be cool. In that way, it would be possible to launch some tidy function to minify html

Any idea is welcome

André R.

Monday 07 February 2011 3:30:26 pm

There is no such hook at the moment, and that would only fix it for cached fragments anyway, not dynamic content. So best place would probably be as a template compile option as it would cover both cases.

But such a hook does not exist either, so your best bet is probably "mod_pagespeed" from Google.

eZ Online Editor 5: http://projects.ez.no/ezoe || eZJSCore (Ajax): http://projects.ez.no/ezjscore || eZ Publish EE http://ez.no/eZPublish/eZ-Publish-Enterprise-Subscription
@: http://twitter.com/andrerom

Jerome Despatis

Tuesday 08 February 2011 1:28:51 am

Yes pagespeed module for apache is an option, but I would prefer to avoid solution based on apache config, as I may not have an hand on it...

You think it's complex to add as a feature ?

Gaetano Giunta

Tuesday 08 February 2011 1:44:59 am

There is no hook to minify cached fragments, but there is one to apply a filter on the "whole page html" before it is echoed back to user.

Look up [OutputSettings] OutputFilterName in site.ini

Principal Consultant International Business
Member of the Community Project Board

Marko Žmak

Tuesday 08 February 2011 3:12:12 am

Jerome, you could also take a look at the eztidy extension:

http://projects.ez.no/eztidy

It has included a template operator and output filter.

It won't completely minify your HTML, but it will remove unneeded spaces and format your HTML in a nice way.

But beware: When passign your HTML through eztidy all the tags that do not comply with the standard are removed. For example tags like <fb:like> get removed.

You can specify new blocklevel tags and new inline tags in eztidy.ini but for me it didn't work properly there were always some problems with the nonstandard tags.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Jerome Despatis

Tuesday 08 February 2011 4:45:34 am

Thanks for your answers !

[OutputSettings] OutputFilterName in site.ini is really a cool feature !

But the outputfiltername is run on every click ? or this custom output is used only one time before creating the cache file ? (which would be optimum)

Marko, thanks for this extension, gonna bundle it in mine...
For custom tags, you should look at tidy settings indeed, this extension is just a small wrapper of tidy. Tidy has really a bunch of options ! need to find the good one...

Gaetano Giunta

Tuesday 08 February 2011 5:03:06 am

The output filter is indeed run on every click - unless you add some caching layer in your filter, of course ;-)

Principal Consultant International Business
Member of the Community Project Board

Marko Žmak

Tuesday 08 February 2011 12:13:45 pm

In my experience setting tidy as the output filter didn't degrade the site performance. I haven't done test and comparison, but there was no visible slowdown.

Also, if you use static cache for your most "heavy" pages, for them the output filter will be called only when static cache is update, and not on every visit on the page.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Jerome Despatis

Tuesday 08 February 2011 12:26:11 pm

Thanks,

As I can read on minifying html, there can be some site performance degradation, but as you say, with varnish in front, not such a problem

Concerning the minify with tidy, in order to minify as much as possible Marko, you could look at this post:

http://php100.wordpress.com/2006/10/30/html-compact/

There's a working setting for tidy, with one extra str_replace to delete the \n

But minifying is not yet one html line, due to inline css/js not minified.

Still have to look this point, but result is good enough for now, my google page speed is going to be happy !

Marko Žmak

Tuesday 08 February 2011 1:18:06 pm

If you want to minify your HTML by removing extra spaces, then eztidy might be a little bit of overkill.

Creating your own output filter sounds like a better solution. I believe it should do the trick by doing this:

  • replace all whitespace characters with space
  • remove all duplicate spaces

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Marko Žmak

Tuesday 08 February 2011 1:21:35 pm

Just found this one...

There's a minify PHP app on google code:

  • http://code.google.com/p/minify/

and here's the code of it's HTML minifier:

  • http://code.google.com/p/minify/source/browse/trunk/min/lib/Minify/HTML.php

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Jerome Despatis

Wednesday 09 February 2011 12:58:58 am

Marko, simple code to remove \n should work, but Murphy is all around on doing regexp... Imagine a js string that contains \n, it would be simply dropped by this code and change the behavior of the application.

Tidy should take care of this I guess.

I've tested the HTML.php function but it leaves some \n.

In fact in the code, the developper has done this logic, as said in a comment:

<span class="pln">use newlines before </span><span class="lit">1st</span><span class="pln"> attribute in open tags </span><span class="pun">(</span><span class="pln">to limit line lengths</span><span class="pun">)</span><span class="pln">
</span>

Marko Žmak

Wednesday 09 February 2011 1:42:12 am

"

Marko, simple code to remove \n should work, but Murphy is all around on doing regexp... Imagine a js string that contains \n, it would be simply dropped by this code and change the behavior of the application.

"

Yes, you're right. And now that think about it, what about <pre> tags. We don't want to strip whitespaces and newlines from there...

Maybe eztidy is the way to go but as I already said in my experience I didn't find it to be a bulletproof solution. First of all you have to be alert when you add new non standard tags (fb: tags, og: tags...) and add them to the new tags in eztidy.ini. And second I had experiences when event this wasn't working. From time to time the non standard tags would just disappear from the markup, and I had to clear the eZ cache in order to get them back. I didn't have time to inspect it futher so I just turned off eztidy.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Jerome Despatis

Wednesday 09 February 2011 5:30:20 am

For now, I'm testing HTML.php lib, i've commented the line that limit line lengths, as I really don't see the aim on this...

It works right now, I keep on testing...

Concerning custom tag, I don't use <xx:yy> tag, but my js framework dojo uses a bunch of tags like <button> etc... and it rocks for now

Maybe you could test it and see if it works on your code

It has a chance to work, because tidy checks the validity of the html, as HTML.php doesn't do that, it just minifies with regexp

Marko Žmak

Wednesday 09 February 2011 2:06:53 pm

I accidentally just found an interesting tool, mod_pagespeed:

  • http://code.google.com/speed/page-speed/docs/module.html

it has a filter that cleans up whitespaces:

  • http://code.google.com/speed/page-speed/docs/filter-whitespace-collapse.html

It preserves whitespaces in <script> and <pre>, and since it's an apache module I think it should work faster than a PHP output filter. And it also has some other interesting filters.

I'll probably give it a try...

For an interesting detail, note the chapter "Risks" on the whitespace filter page. I don't believe that any HTML minifier will honnor the "white-space: pre" CSS directive.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Jerome Despatis

Thursday 10 February 2011 1:23:16 am

For your information, the developer of HTML.php has answered me about the question on adding break lines to limit line length, here is his answer:

-----

The line breaks don't add any size since they replace single spaces. The rumor is some source control systems have trouble with very long lines (I've seen this in YUI Compressor docs). Since Minify_HTML is not a parser, I can't count characters then insert breaks, so I just put them wherever it's easy to.

------

So indeed, I really think those breaklines can be deleted...

Concerning page_speed, yes, in fact this is THE apache module for minifying and perform some stuff. But I'd like to avoid to have a specific apache config, as I may have no hand on it, depending on hosting (on a cloud for example, but maybe I have the hand on the apache config on this, I haven't tested that yet...)

Marko Žmak

Monday 14 February 2011 4:12:12 am

I have modified the HTML.php library to take two more options in the $options parameter:

  • removeComments
  • limitLineLengths

and created an output filter that uses it. It works OK for now.

The difference is 152Kb to 100KB, so I save 52KB per page, not bad. With limitLineLengths turned ON, I get the same so I think I'll leave it.

I have also made a template operator so only some parts of the html can be minified.

And I found a little bug in HTML.php, it doesn't remove multiple whitespaces between two pieces of simple text, for example, a string like "aaaa bbb" will remain untact.

As for the mod_pagespeed I found out that is rather unstable and it can be a performance and security risk to install it. So we should forget about it for now.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

Damien Pobel

Monday 14 February 2011 5:53:04 am

Hi Marko,

And what is the difference with the GZip or deflate compression (mod_deflate in Apache2) enabled ?

Damien
Planet eZ Publish.fr : http://www.planet-ezpublish.fr
Certification : http://auth.ez.no/certification/verify/372448
Publications about eZ Publish : http://pwet.fr/tags/keywords/weblog/ez_publish

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.