Cyrille Bonnet wrote:
Daniel Dekany wrote:
BTW, anybody has found a solution for fixing HTML copy-pasted from Microsoft Word (mostly 2000/XP)? Lot of users has MS Word, and the HTML pasted from it is a CSS killer mess. I tried mxTidy but it didn't improved substantially the HTML. So how do you guys do it? I have looked after solutions for Epoz, but didn't found any. But I don't stick to Epoz... if there is a solution already for Kupu (is Kupu already recommended over Epoz anyway?). Certainly the solution would be an Epoz post-tidy Python script, but I didn't found any for Word tidying. (However, the ideal would be if the HTML is tidied right on the client when it pastes it in -- thus user would really get what it sees, i.e. the HTML wouldn't be changed when he saves it. That effect is really evil.)
As Shane pointed out, there is a tidy up in Kupu. However, in my experience, it is not a very good tidy up (if I remember correctly, a lot of tags are still there after the tidy up).
Unfortunately there is a fine line between tidying up the cruft pasted from Word, and not stripping out things which might actually have been entered legitimately. I think Kupu does this pretty well (but then I'm a bit biased), but without any way to detect that the user is pasting from Word I don't see how much more could be stripped. So far as I know the only thing which doesn't really get stripped from the pasted Word text are the mso classnames. These can be manually blacklisted, but I never got round to producing a definitive blacklist. One of my thoughts is to provide a separate 'clean this up' button which would apply a more aggressive tidy-up than the one when saving. Also, I agree that only applying the tidy on save is bad, but there isn't a cross- browser way to detect a paste, and applying the cleanup on a large document every time you cut/paste one word wouldn't be nice either. Suggestions for improvements are most welcome. P.S. It isn't just pasting bad HTML which is a problem: some Microsoft applications supply RTF on the clipboard but not HTML and it turns out that if you paste RTF into IE it generates seriously invalid HTML with a totally weird and corrupted DOM. That is another area where I think the cleanup code finally does a passable job but not yet a perfect one.