[Zope] RTF documents from Zope
Phil Harris
phil@philh.org
Tue, 28 Sep 1999 08:32:05 +0100
This is a multi-part message in MIME format.
------=_NextPart_000_000B_01BF098B.F258C030
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Hi Jason,
Find attached a html2rtf.pl file (a perl script, I know I know, but it's the
only one I came across that didn't need Win machines).
I've not tried this, just did a quick scan on the web.
Let me know if you have any luck with it.
Phil
phil@philh.org
-----Original Message-----
From: jason@zope.org [mailto:jason@zope.org]On Behalf Of Jason Spisak
Sent: Monday, September 27, 1999 7:58 PM
To: zope@zope.org
Subject: [Zope] RTF documents from Zope
zopists,
Anyone have any luck with converting DTML/HTML stuff from inside Zope to
RTF that can be sent to another party via email? I'm sure I'm not alone
in the world searching for the document format that I can send to people
that everyone can read. I though html would do it, but the Mimetools
haven't been friendly in sending out readable attatchments.
Any thoughts?
--
Jason Spisak
webmaster@hiretechs.com
_______________________________________________
Zope maillist - Zope@zope.org
http://www.zope.org/mailman/listinfo/zope
(To receive general Zope announcements, see:
http://www.zope.org/mailman/listinfo/zope-announce
For developer-specific issues, zope-dev@zope.org -
http://www.zope.org/mailman/listinfo/zope-dev )
------=_NextPart_000_000B_01BF098B.F258C030
Content-Type: application/x-perl;
name="html2rtf.pl"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="html2rtf.pl"
#!/usr/local/bin/perl=0A=
# HTML2RTF -- Make an RTF rendering of an HTML file.=0A=
#=0A=
$html2rtf_version =3D 'HTML2RTF v1.1a'; # by sean@qrd.org=0A=
$html2rtf_revision =3D =0A=
' Time-stamp: <1997-11-25 22:25:30 MST sburke@babel.ling.nwu.edu> ';=0A=
# This package is Copyright 1996- by Sean M. Burke, sean@qrd.org=0A=
#=0A=
# See the docs at http://www.ling.nwu.edu/~sburke/html2rtf/=0A=
# Quick usage summary: html2rtf.pl input.html=0A=
#=0A=
# html2rtf is free software; you can redistribute it and/or modify it=0A=
# under the terms of the GNU General Public License as published by the=0A=
# Free Software Foundation; either version 2, or (at your option) any=0A=
# later version.=0A=
#=0A=
# html2rtf is distributed in the hope that it will be useful, but=0A=
# WITHOUT ANY WARRANTY; without even the implied warranty of=0A=
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU=0A=
# General Public License for more details.=0A=
#=0A=
# To see a copy of the GNU General Public License, see=0A=
# http://www.ling.nwu.edu/~sburke/gnu_release.html, or write to the=0A=
# Free Software Foundation, 675 Mass Ave, Cambridge, MA 02139, USA.=0A=
# ------------------------------------------------------------=0A=
=0A=
## Configurables follow...=0A=
$debug =3D 1;=0A=
# 0 makes this program totally quiet except for errors or warnings=0A=
# -- good for use, say, in a batch file=0A=
# 1 makes this program say friendly things like say what file=0A=
# it's working on.=0A=
# above 1 makes it spit out the RTF code on STDOUT as well as to a file.=0A=
=0A=
=0A=
# Here set the name of the fonts to use for proportional (normal)=0A=
# and monospace (CODE, PRE) text. They have to be the exact (?) name=0A=
# of the fonts as they exist on your system.=0A=
$proportional_font =3D 'Times New Roman';=0A=
# Alternately, choose, say, 'Arial', or maybe 'Optima' or a Schoolbook.=0A=
# Superpimps use 'DomCasual BT' or 'Cooper Blk BT'=0A=
$monospace_font =3D 'Courier New';=0A=
=0A=
# These determine what characters styles=0A=
# are rendered as what font attributes.=0A=
@italic_styles =3D ('I', 'CITE', 'VAR');=0A=
@bold_styles =3D ('STRONG', 'B', 'DEFN');=0A=
@underline_styles =3D ('U', 'EM');=0A=
=0A=
$tab =3D 720; # distance between tab stops, in twips. 720 twips =3D .5 =
inches.=0A=
=0A=
# How to render horizontal rules=0A=
$hr_style =3D '\qc\emdash\emdash\emdash';=0A=
=0A=
# for a fancier and larger HR, use '\qc{\fs100\b ~}'=0A=
# Wishlist: write logic that'll spit out different /kinds/ of HRs=0A=
# depending on the indent level?=0A=
=0A=
$paragraph_space_before =3D 90;=0A=
# Each <P> is preceded by this many twips of space downward=0A=
# twip =3D=3D 1/1440th of an inch =3D=3D 1/20th of a point=0A=
# (Cf. point =3D 1/72nd of an inch)=0A=
$heading_space_before =3D $paragraph_space_before * 2;=0A=
# Each heading style (H1-H6) is preceded by this many=0A=
# twips of space downward.=0A=
=0A=
@heading_styles =3D ( '\fs55 ', '\fs50 ', '\fs45 ',=0A=
'\fs40 ', '\fs35 ', '\fs30 ');=0A=
# Styles for the headings, starting from H1, ending with H6=0A=
# \\fs40 means 20-point, \\fs50 means 25-point, etc.=0A=
=0A=
#What follows is the association of the contents of ampersand entities=0A=
# to what they expand do. I've quoted all 8-bit values here just so=0A=
# the PERL scripts doesn't get manged in transit.=0A=
# You can add anything you want here.=0A=
=0A=
%entities =3D split(/[ \n]*,[ \n]*/,=0A=
"szlig,\xdf,eth,\xf0,ETH,\xd0,thorn,\xfe,THORN,\xde,yuml,\xff,=0A=
yacute,\xfd,Yacute,\xdd,uuml,\xfc,Uuml,\xdc,ugrave,\xf9,=0A=
Ugrave,\xd9,ucirc,\xfb,Ucirc,\xdb,uacute,\xfa,Uacute,\xda,=0A=
ocirc,\xf4,Ocirc,\xd4,=0A=
otilde,\xf5,Otilde,\xd5,oslash,\xf8,Oslash,\xd8,ograve,\xf2,=0A=
Ograve,\xd2,oacute,\xf3,Oacute,\xd3,ntilde,\xf1,Ntilde,\xd1,=0A=
iuml,\xef,Iuml,\xcf,igrave,\xec,Igrave,\xcc,icirc,\xee,=0A=
Icirc,\xce,iacute,\xed,Iacute,\xcd,euml,\xeb,Euml,\xcb,=0A=
egrave,\xe8,Egrave,\xc8,ecirc,\xea,Ecirc,\xca,eacute,\xe9,=0A=
Eacute,\xc9,ccedil,\xe7,Ccedil,\xc7,aelig,\xe6,AElig,\xc6,=0A=
atilde,\xe3,Atilde,\xc3,agrave,\xe0,Agrave,\xc0,acirc,\xe2,=0A=
Acirc,\xc2,aacute,\xe1,Aacute,\xc1,ouml,\xf6,Ouml,\xd6,aring,\xe5,=0A=
Aring,\xc5,auml,\xe4,Auml,\xc4,quot,\",amp,\&,gt,\>,lt,\<,=0A=
copy,\xa9,reg,\xae,=0A=
ensp,\\enspace ,emsp,\\emspace ,nbsp,\\~,\#92,\\\\,\#123,\\\{,\#125,\\\},=0A=
\#13,\\par,\#10,\\par,\#013,\\par,\#010,\\par" );=0A=
# Those last two lines are RTF-specific. Everything before is reusable=0A=
# in non-RTF things.=0A=
=0A=
=0A=
######################################################################=0A=
## END of constants & configurables.=0A=
## Don't change anything from here on, unless you really know=0A=
## what you're doing.=0A=
######################################################################=0A=
=0A=
$charset =3D '\fcharset255';=0A=
# 255 =3D OEM, supposedly a good approximation of ISO-Latin-1=0A=
=0A=
$paragraph_space_before =3D "\\sb$paragraph_space_before";=0A=
$heading_space_before =3D "\\sb$heading_space_before";=0A=
=0A=
$fonttable =3D "\\deff0 {\\fonttbl=0A=
{\\f0\\froman $charset $proportional_font;}=0A=
{\\f1\\fmodern $charset $monospace_font;}}";=0A=
=0A=
# You can add additional fonts to the font table there if you want=0A=
# to use them, for headings, say.=0A=
# E.g., to add Arial as a third font and to have headings in it, go:=0A=
# $fonttable =3D "\\deff0 {\\fonttbl=0A=
# {\\f0\\froman $charset $proportional_font;}=0A=
# {\\f1\\fmodern $charset $monospace_font;}}";=0A=
# {\\f2\\fswiss $charset Arial;}}";=0A=
#=0A=
# Then to change the heading styles to Arial bold italic:=0A=
# @heading_styles =3D ( '\fs55\f2\b\i ', '\fs50\f2\b\i ', =
'\fs45\f2\b\i ',=0A=
# '\fs40\f2\b\i ', '\fs35\f2\b\i ', '\fs30\f2\b\i ');=0A=
#=0A=
# Don't forget $charset, or the character translation won't work.=0A=
#=0A=
# Note the font family names -- Arial's is "swiss". SPECIFYING THE =
FAMILY=0A=
# NAME IS OPTIONAL. See the RTF specs for the allowable family names.=0A=
=0A=
$* =3D 1; #don't assume each string is just one line long.=0A=
=0A=
$init =3D "{\\rtf1\\ansi $fonttable=0A=
{\\colortbl\\red0\\green0\\blue0;}=0A=
{\\stylesheet{\\fs20 \\snext0 Normal;}}\n";=0A=
=0A=
$underline_re =3D '(' . join('|', @underline_styles) . ')';=0A=
$italic_re =3D '(' . join('|', @italic_styles) . ')';=0A=
$bold_re =3D '(' . join('|', @bold_styles) . ')';=0A=
$close_re =3D '(' . join('|', @underline_styles,=0A=
@italic_styles, @bold_styles) . ')';=0A=
=0A=
######################################################################=0A=
$stamp =3D $html2rtf_version;=0A=
if ($html2rtf_revision =3D~ /\<([^\>]+)/) {=0A=
$html2rtf_revision =3D $1; # gussy it up.=0A=
$html2rtf_revision =3D~ tr/ /_/;=0A=
$stamp .=3D ' (revision ' . $html2rtf_revision . ')';=0A=
}=0A=
=0A=
print "Starting $stamp\n" if ($debug > 0);=0A=
=0A=
foreach $file (@ARGV) {=0A=
# reset some states=0A=
$pre_state =3D 0;=0A=
undef @list_level;=0A=
undef @list_type;=0A=
=0A=
die "Can't open input file $file" unless open(INFILE, $file);=0A=
print "Converting $file\n" if ($debug > 0);=0A=
$instream =3D join('', <INFILE>);=0A=
close(INFILE);=0A=
=0A=
undef(@list_type); # clear 'em for each file=0A=
undef(@list_level);=0A=
=0A=
# Catch the document meta-information=0A=
# first up, the title!=0A=
if ($instream =3D~ /<TITLE>([^<]+)<\/TITLE>/i) {=0A=
$title =3D $1;=0A=
} else {=0A=
$title =3D $file;=0A=
}=0A=
$title =3D~ tr/\n{}\\/ ___/; # safetify these kooky illegal characters!=0A=
=0A=
# second up, the author!=0A=
if ($instream =3D~ =
/<LINK\s+REV\s*=3D\s*MADE\s+HREF\s*=3D\s*"([^"]+)"/i) {=0A=
$author =3D $1;=0A=
if ($author =3D~ /^mailto:/i) {=0A=
$author =3D $';=0A=
}=0A=
} else {=0A=
$author =3D '';=0A=
}=0A=
$author =3D~ tr/\n{}\\/ ___/; # safetify these kooky illegal characters!=0A=
=0A=
# third up, keywords!=0A=
if ($instream =3D~ =
/<META\s+HTTP-EQUIV\s*=3D\s*"Keywords"\s+Content\s*=3D\s*"([^"]+)"/i) {=0A=
$keywords =3D $1;=0A=
} else {=0A=
$keywords =3D '';=0A=
}=0A=
$keywords =3D~ tr/\n{}\\/ ___/; # safetify these kooky illegal =
characters!=0A=
=0A=
# The time-stamp of the HTML file is copied into the "Creation Time" of=0A=
# the RTF file. The moment when the conversion happens is stuck into=0A=
# the "Revision Time" of the file, and will be overwritten if/when you=0A=
# make (and save) changes to the RTF file in your word processor.=0A=
=0A=
($junk, $junk, $junk, $junk, $junk, $junk, $junk, $junk, $junk, =0A=
$filetime, $junk, $junk, $junk) =3D stat($file); =0A=
# Get the input file's last revision date, in UNIX format=0A=
=0A=
# Now cook up the init string for this file.=0A=
$myinit =3D $init . "{\\info \n{\\title $title}\n"=0A=
. "{\\creatim" . &unixtime2rtf($filetime) . "}\n{\\revtim"=0A=
. &unixtime2rtf(time()) . "}\n{\\author $author}\n{\\keywords =
$keywords}"=0A=
. "{\\doccomm Converted from $file by: $stamp, by Sean M. Burke =
(sean\@QRD.org)}}\n";=0A=
=0A=
# OK, end of metainformation gig.=0A=
=0A=
if ($instream =3D~ /<BODY[^>]*>/i ) {=0A=
$instream =3D $';=0A=
$instream =3D~ s/ *<\/(BODY|HTML)>//ig;=0A=
} else {=0A=
warn "No BODY tag found in $file ... conversion may be deeply =
flawed.\n";=0A=
}=0A=
$instream =3D~ s/[{}\\]/\\$&/g; # Escape out brackets and \'s=0A=
=0A=
# $instream =3D~ s/<!--\s*([^-]*)\s*-->//g; # Kill SGML comments=0A=
# $instream =3D~ s/<!--\s*([^-]*)\s*-->/{\\plain \\v $&}/g; # Kill =
SGML comments=0A=
=0A=
# Okay, now we start really chewing on the HTML and chunk by chunk=0A=
# replacing it with the correct RTF code.=0A=
=0A=
$instream =3D~ =
s/<(\/PRE|\/[UOD]L|\/H[1-6][^>]*|BLOCKQUOTE|\/BLOCKQUOTE|HR)[^>]*>/$&<P>/=
ig;=0A=
# put <P>'s after </PRE>s, close-UL/OL/DLs, close-DL's, =0A=
# BLOCKQUOTEs, close-BLOQUOTEs, close-H?s, and HRs=0A=
=0A=
$instream =3D~ s/\s*<P>\s*(<H[1-6R][^>]*>)/$1/ig;=0A=
# kill P's before headings or HRs.=0A=
=0A=
$instream =3D~ s/<P>[ \t\n]*<(PRE|[UOD]L|LI|HR|BLOCKQUOTE)>/<$1>/ig;=0A=
# kill <P>'s before <PRE>, UL/OL/DLs, DL's, BLOCKQUOTEs, HR's, and LI's=0A=
=0A=
# Handle whitespace, PREs, and Ps=0A=
$instream =3D~ =
s/(<\/?PRE[^>]*>|<PRE>)|(<\/?P[^>]*>)|(\s+)/&parse_p($&)/ieg;=0A=
# Note the extremely powerful use of s/A/&B/ieg as a bogus parser here;=0A=
# this substitutes for a messy while() loop. Worship s/A/&B/ieg !!=0A=
=0A=
# Now handle character styles=0A=
$instream =3D~ s/<$underline_re>\s*/{\\ul /ig;=0A=
$instream =3D~ s/<$italic_re>\s*/{\\i /ig;=0A=
$instream =3D~ s/<$bold_re>\s*/{\\b /ig;=0A=
$instream =3D~ s/\s*<\/$close_re>/}/ig;=0A=
$instream =3D~ s/<CODE>\s*/{\\f1 /ig;=0A=
$instream =3D~ s/\s*<\/CODE>/}/ig;=0A=
$instream =3D~ s/<SUB>\s*/{\\sub /ig;=0A=
$instream =3D~ s/\s*<\/SUB>/}/ig;=0A=
$instream =3D~ s/<SUP>\s*/{\\super /ig;=0A=
$instream =3D~ s/\s*<\/SUP>/}/ig;=0A=
=0A=
=0A=
=0A=
# Now handle the other structural codes=0A=
# First, headings=0A=
$instream =3D~ =
s/<H([1-6])[^>]*>\s*/\n\\par\\pard\\pard$heading_space_before\{\\plain =
$heading_styles[$1-1]/ig;=0A=
=0A=
=0A=
$instream =3D~ s/<\/H[1-6]>\s*/\}\n/ig;=0A=
# Now, other structural tags=0A=
$instream =3D~ s/[ =
]*<(HR|BR|P|BLOCKQUOTE|\/BLOCKQUOTE|UL|\/UL|LI|\/LI|DL|DT|DD|\/DL|OL|\/OL=
)[^>]*>[ ]*/&parse_structure($1)/ieg;=0A=
=0A=
=0A=
# Whatever tags are left, hide 'em-- we don't know how to deal with 'em.=0A=
$instream =3D~ s/<[^>]+>/\{\\plain \\v $&\}/g;=0A=
=0A=
# Almost there -- resolve entities=0A=
$instream =3D~ s/\&([^\;]{1,9});/&resolve_entity($1)/eg;=0A=
=0A=
# Last step-- quote the 8-bit characters=0A=
$instream =3D~ s/([\x80-\xff])/"\\'".(unpack("H2",$1))/eg;=0A=
=0A=
###=0A=
# Output it all=0A=
$outname =3D $file;=0A=
if ($outname =3D~ /\.html?/i) {=0A=
$outname =3D~ s/\.html?/.rtf/i;=0A=
} else {=0A=
$outname .=3D '.rtf';=0A=
}=0A=
=0A=
die "Can't open output file $outname" unless open (OUTFILE, "> =
$outname");=0A=
print "Writing $outname\n" if ($debug > 0);=0A=
print OUTFILE ($myinit);=0A=
print OUTFILE ($instream);=0A=
print OUTFILE ("}\n");=0A=
if ($debug > 1) {=0A=
print "=3D=3D=3D $file =3D=3D=3D\n$myinit";=0A=
print $instream;=0A=
print "}\n";=0A=
}=0A=
close (OUTFILE);=0A=
}=0A=
print "Done.\n" if ($debug > 0);=0A=
=0A=
######################################################################=0A=
=0A=
sub parse_p {=0A=
# Deal with whitespace, PRE, and P's.=0A=
# To be called only by the s/A/&B/ieg expression up there.=0A=
# remember that $pre_state is global; its value is 0 or 1=0A=
local($input) =3D $_[0]; # the thing that matched=0A=
=0A=
$input =3D~ tr/a-z/A-Z/;=0A=
if ($input =3D~ /<PRE/) { # PRE tags=0A=
$pre_state =3D 1;=0A=
return "\n\\pard\\par\\f1 "; # note that a PRE implies a newline=0A=
# also let's switch to the monospace font=0A=
} elsif ($input =3D~ /<\/PRE/) { # close PRE's=0A=
$pre_state =3D 0;=0A=
return '\f0 '; # back to the proportional font=0A=
} elsif ($input =3D~ /^<P>|<P[ \n\t]+/) { # P tags=0A=
$pre_state =3D 0;=0A=
return '<P>';=0A=
} elsif ($input =3D~ /^<\/P>/) { # close-P's=0A=
$pre_state =3D 0;=0A=
return '';=0A=
} elsif ($input =3D~ /^[ \n\t]+$/) { # whitespace=0A=
if ($pre_state =3D=3D 0) {=0A=
return ' '; # collapse all whitespace=0A=
} else { # we're in a PRE entity-- fiddle with whitespace=0A=
$input =3D~ s/\n/\n\\pard\\par /g;=0A=
return $input;=0A=
}=0A=
}=0A=
}=0A=
=0A=
######################################################################=0A=
=0A=
sub parse_structure {=0A=
# @list_type is global=0A=
# @list_level is global=0A=
# the FIRST element of each list is the most current one.=0A=
local($in_tag) =3D $_[0];=0A=
local($l_indent, $depth);=0A=
$in_tag =3D~ tr/a-z/A-Z/;=0A=
=0A=
$depth =3D @list_level; #note that DL doesn't affect this stack=0A=
$l_indent =3D $tab * $depth;=0A=
=0A=
if ($in_tag eq 'UL') {=0A=
unshift(@list_level, 0); #just a placeholder=0A=
unshift(@list_type, 'UL'); #store the list type=0A=
return '';=0A=
} elsif ($in_tag eq 'OL') {=0A=
unshift(@list_level, 0); #will get incremented=0A=
unshift(@list_type, 'OL'); #store the list type=0A=
return '';=0A=
} elsif ($in_tag eq '/OL' || $in_tag eq '/UL' || $in_tag eq =
'/BLOCKQUOTE') {=0A=
shift(@list_level);=0A=
shift(@list_type);=0A=
#return '\\par\\pard';=0A=
=0A=
} elsif ($in_tag eq 'LI') {=0A=
if($list_type[0] eq 'UL') {=0A=
return "\n\\par\\pard\\li$l_indent\\bullet ";=0A=
} elsif ($list_type[0] eq 'OL') {=0A=
++$list_level[0];=0A=
return "\n\\par\\pard\\li$l_indent {\\b $list_level[0]}\. ";=0A=
} # else what the hell are we doing saying 'LI'?=0A=
} elsif ($in_tag eq '/LI') {=0A=
#ummm?=0A=
=0A=
} elsif ($in_tag eq 'BLOCKQUOTE') {=0A=
unshift(@list_level,0); # a dummy placeholder=0A=
unshift(@list_type,'BLOCKQUOTE'); # not too contentful either=0A=
return '';=0A=
=0A=
} elsif ($in_tag eq 'DL' || $in_tag eq '/DL') {=0A=
# la la la, I'm doing nothing, la la la.=0A=
=0A=
} elsif ($in_tag eq 'P' || $in_tag eq 'DT') {=0A=
# note that DT is just like a P=0A=
return "\n\\par\\pard$paragraph_space_before\\li$l_indent ";=0A=
=0A=
} elsif ($in_tag eq 'DD') {=0A=
# just like a P, but just one more ident level in, and without=0A=
# $paragraph_space_before=0A=
return "\n\\par\\pard\\li" . ($l_indent + $tab) . ' ';=0A=
=0A=
} elsif ($in_tag eq '/DT') {=0A=
# ummm?=0A=
} elsif ($in_tag eq '/DD') {=0A=
# ummm?=0A=
=0A=
} elsif ($in_tag eq 'BR') {=0A=
return "\n\\par\\pard\\li$l_indent ";=0A=
=0A=
} elsif ($in_tag eq 'HR') {=0A=
return "\n\\par\\pard\\li$l_indent $hr_style ";=0A=
# note that there is indenting-- so that a HR in the middle of a list=0A=
# can be centered differently from one outside a list=0A=
}=0A=
=0A=
return '';=0A=
}=0A=
=0A=
######################################################################=0A=
sub unixtime2rtf {=0A=
local($intime, $sec, $min, $hr, $dy, $mo, $yr, $junk);=0A=
=0A=
$intime =3D $_[0];=0A=
=0A=
($sec, $min, $hr, $dy, $mo, $yr, $junk, $junk, $junk) =3D =
localtime($intime);=0A=
# Note that this is local time.=0A=
=0A=
++$mo; # Note that PERL counts months from January=3D0=0A=
# but RTF counts from January=3D1=0A=
$yr +=3D 1900;=0A=
=0A=
return "\\yr$yr\\mo$mo\\dy$dy\\hr$hr\\min$min\\sec$sec";=0A=
}=0A=
=0A=
######################################################################=0A=
sub resolve_entity {=0A=
local($in, $out);=0A=
$in =3D $_[0];=0A=
=0A=
$out =3D $entities{$in};=0A=
if (defined($out)) { #easily resolvable=0A=
return $out;=0A=
} elsif ($in =3D~ /^\#([0-9]+)$/) {=0A=
return pack("C", $1);=0A=
} else {=0A=
return '&' . $in . ';'; #make it unchanged.=0A=
}=0A=
=0A=
}=0A=
=0A=
######################################################################=0A=
=0A=
=0A=
------=_NextPart_000_000B_01BF098B.F258C030--