Reply
 
Thread Tools Display Modes
  #1   Report Post  
Posted to microsoft.public.word.newusers
Peter Rooney
 
Posts: n/a
Default Saving a Word DOC as HTML

I've discovered that by Saving a document As HTML, it will preserve
attributes such as italic ( I ... /I ), bold, and certain characters,
such as #268; (which is a Czech accent). If the document is a table, the
content of the cells is saved inside of htm pairs like td ... /td . In
fact, the Save As HTM gives me a text file which I can analyze and process
with external programs.

So far so good. The problem is, that the Save As conversion also gives me a
lot of trash that is of no use to me, such as:
p class=MsoPlainText style='margin-left:.25 in. etc etc
span style='font-size:12 etc ... /span
etc

I can search and replace some of these strings, and I have developed
filters that can take care of others. But it's a laborious process involving
various software. It would be better not to get the "trash" in the first
place. The Save As XML option is even worse. Is there a way to make a simple
conversion using Word or 3rd party software?


  #2   Report Post  
Posted to microsoft.public.word.newusers
Tony Jollans
 
Posts: n/a
Default Saving a Word DOC as HTML

One man's trash is another's treasure trove. The short answer to your
question is no because only you know which bits of formatting you want to
interrogate and which you want to ignore.

That said, I believe there is a way to remove some of the bloat that Word
adds to HTML documents. I can't remember it off the top of my head but I
think most of what it removes is relatively easy to identify yourself.

--
Enjoy,
Tony


"Peter Rooney" wrote in message
ink.net...
I've discovered that by Saving a document As HTML, it will preserve
attributes such as italic ( I ... /I ), bold, and certain characters,
such as #268; (which is a Czech accent). If the document is a table, the
content of the cells is saved inside of htm pairs like td ... /td .

In
fact, the Save As HTM gives me a text file which I can analyze and process
with external programs.

So far so good. The problem is, that the Save As conversion also gives me

a
lot of trash that is of no use to me, such as:
p class=MsoPlainText style='margin-left:.25 in. etc etc
span style='font-size:12 etc ... /span
etc

I can search and replace some of these strings, and I have developed
filters that can take care of others. But it's a laborious process

involving
various software. It would be better not to get the "trash" in the first
place. The Save As XML option is even worse. Is there a way to make a

simple
conversion using Word or 3rd party software?




  #3   Report Post  
Posted to microsoft.public.word.newusers
Stephen Glynn
 
Posts: n/a
Default Saving a Word DOC as HTML

I've used them but, according to an article at

http://techrepublic.com.com/5100-1035_11-5197013.html ,

there are a couple of free utilities you can download from Microsoft
that'll clear out the gubbins that Word introduces into HTML (links in
the article).

HTH

Steve


Tony Jollans wrote:
One man's trash is another's treasure trove. The short answer to your
question is no because only you know which bits of formatting you want to
interrogate and which you want to ignore.

That said, I believe there is a way to remove some of the bloat that Word
adds to HTML documents. I can't remember it off the top of my head but I
think most of what it removes is relatively easy to identify yourself.

--
Enjoy,
Tony


"Peter Rooney" wrote in message
ink.net...

I've discovered that by Saving a document As HTML, it will preserve
attributes such as italic ( I ... /I ), bold, and certain characters,
such as #268; (which is a Czech accent). If the document is a table, the
content of the cells is saved inside of htm pairs like td ... /td .


In

fact, the Save As HTM gives me a text file which I can analyze and process
with external programs.

So far so good. The problem is, that the Save As conversion also gives me


a

lot of trash that is of no use to me, such as:
p class=MsoPlainText style='margin-left:.25 in. etc etc
span style='font-size:12 etc ... /span
etc

I can search and replace some of these strings, and I have developed
filters that can take care of others. But it's a laborious process


involving

various software. It would be better not to get the "trash" in the first
place. The Save As XML option is even worse. Is there a way to make a


simple

conversion using Word or 3rd party software?





  #4   Report Post  
Posted to microsoft.public.word.newusers
Peter Rooney
 
Posts: n/a
Default Saving a Word DOC as HTML

Thanks. The "Office 2000 HTML Filter" - downloaded as MSOHTMF2.EXE - seems
to be just what I'm looking for (aside from the limitation that you need to
have Office 2000 on your system to install it - Office 2003 won't do. It's a
ridiculous limitation, but can be circumvented).


*"Stephen Glynn wrote "according to an article at
*http://techrepublic.com.com
*there are a couple of free utilities you can download *from Microsoft
that'll clear out the gubbins that Word *introduces into HTML"


  #5   Report Post  
Posted to microsoft.public.word.newusers
Bob Buckland ?:-\)
 
Posts: n/a
Default Saving a Word DOC as HTML

Hi Peter,

In Word 2002 and Word 2003 using
File=Save As Web Page-Filtered will do
basically the same thing as the Office 2000
HTML filter (at least the part used inside of
Word that lists it as File=Save as Compact HTML
by using the MSFilter.DOT addin.

In Word 2002 and 2003 the 'Filtered'
content will take into consideration the settings
you have in Tools=Options=General=[Web Options]


You can still use the standalone Office 2000 MSFilter.exe
tool to batch process already created Word HTML files and
it will remove the CSS style formatting from the filtered
HTML pages as well.

You can also use apps such as HTMLTidy to process the files.
Creating 'public use' web pages wasn't really the design goal
for the Word files after Word 97 but rather as a way to create
a 'browser viewable' version of a Word document while retaining
all of the parts of a .doc file that a browser didn't support
so you could turn it back into a doc file when opened in Word
from a browser ('roundtripping').

For 'web page' MS Office Frontpage was the app targeted.

=======
"Peter Rooney" wrote in message ink.net...
Thanks. The "Office 2000 HTML Filter" - downloaded as MSOHTMF2.EXE - seems
to be just what I'm looking for (aside from the limitation that you need to
have Office 2000 on your system to install it - Office 2003 won't do. It's a
ridiculous limitation, but can be circumvented).
--
Let us know if this helped you,

Bob Buckland ?:-)
MS Office System Products MVP

*Courtesy is not expensive and can pay big dividends*

For Everyday MS Office tips to "use right away" -
http://microsoft.com/events/series/a...andtricks.mspx




Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Word 97 in Windows XP to maintain formatting Charlie''s Word VBA questions Microsoft Word Help 22 May 20th 23 08:51 PM
Does Word have Keyboard Merges like Word Perfect does? Donnas Mailmerge 1 June 28th 05 09:30 PM
Word2000 letterhead merge BAW Mailmerge 3 June 25th 05 01:17 PM
Underscore (_) will not always display in RTF files (Word 2002). David A Edge Microsoft Word Help 6 June 14th 05 10:39 AM
Boiletplates from Word Perfect linda Microsoft Word Help 1 January 28th 05 05:37 PM


All times are GMT +1. The time now is 11:44 PM.

Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 Microsoft Office Word Forum - WordBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Word"