Reply
 
Thread Tools Display Modes
  #1   Report Post  
MLeditor_Dana
 
Posts: n/a
Default How to clean .rtf documents

My company uses .rtf documents as "raw" documents (revised annually) that are
subsequently converted (via a special in-house conversion program) to
speically coded html documents for use on our web site. Over the past 2
years or so, we have noticed more than ever that our conversion program is
getting caught up on all the extraneous code automatically entered into .rtf
documents as they are opened and edited by various people. The code causes
our conversion program to skip important sections of text that should be
recognized and linked, and occasionally, the file will get so jumbled with
this text that the conversion program rejects it altogether!

Currently, the only way I know to fix the problem is to save the entire
document as a .txt file, then resave it as .rtf. Then reformat all my lost
formatting (bolds, italics, etc). And even then, I still sometimes have to
re-open the file as .txt and manually hunt for the offensive code! We have
over 1200 documents that I reconvert 4 times a year, so I'm wasting a ton of
time.

Is there a way to "clean" this code? Also, is there a way to turn off this
annoying code so that it doesn't get entered in the first place? (I believe
much of this formatting is RSID code that records document changes and
versions? Much of our revision process involved pasting text from other
documents, so Im sure some of this code is getting entered as a result of
the copy/paste.)

For example, this:
2-Methyl-3-hydroxybutyryl-}{\insrsid8069086
coA}{\insrsid8069086\charrsid1904895 dehydrogenase deficiency is a rare
X-linked organic aciduria with a highly unusual \'93neurodegenerative\'94
disease

should simply be this:
2-Methyl-3-hydroxybutyryl-coA dehydrogenase deficiency is a rare X-linked
organic aciduria with a highly unusual "neurodegenerative" disease

ANY assistance is MUCH appreciated!!
  #2   Report Post  
MLeditor_Dana
 
Posts: n/a
Default How to clean .rtf documents

Thanks Don. Unfortunately, our "html conversion" is not a simple
convert--the conversion also does a massive amount of linking, indexing, etc.
We also have a pretty sophisticated document management system.
Unfortunately it was all built based on our "raw" documents as .rtf, and it
would be a tragedy to scrap the entire system to use a different type of raw
document at this point. Integrating our conversion program with a 3rd party
software is probably futile.

I was REALLY just hoping to find a way to clean up these .rtf documents. It
just doesn't seem right that there's no way to strip out all that extra RSID
code if we don't need or want it in there.

--dana

"Don" wrote:

"?B?TUxlZGl0b3JfRGFuYQ==?="
wrote in
:

My company uses .rtf documents as "raw" documents (revised annually)
that are subsequently converted (via a special in-house conversion
program) to speically coded html documents for use on our web site.
Over the past 2 years or so, we have noticed more than ever that our
conversion program is getting caught up on all the extraneous code
automatically entered into .rtf documents as they are opened and
edited by various people. The code causes our conversion program to
skip important sections of text that should be recognized and linked,
and occasionally, the file will get so jumbled with this text that the
conversion program rejects it altogether!

Currently, the only way I know to fix the problem is to save the
entire document as a .txt file, then resave it as .rtf. Then reformat
all my lost formatting (bolds, italics, etc). And even then, I still
sometimes have to re-open the file as .txt and manually hunt for the
offensive code! We have over 1200 documents that I reconvert 4 times
a year, so I'm wasting a ton of time.

Is there a way to "clean" this code? Also, is there a way to turn off
this annoying code so that it doesn't get entered in the first place?
(I believe much of this formatting is RSID code that records document
changes and versions? Much of our revision process involved pasting
text from other documents, so I€„¢m sure some of this code is getting
entered as a result of the copy/paste.)

For example, this:
2-Methyl-3-hydroxybutyryl-}{\insrsid8069086
coA}{\insrsid8069086\charrsid1904895 dehydrogenase deficiency is a
rare X-linked organic aciduria with a highly unusual
\'93neurodegenerative\'94 disease

should simply be this:
2-Methyl-3-hydroxybutyryl-coA dehydrogenase deficiency is a rare
X-linked organic aciduria with a highly unusual "neurodegenerative"
disease

ANY assistance is MUCH appreciated!!


Word and web pages is an oxymoron ;-)
Even if it's only a simple RTF!

It's likley your users are copying and pasting from newer versions of MS-
products (Word or otherwise) into the RTF format (Word 6.0).

The most effective and least efforts on your part would be to instruct the
users to copy and paste in the following order:
1) Copy from Word
2) Paste into NotePad
3) Copy and paste from NotePad to RTF and recreate formattting as desired.

There are many tools for cleaning Word bloat in html pages (one is HTML
Tidy), the majority of these do not remove all the bloat and will remove
all formatting.

IMO, you'd be much better off in the long run with CMS system as opposed to
using RTF and a conversion process. Especially considering the volume of
your existing documents and future quanities to come.
Here's a link to an open-source CMS:
http://www.opendocman.com/

Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Master/Outline documents of substantial length PopS Microsoft Word Help 7 September 13th 05 01:17 AM
Master/Outline documents of substantial length PopS Formatting Long Documents 7 September 13th 05 01:17 AM
What is the best strategy for managing large documents? [email protected] Microsoft Word Help 0 August 16th 05 08:33 PM
print multiple documents or merge multiple documents into one EE in Need Microsoft Word Help 3 July 16th 05 12:38 AM
Problems merging multiple Word documents Tevibear Microsoft Word Help 2 June 26th 05 12:27 AM


All times are GMT +1. The time now is 11:20 PM.

Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 Microsoft Office Word Forum - WordBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Word"