Home |
Search |
Today's Posts |
#1
|
|||
|
|||
How to clean .rtf documents
My company uses .rtf documents as "raw" documents (revised annually) that are
subsequently converted (via a special in-house conversion program) to speically coded html documents for use on our web site. Over the past 2 years or so, we have noticed more than ever that our conversion program is getting caught up on all the extraneous code automatically entered into .rtf documents as they are opened and edited by various people. The code causes our conversion program to skip important sections of text that should be recognized and linked, and occasionally, the file will get so jumbled with this text that the conversion program rejects it altogether! Currently, the only way I know to fix the problem is to save the entire document as a .txt file, then resave it as .rtf. Then reformat all my lost formatting (bolds, italics, etc). And even then, I still sometimes have to re-open the file as .txt and manually hunt for the offensive code! We have over 1200 documents that I reconvert 4 times a year, so I'm wasting a ton of time. Is there a way to "clean" this code? Also, is there a way to turn off this annoying code so that it doesn't get entered in the first place? (I believe much of this formatting is RSID code that records document changes and versions? Much of our revision process involved pasting text from other documents, so Im sure some of this code is getting entered as a result of the copy/paste.) For example, this: 2-Methyl-3-hydroxybutyryl-}{\insrsid8069086 coA}{\insrsid8069086\charrsid1904895 dehydrogenase deficiency is a rare X-linked organic aciduria with a highly unusual \'93neurodegenerative\'94 disease should simply be this: 2-Methyl-3-hydroxybutyryl-coA dehydrogenase deficiency is a rare X-linked organic aciduria with a highly unusual "neurodegenerative" disease ANY assistance is MUCH appreciated!! |
#2
|
|||
|
|||
How to clean .rtf documents
Thanks Don. Unfortunately, our "html conversion" is not a simple
convert--the conversion also does a massive amount of linking, indexing, etc. We also have a pretty sophisticated document management system. Unfortunately it was all built based on our "raw" documents as .rtf, and it would be a tragedy to scrap the entire system to use a different type of raw document at this point. Integrating our conversion program with a 3rd party software is probably futile. I was REALLY just hoping to find a way to clean up these .rtf documents. It just doesn't seem right that there's no way to strip out all that extra RSID code if we don't need or want it in there. --dana "Don" wrote: "?B?TUxlZGl0b3JfRGFuYQ==?=" wrote in : My company uses .rtf documents as "raw" documents (revised annually) that are subsequently converted (via a special in-house conversion program) to speically coded html documents for use on our web site. Over the past 2 years or so, we have noticed more than ever that our conversion program is getting caught up on all the extraneous code automatically entered into .rtf documents as they are opened and edited by various people. The code causes our conversion program to skip important sections of text that should be recognized and linked, and occasionally, the file will get so jumbled with this text that the conversion program rejects it altogether! Currently, the only way I know to fix the problem is to save the entire document as a .txt file, then resave it as .rtf. Then reformat all my lost formatting (bolds, italics, etc). And even then, I still sometimes have to re-open the file as .txt and manually hunt for the offensive code! We have over 1200 documents that I reconvert 4 times a year, so I'm wasting a ton of time. Is there a way to "clean" this code? Also, is there a way to turn off this annoying code so that it doesn't get entered in the first place? (I believe much of this formatting is RSID code that records document changes and versions? Much of our revision process involved pasting text from other documents, so I€„¢m sure some of this code is getting entered as a result of the copy/paste.) For example, this: 2-Methyl-3-hydroxybutyryl-}{\insrsid8069086 coA}{\insrsid8069086\charrsid1904895 dehydrogenase deficiency is a rare X-linked organic aciduria with a highly unusual \'93neurodegenerative\'94 disease should simply be this: 2-Methyl-3-hydroxybutyryl-coA dehydrogenase deficiency is a rare X-linked organic aciduria with a highly unusual "neurodegenerative" disease ANY assistance is MUCH appreciated!! Word and web pages is an oxymoron ;-) Even if it's only a simple RTF! It's likley your users are copying and pasting from newer versions of MS- products (Word or otherwise) into the RTF format (Word 6.0). The most effective and least efforts on your part would be to instruct the users to copy and paste in the following order: 1) Copy from Word 2) Paste into NotePad 3) Copy and paste from NotePad to RTF and recreate formattting as desired. There are many tools for cleaning Word bloat in html pages (one is HTML Tidy), the majority of these do not remove all the bloat and will remove all formatting. IMO, you'd be much better off in the long run with CMS system as opposed to using RTF and a conversion process. Especially considering the volume of your existing documents and future quanities to come. Here's a link to an open-source CMS: http://www.opendocman.com/ |
Reply |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Master/Outline documents of substantial length | Microsoft Word Help | |||
Master/Outline documents of substantial length | Formatting Long Documents | |||
What is the best strategy for managing large documents? | Microsoft Word Help | |||
print multiple documents or merge multiple documents into one | Microsoft Word Help | |||
Problems merging multiple Word documents | Microsoft Word Help |