Home |
Search |
Today's Posts |
#1
|
|||
|
|||
Removing redundant characters
Hi!
I'm trying to remove all redundant characters like two many spaces or paragraph signs from a word doc. This is my code so far: Dim wdApp As Word.Application Dim wdDatei As Word.Document Try wdApp = New Word.Application wdDatei = wdApp.Documents.Open(txtSource.Text) Catch ex As Exception MessageBox.Show(ex.Message) Exit Sub End Try Try With wdDatei.Range.Find .Text = "^13^13" .Replacement.Text = "^p" .Forward = True .Wrap = Word.WdFindWrap.wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With wdDatei.Range.Find.Execute(Replace:=Word.WdReplace .wdReplaceAll) Catch ex As Exception MessageBox.Show(ex.Message) Finally wdDatei.SaveAs(txtTarget.Text) wdDatei.Close(Word.WdSaveOptions.wdDoNotSaveChange s) wdApp.Quit() wdDatei = Nothing wdApp = Nothing End Try´ When it gets to line ..Text = "^13^13" it throws an exception. I'm using Word 2000 and the Microsoft Word 9.0 Object Library. Can anybody tell me what I'm doing wrong? ralf |
#2
|
|||
|
|||
G'day "ralf" ,
A repost from Word PC-L and AusTechWriter mailing lists In trying to explain how Word's Find and Replace (FnR) wilcard mechanism works, I'll also present a practical solution to the multitude of problems encountered by the seemingly innocuous ^p^p to ^p, whose usual objective is to remove unnecessary blank lines. In doing so, we shall traverse the width of Word's pitfalls that never fail to trip up a traveller. First up, the Word Help System has some excellent help on wildcards. It is a complete PITA to access, but you can find something. In Word 2k: F1 - help Answer Wizard | Index Search on: wildcard The second topic down is the master list of all FnR stuff. Select it. Pick the Wildcard Characters topic down that list. Now select the _type a wildcard_ hyperlink. Hooray. Print the damn thing. Use it as a guide from now on :-) You have just found the first excellent Quick Reference in the help system. The very last two paragraphs are the key to what I am attempting here. For our replace a double para with a single para, we would think that Find ^p^p and replace with ^p would do the job right? Well, not really. If you do it via VBA you find yourself stalling forever if your document is terminated by a blank paragraph as you have to perform it iteratively until you get a Not .Found condition. Why does it fail to replace the last paragraph mark? Well, you can't delete the last paragraph mark - ever. When you a start a brand new virgin document and turn on View Formatting, that paragraph mark you see is the End Of Document paragraph mark. As the document exists and has a finite end point, that magic pilcrow (backwards P) has to appear. It is also the marker point in memory to place the nasty little objects we infest our nice clean ascii text with. Style definitions, table formatting, list templates, graphical objects and the list goes on. See Alt + F11 F2 Enter for more information. So, to get around the VBA problem, we simple pre-process the final paragraph. If it is blank, just a para mark, then kill the second last character - which must be the penultimate paragraph mark. Manually, press Ctrl+End and use the backspace key as often as required. The main problem with the simple FnR replace postulation is similar. If you just delete a para mark, you lose the style for that paragraph. So, we can get around this by ensuring it is always the trailing paragraph that gets deleted. It won't do the final blank paragraph in a document, but this is solved above. First, we need to understand how the brackets work, and the help topic does that nicely. So let us put the guide into good use. (^p)^p means that we have marked the first para mark as our first 'text chunk'. If we use \1 in the replace string, it means to leave the first text chunk, the para mark with the holy styling applied, in place. Unfortunately for us, we still haven't got there yet. We get an error, we can't use ^p if we are using wildcards. *******s. So we have to use ^013 instead. Herein lies our next problem - paragraph marks that aren't! Oh yes kiddies, just because you see a pilcrow does not mean you are looking at a paragraph mark. Oh no. Not with Paste Special and even weirder applications handing in clipboard data streams without thought. Word dutifully displays a pilcrow when it encounters an ASCII 013, but the background machinery may not have resolved into a paragraph object to be kept dynamically updated. How do I know it is ASCII 013? Well, I cheat. I select the paragraph mark, or whatever character I need to know, and use VBA. Alt + F11 (VB Editor or the VBE). Ctrl+G (Immediate Window). Enter: ? ASCW(Selection) I use ASCW() rather than ASC() because I want the full Unicode value. For ASCII characters the Unicode value is the same. Go ahead, work out the wildcards' ASCII numbers and write it on yer guide. So, if we are going to use replace (^013)^013 with ^013 we have to make sure every ASCII 13 is a damn paragraph mark. Without wildcards on, find ^013 and replace it with ^p. Honest paragraphs will see no change, fake paragraphs get converted to your will on the spot. Now you can get serious and stick yer wildcard search on. Replace (^013)^013 with \1 and we're in the clear. Done. In a similar fashion, the much simpler exercise of replacing a colon that occurs after a ket - a ) char - without destroying the ket itself, would be to use wildcards, and replace (^041)^058 with \1. However, if we were searching for a bra, a ( character, we run into another peculiar little Word problem with managing RTF strings. If you insert a symbol from the Windings range, or many other non-unicode graphical fonts, Word actually stores a marker there instead, and then stores the actual font character off beyond the end of section mark. That marker is ASCII 40, our unfortunate bra. So an ^040^058 sequence could very well be any damn symbol followed by a colon. If we were using two blank paragraphs before every heading and no space before to ensure our new pages always start at the very top no matter the method used to page break, and we wanted to get rid of scads of three or more blank paras in excess of a single hit (are we listening VBA people?) we could do something evil and wicked like this: find (^013{2,2})(^013)@ and replace it with \1. This leaves us with a maximum of two following blank paragraphs anywhere in the document, even at the end - in one single find operation. Interestingly enough, for those still able to follow, (^013{2,2})^013{1,} fails with an invalid pattern. I forced it with the brackets for the above solution. Which then brings us to the final solution for technical writers seeking to mass destroy all blank lines. It has taken a while, but boy haven't we learn't a lot of useless stuff about Word on the way. Find (^013)(^013)@ and replace with \1 to kill all blank paras in the document in a single pass, with the exception of the first paragraph (there is no start of document paragraph mark to give us a two-in-a-row target) and the last paragraph mark (which is forbidden from the find range). Steve Hudson Word Heretic, Sydney, Australia Tricky stuff with Word or words for you. www.wordheretic.com ABN: 86 453 419 554 "Qualified Good Tech Writer Dude" Free Association of Words Without prejudice Steve Hudson - Word Heretic steve from wordheretic.com (Email replies require payment) Without prejudice ralf reckoned: Hi! I'm trying to remove all redundant characters like two many spaces or paragraph signs from a word doc. This is my code so far: Dim wdApp As Word.Application Dim wdDatei As Word.Document Try wdApp = New Word.Application wdDatei = wdApp.Documents.Open(txtSource.Text) Catch ex As Exception MessageBox.Show(ex.Message) Exit Sub End Try Try With wdDatei.Range.Find .Text = "^13^13" .Replacement.Text = "^p" .Forward = True .Wrap = Word.WdFindWrap.wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With wdDatei.Range.Find.Execute(Replace:=Word.WdReplac e.wdReplaceAll) Catch ex As Exception MessageBox.Show(ex.Message) Finally wdDatei.SaveAs(txtTarget.Text) wdDatei.Close(Word.WdSaveOptions.wdDoNotSaveChange s) wdApp.Quit() wdDatei = Nothing wdApp = Nothing End Try´ When it gets to line .Text = "^13^13" it throws an exception. I'm using Word 2000 and the Microsoft Word 9.0 Object Library. Can anybody tell me what I'm doing wrong? ralf |
#3
|
|||
|
|||
Thanks.
Another error was that I used Word 2000. When I tried it on Word 2002 I could at least replace a row of spaces with one single space and delete the Chr(11).ToString-signs and the application did not throw an exception, what it did using Word2000. I'll try your proposal for the ^p^p. "Word Heretic" schrieb: G'day "ralf" , A repost from Word PC-L and AusTechWriter mailing lists In trying to explain how Word's Find and Replace (FnR) wilcard mechanism works, I'll also present a practical solution to the multitude of problems encountered by the seemingly innocuous ^p^p to ^p, whose usual objective is to remove unnecessary blank lines. In doing so, we shall traverse the width of Word's pitfalls that never fail to trip up a traveller. First up, the Word Help System has some excellent help on wildcards. It is a complete PITA to access, but you can find something. In Word 2k: F1 - help Answer Wizard | Index Search on: wildcard The second topic down is the master list of all FnR stuff. Select it. Pick the Wildcard Characters topic down that list. Now select the _type a wildcard_ hyperlink. Hooray. Print the damn thing. Use it as a guide from now on :-) You have just found the first excellent Quick Reference in the help system. The very last two paragraphs are the key to what I am attempting here. For our replace a double para with a single para, we would think that Find ^p^p and replace with ^p would do the job right? Well, not really. If you do it via VBA you find yourself stalling forever if your document is terminated by a blank paragraph as you have to perform it iteratively until you get a Not .Found condition. Why does it fail to replace the last paragraph mark? Well, you can't delete the last paragraph mark - ever. When you a start a brand new virgin document and turn on View Formatting, that paragraph mark you see is the End Of Document paragraph mark. As the document exists and has a finite end point, that magic pilcrow (backwards P) has to appear. It is also the marker point in memory to place the nasty little objects we infest our nice clean ascii text with. Style definitions, table formatting, list templates, graphical objects and the list goes on. See Alt + F11 F2 Enter for more information. So, to get around the VBA problem, we simple pre-process the final paragraph. If it is blank, just a para mark, then kill the second last character - which must be the penultimate paragraph mark. Manually, press Ctrl+End and use the backspace key as often as required. The main problem with the simple FnR replace postulation is similar. If you just delete a para mark, you lose the style for that paragraph. So, we can get around this by ensuring it is always the trailing paragraph that gets deleted. It won't do the final blank paragraph in a document, but this is solved above. First, we need to understand how the brackets work, and the help topic does that nicely. So let us put the guide into good use. (^p)^p means that we have marked the first para mark as our first 'text chunk'. If we use \1 in the replace string, it means to leave the first text chunk, the para mark with the holy styling applied, in place. Unfortunately for us, we still haven't got there yet. We get an error, we can't use ^p if we are using wildcards. *******s. So we have to use ^013 instead. Herein lies our next problem - paragraph marks that aren't! Oh yes kiddies, just because you see a pilcrow does not mean you are looking at a paragraph mark. Oh no. Not with Paste Special and even weirder applications handing in clipboard data streams without thought. Word dutifully displays a pilcrow when it encounters an ASCII 013, but the background machinery may not have resolved into a paragraph object to be kept dynamically updated. How do I know it is ASCII 013? Well, I cheat. I select the paragraph mark, or whatever character I need to know, and use VBA. Alt + F11 (VB Editor or the VBE). Ctrl+G (Immediate Window). Enter: ? ASCW(Selection) I use ASCW() rather than ASC() because I want the full Unicode value. For ASCII characters the Unicode value is the same. Go ahead, work out the wildcards' ASCII numbers and write it on yer guide. So, if we are going to use replace (^013)^013 with ^013 we have to make sure every ASCII 13 is a damn paragraph mark. Without wildcards on, find ^013 and replace it with ^p. Honest paragraphs will see no change, fake paragraphs get converted to your will on the spot. Now you can get serious and stick yer wildcard search on. Replace (^013)^013 with \1 and we're in the clear. Done. In a similar fashion, the much simpler exercise of replacing a colon that occurs after a ket - a ) char - without destroying the ket itself, would be to use wildcards, and replace (^041)^058 with \1. However, if we were searching for a bra, a ( character, we run into another peculiar little Word problem with managing RTF strings. If you insert a symbol from the Windings range, or many other non-unicode graphical fonts, Word actually stores a marker there instead, and then stores the actual font character off beyond the end of section mark. That marker is ASCII 40, our unfortunate bra. So an ^040^058 sequence could very well be any damn symbol followed by a colon. If we were using two blank paragraphs before every heading and no space before to ensure our new pages always start at the very top no matter the method used to page break, and we wanted to get rid of scads of three or more blank paras in excess of a single hit (are we listening VBA people?) we could do something evil and wicked like this: find (^013{2,2})(^013)@ and replace it with \1. This leaves us with a maximum of two following blank paragraphs anywhere in the document, even at the end - in one single find operation. Interestingly enough, for those still able to follow, (^013{2,2})^013{1,} fails with an invalid pattern. I forced it with the brackets for the above solution. Which then brings us to the final solution for technical writers seeking to mass destroy all blank lines. It has taken a while, but boy haven't we learn't a lot of useless stuff about Word on the way. Find (^013)(^013)@ and replace with \1 to kill all blank paras in the document in a single pass, with the exception of the first paragraph (there is no start of document paragraph mark to give us a two-in-a-row target) and the last paragraph mark (which is forbidden from the find range). Steve Hudson Word Heretic, Sydney, Australia Tricky stuff with Word or words for you. www.wordheretic.com ABN: 86 453 419 554 "Qualified Good Tech Writer Dude" Free Association of Words Without prejudice Steve Hudson - Word Heretic steve from wordheretic.com (Email replies require payment) Without prejudice ralf reckoned: Hi! I'm trying to remove all redundant characters like two many spaces or paragraph signs from a word doc. This is my code so far: Dim wdApp As Word.Application Dim wdDatei As Word.Document Try wdApp = New Word.Application wdDatei = wdApp.Documents.Open(txtSource.Text) Catch ex As Exception MessageBox.Show(ex.Message) Exit Sub End Try Try With wdDatei.Range.Find .Text = "^13^13" .Replacement.Text = "^p" .Forward = True .Wrap = Word.WdFindWrap.wdFindContinue .Format = False .MatchCase = False .MatchWholeWord = False .MatchWildcards = False .MatchSoundsLike = False .MatchAllWordForms = False End With wdDatei.Range.Find.Execute(Replace:=Word.WdReplac e.wdReplaceAll) Catch ex As Exception MessageBox.Show(ex.Message) Finally wdDatei.SaveAs(txtTarget.Text) wdDatei.Close(Word.WdSaveOptions.wdDoNotSaveChange s) wdApp.Quit() wdDatei = Nothing wdApp = Nothing End Try´ When it gets to line .Text = "^13^13" it throws an exception. I'm using Word 2000 and the Microsoft Word 9.0 Object Library. Can anybody tell me what I'm doing wrong? ralf |
Reply |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
Other characters in a document | Microsoft Word Help | |||
How to insert Asia characters | Microsoft Word Help | |||
Difficulties mail merging special characters from excel to word 20 | Mailmerge | |||
International characters in mail merge | Mailmerge | |||
Formatting characters | Formatting Long Documents |