Home |
Search |
Today's Posts |
#1
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
Hi all,
Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#2
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
What's wrong with making a copy then destroying all the punctuation?
Quickest method I know is to use Find and Replace to delete all non-text and convert all white space to paragraph marks; then copy to Excel and do a unique filter. If you want to do it with VBA, iterate the words collection, check whether the 'word' is text, and if so, add it to a collection using it as both the key and the item. Since keys must be unique, you end up with a unique list. "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#3
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
Hi Jezebel,
There isn't anything wrong with making a copy and destroying all the punctuation, I've done that before and it works well. I was just hoping there was a faster way because it takes quite a while to go through all the possible punctuation marks and stuff. Is there a way to quickly replace anything nontext with a paragraph break? "Jezebel" wrote: What's wrong with making a copy then destroying all the punctuation? Quickest method I know is to use Find and Replace to delete all non-text and convert all white space to paragraph marks; then copy to Excel and do a unique filter. If you want to do it with VBA, iterate the words collection, check whether the 'word' is text, and if so, add it to a collection using it as both the key and the item. Since keys must be unique, you end up with a unique list. "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#4
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
Use Edit | Replace and replace full stop with paragraph break. Same again for
comma and any other punctuation mark in the document. -- Rae Drysdale "jezzica85" wrote: Hi Jezebel, There isn't anything wrong with making a copy and destroying all the punctuation, I've done that before and it works well. I was just hoping there was a faster way because it takes quite a while to go through all the possible punctuation marks and stuff. Is there a way to quickly replace anything nontext with a paragraph break? "Jezebel" wrote: What's wrong with making a copy then destroying all the punctuation? Quickest method I know is to use Find and Replace to delete all non-text and convert all white space to paragraph marks; then copy to Excel and do a unique filter. If you want to do it with VBA, iterate the words collection, check whether the 'word' is text, and if so, add it to a collection using it as both the key and the item. Since keys must be unique, you end up with a unique list. "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#5
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
With 'Use wildcards' checked --
Find: [!a-zA-Z] Replace: ^013 "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#6
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and apostrophes? And I'm just curious so I know later, what does the ^013 mean? Thanks! "Jezebel" wrote: With 'Use wildcards' checked -- Find: [!a-zA-Z] Replace: ^013 "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#7
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
The find text is a regular expression. The exclamation mark means 'not' --
so the expression means 'match any character other than a-z, upper or lower case. You can add any other characters you also want to exclude, eg [!A-Za-z,-] You have to put the hyphen last, otherwise it's interpreted as a range indicator. The caron means that the following digits are a decimal character number. 013 = paragraph mark. "jezzica85" wrote in message ... Thanks Jezebel, that works really well, but I notice it destroys hyphens and apostrophes too, is there a way to do this keeping the hyphens and apostrophes? And I'm just curious so I know later, what does the ^013 mean? Thanks! "Jezebel" wrote: With 'Use wildcards' checked -- Find: [!a-zA-Z] Replace: ^013 "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#8
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm. The ^013 is the code for a paragraph mark (technically, the ASCII character with the numeric value 13, which is a carriage return in plain text). The code ^p could also be used for a paragraph mark, but only in the Replace With box (for some reason only ^013 works in the Find What box). In fact, if you use ^013 in the Replace With box, the Table Sort command in Word won't recognize the "paragraph marks" and will claim there are no valid records (paragraphs) in the text to be sorted. They'll work OK when you copy the text into Excel, though. To make the Replace leave apostrophes and hyphens in place, use the search expression [!a-zA-Z'-] This expression translates to "find all characters that are not in the ranges a through z or A through Z, and are not an apostrophe or a hyphen". -- Regards, Jay Freedman Microsoft Word MVP FAQ: http://word.mvps.org Email cannot be acknowledged; please post all follow-ups to the newsgroup so all may benefit. On Sat, 22 Apr 2006 17:33:01 -0700, jezzica85 wrote: Thanks Jezebel, that works really well, but I notice it destroys hyphens and apostrophes too, is there a way to do this keeping the hyphens and apostrophes? And I'm just curious so I know later, what does the ^013 mean? Thanks! "Jezebel" wrote: With 'Use wildcards' checked -- Find: [!a-zA-Z] Replace: ^013 "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
#9
![]()
Posted to microsoft.public.word.docmanagement
|
|||
|
|||
![]()
Thank you Jezebel and Jay, this was really helpful.
jezzica85 "Jay Freedman" wrote: A good reference for wildcards in Find and Replace is at http://www.gmayor.com/replace_using_wildcards.htm. The ^013 is the code for a paragraph mark (technically, the ASCII character with the numeric value 13, which is a carriage return in plain text). The code ^p could also be used for a paragraph mark, but only in the Replace With box (for some reason only ^013 works in the Find What box). In fact, if you use ^013 in the Replace With box, the Table Sort command in Word won't recognize the "paragraph marks" and will claim there are no valid records (paragraphs) in the text to be sorted. They'll work OK when you copy the text into Excel, though. To make the Replace leave apostrophes and hyphens in place, use the search expression [!a-zA-Z'-] This expression translates to "find all characters that are not in the ranges a through z or A through Z, and are not an apostrophe or a hyphen". -- Regards, Jay Freedman Microsoft Word MVP FAQ: http://word.mvps.org Email cannot be acknowledged; please post all follow-ups to the newsgroup so all may benefit. On Sat, 22 Apr 2006 17:33:01 -0700, jezzica85 wrote: Thanks Jezebel, that works really well, but I notice it destroys hyphens and apostrophes too, is there a way to do this keeping the hyphens and apostrophes? And I'm just curious so I know later, what does the ^013 mean? Thanks! "Jezebel" wrote: With 'Use wildcards' checked -- Find: [!a-zA-Z] Replace: ^013 "jezzica85" wrote in message ... Hi all, Does anyone know if it's possible to make a list of all the unique words in a document without having to destroy all the punctuation and formatting first? I know you can make a concordance index, but you have to know all the words first for that. I'm an amateur Java programmer, so if you know Java, you know that we can use StringTokenizers and HashSets to do this for small strings, but is there a way to do that on a larger scale for a Word file (I know it's a different programming language too, the Java was just an example) that's a few hundred pages long? Thanks! |
Reply |
Thread Tools | |
Display Modes | |
|
|
![]() |
||||
Thread | Forum | |||
Word should catalog misspelled words to study. | Microsoft Word Help | |||
How to find a series of words and then changing formats | Microsoft Word Help | |||
Catalog all words in document | Microsoft Word Help | |||
How can I find if there are doubles in my list of words? | Microsoft Word Help | |||
Frequency count in Word | Microsoft Word Help |