#1   Report Post  
Posted to microsoft.public.word.docmanagement
jezzica85
 
Posts: n/a
Default Finding unique words

Hi all,
Does anyone know if it's possible to make a list of all the unique words in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all the
words first for that. I'm an amateur Java programmer, so if you know Java,
you know that we can use StringTokenizers and HashSets to do this for small
strings, but is there a way to do that on a larger scale for a Word file (I
know it's a different programming language too, the Java was just an example)
that's a few hundred pages long?
Thanks!
  #2   Report Post  
Posted to microsoft.public.word.docmanagement
Jezebel
 
Posts: n/a
Default Finding unique words

What's wrong with making a copy then destroying all the punctuation?
Quickest method I know is to use Find and Replace to delete all non-text and
convert all white space to paragraph marks; then copy to Excel and do a
unique filter.

If you want to do it with VBA, iterate the words collection, check whether
the 'word' is text, and if so, add it to a collection using it as both the
key and the item. Since keys must be unique, you end up with a unique list.




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!



  #3   Report Post  
Posted to microsoft.public.word.docmanagement
jezzica85
 
Posts: n/a
Default Finding unique words

Hi Jezebel,
There isn't anything wrong with making a copy and destroying all the
punctuation, I've done that before and it works well. I was just hoping
there was a faster way because it takes quite a while to go through all the
possible punctuation marks and stuff. Is there a way to quickly replace
anything nontext with a paragraph break?

"Jezebel" wrote:

What's wrong with making a copy then destroying all the punctuation?
Quickest method I know is to use Find and Replace to delete all non-text and
convert all white space to paragraph marks; then copy to Excel and do a
unique filter.

If you want to do it with VBA, iterate the words collection, check whether
the 'word' is text, and if so, add it to a collection using it as both the
key and the item. Since keys must be unique, you end up with a unique list.




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!




  #4   Report Post  
Posted to microsoft.public.word.docmanagement
Rae Drysdale
 
Posts: n/a
Default Finding unique words

Use Edit | Replace and replace full stop with paragraph break. Same again for
comma and any other punctuation mark in the document.
--
Rae Drysdale


"jezzica85" wrote:

Hi Jezebel,
There isn't anything wrong with making a copy and destroying all the
punctuation, I've done that before and it works well. I was just hoping
there was a faster way because it takes quite a while to go through all the
possible punctuation marks and stuff. Is there a way to quickly replace
anything nontext with a paragraph break?

"Jezebel" wrote:

What's wrong with making a copy then destroying all the punctuation?
Quickest method I know is to use Find and Replace to delete all non-text and
convert all white space to paragraph marks; then copy to Excel and do a
unique filter.

If you want to do it with VBA, iterate the words collection, check whether
the 'word' is text, and if so, add it to a collection using it as both the
key and the item. Since keys must be unique, you end up with a unique list.




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!




  #5   Report Post  
Posted to microsoft.public.word.docmanagement
Jezebel
 
Posts: n/a
Default Finding unique words

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!





  #6   Report Post  
Posted to microsoft.public.word.docmanagement
jezzica85
 
Posts: n/a
Default Finding unique words

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

"Jezebel" wrote:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!




  #7   Report Post  
Posted to microsoft.public.word.docmanagement
Jezebel
 
Posts: n/a
Default Finding unique words

The find text is a regular expression. The exclamation mark means 'not' --
so the expression means 'match any character other than a-z, upper or lower
case. You can add any other characters you also want to exclude, eg
[!A-Za-z,-] You have to put the hyphen last, otherwise it's interpreted as
a range indicator.

The caron means that the following digits are a decimal character number.
013 = paragraph mark.




"jezzica85" wrote in message
...
Thanks Jezebel, that works really well, but I notice it destroys hyphens
and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013
mean?
Thanks!

"Jezebel" wrote:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique
words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know
all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word
file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!






  #8   Report Post  
Posted to microsoft.public.word.docmanagement
Jay Freedman
 
Posts: n/a
Default Finding unique words

A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm.

The ^013 is the code for a paragraph mark (technically, the ASCII
character with the numeric value 13, which is a carriage return in
plain text).

The code ^p could also be used for a paragraph mark, but only in the
Replace With box (for some reason only ^013 works in the Find What
box). In fact, if you use ^013 in the Replace With box, the Table
Sort command in Word won't recognize the "paragraph marks" and will
claim there are no valid records (paragraphs) in the text to be
sorted. They'll work OK when you copy the text into Excel, though.

To make the Replace leave apostrophes and hyphens in place, use the
search expression

[!a-zA-Z'-]

This expression translates to "find all characters that are not in the
ranges a through z or A through Z, and are not an apostrophe or a
hyphen".

--
Regards,
Jay Freedman
Microsoft Word MVP FAQ: http://word.mvps.org
Email cannot be acknowledged; please post all follow-ups to the
newsgroup so all may benefit.

On Sat, 22 Apr 2006 17:33:01 -0700, jezzica85
wrote:

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

"Jezebel" wrote:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!




  #9   Report Post  
Posted to microsoft.public.word.docmanagement
jezzica85
 
Posts: n/a
Default Finding unique words

Thank you Jezebel and Jay, this was really helpful.
jezzica85

"Jay Freedman" wrote:

A good reference for wildcards in Find and Replace is at
http://www.gmayor.com/replace_using_wildcards.htm.

The ^013 is the code for a paragraph mark (technically, the ASCII
character with the numeric value 13, which is a carriage return in
plain text).

The code ^p could also be used for a paragraph mark, but only in the
Replace With box (for some reason only ^013 works in the Find What
box). In fact, if you use ^013 in the Replace With box, the Table
Sort command in Word won't recognize the "paragraph marks" and will
claim there are no valid records (paragraphs) in the text to be
sorted. They'll work OK when you copy the text into Excel, though.

To make the Replace leave apostrophes and hyphens in place, use the
search expression

[!a-zA-Z'-]

This expression translates to "find all characters that are not in the
ranges a through z or A through Z, and are not an apostrophe or a
hyphen".

--
Regards,
Jay Freedman
Microsoft Word MVP FAQ: http://word.mvps.org
Email cannot be acknowledged; please post all follow-ups to the
newsgroup so all may benefit.

On Sat, 22 Apr 2006 17:33:01 -0700, jezzica85
wrote:

Thanks Jezebel, that works really well, but I notice it destroys hyphens and
apostrophes too, is there a way to do this keeping the hyphens and
apostrophes? And I'm just curious so I know later, what does the ^013 mean?
Thanks!

"Jezebel" wrote:

With 'Use wildcards' checked --

Find: [!a-zA-Z]
Replace: ^013




"jezzica85" wrote in message
...
Hi all,
Does anyone know if it's possible to make a list of all the unique words
in
a document without having to destroy all the punctuation and formatting
first? I know you can make a concordance index, but you have to know all
the
words first for that. I'm an amateur Java programmer, so if you know
Java,
you know that we can use StringTokenizers and HashSets to do this for
small
strings, but is there a way to do that on a larger scale for a Word file
(I
know it's a different programming language too, the Java was just an
example)
that's a few hundred pages long?
Thanks!




Reply
Thread Tools
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Word should catalog misspelled words to study. rndthought Microsoft Word Help 39 May 21st 23 02:47 AM
How to find a series of words and then changing formats MolTom Microsoft Word Help 4 December 13th 05 03:05 PM
Catalog all words in document Brad A. Microsoft Word Help 1 July 20th 05 09:44 PM
How can I find if there are doubles in my list of words? Rhen Microsoft Word Help 3 June 9th 05 05:12 AM
Frequency count in Word mmm Microsoft Word Help 1 November 28th 04 12:44 PM


All times are GMT +1. The time now is 03:39 AM.

Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 Microsoft Office Word Forum - WordBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Word"