View Single Post
  #10   Report Post  
Greg Maxey
 
Posts: n/a
Default

Jay,

Well done!! My stab at RTF was literally pulled from the nether region.

--
Greg Maxey/Word MVP
A Peer in Peer to Peer Support

Jay Freedman wrote:
Hi Jack,

I'll try to follow up Greg's musings and shed what light I can...

In the macro, the line ".Delete" removes the frame and leaves the
text. That should give you what you need -- plain text -- but there
may be a wrinkle, which I'll explain after a bit.

A frame and a textbox are similar in some ways, but the big difference
is that a textbox is in the "drawing layer" while the frame is in the
"text layer". That is, Word thinks of the textbox as a sort of
picture, while a frame is more like special formatting of text. You
can include a frame as part of a paragraph style, which you can't do
with a textbox. The ability to transform a textbox into a frame is
truly magical and involves some very fancy programming inside Word.

RTF is actually "Rich Text Format", and it's a way to use a file of
plain text to describe all sorts of formatting. If you open an RTF
file in NotePad, you'll see a ton of codes in braces that describe
fonts, page locations, and lots of other things. When you tell Word to
open an RTF file, a special converter program reads all those codes
and applies the formatting to the text part, resulting in what looks
like a regular Word document. You can then save that as a .doc file,
whose structure is completely different.

When you scan a document, the initial result is just a picture of the
page. Many scanners will let you save that as a graphics file (usually
.tif or .jpg). You feed that picture into an optical character
recognition (OCR) program, which may be part of the scanner software
or may be a separately installed program. The output of the OCR is
text.

In the early days, you were doing well to get just a plain-text
reading of the document, with headers and footers and pictures all
jammed in there. As OCR programmers got better, they started offering
output of a word processing file that looked exactly (well, more or
less) like the original, with the proper fonts, bold/italic, headers,
and so forth. In order to get the stuff positioned correctly on the
page, they resorted to textboxes -- but that's really hard to deal
with when you want to edit the document.

Now the wrinkle... Every graphic object in Word's drawing layer has an
"anchor", a spot in the regular text to which it's attached. (You can
see the anchor symbol in the left margin of Page Layout view if you go
to Tools Options View and check "Object anchors", then select a
textbox or floating picture.) When you convert the textbox to a frame
and then delete the frame, the text inside gets dumped into the
regular text at the anchor position.

Many OCR programs put a single paragraph mark on a page, and anchor
all the textboxes on the page to that paragraph. When you run the
macro, the various chunks of text appear in the order in which their
anchors occurred in the original paragraph, which will probably be
more-or-less random. You're then left to untangle the spaghetti. :-(

This is why Graham's suggestion to output the scan (from the OCR
program) as plain text is a good one. You may lose the "looks just
like the original" formatting, but you'll also never create the
textboxes. This should make your editing job a whole lot simpler. Look
through the OCR program and its help file to find out where you can
turn off formatted output.


Jack,

I am afraid that my usefulness to you has about run its course :-)

The code does look at all shapes, if the shape is a textbox it
converts it to a frame, removes any borders and fill effects from
the frame and then deletes the frame leaving the text. I found
through experimentation that if I just deleted the frames then any
border and fill effects in the frame would be transfered to the text
paragraphs.

I will have to defer to others as to the technical difference
between a frame and textbox.

RTF is, I think, "Raw Text Format." I have never monkeyed around
very much with differenct types of text, but why don't you just try
saving your RTF file as a Word.doc and see what happens :-)

I have a hard time figuring out the workings of a simple screw, so I
can't be of much help with the workings of your scanner. Sorry.

--
Greg Maxey/Word MVP
A Peer in Peer to Peer Support

Jack Sons wrote:
Greg,

Thank you for your macro, it worked.

How did it work? I think it converts each textbox to a frame without
(visible) borders. What is the essential difference between a
textbox
and a frame ?
And what is done with the frames, I can't find them in the result.
To
me it looks like a normal document, without any objects, just
characters as it should be.

Would the result of using "the plain text output of the software",
as Graham advised, (if I would know how to do that) give a different
result?

Before and after the use of the macro the resulting document is a
rtf-file (result.rtf). What does that extension inplicate? Can I
rename it as a doc-file (result.doc) without repercussions?

Last question (for now): why does the scanning process result in a
textbox output in stead of "normal text"?

Jack.


"Greg Maxey" gro.spvm@yexamg (thats my e-mail address backwards)
schreef in bericht ...
Jack Sons,

Yes Graham is probably right. I cobbled together the following
which first converts textboxes to frames and then removes the
frame.

Sub ScratchMacro()
'Convert textbox text to plain text
Dim oShp As Shape
Dim i As Integer
For Each oShp In ActiveDocument.Shapes
If oShp.Type = msoTextBox Then oShp.ConvertToFrame
Next oShp
For i = ActiveDocument.Frames.Count To 1 Step -1
With ActiveDocument.Frames(i)
.Borders.Enable = False
With .Shading
.Texture = wdTextureNone
.ForegroundPatternColor = wdColorAutomatic
.BackgroundPatternColor = wdColorAutomatic
End With
.Delete
End With
Next
End Sub

--
Greg Maxey/Word MVP
A Peer in Peer to Peer Support

Jack Sons wrote:
Hi all,

I scanned a document of may pages. The result (a rtf-file) looks
fine, but in reality the text I see is not "text in a document"
but text in textboxes.

I really need this to convert to text "directly in the document",
like in any "normal" document. I mean that it will be as if I
typed
it directly into the document.

Of course I could select (highlight) the text in the first textbox
and than paste it to a new document (a doc-file), do the same with
the text of the next textbox, past it below the first text in the
new docment etc. I tried, did it for a lot of textboxes, but it
will be
very tedious to do it with the whole document because of the many
hundreds - maybe thouthands - of textboxes, some of which contain
only a single line of text..

Also there is a strange effect, when I try to "control c -
control v " the highlighted text of a textbox to the other
document, suddenly
it is not the text that is copied to the new document, but the
whole textbox, and so it just moved the problem from one document
to the other one.

Can anyone show me a way out? Perhaps with VBA it will be possible
to convert all textboxes at once to normal text.

I am in very urgent need for advice. Please help.

Jack Sons
The Netherlands