View Single Post
  #4   Report Post  
Posted to microsoft.public.word.docmanagement
Peter Jamieson Peter Jamieson is offline
external usenet poster
 
Posts: 4,582
Default .docx files have XML components, but what's their use?

..docx and .doc files (at least since about Word 6) have a more similar
structure than many people probably realise - even in .doc, which uses
OLE Compound Files, the content is divided into different "streams"
which can be opened separately.

That said, .docx does have considerable advantages, including
a. the ZIP file structure itself is a de facto standard - I don't
personally have any ZIP utilities for recovering "unopenable" ZIP files,
but I expect there are many. I don't think you will find so many
utilities that know how to recover the content of a corrupted OLE
Compound File
b. each file within the ZIP is almost certainly going to be an XML
text file such as "document.xml", or a single binary object such as a
..jpg. If the ZIP is damaged, but you can still open it and get the
document.xml, you have already achieved quite a lot. Even if the ZIP is
damaged to the extent that you cannot open it, a recovery utility has a
much better chance of identifying the component files when it knows that
they are either XML or - in some cases at least - well-known types of
binary object such as .jpg. In contrast, in a .doc, the equivalent of
document.xml is a complex binary structure. It isn't even a simple
stream of text with markup. You have to have a utility that knows
precisely how to look through that binary representation in order to
extract anything at all. Although MS has now published the .doc standard
(it appears to be a work in progress), I suspect not many people will
want to spend resource developing new recovery software for obsolescent
formats.

Peter Jamieson

http://tips.pjmsn.me.uk

On 05/05/2010 09:10, Ghitorni wrote:
I read that if any corruption occurs, slim chances of recovering for
2003 version files. In 2007 you can recover almost fully because the
actual file is in zip format and inside it contains many xml files. But
the "file" as such, .docx is a single file (until unzipped & extracted).
Then how can some corruption save the file, because even in a zip format
file, if a small chunk is gone, you can never open it. Could anyone shed
some light on this? Thanks