Home |
Search |
Today's Posts |
#1
|
|||
|
|||
MS Word to XHTML
Is there any macro / other tool - free or commercial - that can split
long Word docs into multiple XHTML pages? Any comments on the quality/effectiveness of suitable products also welcomed. |
#2
|
|||
|
|||
__/ [Caversham] on Sunday 11 September 2005 06:02 \__
Is there any macro / other tool - free or commercial - that can split long Word docs into multiple XHTML pages? Any comments on the quality/effectiveness of suitable products also welcomed. I would advice you to do the following: * Download Open Office 2 beta (openoffice.org) * Install it on your Windows machine * Open the Word document in Open Office * Save or export as HTML * Fragment the output as requires, probably by hand (WYSIWYG programs like Word have no notion of structure or semantics) * Run HTMLTidy on the resulting HTML (find it in sourceforge.org) * Modify output to fit XHTML standards * Use search & replace for the task above * Lastly, make sure your code validates (W3C validator) Good luck, Roy -- Roy S. Schestowitz | "Slashdot is standard-compliant... in Japan" http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E 7:40am up 17 days 6:08, 3 users, load average: 2.10, 2.08, 1.85 |
#3
|
|||
|
|||
On Sun, 11 Sep 2005, Roy Schestowitz wrote (seen on alt.html): [...] * Fragment the output as requires, probably by hand (WYSIWYG programs like Word have no notion of structure or semantics) This isn't by any means aimed at you personally, but your posting triggered a response from me, and it looks as if knowledge is proceeding backwards. Proper use of MS Word uses Styles, oriented towards the structure of the document. (If I had my way, I'd rip the direct styling buttons out of the main menu of Word, and hide them away in an Advanced Users menu). Such properly-made Word documents are reasonably capable of being converted well to structural HTML, and a stylesheet suitable for web use can then be applied (it usually won't be the same "style sheet" (= style template) as would be suitable for a printed Word document, of course!). I had some experience, around 1997-8, with the (payware) rtftohtml program - subsequently renamed and marketed under the company name Logictran - it had this pretty-much sorted out. I must admit I haven't got experience of it since the change of name, but I can say that the principles of the original program seemed to what I was looking for, unlike most of the other pseudo-WYSIWYG garbage from other places (that offended all sense of what is suitable for the WWW). With that rtftohtml program, decently structured Word could be turned into decently structured HTML, and split on chapter or section headings quite automatically, with HTML indexes and table of contents generated automatically. OK, there were some rough edges, but at least the principles showed up just fine. I find it sad that some 7 years later we seem to have fallen back to the stone age of direct styling and pseudo-WYSIWYG in most of the Word conversions that I have seen. [Note - there are other programs called rtftohtml or rtf2html - it may be that some of them do a similar job, I can't speak for or against them, I'm just commenting as a reasonably satistfied user of version 4 of this particular program from around 1998 onwards.] |
#4
|
|||
|
|||
Alan J. Flavell wrote:
On Sun, 11 Sep 2005, Roy Schestowitz wrote (seen on alt.html): [...] * Fragment the output as requires, probably by hand (WYSIWYG programs like Word have no notion of structure or semantics) This isn't by any means aimed at you personally, but your posting triggered a response from me, and it looks as if knowledge is proceeding backwards. Proper use of MS Word uses Styles, oriented towards the structure of the document. (If I had my way, I'd rip the direct styling buttons out of the main menu of Word, and hide them away in an Advanced Users menu). Such properly-made Word documents are reasonably capable of being converted well to structural HTML, and a stylesheet suitable for web use can then be applied (it usually won't be the same "style sheet" (= style template) as would be suitable for a printed Word document, of course!). I had some experience, around 1997-8, with the (payware) rtftohtml program - subsequently renamed and marketed under the company name Logictran - it had this pretty-much sorted out. I must admit I haven't got experience of it since the change of name, but I can say that the principles of the original program seemed to what I was looking for, unlike most of the other pseudo-WYSIWYG garbage from other places (that offended all sense of what is suitable for the WWW). With that rtftohtml program, decently structured Word could be turned into decently structured HTML, and split on chapter or section headings quite automatically, with HTML indexes and table of contents generated automatically. OK, there were some rough edges, but at least the principles showed up just fine. I find it sad that some 7 years later we seem to have fallen back to the stone age of direct styling and pseudo-WYSIWYG in most of the Word conversions that I have seen. [Note - there are other programs called rtftohtml or rtf2html - it may be that some of them do a similar job, I can't speak for or against them, I'm just commenting as a reasonably satistfied user of version 4 of this particular program from around 1998 onwards.] Word XP and upwards stores its documents in XML format doesn't it? You could probably write your own XSLT to turn in into HTML fairly easily. -- x theSpaceGirl (miranda) # lead designer @ http://www.dhnewmedia.com # # remove NO SPAM to email, or use form on website # # this post (c) Miranda Thomas 2005 # explicitly no permission given to Forum4Designers # to duplicate this post. |
#5
|
|||
|
|||
On Sun, 11 Sep 2005, SpaceGirl wrote:
Alan J. Flavell wrote: [comprehensive quote of my posting, without apparently having anything relevant to say about it.] Word XP and upwards stores its documents in XML format doesn't it? So what? XML is only a format for defining markup. If the markup doesn't do anything meaningful (specifically - if it only creates a visual result on a printed page, without having any significant structure) then it's not going to turn into effective HTML: it'd just be the usual garbage in / garbage out that we're accustomed to with Word conversions to soi-disant "web" format. You could probably write your own XSLT to turn in into HTML fairly easily. There seems to be some kind of conceptual disconnect here. Most Word documents (in my experience) simply don't contain the necessary structure for useful conversion to HTML: they've been created as a purely visual construction for printing onto paper. It's irrelevant what underlying technology you use (RTF, XML, SGML, whatever) - the problem is that the source material simply does not represent the needed structures, *because the document authors do not put it there*. You might as well try to convert cheese into fresh cream: both are fine milk products, it's true, but instead of trying to convert the one into the other, you'd do better to produce them both starting from fresh milk. And the kind of "fresh milk" that's needed here is logically structured text markup. Not visual formatting. Until the authors of Word documents can grasp that, the prospects for conversion of Word to web formats are poor, IMHO. |
#6
|
|||
|
|||
__/ [Alan J. Flavell] on Sunday 11 September 2005 11:19 \__
On Sun, 11 Sep 2005, SpaceGirl wrote: Alan J. Flavell wrote: [comprehensive quote of my posting, without apparently having anything relevant to say about it.] Word XP and upwards stores its documents in XML format doesn't it? So what? XML is only a format for defining markup. If the markup doesn't do anything meaningful (specifically - if it only creates a visual result on a printed page, without having any significant structure) then it's not going to turn into effective HTML: it'd just be the usual garbage in / garbage out that we're accustomed to with Word conversions to soi-disant "web" format. You could probably write your own XSLT to turn in into HTML fairly easily. There seems to be some kind of conceptual disconnect here. Most Word documents (in my experience) simply don't contain the necessary structure for useful conversion to HTML: they've been created as a purely visual construction for printing onto paper. It's irrelevant what underlying technology you use (RTF, XML, SGML, whatever) - the problem is that the source material simply does not represent the needed structures, *because the document authors do not put it there*. You might as well try to convert cheese into fresh cream: both are fine milk products, it's true, but instead of trying to convert the one into the other, you'd do better to produce them both starting from fresh milk. And the kind of "fresh milk" that's needed here is logically structured text markup. Not visual formatting. Until the authors of Word documents can grasp that, the prospects for conversion of Word to web formats are poor, IMHO. I fully agree with you on that point. Any attempt at rephrasing the same ideas would result in depletion. To suggest ways forward, I suggest that the OP, who clearly wants to publish material on the Web, learns LaTeX. Shall the idea of editing raw text become daunting, I suggest LyX lyx.org [LyX: Front-end to LaTeX]. 5 minutes with LyX would help anyone realise the difference and convey the idea, e.g. varying outputs, styles, imposition of structure, etc. Only a few days ago, somebody in the LyX mailing lists mentioned his upcoming presentation on "Word: What you See Is What a Mess". The presentation I deliver on Wednesday is well-formed XHTML http://schestowitz.com/Weblog/archiv...blic-speaking/ and is motored by Eric Meyer's S5. Roy -- Roy S. Schestowitz | "Software sucks. Open Source sucks less." http://Schestowitz.com | SuSE Linux | PGP-Key: 74572E8E 1:45pm up 17 days 12:13, 3 users, load average: 0.51, 0.58, 0.70 |
#7
|
|||
|
|||
Hi,
Tempore 12:19:53, die Sunday 11 September 2005 AD, hinc in foribus {microsoft.public.word.vba.general,microsoft.publi c.word.docmanagement,alt.html,comp.text.xml} scripsit Alan J. Flavell : Word XP and upwards stores its documents in XML format doesn't it? So what? XML is only a format for defining markup. If the markup doesn't do anything meaningful (specifically - if it only creates a visual result on a printed page, without having any significant structure) then it's not going to turn into effective HTML: it'd just be the usual garbage in / garbage out that we're accustomed to with Word conversions to soi-disant "web" format. You could probably write your own XSLT to turn in into HTML fairly easily. There seems to be some kind of conceptual disconnect here. Most Word documents (in my experience) simply don't contain the necessary structure for useful conversion to HTML: they've been created as a purely visual construction for printing onto paper. It's irrelevant what underlying technology you use (RTF, XML, SGML, whatever) - the problem is that the source material simply does not represent the needed structures, *because the document authors do not put it there*. You might as well try to convert cheese into fresh cream: both are fine milk products, it's true, but instead of trying to convert the one into the other, you'd do better to produce them both starting from fresh milk. And the kind of "fresh milk" that's needed here is logically structured text markup. Not visual formatting. Until the authors of Word documents can grasp that, the prospects for conversion of Word to web formats are poor, IMHO. I warmheartedly applaud your brilliant analysis. You stated your point very clearly. It's depressing to see what a tiny percentage of people realize (or bother with) the importance of structural markup. The future does not look bright. I have seen so called 'IT-classes' where they make innocent people believe they are IT-experts when they can change the background color of characters typed in Word... regards, -- Joris Gillis (http://users.telenet.be/root-jg/me.html) Spread the wiki (http://www.wikipedia.org) |
#8
|
|||
|
|||
On Sun, 11 Sep 2005, Roy Schestowitz wrote:
To suggest ways forward, I suggest that the OP, who clearly wants to publish material on the Web, learns LaTeX. Well, this drifts somewhat off the topic of some of the crossposted groups, but our physicists are accustomed to writing their publications in some form of latex, and I can say that when I was handling the web-ifying of their publications, several years back, I was (for the most part) getting good results from a program called latex2html, and most problems were attributable to identifiable causes, none of which were usually a major hindrance. (Back then we had to make do with the deplorable HMTL version called HTML/3.2, but, aside from that, the principles seemed right). Shall the idea of editing raw text become daunting, I suggest LyX lyx.org [LyX: Front-end to LaTeX]. 5 minutes with LyX would help anyone realise the difference and convey the idea, e.g. varying outputs, styles, imposition of structure, etc. Only a few days ago, somebody in the LyX mailing lists mentioned his upcoming presentation on "Word: What you See Is What a Mess". googled! It's really the principles which count he but in practical terms, I'm sure you're right in aiming at a format which promotes doing the right thing by default - as opposed to one which has prominent direct-formatting buttons on its user interface, and logical markup as an apparently advanced topic which, I'm afraid, too many of authors seem to disdain learning. all the best |
#9
|
|||
|
|||
MS Word to XHTML d2help
maybe a tad of topic
several years ago i used a tool called doc2help which turned properly formatted word documents into online help in either html of chm format. in the process the word document was converted to rtf and the desired html fragments. i am now facing the need of having a tool to do a word split up based on the headings. But have to save it as word, pdf or rtf to have it incorporated in our DMS. doc2help unfortunately does not longer exist, though products like robohelp (now macromedia) worked in a identical way recognizing the styles and building a document accordingly. Its a bit identical as a problem. If you find something keep me posted. |
Reply |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Forum | |||
The WordPerfect "Reveal Codes" method is so much easier to use. | Microsoft Word Help | |||
Does Word have Keyboard Merges like Word Perfect does? | Mailmerge | |||
Word2000 letterhead merge | Mailmerge | |||
Underscore (_) will not always display in RTF files (Word 2002). | Microsoft Word Help | |||
Making Word do something that Wordperfect can do | New Users |