Word 2 HTML
Nonfic.Word2HTML History
Hide minor edits - Show changes to markup
For more details, see www.onebit.ca.
For more details, see http://www.onebit.ca.
Word (and other word processors, and even some WYSIWYG HTML editors) generate really nasty HTML. Usually half of it is code that over-rides conventions or won't display properly on other browsers. It uses characters unique to MicroSoft (such as three dots and open / close quotes), and it uses embedded CSS stylesheets and inline CSS styles and HTML formatting codes, all at the same time.
But simply saving it as text and replacing all the line breaks with P tags loses any italics or headers or other formatting which you do want to keep.
And writing something large in a text-editor or directly in HTML? Ugh.
So, here are a few methods to take the HTML that Word generates, and turn it into something reasonable. (Now to do something for Word to PmWiki!)
Now to come up with something for Word to PmWiki!
Method Two
HTML Cleaner by OneBitCPU
Method Three
Use a Good Text Editor and Search and Replace
Method One
Let Someone Else Do It
Method One Let Someone Else Do It
Method One
Let Someone Else Do It
Method One
Let Someone Else Do It
Method One -- Let Someone Else Do It
Method One
Let Someone Else Do It
!!Method One -- Let Someone Else Do It
Method One -- Let Someone Else Do It
!!Method Two -- HTML Cleaner by OneBitCPU
Method Two -- HTML Cleaner by OneBitCPU
!!Method Three -- Use a Good Text Editor and Search and Replace
Method Three -- Use a Good Text Editor and Search and Replace
Method One -- Let Someone Else Do It
!!Method One -- Let Someone Else Do It
Method Two -- HTML Cleaner by OneBitCPU
!!Method Two -- HTML Cleaner by OneBitCPU
Method Three -- Use a Good Text Editor and Search and Replace
!!Method Three -- Use a Good Text Editor and Search and Replace
Converting stories from one format to another is a pain, especially if you want to keep italics and bold and headings.
Word (and other word processors, and even some WYSIWYG HTML editors) generate really nasty HTML. Usually half of it is code that over-rides conventions or won't display properly on other browsers. It uses characters unique to MicroSoft (such as three dots and open / close quotes), and it uses embedded CSS stylesheets and inline CSS styles and HTML formatting codes, all at the same time.
But simply saving it as text and replacing all the line breaks with P tags loses any italics or headers or other formatting which you do want to keep.
And writing something large in a text-editor or directly in HTML? Ugh.
So, here are a few methods to take the HTML that Word generates, and turn it into something reasonable. (Now to do something for Word to PmWiki!)
All these methods assume you know basic HTML and can look up any tags or formats you need.
Method One Let someone else do it.
Method Two Use Word to convert to HTML, then use a good text editor for Search and Replace.
Method Three Use Word to convert to HTML, then use OneBitCPU's HTML Cleaner.
Method One -- Let Someone Else Do It
Get an account on Fanfiction.net.
Upload the story to it.
Open the story on their site and use your browser to View the Source Code. Save the source code.
Open the resulting HTML in your HTML editor and strip out the headings and tables.
Done.
Method Two -- HTML Cleaner by OneBitCPU
It's not particularly friendly yet -- he needs feedback.
First use MSWord to save the document as HTML. Then make a backup.
HTMLCleaner takes out the MS special characters and replaces them with standard characters. Then it strips out most of the HEAD section.
It then takes you through each tag and lets you change the tag to something simpler (e.g., remove the formatting from a P tag), remove the tag altogether (e.g., many SPAN tags) or remove the tag and contents (e.g., STYLE tags). You can change just the tag in question, or all identical tags.
For more details, see www.onebit.ca.
Method Three -- Use a Good Text Editor and Search and Replace
First use MSWord to save the document as HTML. Then make a backup.
Open the HTML file in a good text editor such as TextPad, www.textpad.com. Do not use a regular word processor because it will do the very things that we're trying to avoid. The editor has to be able to find hidden characters such as tabs and paragraph breaks; TextPad calls these "regular expressions".
Cut out just about everything in the HEAD. As I said earlier, these instructions assume you know basic HTML.
Do a Search and Replace to turn all < into <@ (This assumes that there are no <@ combinations in the text. If there are, use something else instead of the @ sign.
Search for @ . Copy the entire tag into the search buffer, and replace it with what you want. If there are no important attributes in the tag, you can use wildcards to replace all instances.
If the formatting instructions in the P tag are too long for the search field, you have two options: a. If you don't care about any formatting, use wildcards. b. Use two passes. On the first pass, replace a chunk of the tag with a single letter. On the second pass, search for that letter and the rest of the tag.
Continue until there are no more @ signs.
It is possible to do all the close tags at once by doing <@/ , but that leaves Microsoft's </O:P> tags in the document.
Notes
Free advertisement to anyone who can tell me what </O:P> means in MicroSoft Word HTML files.
Instructions updated October 28, 2004