cricket@onebit.ca

Word 2 HTML

Nonfic.Word2HTML History

Hide minor edits - Show changes to markup

December 15, 2006, at 10:32 PM by Super -
Changed lines 39-40 from:

For more details, see www.onebit.ca.

to:

For more details, see http://www.onebit.ca.

December 15, 2006, at 10:31 PM by Super -
Deleted lines 2-9:

Word (and other word processors, and even some WYSIWYG HTML editors) generate really nasty HTML. Usually half of it is code that over-rides conventions or won't display properly on other browsers. It uses characters unique to MicroSoft (such as three dots and open / close quotes), and it uses embedded CSS stylesheets and inline CSS styles and HTML formatting codes, all at the same time.

But simply saving it as text and replacing all the line breaks with P tags loses any italics or headers or other formatting which you do want to keep.

And writing something large in a text-editor or directly in HTML? Ugh.

So, here are a few methods to take the HTML that Word generates, and turn it into something reasonable. (Now to do something for Word to PmWiki!)

Added lines 11-12:

Now to come up with something for Word to PmWiki!

December 15, 2006, at 10:30 PM by Super -
Changed lines 33-35 from:

Method Two -- HTML Cleaner by OneBitCPU

to:

Method Two
HTML Cleaner by OneBitCPU

Changed lines 46-48 from:

Method Three -- Use a Good Text Editor and Search and Replace

to:

Method Three
Use a Good Text Editor and Search and Replace

December 15, 2006, at 10:29 PM by Super -
Changed line 21 from:

Method One \

to:

Method One \\

December 15, 2006, at 10:28 PM by Super -
Changed lines 21-23 from:

Method One

Let Someone Else Do It

to:

Method One Let Someone Else Do It

December 15, 2006, at 10:25 PM by Super -
Changed lines 20-23 from:

Method One

Let Someone Else Do It

to:

Method One

Let Someone Else Do It

December 15, 2006, at 10:24 PM by Super -
Changed lines 21-22 from:

Method One -- Let Someone Else Do It

to:

Method One

Let Someone Else Do It

December 15, 2006, at 10:23 PM by Super -
Changed lines 19-20 from:

!!Method One -- Let Someone Else Do It

to:

Method One -- Let Someone Else Do It

Changed lines 32-34 from:

!!Method Two -- HTML Cleaner by OneBitCPU

to:

Method Two -- HTML Cleaner by OneBitCPU

Changed lines 44-46 from:

!!Method Three -- Use a Good Text Editor and Search and Replace

to:

Method Three -- Use a Good Text Editor and Search and Replace

December 15, 2006, at 10:22 PM by Super -
Changed lines 19-21 from:

Method One -- Let Someone Else Do It

to:

!!Method One -- Let Someone Else Do It

Changed lines 31-33 from:

Method Two -- HTML Cleaner by OneBitCPU

to:

!!Method Two -- HTML Cleaner by OneBitCPU

Changed lines 43-45 from:

Method Three -- Use a Good Text Editor and Search and Replace

to:

!!Method Three -- Use a Good Text Editor and Search and Replace

December 15, 2006, at 10:22 PM by Super -
Added lines 1-70:

Converting stories from one format to another is a pain, especially if you want to keep italics and bold and headings.

Word (and other word processors, and even some WYSIWYG HTML editors) generate really nasty HTML. Usually half of it is code that over-rides conventions or won't display properly on other browsers. It uses characters unique to MicroSoft (such as three dots and open / close quotes), and it uses embedded CSS stylesheets and inline CSS styles and HTML formatting codes, all at the same time.

But simply saving it as text and replacing all the line breaks with P tags loses any italics or headers or other formatting which you do want to keep.

And writing something large in a text-editor or directly in HTML? Ugh.

So, here are a few methods to take the HTML that Word generates, and turn it into something reasonable. (Now to do something for Word to PmWiki!)

All these methods assume you know basic HTML and can look up any tags or formats you need.

Method One Let someone else do it.

Method Two Use Word to convert to HTML, then use a good text editor for Search and Replace.

Method Three Use Word to convert to HTML, then use OneBitCPU's HTML Cleaner.

Method One -- Let Someone Else Do It

Get an account on Fanfiction.net.

Upload the story to it.

Open the story on their site and use your browser to View the Source Code. Save the source code.

Open the resulting HTML in your HTML editor and strip out the headings and tables.

Done.

Method Two -- HTML Cleaner by OneBitCPU

It's not particularly friendly yet -- he needs feedback.

First use MSWord to save the document as HTML. Then make a backup.

HTMLCleaner takes out the MS special characters and replaces them with standard characters. Then it strips out most of the HEAD section.

It then takes you through each tag and lets you change the tag to something simpler (e.g., remove the formatting from a P tag), remove the tag altogether (e.g., many SPAN tags) or remove the tag and contents (e.g., STYLE tags). You can change just the tag in question, or all identical tags.

For more details, see www.onebit.ca.

Method Three -- Use a Good Text Editor and Search and Replace

First use MSWord to save the document as HTML. Then make a backup.

Open the HTML file in a good text editor such as TextPad, www.textpad.com. Do not use a regular word processor because it will do the very things that we're trying to avoid. The editor has to be able to find hidden characters such as tabs and paragraph breaks; TextPad calls these "regular expressions".

Cut out just about everything in the HEAD. As I said earlier, these instructions assume you know basic HTML.

Do a Search and Replace to turn all < into <@ (This assumes that there are no <@ combinations in the text. If there are, use something else instead of the @ sign.

Search for @ . Copy the entire tag into the search buffer, and replace it with what you want. If there are no important attributes in the tag, you can use wildcards to replace all instances.

If the formatting instructions in the P tag are too long for the search field, you have two options: a. If you don't care about any formatting, use wildcards. b. Use two passes. On the first pass, replace a chunk of the tag with a single letter. On the second pass, search for that letter and the rest of the tag.

Continue until there are no more @ signs.

It is possible to do all the close tags at once by doing <@/ , but that leaves Microsoft's </O:P> tags in the document.

Notes

Free advertisement to anyone who can tell me what </O:P> means in MicroSoft Word HTML files.

Instructions updated October 28, 2004


Page last changed: December 15, 2006, at 10:32 PM.