How the HTML Import 2 Plugin Works
The plugin works by reading HTML as XML and copying the specified tags’ contents into various WordPress fields. It therefore works best with well-formed HTML. Your files don’t necessarily have to validate according to the W3C specification, but they should at least contain tags that are properly nested. The importer will try to import improperly nested HTML, but it might not work as you expect!
The files you are importing must be on the same server as your WordPress installation. If you do not have (S)FTP access to the old site, you can use wget or an application like SiteSucker to download the files. (I’ve done this a few times for clients who’d forgotten their passwords.)
If you choose to import the files as pages (or any hierarchical post type), the new page hierarchy will match your original directory hierarchy. If the directory contained a default index file (in this case, index.htm), the contents of that file will be used for the parent page.
This site, after importing:
Importing can be tricky, and it doesn’t always go well the first time. Therefore, it’s important to install the a backup plugin in case you need to start over. I like DB Backup for its simplicity, and I recommend it if you don’t already have a backup plugin.
You’ll probably need to do a little bit of cleanup work afterward, so you’ll need to install the Search and Replace plugin as well.
If you’re importing content into a WordPress site that already contains content, back up your database and put the site into maintenance mode before you begin importing. Seriously, stop reading and back up now.
If you’ve installed a plugin that crossposts your content to another site (like Facebook or LiveJournal) or automatically notifies another site of your new posts (like Twitter), be sure to deactivate those plugins before you import. Otherwise, you’ll flood your social network with your imported posts!
Cleaning Things Up
It’s a good idea to look through your HTML files and make sure things are consistent before you run the importer. For example, the importer will give you an option to remove a repeated phrase from page titles. (Most WordPress themes automatically insert the site title into the <title> tag, so you don’t want that to be part of each post’s title, or it’ll be duplicated.) In order for this to work, the phrase must be exactly the same in all your files. Of course, you can change the page titles after the files have been imported, but it’s usually faster to edit the files than to edit and save each WordPress entry. (Bulk Edit doesn’t work on titles.)
Even more important: you need to make sure that the content you want to import from each file is surrounded by the same HTML tag or Dreamweaver template region in every file.
Finally, if you want the importer to copy your linked images and documents into the Media Library, make sure the image paths are correct! The importer can handle all of the following path styles, as long as they work:
<img src="http://example.com/images/foo.jpg" />
<img src="/images/foo.jpg" />
<img src="../../images/foo.jpg" />
<img src="foo.jpg" />
Importing HTML Files
This is a complicated importer with lots of options. The settings page (Settings → HTML Import) is broken up into six sections. You need to look through the first five before you run the importer. The sixth (Tools) contains links to some tools that are helpful after you’ve imported, and lets you regenerate the rewrite rules to imported documents–useful if you change your permalink structure after you’ve finished importing.
Directory to import
Find the absolute path—not a site- or file-relative one—to this directory. On a Windows machine, that path will begin with a drive letter (e.g. C:\sites\public_html). On a UNIX-based server (including Macs), the path will begin with a slash (e.g. /users/username/home/public_html or /Library/WebServer/mysite). You may leave this blank if you plan to upload a single file later, but if you can fill that file’s path anyway, do so. The importer will then be able to locate any linked images in the uploaded file.
Don’t fret about whether or not to include the trailing slash. The importer will figure it out.
Do not enter a URL for the directory to import. This MUST be a path to a directory on the same server as your WordPress installation.
Old site URL
The old URL should correspond with your beginning directory. This will be used generate .htaccess redirects and locate images. This is not where the importer will search for files to import; those must be specified using the directory option above.
If you are importing a subdirectory and not the full site, update this URL to include the subdirectory. For example, if you are importing /Sites/old_site/public_html/about/, the URL should be http://example.com/about/.
The name of directories’ default (index) files for this site. This is usually index.html on Apache servers and default.htm on IIS. The importer accepts multiple filenames; simply list them all, separated by commas.
File extensions to include
The importer will scan your beginning directory for files to import. You should specify the types of files you want to import (so the importer doesn’t try to read HTML out of an MP3, for example). All file types NOT listed here will be ignored.
The importer does not want spaces between extensions or directory names, but you can add them if it makes things easier to read. The importer will remove them when you save the options.
Directories to exclude
You may enter the names of directories that will not be scanned for files to import. Paths to specific directories will not work; you need to simply add your directory’s name to this list. This way, if you have twelve directories called “images” at various levels in your directory hierarchy, you can skip all of them. If you have several directories that share a name, like “media,” and you need to skip some of them but import one or two, rename the directories you want to keep while you run the import.
Preserve File Names
WordPress normally generates the slug (the unique portion of the post/page URL) using the title. If your old files have short names, like about.html, and your page titles are very long, like “About Our Company, Our Founders, and Our Values,” you can check this option to use “about” instead of “about-our-company-our-founders-our-values” as the imported page’s slug.
To select the part of the file that contains the main content—what will become the post or page content in WordPress—you can specify an HTML tag or a Dreamweaver template region.
If your pages are based on Dreamweaver templates, select the Dreamweaver option and enter the name of the content area (e.g. “Main Content”) into the template region field.
If you’re using a tag without attributes, or where the attributes don’t matter, simply enter the tag (without brackets) in the tag field, and leave the attribute and value fields blank. For example, if you need to import the entire <body> tag, your fields would look like this:
If your tag does have an attribute that makes it unique, enter the attribute name (like class or id) in the attribute field and the value in the value field. For example, if your content is contained in the <div id="spotlight"> tag, your import setting would look like this:
Table cells will work, too, even if they don’t have a class or ID — as long as they do have a width attribute with a unique value. For example: Tag: table, Attribute: width, Value: 730 will work just fine.
Any attribute/value pair will work as long as the value is unique.
Firebug makes it easy to find the right tag. Open it, press the arrow button, and hover your mouse over the section you want to import. It will highlight the appropriate tag in the HTML view:
For this page, the content settings should be:
- Tag: div
- Attribute: id
- Value: spotlight
More content options
If you choose to import meta description tags as excerpts, the excerpts will be stored for both posts and pages. However, WordPress pages don’t normally have excerpts. To edit and/or display excerpts for pages, you will need to install a plugin such as PJW Page Excerpt.
If your original files used a character set other than UTF-8, you should check the option to convert special characters.
Import linked images
With this option checked, the importer will attempt to copy your images to the Media Library, add them to the appropriate post as attachments, and replace the src="" paths in all your image tags. It will check for duplicates, so that images used in several pages will be imported just once.
Linked images will be imported no matter where they are located. It’s fine to leave images in your list of skipped directories, since that setting just tells the importer where to look for the HTML files.
Import linked documents
You can use this option to import files other than images, like PDFs and Word documents. Only the file extensions you specify will be imported into the media library. The importer will update the links to the files. Once it’s finished, all your file links should go to the copies in the media library.
Cleaning up HTML
You can have the importer clean up any unneeded HTML, if you wish. For example, if your files came from Microsoft Word or Frontpage, they’re probably littered with extraneous <div> tags, smart tags, and class attributes. To clean them up, check the Clean up bad (Word, Frontpage) HTML option, then specify the HTML tags and attributes that should be allowed. Any tags and attributes not in these lists will be removed. A list of suggested tags and attributes is provided, along with an extra set that you should include if your content contains data tables. Please look over both lists carefully before you import. Note that style and align are not among the attributes allowed by default, which might be important to you if you’re importing images.
Title & Metadata
You can select the title tag the same way you chose your content area. You can have the importer remove common words or phrases from your titles. Remember that your site title will be added automatically to your WordPress posts and pages (depending on your theme). If it’s part of your HTML files’ <title> tags, for example, you’ll need to remove it now to avoid duplication on your WordPress site.
The importer will encode HTML entities for you. For example, if your title contains an ampersand (“&”), you don’t have to enter it as “&”.
If you were to import the title using the <title> tag and the post/page content using the <body> tag, you would have no problems with duplicate titles, since there is no overlap between those HTML tags.
But let’s say you’re importing <div id=”content”>, and your title is the <h2> inside that <div>. When the importer runs, it will use the contents of that <h2> as the post title. But the <h2> will still appear within the post content–which means it will be displayed twice on the page in your WordPress site.
To prevent this duplication, the importer offers this option to remove the title from the imported content.
The metadata section is where you can specify all the little details: whether you want to import the files as posts, pages, or a custom post type; what the status of the imported posts should be (draft can be handy if you know you’ll need to do some additional editing after importing); which user should be listed as the author; and (for pages and other hierarchical post types) what the parent and page template should be.
What if you need to import some directories as posts and others as pages? Just run the importer twice with different beginning directories. If the site you’re importing has a news section, for example, you could import that subdirectory as posts, then add “news” to your list of skipped directories and import the parent directory as pages.
Timestamps and Custom Fields
If you need to set the imported posts’ timestamps according to a date that’s somewhere in your HTML, choose “custom field” in this section. Under the Custom Fields tab, you’ll be able to specify the HTML tag or Dreamweaver region that contains the date.
Categories & Tags
Here you can assign categories, tags, and post formats to your imported files. If you have created custom taxonomies for your site, you’ll see fields for those as well. This screenshot shows the hierarchical taxonomies from Content Audit as well as a book review site that has a set of taxonomies related to genres, along with the flat taxonomies for a research institute’s site dealing with animal species and funding institutions. (My test installation can get a little mixed up sometimes.)
You may specify any number of custom fields. For each field, you must provide:
- a field name (the meta_key)
- an HTML tag or Dreamweaver region
- whether HTML within the custom field should be preserved
If you allow HTML in your custom field, the allowed tags from the Content tab will be applied, and all other HTML tags will be stripped. You cannot specify a different set of allowed tags for the custom fields.
Importing the date from a custom field
To import the date from a custom field, simply specify the HTML tag or Dreamweaver region that contains the date. You do not need to specify a custom field name, since the importer will not create a custom field, but will instead try to interpret the date as a timestamp using PHP’s strtotime() function. Make sure that you have selected “custom field” as the timestamp option under the Title & Metadata tab.
Importing post tags from a custom field
You may import tags from your files! Add another custom field and give it the name post_tag. Then specify the HTML tag or Dreamweaver region that contains the tag names. The importer will use their names as they appear in the text. If you end up with duplicates (like “tag” and “tags”), you can use the Term Management Tools plugin to merge them once you’ve finished your import.
Ready to import?
Once you’ve filled in all the settings, save your settings. Then the import files button will appear. If you need to go do something else and come back to this later, you can either return to this settings screen or go to Tools → Import and select HTML from the list.
When you press “Import files,” you’ll leave the options screen and jump to the importer:
If you’re importing a directory with many files — say, more than a hundred — this will take a few minutes. When the importer has finished, it will display a list of the imported files with any errors noted.
The report will include a set of rewrite rules that you can use in your .htaccess file to redirect visitors from your old files to your new WordPress posts or pages. If you entered the old site’s URL in your settings, the rules should be exact. Otherwise, they’ll use the file system path instead of a URL. You should be able to correct them with a simple search and replace. The importer does not write the new rules to your .htaccess file; you’ll need to copy them and paste them into the file yourself. You can retrieve them again later; look for the link in the Tools section of the settings screen.
If you chose to import images and/or other media files, those reports will be shown beneath the redirects. Any files that couldn’t be located will be listed (in orange) so you can add them by hand later.
At the very end of the report (because it happens last), the importer lets you know that it has updated your internal links, if you chose to do so.
After Importing: Fixing What’s Broken
No matter how careful you were with the settings, there’s a good chance you’ll see some errors in your newly imported content. Visit the Tools section of the settings page for a list of additional plugins that can help you clean up imported content. In particular, Broken Link Checker and Search and Replace are amazingly useful.
I import a few files and then the results page just gets cut off. What can I do?
The importer will attempt to work around your server’s max_execution_time setting for PHP (usually 30 seconds), but some servers don’t allow this. You can try to increase it by adding a line to your .htaccess file:
php_value max_execution_time 160
If that gets you further but still doesn’t finish, just increase the number (it’s in seconds). However, note that your host might get irritated with you for hogging the server’s resources. If you have a lot of files to import, it’s best to install WordPress on your desktop (XAMPP for Windows/Linux and MAMP for Macs make it pretty easy) and do the import there.
It’s also quite possible that the script is trying to use more memory than your server allows. You can try to change that setting, too, in .htaccess:
php_value memory_limit 1024M
If you’re having trouble getting things to work, please ask your questions in the wordpress.org support forum for the plugin. There are a lot of you, and only one of me. If I can answer your question in the forum, someone else might see the answer. If you write to me personally, a) I won’t reply, and b) if I did, you would be the only person who would benefit from the answer. Be good to the community–and my sanity.
Did this plugin save you several hours of work?
Consider donating a bit to future development. You’d be surprised how few people do!