Doing a content audit or inventory

This is a re-post of my article Doing a content audit or inventory that was posted on the nForm blog January 26, 2010.

This past couple weeks I’ve been doing a large content audit and inventory. With the increased attention on content strategy the audit/inventory seems to be making a comeback.

In this post I’m going to share my recent approach and talk about the tools and process I used to get a comprehensive audit/inventory done.

Audit or inventory?

Before I start though I want to clear up some terminology – in particular the difference between a content audit and a content inventory. Some use the terms interchangeably, but I think there is a difference (I could be wrong), so here’s my take on it:

An inventory is quantitative. An inventory is to help you paint a picture of how much content the site has and to help in the identification of the specific pages/URLs/content as well as any other meta data I can gather. I prefer to do these using a tool to automate much of the work.

An audit is qualitative. An audit helps me understand what kinds of content there are, shows me how pages are currently organized, connects me with the voice and tone of the content and helps me uncover the subject matter so I can be conversant about the site with my client. I prefer to do these manually, reviewing the content on as many pages as I can (for larger sites this might just a sample of the overall content – every X page or page/document type – or a level or two deep).

Doing a content inventory

I’m the first to admit that the task of doing an audit/inventory can be a significant undertaking, tedious and even “mind-numbing” as Jeffrey Veen once said, so I’m always looking for ways to make it easier. Below are a couple tools that can help speed up your work:

iGooMap (OSX) is an application that was designed to help create Google Sitemap files for search engine optimization. What I particularly like about this application is that it can pull down a list of URLs with the date last modified from the server (which can be helpful in showing when content was last edited and expediting the ROT or OUCH assessment). It exports files to an XML Sitemap file or to a text file listing all the URLs. You can also filter extensions and parameters from URLs, disallow certain files or folders and view bad links.
Integrity is an application that was created for webmasters doing link checking and has been extended recently to allow you to create XML sitemap files as well. It also can export the results of your scan as a .CSV file or as an XML sitemap file, allows filtering and the ability to ignore certain parameters. Like iGooMap it also shows you bad links.

Note: if you are a PC user there may be comparable Google sitemap creation tools, but I don’t have any suggestions for you. For link checking, I’d suggest trying XENU Link Sleuth

The advantage of using a tool to automate some of the work is that: is that it gives me a comprehensive list of URLs without much effort and can run in the background while I do my audit or other work. While tools are helpful, they tend to scan everything they can find via a hyperlink. So once I have my list of pages I usually try and spend some time cleaning up my inventory. Here are a few methods I use:

Load the list of pages into Excel (each page URL on it’s own row) and then sort the column ascending from A-Z so that the URLs appeared based on their site file structure. This speeds up the grouping of content in the spreadsheet and makes it easier to locate similar content based on the URL path.
If you have URLs in a mix of cases it can be hard to scan and read. So to make them all the same case using Excel’s LOWER function.
To locate duplicates within the spreadsheet use Excel’s COUNTIF function and some conditional formatting, then quickly identify and flag duplicates and delete them from the spreadsheet. If you’re not an Excel wiz, try this tutorial on using the COUNTIF function to find duplicates.
Consider moving other content/file types (PDF, Word Docs, Video, Audio, PowerPoint, etc.) to other tabs in your spreadsheet so that you can separate the pages from the files on the site and really focus on the content that’s HTML. I usually have one for Documents, Audio and Video respectively.

Doing a content audit

When I work on an audit I create a spreadsheet with a few columns set up:

Content ID (optional). Some people like to number everything, but I don’t. Personally I find this hard to read and even harder to navigate.
Levels. Here I record the page name. Rather than rely on colored background or other reminders I use a column for each level within the site (labeling it simply as L1, L2, L3, etc.). This helps me see the relationship of pages to one another more easily that the numbering scheme used above. I usually don’t go more than a few levels deep and keep the column width small to make it easy to read.
URL. The page URL
Document type. Here I list what template the page uses or the type of page it is
Owner/Maintainer. Here I list who’s responsible for the content. This I usually have the client fill in.
ROT or OUCH. ROT stands for Redundent, Outdated and Trivial. It’s a bit ambiguous so lately I’ve been using Gene Smith’s suggestion of OUCH (Outdated, Unnecessary, Current, Have to Write). This you can document or have the client fill in.
Do not review. At times there is content that simply just doesn’t need to be reviewed. It could be historical, time-based content (like past press releases or newsletters) that just isn’t worth the time reviewing. You can speed up this task by flagging the content in the spreadsheet. I might take a stab at this or have the client do it.
Notes. Here I’ll record anything that I want to make note of or mention.

Combining the two approaches

Sometimes it is worth combining the two methods. When I do:

I start by doing a manual audit of the site
While I’m doing the audit I create a content inventory to get a list of pages and when they were possibly last modified/updated and clean up the inventory file – removing duplicate URLs and cleaning up formatting
I open both spreadsheets and copy/paste the pages from the inventory to the manual audit that were not reviewed giving me a complete list of all the content on the site
I may have the client identify the page owner/maintainer, OUCH, which pages we may be able to ignore and which content might be relevant to particular audiences

A couple final tips

Make sure your scanning tool indexes all the various file extensions. Most are made for search engine optimization and as such exclude a number of extensions.
Double check for redirects. These could point to redundant pages in your inventory.
If you have content behind a login area, see if your tool will allow you to enter login credentials to capture this content. If it won’t, you’ll likely have to do it manually.

Hopefully these tips and tools help to make your content audit/inventory work a bit smoother.