Ever have one of those days where you get mentally slapped with something so obvious that you can't believe you never realized it before?
The other day I was studying one of those infamous SEO reports to diagnose a problem on a website. The report stated that a particular website has 600 products in the online catalog when I knew for a fact that there were only 250. So here I am, staring at this report trying to comprehend why the SEO report was showing duplicate content for all the products. But not just duplicate; there were 100 of them that appeared as triplicate.
ACK!
Any good SEO professional knows that you should avoid duplicate content on your website. This used to be a really big concern back when Google penalized you for repeating yourself, but as the Google system got smarter (not Terminator, or "singularity" smart, but more like Star Trek computer smart) they've relaxed those guidelines. They understand when you've repeated yourself, and they know you didn't really mean to make them waste electricity and CPU processing power to figure out what you've repeated.
In all seriousness, you're not supposed to worry about duplicate content if it's only something small, like 250 pages, but I assume they still penalize websites that spin out of control with thousands of pages of duplicate content.
Regardless, a point of order on any website, especially one with a product catalog, should be to prevent all possible duplicate pages.
The report said I had 150 products that were repeated once, and 100 products that were repeated twice. That makes 150 + 200 = 350 extra products listed on the website when there were only 250.
Hmm... It turns out that the content management system was causing the problem. This is an interesting situation and I wanted to detail in for those of you shopping around for a new website, or considering e-commerce.
All online product catalogs today are powered by some backend database. That database controls how to sort the items, what pages they appear on, and controls every little aspect of how customers navigate your website.
You shouldn't worry about how all this technical magic works. Your programmer set it all up for you, and life is good.
I have to get technical now, sorry, but I'll use a really easy to understand example (I hope).
I have this engagement ring; in fact it's a halo ring with a 0.75ct center diamond and another 0.5ctw of diamonds in the halo. It's set in 18k white gold. Can you picture that? You might even have a ring like that in your showcase. The SKU in my point of sale computer is 6789. I have this engagement ring in my database. The above would be the written-out description, but all the individual details would be probably in different fields in your product database.
Off the top of my head, I can think of 3 potential ways that a database could generate the URL of this ring if it contained that proper descriptive information.
Way Number 1:The database could organize all the jewelry into buckets. Those buckets then become part of the URL. The main bucket for this ring would be, well, "ring." The next bucket would be "engagement," then "halo," and finally "18kwg."
Now let's take those buckets and organize them as "folders" in the URL like this:
/ring/engagement/halo/18kwg/
We'd have to identify the unique product, so slap the SKU at the end of that, oh, and we need a domain too...
perosijewelers.com/ring/engagement/halo/18kwg/6789
That URL has some good SEO value built right into the file structure. Excellent, except that there are a few dozen ways a programmer could make this happen. Don't ask how, again; it's magic, but it's easy to read and looks pretty.
Way Number 2:Some databases allow you to set a real page for every item. Well, they are not actually real, but they look like they are. That same ring might have a real looking page like this:
perosijewelers.com/ring-engagement-halo-18K-WG-1.25ctw-6789.php
The total carat weight above (1.25ctw) may be automatically added together from the individual diamond weight.
That URL also has some good SEO value built right into it. Excellent, except I think this one looks a little messy, especially with those two periods in the file name. IMO: Yuck! But that's how some systems will really do it.
Way Number 3:Sometimes a website is built with a standard URL structure that's doesn't necessarily worry about those product names or the category buckets. In that case, you just call it a product catalog with the SKU like this:
perosijewelers.com/jewelry-catalog/6789
or this:
perosijewelers.com/jewelry-catalog/index.html?sku=6789
Okay, time to get even a little more technical. I'm going to start reciting the alphabet to make this a little easier.
So far I've given you these URL structure examples:
perosijewelers.com/ring/engagement/halo/18kwg/6789
perosijewelers.com/ring-engagement-halo-18K-WG-1.25ctw-6789.php
perosijewelers.com/jewelry-catalog/index.html?sku=6789
I'm changing them to this so you can follow more easily:
perosijewelers.com/aaa/bbb/ccc/ddd/6789
perosijewelers.com/aaa-bbb-ccc-ddd-eee-6789.php
perosijewelers.com/fff/index.html?sku=6789
What do you think would happen if I scrambled the alphabet up like this?
perosijewelers.com/ccc/aaa/ddd/bbb/6789
perosijewelers.com/ccc-aaa-ddd-bbb-eee-6789.php
perosijewelers.com/fff/index.html?sku=6789
Um, I guess I couldn't scramble that last one. It's still sunny side up.
You and I might not notice the difference while using a website, just as long as it brought us to the same, correct page. From the database point of view, it doesn't care that I scrambled the letters because those letters are nothing more than a filter that shows a specific item.
HOWEVER... According to the SEO report I was reading, and from Google's point of view, the original page and the scrambled page have different URLs producing the same exact content.
In other words... duplicate content.
This is exactly what was happening on the website I was working on. In one area of the product catalog all the products knew their ABC's like this:
perosijewelers.com/aaa/bbb/ccc/ddd/6789
and in another area of the catalog it was dyslexic like this:
perosijewelers.com/ccc/aaa/ddd/bbb/6789
Fixing it was simple. I just reprogrammed the dyslexic area of the catalog to correctly display the ABC's. In one stroke of my programming pen, I solved 250 of my duplicate issues.
But wait, there's still more!
Remember I said that 100 of the products were in triplicate. Figuring out that problem out took a little more work.
As it turns out, this website recently went through an upgrade. All the products and content were migrated from an old version of the CMS to a new version. Not a big deal and this happens all the time in the ecommerce industry. Except that the older version of the CMS used this type of URL:
perosijewelers.com/aaa/bbb/ccc/ddd/index.html?sku=6789
See the difference? The new URL does not have the "index.html?sku=" part:
perosijewelers.com/aaa/bbb/ccc/ddd/6789
Why was this a problem? Because this jeweler was also doing a fantastic job with their SEO and content building over the last two years. They had written dozens of blog posts about their products, and they had internal links from those blogs to 100 different products. All of those links were using the old URL format.
YIKES! Cleaning it up required a bit of scouring through all the blogs to find and rewrite the URLs. Next, I ran the SEO report one more time and it reported 250 products on the website without any duplicate content.
Let me bring this all back down to Earth. This really is a pretty boring topic, and my tongue-in-cheek Nugget is meant to demonstrate the importance of having a content management system that is well designed and supported by programmers who have paid attention to extreme detail.
All those ABC's refer to how variables are fed into the database. The database doesn't care about the order, but the search engines do. Make sure all your variables always appear in the same order no matter how they are linked throughout your website.
When it comes to variables, it's not just the "ring" and "sku" that you need to think about. You also need to maintain the constant location of any variable that sorts or filters.
One last step in this nerdy variable discussion... Go into Google Webmaster Tools and tell it about all the website's variables that sort and filter. This will reduce the risk of accidental duplicate content.
Remember that accidental duplicate content isn't quite the devil of SEO anymore, but sites with duplicate content still don't rank as well. Every SEO has their beliefs as to why this happens; my belief is that Google punishes those sites for using more CPU time and electricity.