An Early Christmas Present: Free code to list Published Content Links
Howdy all, hope this note finds you well and recovering from Turkey overload.
$pub_root/foo/foo.html
contains a link to another content item that looks like:
http://pubcontentserver/bar/bar.html
So far so good, except that during the content migration, the foo and bar directories are getting rolled into one new folder named foobar. When this happens, we’ll end up with the following files:
$pub_root/foobar/foo.html
$pub_root/foobar/bar.html
http://pubcontentserver/bar/bar.html
What we’d like to happen, though, is for our link to point to the new location of the file, i.e.:
http://pubcontentserver/foobar/bar.html
This particular client has a bunch of published content items (in the tens of thousands), which makes it pretty impractical to ask someone to manually go through every published content item and look for links that are going to break. So I volunteered to write a script/program to go through and identify potential problem links automatically. I have to admit I first volunteered because I thought this was going to be really easy to figure out via. a recursive grep:
From $PUB_CONTENT_ROOT on filesystem
find . | xargs grep -i ‘publishedcontent’
Just to make me mad though, while the grep approach works, it returns too many false positives. It turns out that a lot of the out-of-the-box content items have links back to themselves, i.e something like:
in the file $PUB_CONTENT_ROOT/foo/foo.html
http://pubcontentserver/foo/foo.html#myTarget
And Publisher is smart enough to fix these self-referencing links when you move the items around, i.e. the link above gets updated to
in the file $PUB_CONTENT_ROOT/foobar/foo.html
http://pubcontentserver/foobar/foo.html#myTarget
What we really need is a piece of code that does the following:
1) Recurse down the published content tree and examine all the published content items
2) Check for links to other published content items in each file. This can be done by looking for a specific token, which by default, is “publishedcontent”
3) At this point, we have a list of all the links to content items in the file. However, we need to filter out the self-referencing links that will get auto-updated by Publisher. To do this, let’s look at the path in the link, and ignore any references that point to the same directory. For example:
In the file $PUB_CONTENT_ROOT/foo/foo.html
We’ll capture the link http://putcontentserver/bar/bar,html
But we’ll ignore the link http://pubcontentserver/foo/mystuff.html because it will be auto-updated by Publisher.
4) Dump all matches found out to a file so the links can be reviewed and fixed.
So I wrote some code that does the above. Go grab LinkChecker.java if you’re interested. The code is pretty simple, but it does have a few caveats which are all listed in the comments. You should be able to compile the code with a JDK 1.5+ compiler without any external dependencies:
javac com\function1\utility\LinkChecker.java
java -classpath . com.function1.utility.LinkChecker my/path/to/pubcontent_root/publish
Eventually, somebody with a little motivation could evolve this code to the next logical iteration and add functionality to auto-check the links that are grabbed. Would be pretty simple to do by adding something like:
URLConnection myLink = new URLConnection(currentLink);
int statusCode = myLink.getResponseCode();