An Early Christmas Present: Free code to list Published Content Links

Uncategorized by Brian Hak on December 11th, 2009 No Comments

Howdy all, hope this note finds you well and recovering from Turkey overload.

Warning: This is a pretty specific discussion of a pretty specific situation that occurs when you’re moving a bunch of stuff around in Publisher.  If you’re not interested in Publisher and it’s idiosyncrasies,  you should save your time and skip this post.
Anyhow, with the holiday season upon us and all, I figured I’d share the source code for a little Publisher utility I recently put together.  I’m working with a client who is in the process of a major Publisher update: they’re re-organizing their publisher content hierarchy to operate more efficiently.  This is all fine and dandy, except for the fact that many of their content items link to other content items.  As such, when they re-organize their content hierarchy, a bunch of embedded links are going to break :(  For example, consider the following:
The file:


$pub_root/foo/foo.html


contains a link to another content item that looks like:


http://pubcontentserver/bar/bar.html


So far so good, except that during the content migration, the foo and bar directories are getting rolled into one new folder named foobar.  When this happens, we’ll end up with the following files:

$pub_root/foobar/foo.html

$pub_root/foobar/bar.html


Unfortunately, Publisher isn’t smart enough to fix embedded links for you when you move stuff around within the hierarchy.  The net result is that our link in foo.html is out of date and pointing to a file that doesn’t exist:

http://pubcontentserver/bar/bar.html


What we’d like to happen, though, is for our link to point to the new location of the file, i.e.:


http://pubcontentserver/foobar/bar.html

This particular client has a bunch of published content items (in the tens of thousands), which makes it pretty impractical to ask someone to manually go through every published content item and look for links that are going to break.  So I volunteered to write a script/program to go through and identify potential problem links automatically.  I have to admit I first volunteered because I thought this was going to be really easy to figure out via. a recursive grep:

From $PUB_CONTENT_ROOT on filesystem

find . | xargs grep -i ‘publishedcontent’


Just to make me mad though, while the grep approach works, it returns too many false positives.  It turns out that a lot of the out-of-the-box content items have links back to themselves, i.e something like:


in the file $PUB_CONTENT_ROOT/foo/foo.html

http://pubcontentserver/foo/foo.html#myTarget


And Publisher is smart enough to fix these self-referencing links when you move the items around, i.e. the link above gets updated to


in the file $PUB_CONTENT_ROOT/foobar/foo.html
http://pubcontentserver/foobar/foo.html#myTarget


What we really need is a piece of code that does the following:

1) Recurse down the published content tree and examine all the published content items
2) Check for links to other published content items in each file.  This can be done by looking for a specific token, which by default, is “publishedcontent”
3) At this point, we have a list of all the links to content items in the file.  However, we need to filter out the self-referencing links that will get auto-updated by Publisher.  To do this, let’s look at the path in the link, and ignore any references that point to the same directory.  For example:

In the file $PUB_CONTENT_ROOT/foo/foo.html
We’ll capture the link http://putcontentserver/bar/bar,html
But we’ll ignore the link http://pubcontentserver/foo/mystuff.html because it will be auto-updated by Publisher.

4) Dump all matches found out to a file so the links can be reviewed and fixed.


So I wrote some code that does the above.  Go grab LinkChecker.java if you’re interested.  The code is pretty simple, but it does have a few caveats which are all listed in the comments.  You should be able to compile the code with a JDK 1.5+ compiler without any external dependencies:

javac com\function1\utility\LinkChecker.java


And then run it pretty easily too:

java -classpath . com.function1.utility.LinkChecker my/path/to/pubcontent_root/publish


Note that the code actually generates two output files: published_content_links.log and external_links.log.  published_content_links.log is the list of all the links and files that fall into the scenario outlined above.  external_links.log just lists all links in content items that are to non-published content items (i.e. links to external websites, and other portal links).


Eventually, somebody with a little motivation could evolve this code to the next logical iteration and add functionality to auto-check the links that are grabbed.  Would be pretty simple to do by adding something like:

URLConnection myLink = new URLConnection(currentLink);
int statusCode = myLink.getResponseCode();

But then you have to think about faking portal authentication and all that other good stuff that I’m not really up for tonight.  Anyhow, hope this little utility comes in handy for at least somebody out there on the internet….enjoy the early (admittedly pretty lame) Christmas present from your friends at Function1.
No Responses to “An Early Christmas Present: Free code to list Published Content Links”

Leave a Reply

You must be logged in to post a comment.