Investigating Sites Tables in R

image

A very common theme is being asked to do some analysis on a WebCenter Sites installation but not having direct access to the system.

In this blog we will show you how you can take simple Sites Explorer or Catalog Mover exports and interrogate them in R.

If you are not familiar with R you can find more information at https://www.r-project.org/.  Basically it is a statistical programing language which does many things like you would do in Excel except programatically.

First we load up the XML library to process the html files:

 

library(XML)

 

Now lets open up the AssetPublication table from the zip file and read in the data to a "DataFrame".  A DataFrame is one of the base units of R and similar to a worksheet in excel with rows and columns.

 

# create an object pointing to the file in the zip we want to process
f <- unz("AssetPublication.zip", "AssetPublication.html")

# read all of text and pass it to the readHTMLTable function
t <- readHTMLTable(readLines(f))

# The output of readHTMLTable will be a list with each table in the file as an element.  We want the AssetPublication table
assetPublication <- t$AssetPublication

close(f)


head(assetPublication)
##              id         pubid assettype       assetid
## 1             1 1112198287026   Media_A 1112195133863
## 2            10 1112198287026   Media_A 1112195133923
## 3            11 1112198287026   Media_A 1112195133928
## 4 1112192431361 1112198287026 Content_A 1112192431358
## 5 1112192431366 1112198287026 Content_A 1112192431363
## 6 1112192431371 1112198287026 Content_A 1112192431368

 

So now we have the AssetPublication table. Lets have a look at the asset counts.

table(assetPublication$assettype)
## 
##          AdvCols  ArticleCategory        AttrTypes       AVIArticle 
##               15                6               13              136 
##         AVIImage        Content_A        Content_C       Content_CD 
##              249               13               13                1 
##        Content_F        Content_P       Content_PD ContentAttribute 
##                2                1                1               24 
##       ContentDef ContentParentDef        CSElement           Device 
##                3                1               46               25 
##      DeviceGroup        Dimension     DimensionSet       Document_A 
##               11                4                1               10 
##       Document_C      Document_CD       Document_F       Document_P 
##               35                1                3                4 
##      Document_PD      FSIIVisitor  FSIIVisitorAttr   FSIIVisitorDef 
##                1                2               11                1 
##   FW_Application          FW_View    ImageCategory          Media_A 
##                5                5               11               11 
##          Media_C         Media_CD          Media_F          Media_P 
##               27                1                3                3 
##         Media_PD             Page    PageAttribute   PageDefinition 
##                1               30               18                9 
##        Product_A        Product_C       Product_CD        Product_F 
##               15               17                1                4 
##        Product_P       Product_PD       Promotions       ScalarVals 
##               20                4                1               12 
##         Segments        SiteEntry         SitePlan            Slots 
##                3                9                4               17 
##       StyleSheet         Template          WebRoot          YouTube 
##               10              139                3                5

 

Well that isn't very pretty. We can use the dplyr package to get some easy to use functions to process the data.

Lets find the top 10 asset types in this table:

library(dplyr)
assetPublication %>% count(assettype, sort=TRUE) %>% top_n(10)
## Source: local data frame [10 x 2]
## 
##           assettype     n
##              (fctr) (int)
## 1          AVIImage   249
## 2          Template   139
## 3        AVIArticle   136
## 4         CSElement    46
## 5        Document_C    35
## 6              Page    30
## 7           Media_C    27
## 8            Device    25
## 9  ContentAttribute    24
## 10        Product_P    20

 

What if we want to break that down by site? We probably want to include the names out of the publication table.

publication <-  readHTMLTable(readLines(unz("Publication.zip", "Publication.html")))$Publication
assetPublicationWithNames <- assetPublication %>% inner_join(publication, c("pubid"="id"))
assetPublicationWithNames %>% group_by(name, assettype) %>% tally(sort = TRUE) %>% top_n(5)
## Source: local data frame [12 x 3]
## Groups: name [3]
## 
##           name        assettype     n
##         (fctr)           (fctr) (int)
## 1    AdminSite   FW_Application     5
## 2    AdminSite          FW_View     5
## 3    avisports         AVIImage   249
## 4    avisports       AVIArticle   136
## 5    avisports         Template    91
## 6    avisports ContentAttribute    24
## 7    avisports        CSElement    21
## 8  FirstSiteII         Template    48
## 9  FirstSiteII       Document_C    35
## 10 FirstSiteII          Media_C    27
## 11 FirstSiteII        CSElement    25
## 12 FirstSiteII        Product_P    20
 

As one final example lets look at the number of assets per site.

aps <- assetPublicationWithNames %>% group_by(name) %>% tally() %>% arrange(n)
pie(aps$n, labels = aps$name, col = rainbow(length(aps$n)))

Or as a bar chart

library(ggplot2)
  ggplot(aps, aes(x=name, y=n, fill=name)) + 
  geom_bar(stat="identity") +
  theme_bw(base_size = 20) +
  ylab("Number of Assets") +
  theme(axis.title.x = element_blank()) +
  guides(fill=guide_legend(title=NULL))

 

 

Comments

Subscribe to Our Newsletter

Stay In Touch