Friday, 12 September 2014

Configure Websphere Commerce crawler to crawl static html files

High Level description

  • The implementation below is based on OOB configuration but it changes the default crawling page to another static content page (staticcontentindex.html) which customer can update manually and upload to server to define a list of static files for crawler to manage.
  • The configuration is not required on production. Crawling and indexing will happen on staging and index is propagated to production (as usual)
  • It is assumed SEO is configured on WebSphere Commerce so static content html are accessed using /info/ e.g. http://<hostname>/info/contactus
  • It is assumed that web server is configured so all static html files are located under /opt/webserver/assets/infodocs. For example, if you try to access file http://hostname/info/contactus, it is expected to have the following file available on http server/opt/webserver/assets/infodocs/contactus.html
  • The customization done to Commerce to display static html files are not included as part of the article.

For case where Solr is local to Commerce box (e.g. UAT-Staging)

  • Update droidConfig.xml (environment specific) to change following entries (file attached as reference)
    • hostname
    • storePathDirecttory. I don’t like the default storePathDirectory as it will copy crawled data in subfolders under it. So feel free to define another location that makes more sense if you wish to.
    • Add a new var called solrHostname (in most environment the value will be identical to hostname)
    • Add a new var called solrPort
    • Change location to the following value http://${hostname}/info/staticcontentindex
    • Change relative path to empty string. This will make sure all crawler links will have a full URL and will avoid awkward issues that might popup with relative ones. Just make sure the hostname used is one that customers can access externally. More on this might come in future posts. 
    • Set autoIndex enable=”true” and set URL as shown below.
http://${solrHostname}:${solrPort}/solr/MC_${catalogId}_CatalogEntry_Unstructured_${localename}/webdataimport?command=full-import&amp;storeId=${storeId}&amp;basePath=
  • Use attached filters.txt instead of default one which allows crawling of static html files and ignore all others. Because your static files might include your megamenu, it simply means the crawler will attempt to crawl the entire store. You need to make sure the rules defined in filters.txt will prohibit this from happening. In attached sample file, I included two stop rules which are -.*(search).* & -.*(category).* because my SEO uses both of them and I need to make sure they are filtered out. As a consequence you need to avoid to use the same SEO pattern as folder names for your static content.
  • Copy staticcontentindex.html to /opt/webserver/assets/infodocs. It is just provided as a sample for testing purposes, so don’t replace it in case it already exist on server.

For case where Solr server is remote to Commerce server


Database

  • In table SRCHCONFEXT, for INDEXSUBTYPE WebContent, make sure CONFIG column has something similar to what is below where storePathDirectory is as defined in droidConfig.xml
BasePath=<storePathDirectory>\StaticContent\en_US\,SearchServerPort=<solrPort>,SearchServerName=<solrHostname>,StoreId=<storeId>
Crontab
  • We need to add crawler.sh to crontab and define an appropriate schedule for it

Testing

  • In case Solr/Commerce are on same box, change directory to commerce server bin directory & execute command shown below. In case Solr is on a separate box, you need to run crawler.sh from the Solr box instead of Commerce.
./crawler.sh -cfg /usr/WebSphere/AppServer70/profiles/search/solr/home/droidConfig.xml -instance <instancename>
  • Verify that crawler.sh completed successfully without errors
  • Verify indexing status using the following link
http://<solrHostname>:<solrPort>/solr/MC_10001_CatalogEntry_Unstructured_en_US/webdataimport?command=status
  • Run index update for WebContent as shown below and make sure it completes successfully without errors
/usr/WebSphere/CommerceServer70/bin/di-buildindex.sh -instance <instanceName> -masterCatalogId <masterCatalogId> -indexSubType WebContent -dbuser <user> -dbuserpwd <password> -fullbuild true -statusInterval 10000 -localename en_US


More considerations

Please check my newest post regarding crawler configuration tips you need to take into consideration.

References

No comments:

Post a Comment