Keep it Quick: Configure Websphere Commerce crawler to crawl static html files

High Level description

The implementation below is based on OOB configuration but it changes the default crawling page to another static content page (staticcontentindex.html) which customer can update manually and upload to server to define a list of static files for crawler to manage.
The configuration is not required on production. Crawling and indexing will happen on staging and index is propagated to production (as usual)
It is assumed SEO is configured on WebSphere Commerce so static content html are accessed using /info/ e.g. http://<hostname>/info/contactus
It is assumed that web server is configured so all static html files are located under /opt/webserver/assets/infodocs. For example, if you try to access file http://hostname/info/contactus, it is expected to have the following file available on http server/opt/webserver/assets/infodocs/contactus.html
The customization done to Commerce to display static html files are not included as part of the article.

For case where Solr is local to Commerce box (e.g. UAT-Staging)

Update droidConfig.xml (environment specific) to change following entries (file attached as reference)

hostname
storePathDirecttory. I don’t like the default storePathDirectory as it will copy crawled data in subfolders under it. So feel free to define another location that makes more sense if you wish to.
Add a new var called solrHostname (in most environment the value will be identical to hostname)
Add a new var called solrPort
Change location to the following value http://${hostname}/info/staticcontentindex
Change relative path to empty string. This will make sure all crawler links will have a full URL and will avoid awkward issues that might popup with relative ones. Just make sure the hostname used is one that customers can access externally. More on this might come in future posts.
Set autoIndex enable=”true” and set URL as shown below.

http://${solrHostname}:${solrPort}/solr/MC_${catalogId}_CatalogEntry_Unstructured_${localename}/webdataimport?command=full-import&storeId=${storeId}&basePath=

Use attached filters.txt instead of default one which allows crawling of static html files and ignore all others. Because your static files might include your megamenu, it simply means the crawler will attempt to crawl the entire store. You need to make sure the rules defined in filters.txt will prohibit this from happening. In attached sample file, I included two stop rules which are -.*(search).* & -.*(category).* because my SEO uses both of them and I need to make sure they are filtered out. As a consequence you need to avoid to use the same SEO pattern as folder names for your static content.
Copy staticcontentindex.html to /opt/webserver/assets/infodocs. It is just provided as a sample for testing purposes, so don’t replace it in case it already exist on server.

For case where Solr server is remote to Commerce server

You need to follow technote provided by IBM http://www-01.ibm.com/support/docview.wss?uid=swg21612155
Apply steps above on the remote Solr server

Database

In table SRCHCONFEXT, for INDEXSUBTYPE WebContent, make sure CONFIG column has something similar to what is below where storePathDirectory is as defined in droidConfig.xml

BasePath=<storePathDirectory>\StaticContent\en_US\,SearchServerPort=<solrPort>,SearchServerName=<solrHostname>,StoreId=<storeId>

Crontab

We need to add crawler.sh to crontab and define an appropriate schedule for it

Testing

In case Solr/Commerce are on same box, change directory to commerce server bin directory & execute command shown below. In case Solr is on a separate box, you need to run crawler.sh from the Solr box instead of Commerce.

./crawler.sh -cfg /usr/WebSphere/AppServer70/profiles/search/solr/home/droidConfig.xml -instance <instancename>

Verify that crawler.sh completed successfully without errors
Verify indexing status using the following link

http://<solrHostname>:<solrPort>/solr/MC_10001_CatalogEntry_Unstructured_en_US/webdataimport?command=status

Run index update for WebContent as shown below and make sure it completes successfully without errors

/usr/WebSphere/CommerceServer70/bin/di-buildindex.sh -instance <instanceName> -masterCatalogId <masterCatalogId> -indexSubType WebContent -dbuser <user> -dbuserpwd <password> -fullbuild true -statusInterval 10000 -localename en_US

More considerations

Please check my newest post regarding crawler configuration tips you need to take into consideration.

References

droidConfid.xml
filters.txt
staticontentindex.html

Keep it Quick

Blog Category

Friday, 12 September 2014

Configure Websphere Commerce crawler to crawl static html files

High Level description

For case where Solr is local to Commerce box (e.g. UAT-Staging)

For case where Solr server is remote to Commerce server

Database

Testing

More considerations

References

No comments:

Post a Comment