Saturday, 28 May 2011

OmniFind: Dump collected documents by a seedlist crawler

  • Create a file with name seedlistcrawler_ext.xml which contains the following:
<?xml version="1.0" encoding="UTF-8"?> <ExtendedProperties><AppendChild Xpath="/Crawler/DataSources/Server" Name="HttpTraceSeedlist">/tmp/seed.log</AppendChild> </ExtendedProperties>
  • Put the file into ES_NODE_ROOT/master_config/<Crawler ID>/ 
  • Restart the crawler session, and then perform full crawl.
  • All HTTP activities regarding seedlist should be logged in the specified file (/tmp/seed.log ). It will contain dumps of all pages.

No comments:

Post a Comment