Setting up a Content Crawler Job to Retrieve Content from Specific Directories

The following guide describes how to set up crawler job for getting directories using Conductor.

  1. Go to Conductor UI. For ex. at http://localhost:8890/conductor .
  2. Enter dba credentials.
  3. Go to "Web Application Server".



  4. Go to "Content Imports".



  5. Click "New Target".



  6. In the shown form set respectively:
    • "Crawl Job Name":

      Gov.UK data

    • "Data Source Address (URL)":

      http://source.data.gov.uk/data/

    • "Local WebDAV Identifier" for available user, for ex. demo:

      /DAV/home/demo/gov.uk/

    • Choose from the available list "Local resources owner" an user, for ex. demo ;



    • Click the button "Create".
  7. As result the Robot target will be created:



  8. Click "Import Queues".



  9. For "Robot target" with label "Gov.UK data " click "Run".
  10. As result will be shown the status of the pages: retrieved, pending or respectively waiting.



  11. Click "Retrieved Sites"
  12. As result should be shown the number of the total pages retrieved.



  13. Go to "Web Application Server" -> "Content Management" .
  14. Enter path:

    DAV/home/demo/gov.uk





  15. Go to path:

    DAV/home/demo/gov.uk/data

    1 As result the retrieved content will be shown.



Related