$Id: DESIGN 4349 2008-07-15 20:30:04Z yuri $ 

- DAG talker [yuri doing this]
  - CRON
   - runs on lander*, as lander user
   - takes data from DAG to ramdisk
  - 1 per LANDER trace box
  - 27-Jan-05: needs testing?  nfs throughput issues?
    *** may need to come back and improve performance to handle attacks
  - 27-Jan-05: merged dag talker and dag mover, dag talker writes directly
	to NSF

- DAG mover [yuri doing this]
  - triggered by talker finishing
  -  runs on lander*, as lander user
  - moves data from ramdisk to either RAID [ or localdisk]
  - Alefiya: currently we do not have a ramdisk. We need to write 
    directly to NFS RAID. 
  - 1 per LANDER trace box
  - 27-Jan-05: not used, because of merge with dag talker

- DAG backup mover [postponed]
  - takes old data on local disk and 
    moves to raid in RAW/LANDER* when recovered 
  - runs on LANDER* as lander user
  - 1 per LANDER trace box
  - CRON once/hour
  - 27-Jan-05: not used because no local disk use

- scrubber worker program dispatcher [gbartlett doing this]
  - run as lander user
  - runs on HPCC master
  - scans raw directory
  - schedule, automatically scrubber programers per trace files
  - if raw file is empty, it waits for new files 
  - multiple copies on HPCC

- scrubber program [yuri completed]
  - runs as lander user
  - runs per raw trace file, BATCH SCHEDULED
  - runs on HPCC compute node
  - scrubs file, writes to scrubbed director
  - hardlinks file in to all users incoming
  - ONE PER FILE, on hpcc
  - 27-Jan-05: done but needs audit

- monitor [alefiya ] 
  - checks to see if there are only few worker programs running 
  - check if queue size in the batch processing system is limited 
  - CRON job run periodically as plander 
  - run every 30min  

- scrambler dispatcher [LATER] 
  - run as lander user
  - run periodically CRON every X days/hours or so
  - creates MAP for IP to scrambled addresses 
  - runs on HPCC master
  - scans scrubber directory
  - schedule, automatically scrambler program per trace files
  - ONE ON HPCC

- scrambler program [LATER]
  - runs as lander user
  - runs per scrubbed trace file  
  - run on HPCC compute node 
  - scrambles IP addresses as per MAP, writes to scrambled dir
  - ONE PER file on hpcc 

- cleaner [postponed]
  - run as lander user
  - runs on HPCC master OR could be scheduled on compute node
  - runs automatically, periodically (once day), CRON JOB once/day
  - looks at all incoming and scrubbed dir, deletes old stuff
  - look at in-process and report errors for too much queued work
  - 27-Jan-05: on hold

- quota checker [postponed]
  - run as lander user
  - runs on HPCC master OR could be scheduled on compute node
  - runs automatically, periodically (once day), CRON JOB once/day
  - looks at each users disk space and reports hogs
  - 27-Jan-05: on hold

- user dispatcher scanner [alefiyah doing this]
  - runs as user
  - runs on HPCC master OR could be scheduled on compute node
  - scans user incoming periodically, CRON JOB every 2 minutes
  - runs on HPCC master
  - moves file into in-process dir
Alefiya: Need to add in-process/LANDER* 
  - writes script per file
  - 27-Jan-05: done

- user worker programs [similar ro scrubber worker program] 
#- user dispatcher runner [alefiyah doing this]
  - runs as user
  - runs on HPCC master
  - runs as user, RUN MANUALLY OR VIA CRON JOB (every 2 minutes)
  - checks size of queueud jobs, reports error if over threshold
  - runs on HPCC master
  - perdiocally takes each written script and schedules it on batch system
  - 27-Jan-05: done, but not cron job yet

- user processor [user provided, will need prototype---alefiyah has done]
  - runs as user, RUN VIA BATCH SYSTEM
  - runs on HPCC compute node
  - runs on each file in user in-process dir
  - removes it after either linking to storage dir or writing summary files
  - 27-Jan-05: alefiyah has script
  
- status monitor [postponed]
  - every lander script needs actively record when it runs
    - compute jobs should record their run times
  - runs as lander
  - run on our machine somewhere
  - aggregates run-records into status web page
  - 27-Jan-05: on hold

Error handling.  [needs to be done]
1. the job never runs
2. the job runs and stops in the middle with part of the work done
3. the job runs twice (or more), sequentially
4. the job runs twice, concurrently

For EACH of the tasks that we worked out, we should think about what
happens in each of these four failure modes.

I think our monitoring process covers case #1: if jobs stop running,
then we notice that the times in the monitor process get long and we
determine something's wrong.

A general approach to running multiple times is to ensure that the
jobs are idempotent, i.e., they can run multiple times without hurting
things.  Some of our jobs are idempotent, some aren't, and some are
sort of (the penalty for multiple runs isn't too bad).  We need to
think them through.
