Hi guys,

Ahead of this afternoon's call I have read the PCPD paper (http://www.pdcp.org/vols/vol06/SCPE_6_1_06.pdf) and had a think about it. I think PCPD can be used, in fact I would go as far as to say that it could be very useful for us (JRA4 and GridMon). However, (and no prizes for having already guessed this ;-) it will need work.

My initial (random) comments are:

1. I think we can make life a bit easier for ourselves by saying that the fairly non-intrusive tests (ping and traceroute) can be run at the same time as each other or ONE intrusive test (iperf OR udpmon). Intrusive tests must never run at the same time. So we can say that:
- ping clique tokens are non-blocking to traceroute, iperf and udpmon tokens
- traceroute clique tokens are non-blocking to ping, iperf and udpmon tokens
- iperf clique tokens are non-blocking to ping and traceroute tokens
- udpmon clique tokens are non-blocking to ping and traceroute tokens
- no more than two tokens should be active on a node at the same time (e.g. iperf and ping OK, iperf and ping and traceroute NOT OK).

As a consequence blocking (node locking) will be key.

2. I think we need a mechanism for reporting errors to someone. The odd error can be ignored - running tests every 30 mins is a reasonable frequency and missing one or two consecutive datapoints is not a problem. We're looking for recurrent problems, e.g. node n's tests always run over, so that node n and node n+1 (which regenerates the token after the timeout) run overlapping tests EVERY 30 minutes. It could also be used at the start of testing (i.e. in the first few days) or when nodes are added/removed from a clique to check that there are no major inter-clique problems, e.g. removing Edinburgh from the "RALtoScotGrid-Iperf" clique moves RAL's testing to Glasgow forward by 20 seconds, so that it now occupies the timeslot at which Glasgow normally tests to RAL. I imagined that a little bit of tweaking will be needed at times - the PCPD paper seems to suggest this also.

I think the easiest way to report errors is send email to the address identified in the theoretical PCPD token (in the "owner_s_email" field) PROVIDING this is the original owner of the token, i.e. the person who first generated it. That person can then liaise with the appropritate local contacts, or delete and issue a new token to fix the problem. I say "theoretical token" as it looks from Ratna's notes that not everything mentioned in the PCPD paper was actually present in the EDG implementation. I also specifically stated the email of the person who originally generated the token as I suspect that they will be the main person responsible for the monitoring infrastructure, rather than a SysAdmin responsible for a local monitoring box who has 3mill other things to do.

3. Security is a big area. If we decide to proceed this is something we can have a separate phone call about. Since we're controlling access to things like the DT using certificates, certificates would seem to be the way to go.

4. I think the PCPD paper in section 4 (Related Work) TRIES to suggest that you could mix scheduling scenarios within your toolkit, depending on the requirements of the tools you are using, e.g. you could run PingER tests using cron jobs and RTPL tests using centralised control. I don't think we should mix and match. We either schedule everything using PCPD or nothing. If we try a mixed approach we're going to get in a muddle.

5. PCPD uses timestamps, but do those timestamps need to be particular in-sync with those of other clique members? What I'm trying to say is, does each clique member say "Right, I THINK it's 14:30:00. I last ran a test at 14:00:00 and the periodicity of my iperf token is 30 minutes, so it's time to start running another set of iperf tests"? Alternatively, must 14:30 on node n be exactly the same as 14:30 on node n-1? If the latter, all hosts will need to be accurately time synch'd using NTP, and we'll need to figure out how to use this across timezones.

We use NTP for GridMon (albeit in just one time zone) so that when we're comparing network performance from say Edinburgh to Glasgow with Glasgow to London, it's like for like.

6. Are there any changes we can make to our respective monitoring toolkits to increase our chances of success with PCPD? A simple example: If an iperf client cannot make a connection with the iperf server on the other end of a test path (e.g. the server has fallen over, or there are firewall problems) then the client will sit there for ages before timing out. This has the potential to seriously skew our testing period. So can we look at making iperf timeout after not much more time than the duration of a normal iperf test (e.g. 10 seconds)?


So those were my random comments. In getting back to what we originally agreed that I would do (think about what we want and need from a scheduling mechanism using tokens :) my comments are below. The YES/NO indicates whether I think PCPD does this.

7. must support regeneration of tokens to limit affects of token loss or delay: YES
8. should support means of stopping tests if they have been delayed or are overunning, and those tests will only complete by overunning into the test period expected to be used by another group member: NO
9. must supporting blocking and non-blocking tokens to allow control of which tests can and cannot be run in parallel: NO
10. errors need reporting to a central location, where responsive, proactive people live who can fix the problem: NO

The one thing I need to look at more is how PCPD controls when tests start. But I think we have enough in our heads to start talking.

Lunch is calling....speak to you soon,

Mark.


> -----Original Message-----
> From: Alistair Phipps [mailto:alistair@epcc.ed.ac.uk]
> Sent: 02 March 2006 16:09
> To: Kostas Kavoussanakis; Leese, MJ (Mark)
> Subject: pcpd minutes
> 
> 
> Hi Kostas + Mark,
> 
> A few notes from the meeting:
> 
> http://agenda.cern.ch/askArchive.php?base=agenda&categ=a061426
> &id=a061426/minutes
> 
> Let me know if I should add/change anything.
> 
> Cheers,
> 
> Alistair
>