From: owner-hep-proj-grid-exec@listbox.cern.ch on behalf of Bob Jones
[Robert.Jones@cern.ch]
Sent: Friday, February 20, 2004 1:20 PM
To: Hep-Proj-Grid-Pmb@Cern. Ch; WP Internal
Subject: WP MGR INTERNAL Notes from EDG final review: day 2

Categories: CERN SpamKiller Note: -49

***********************************************************
* Message from WP managers internal list - do not forward *
***********************************************************
Hi,
Here are my notes from the EDG review day 2. It includes comments/questions
made and a summary of those aspects not covered by the slides:
http://agenda.cern.ch/fullAgenda.php?ida=a036278

There will be a separate email with the notes from the feedback session.

Cheers, Bob.


WP1: (Francesco)
Talk: 15 minutes
Questions: 7 minutes

Q: Did you try to simulate the behaviour of the RB
A: We did have a computer science group but they were more interested in
exploring novel approaches meta-scheduling issues but this was too far
removed from the needs of deploying a working system for the users.

Q: How far did you go with scalability testing?
A: The LCG certification group regularly test 20,000 jobs over the weekend
using 20 streams.

Q: In which version was check-pointing included and has it been used by
WP10?
A: The WP10 comments about losing long-running jobs referred to EDG 1.4
before the check-pointing was introduced.

Q: How did you handle complex integration of the many components?
A: The autobuild tools did help to some extent with the dependencies but
better support is for handling the complexity is required. A paper has been
written by Elisabetta Ronchieri on this subject.
A (Erwin): In D6.8 there is a summary of the integration process

Q: LCG have certified the WP1 to some extent but do you have a feeling for
its true limits?
A: LCG have a very good stress test for the types of jobs run in HEP but
this does not cover aspects such as MPI jobs. We have not stressed all
aspects of the system.


WP2: (Peter)
Talk: 15 minutes
Questions: 5 minutes

Q: Do you assume the "work to be done" will be performed in EGEE?
A: Some aspects will be performed in EGEE building on the results of EDG.
There is still some uncertainty concerning web services and security
interaction that needs further thought.

Q: Was the use of the Replica mgmt Bloom filter successful?
A: The use of the Bloom filter is explained in D2.6 and proved successful
but requires careful configuration of the various parameters. It is not
suitable for metadata usage.

Q: Have you made the bloom filter into a web service?
A: No it is hidden within the replica mgmt tools.



WP3: (Steve)
Talk: 15 minutes
Questions: 5 minutes

Q: What is the expected exploitation
A: There are people who want to see it deployed on LCG and related systems

Q: What constrains are there for documentation and packaging?
A: Currently RGMA is packaged within the EDG environment which is quite
restrictive. We want to distribute source tar files that can built and run
anywhere.

Q: When will this happen?
A: In the coming few months as we move into EGEE

Q: What were the main reasons for instability of RGMA?
A: Mainly deadlocks between the different components that only showed-up
when tested on large distributed testbeds when the load was increased. These
are now being limited systematically.

Q: How did you detect the deadlocks?
A: We started with an extreme programming approach that produced many unit
tests but now are taking a step backwards and reviewing the design aspects.

A: Would you use extreme programming again?
Q: Yes as a technique for getting early prototypes going.

WP4: (Maite)
Talk: 15 minutes
Questions: 10 minutes
Q: What types of nodes (single PC, small farm of linux boxes, loosely
coupled cluster, parallel and vector systems etc.) can you support now and
in the future?
A: Currently do not get involved in the architecture of the nodes. We
currently support loosely coupled clusters of standard PCs running linux or
Solaris.
Q: Even standard PC boxes are complicated because the clusters may contain
different configurations of nodes. What about home PCs that are less
reliable? Without clearly defined policies at sites I have my doubts you
will be able to manage the complexity.
A: Quattor and lemon could be used for home PCs. Unlike its predecessor,
LCFG-ng, Quattor/Lemon does not take full control of the node and hence can
be used in more varied environments.

Q: Do the configuration aspects allow installation to be performed
unattended?
A: Yes but some sites do not want to make use of this level of automation.
A (German): The cluster at CERN has nodes of different configurations
(different CPUs, disks etc.) With Quattor we can record this sort of
information in the configuration database from which the info is extracted
to generate a kick-start file to boot the node so variations can be managed.
Karsten Decker comment: This has extremely interesting potential commercial
value to many customers since it simplifies the day-to-day work. You could
become bloody filthy rich!
Q: How does the effort scale with the size of the installation?
A: Adapting the WP4 framework takes time and effort for each site but once
in place the addition of extra nodes is very easy. CERN has currently grown
to 2000 nodes.
Q: How far can it scale and have you measured reliability?
A: Scalability can be achieved by redundancy, as at CERN, and this is built
in the architecture but we have not measured the uptime achieved as a result
of the introduction of Quattor.

Q: How does the fabric interface to the grid?
A: A site needs to publish information about it configuration and status to
the grid information service. The fabric mgmt tools help in providing
consistent information.

Q: How will you use such tools in a heterogeneous production grid?
A: We cannot impose the choice of tools but we will define the requirements
for site behaviour via SLAs and make tools, such as Quattor/lemon, available
for the sites that want to adopt them.

Coffee

WP5: (Jens)
Talk: 15 minutes
Questions: 5 minutes
Q: What is the performance disk to disk with gridFTP?
A: Do not have figures to hand but similar to 30MB/s for SE.
A (John Gordon): The DataTAG land speed record is far in excess of what the
users use at the moment. The performance is dominated by the network
throughput.
A (Frank Bonnassieux): 400MB/s was demonstrated between CERN and NIKHEF
during the 2nd EDG review
A (Peter Clarke): It is not the network speed that dominates the result but
a concerted effort by all involved (network experts and people running HEP
production runs) to get an end-to-end solution. Disk-to-disk of 1GB/s for
end users would be fantastic. 500MB/s is possible today.

WP7: (Franck)
Talk: 15 minutes
Questions: 5 minutes

Q: In which version of the QoS EDG sw is it installed?
A: It is not installed because the technology is too new. E.g. IP premium
requires all NRENs to set-up a static configuration for it to work which
they are not prepared to do on a regular basis.

Q: Are the advanced services you described available?
A: No they are not available the production network.

Q: What are the plans for EGEE?
A: We will continue work with the advance TCP stack to provide advanced
facilities  but there is no point implementing his unless the NRENs deploy
it.
A (Fab): EGEE has formed a collaboration with DANTE, G2 and the NRENs to
work in these areas (EGEE JRA4).
A: We are currently adjusting the planning of EGEE and GN2 deliverables to
ensure they interact properly and EGEE can exploit the network advances.

Security (Dave):
Talk: 34 minutes
Questions 7 minutes

Q: Most of the developments will be taken over by EGEE?
A: Yes, most of it is part of mw which will be moved into EGEE.

Q: What about accounting?
A: Not the problem of the SCG; did only briefly touch the issue, is included
in the design.
(Bob): one of the activities that go across all WPs (WP1, WP4, WP6); have
not deployed a full accounting system.

Q: Will the missing security requirements be satisfied within the timeframe
of EGEE?
A: (Bob): first milestone of EGEE is assessment of requirements input from
EDG and other projects; at month 3 this will be clarified, biomed is one of
the 2 pilot application groups of EGEE hence their requirements will be
considered high priority.

Q: accounting will be important for commercial exploitation
A: (Bob): current plan in LCG is to put in place an offline accounting
system rather than an online one which could be used by the broker during
job submission. This involves collecting the log files from the sites and
analysing the contents. This is a model more like the monthly statement you
get for your credit card.


3 years Summary (Erwin):
Talk: 23 minutes
Questions delayed until after Fab's presentation.

Project Summary (Fab)
NOTE: The thanks list is incomplete (as is always the case when one tries to
doing quickly). Please wait for a more complete list which will be uploaded
during the afternoon.

Q: What are the 3 most proud decisions made during project?
A (Fab): The collective decision-making process taking into account the
distributed nature of the project - e.g. do we fix the current version or go
forward with new software etc. May not be the most efficient way of running
the project (a centralised, smaller project can be more agile) but we needed
to build a community that is interested in this new technology.
Resisted the pressure to appoint a single chief architect but rather seek
consensus.
Being prepared to go back on the 2nd year plan for functionality and focus
on application support and quality.

Q: What do you see as the evolution of the assessment of EU projects?
A (Fab): My personal opinion, not that of the consortium, we took a number
of independent decisions which were not always aligned with other grid
projects which has lead to a love/hate relationship with those projects.

Q: In the "Lessons learnt" is there something that is very grid specific? It
looks more like standard sw engineering points?
A (Erwin): Clearly most come from software engineering aspects but to run a
successful grid you need to be able to validate and monitor the sites. This
is clearly a unique feature of a grid because it tries to federate a set of
separately managed resources. Equally the importance of grid security that
is available from the start is very high.
(Frank Harris): flexible software installation is of paramount importance.
(Fab): putting together a few thousand PCs in a single site is now being
possible so we need to be able to convince rich application groups that the
grid is useful and necessary. This passes by the provision of a well-managed
and secure infrastructure that is easy to use and deploy.

(Luigi Fusco): A well controlled testing environment is essential. Need to
make clear very early the interfaces used by the applications from the grid
middleware.

Q: Is not the scale of the problem what makes it different/hard for the
grid?
A (Erwin): yes, the people were used to distributed systems but the
quantities and qualitative scale of the grid was beyond their experience.

EDG Video (6 minute film set to music showing the history and output of the
project produced by Rosy Mondardini) was projected and received a resounding
round of applause.

Session closed at 13:20.