From: owner-hep-proj-grid-exec@listbox.cern.ch on behalf of Bob Jones [Robert.Jones@cern.ch] Sent: Friday, February 20, 2004 1:20 PM To: Hep-Proj-Grid-Pmb@Cern. Ch; WP Internal Subject: WP MGR INTERNAL Notes from EDG final review: day 2 Categories: CERN SpamKiller Note: -49 *********************************************************** * Message from WP managers internal list - do not forward * *********************************************************** Hi, Here are my notes from the EDG review day 2. It includes comments/questions made and a summary of those aspects not covered by the slides: http://agenda.cern.ch/fullAgenda.php?ida=a036278 There will be a separate email with the notes from the feedback session. Cheers, Bob. WP1: (Francesco) Talk: 15 minutes Questions: 7 minutes Q: Did you try to simulate the behaviour of the RB A: We did have a computer science group but they were more interested in exploring novel approaches meta-scheduling issues but this was too far removed from the needs of deploying a working system for the users. Q: How far did you go with scalability testing? A: The LCG certification group regularly test 20,000 jobs over the weekend using 20 streams. Q: In which version was check-pointing included and has it been used by WP10? A: The WP10 comments about losing long-running jobs referred to EDG 1.4 before the check-pointing was introduced. Q: How did you handle complex integration of the many components? A: The autobuild tools did help to some extent with the dependencies but better support is for handling the complexity is required. A paper has been written by Elisabetta Ronchieri on this subject. A (Erwin): In D6.8 there is a summary of the integration process Q: LCG have certified the WP1 to some extent but do you have a feeling for its true limits? A: LCG have a very good stress test for the types of jobs run in HEP but this does not cover aspects such as MPI jobs. We have not stressed all aspects of the system. WP2: (Peter) Talk: 15 minutes Questions: 5 minutes Q: Do you assume the "work to be done" will be performed in EGEE? A: Some aspects will be performed in EGEE building on the results of EDG. There is still some uncertainty concerning web services and security interaction that needs further thought. Q: Was the use of the Replica mgmt Bloom filter successful? A: The use of the Bloom filter is explained in D2.6 and proved successful but requires careful configuration of the various parameters. It is not suitable for metadata usage. Q: Have you made the bloom filter into a web service? A: No it is hidden within the replica mgmt tools. WP3: (Steve) Talk: 15 minutes Questions: 5 minutes Q: What is the expected exploitation A: There are people who want to see it deployed on LCG and related systems Q: What constrains are there for documentation and packaging? A: Currently RGMA is packaged within the EDG environment which is quite restrictive. We want to distribute source tar files that can built and run anywhere. Q: When will this happen? A: In the coming few months as we move into EGEE Q: What were the main reasons for instability of RGMA? A: Mainly deadlocks between the different components that only showed-up when tested on large distributed testbeds when the load was increased. These are now being limited systematically. Q: How did you detect the deadlocks? A: We started with an extreme programming approach that produced many unit tests but now are taking a step backwards and reviewing the design aspects. A: Would you use extreme programming again? Q: Yes as a technique for getting early prototypes going. WP4: (Maite) Talk: 15 minutes Questions: 10 minutes Q: What types of nodes (single PC, small farm of linux boxes, loosely coupled cluster, parallel and vector systems etc.) can you support now and in the future? A: Currently do not get involved in the architecture of the nodes. We currently support loosely coupled clusters of standard PCs running linux or Solaris. Q: Even standard PC boxes are complicated because the clusters may contain different configurations of nodes. What about home PCs that are less reliable? Without clearly defined policies at sites I have my doubts you will be able to manage the complexity. A: Quattor and lemon could be used for home PCs. Unlike its predecessor, LCFG-ng, Quattor/Lemon does not take full control of the node and hence can be used in more varied environments. Q: Do the configuration aspects allow installation to be performed unattended? A: Yes but some sites do not want to make use of this level of automation. A (German): The cluster at CERN has nodes of different configurations (different CPUs, disks etc.) With Quattor we can record this sort of information in the configuration database from which the info is extracted to generate a kick-start file to boot the node so variations can be managed. Karsten Decker comment: This has extremely interesting potential commercial value to many customers since it simplifies the day-to-day work. You could become bloody filthy rich! Q: How does the effort scale with the size of the installation? A: Adapting the WP4 framework takes time and effort for each site but once in place the addition of extra nodes is very easy. CERN has currently grown to 2000 nodes. Q: How far can it scale and have you measured reliability? A: Scalability can be achieved by redundancy, as at CERN, and this is built in the architecture but we have not measured the uptime achieved as a result of the introduction of Quattor. Q: How does the fabric interface to the grid? A: A site needs to publish information about it configuration and status to the grid information service. The fabric mgmt tools help in providing consistent information. Q: How will you use such tools in a heterogeneous production grid? A: We cannot impose the choice of tools but we will define the requirements for site behaviour via SLAs and make tools, such as Quattor/lemon, available for the sites that want to adopt them. Coffee WP5: (Jens) Talk: 15 minutes Questions: 5 minutes Q: What is the performance disk to disk with gridFTP? A: Do not have figures to hand but similar to 30MB/s for SE. A (John Gordon): The DataTAG land speed record is far in excess of what the users use at the moment. The performance is dominated by the network throughput. A (Frank Bonnassieux): 400MB/s was demonstrated between CERN and NIKHEF during the 2nd EDG review A (Peter Clarke): It is not the network speed that dominates the result but a concerted effort by all involved (network experts and people running HEP production runs) to get an end-to-end solution. Disk-to-disk of 1GB/s for end users would be fantastic. 500MB/s is possible today. WP7: (Franck) Talk: 15 minutes Questions: 5 minutes Q: In which version of the QoS EDG sw is it installed? A: It is not installed because the technology is too new. E.g. IP premium requires all NRENs to set-up a static configuration for it to work which they are not prepared to do on a regular basis. Q: Are the advanced services you described available? A: No they are not available the production network. Q: What are the plans for EGEE? A: We will continue work with the advance TCP stack to provide advanced facilities but there is no point implementing his unless the NRENs deploy it. A (Fab): EGEE has formed a collaboration with DANTE, G2 and the NRENs to work in these areas (EGEE JRA4). A: We are currently adjusting the planning of EGEE and GN2 deliverables to ensure they interact properly and EGEE can exploit the network advances. Security (Dave): Talk: 34 minutes Questions 7 minutes Q: Most of the developments will be taken over by EGEE? A: Yes, most of it is part of mw which will be moved into EGEE. Q: What about accounting? A: Not the problem of the SCG; did only briefly touch the issue, is included in the design. (Bob): one of the activities that go across all WPs (WP1, WP4, WP6); have not deployed a full accounting system. Q: Will the missing security requirements be satisfied within the timeframe of EGEE? A: (Bob): first milestone of EGEE is assessment of requirements input from EDG and other projects; at month 3 this will be clarified, biomed is one of the 2 pilot application groups of EGEE hence their requirements will be considered high priority. Q: accounting will be important for commercial exploitation A: (Bob): current plan in LCG is to put in place an offline accounting system rather than an online one which could be used by the broker during job submission. This involves collecting the log files from the sites and analysing the contents. This is a model more like the monthly statement you get for your credit card. 3 years Summary (Erwin): Talk: 23 minutes Questions delayed until after Fab's presentation. Project Summary (Fab) NOTE: The thanks list is incomplete (as is always the case when one tries to doing quickly). Please wait for a more complete list which will be uploaded during the afternoon. Q: What are the 3 most proud decisions made during project? A (Fab): The collective decision-making process taking into account the distributed nature of the project - e.g. do we fix the current version or go forward with new software etc. May not be the most efficient way of running the project (a centralised, smaller project can be more agile) but we needed to build a community that is interested in this new technology. Resisted the pressure to appoint a single chief architect but rather seek consensus. Being prepared to go back on the 2nd year plan for functionality and focus on application support and quality. Q: What do you see as the evolution of the assessment of EU projects? A (Fab): My personal opinion, not that of the consortium, we took a number of independent decisions which were not always aligned with other grid projects which has lead to a love/hate relationship with those projects. Q: In the "Lessons learnt" is there something that is very grid specific? It looks more like standard sw engineering points? A (Erwin): Clearly most come from software engineering aspects but to run a successful grid you need to be able to validate and monitor the sites. This is clearly a unique feature of a grid because it tries to federate a set of separately managed resources. Equally the importance of grid security that is available from the start is very high. (Frank Harris): flexible software installation is of paramount importance. (Fab): putting together a few thousand PCs in a single site is now being possible so we need to be able to convince rich application groups that the grid is useful and necessary. This passes by the provision of a well-managed and secure infrastructure that is easy to use and deploy. (Luigi Fusco): A well controlled testing environment is essential. Need to make clear very early the interfaces used by the applications from the grid middleware. Q: Is not the scale of the problem what makes it different/hard for the grid? A (Erwin): yes, the people were used to distributed systems but the quantities and qualitative scale of the grid was beyond their experience. EDG Video (6 minute film set to music showing the history and output of the project produced by Rosy Mondardini) was projected and received a resounding round of applause. Session closed at 13:20.