From: owner-hep-proj-grid-exec@listbox.cern.ch on behalf of Bob Jones [Robert.Jones@cern.ch] Sent: Thursday, February 19, 2004 6:55 PM To: Hep-Proj-Grid-Pmb@Cern. Ch; WP Internal Subject: WP MGR INTERNAL Notes from EDG final review: day 1 Categories: CERN SpamKiller Note: -49 *********************************************************** * Message from WP managers internal list - do not forward * *********************************************************** Hi, Here are my notes from the EDG review day 1. It includes comments/questions made and a summary of those aspects not covered by the slides: http://agenda.cern.ch/fullAgenda.php?ida=a036278 Cheers, Bob. Welcome by Wolfgang Von Reuden: Personally considers the EDG project a success since the software products are now in production use with LCG. Looking forward to the EGEE project during this planning stage. Apologised for late arrival of Jos Engelen who will join the review later (arrived during Fab's talk and left when it finished). Agenda accepted without modification by Kyriakos Baxevanidis and he reminded speakers to please leave sufficient time for the reviewers to ask questions. Status of the project (Fab): Talk: 40 minutes Questions: 15 minutes Q: Has the advanced functionality of EDG 2.1 been tested with the applications? A: the documents show results up to release EDG 2.0. Results from EDG 2.1 will be presented during the review Q: Explain support from Globus & VDT and how that will work in the future. A (Erwin): Good quality support for Globus via VDT. Continued support for GT2 by Globus for the next 2 years has recently been announced by Ian Foster. Q: EDG 2.1 deliverables suggest that a performance problem is also a functionality problem. Will details of the latest status be reported? A: The status will be reported in the presentations and demos of the review as well as the final quarterly report. Q: Has EDG influenced the adoption of web-service for grids? A: IBM has influenced this outcome more than EDG but EDG has been active in GGF promoting its web-services based new developments. Q: With the wide-spread adoption of web-services by industry, is GGF still relevant for EDG and EGEE? A: There is still value in participating in GGF and encouraging tighter links with OASIS. Q: R-GMA is a major achievement, but LCG has retained MDS not R-GMA, why? A: Given the strict production requirements and timescales it has been necessary for LCG to remain with LCG for its current version. However, LCG are now investigating R-GMA and plan to introduce in the near future. A (Les): R-GMA is now being distributed to LCG sites and used for monitoring purposes. Migration to R-GMA as the basic information service will depend on its performance during its testing in the monitoring area. Q: Why UH has an excess of 133% for funded effort. A: UH has hired many post-docs instead of professional sw engineers for the same cost. Sw Integration, deployment and evaluation (Cal) Start: 12:05, 12:25 Q: What is the EDG experience concerning diagnostics facilities built into the code? A: This was initially a weakness but the situation ash improved during the life of the project with the introduction of standards pertaining not logging but further work is needed on logging formats Q: What do we mean by "reasonable stability" of latest EDG release? A: EDG 2.0 was not very stable due to information system that required a lot of effort system mgrs to keep it going. With EDG 2.1 the effort of sys mgrs has greatly decreased. It has been deployed since 2 months and performed very well with stress tests and end-users exploiting it. Q: 2500 bugs refers to the last release? A: No 2500 since the introduction of Bugzilla A (Gabriel): 1000 bugs have been recorded since EDG 2.0 Q: EDG has developed code in distributed sites posing heavy integration problems. Will this be the subject of a white paper? A: A lot of this information is held in D6.8 which can be used or a white paper. A (Fab): This could be a subject for a paper in the special edition of the grid journal that will be describe in Erwin's talk. Comment by Kyriakos: such results be written in the final report of the project (it can have a public an private part). It should also include "everything you would do differently if you were to do it again". Q: grid stability and sites coming/going is an important aspect. Is EDG 2.1 stable enough in itself to measure resilience to some level? A: To some level yes but it depends where the failure happens. EDG 2.1 still has some single points of failure. Centralised infrastructure is not resilient but individual sites coming and going is not so important. Q: Is the resource broker resilient A: We can run multiple copes of the resource broker. WP8 HEP applications (Frank) Talk: 20 minutes Questions: 10 minutes Q: Space management A: Resource broker cannot know about the status of space allocation before submitting a job. The idea behind Zamboni was to provide space management once a job is submitted for a particular site. Q: Is not space mgmt a basic requirement for a grid? A (Francesco & Erwin): This is not strictly a problem for the resource broker or grids. It depends heavily only local site configuration. Q: You say you get 90% success when running, i.e. when R-GMA is up? A: Yes the performance/reliability is very good when R-GMA is running and we now have good stability with RGMA. Q: Apparently the RB blocked when a large set of jobs were submitted? A (Francesco): This has been resolved with changes to the RB made by WP1 and LCG. The problem was traced to threading issues in the C++ libraries that have been solved moving to the latest version o the compiler. Q: The introduction of fault-tolerance has helped with efficiency? A: Fault-tolerance and monitoring have definitely helped the efficiency Q: How bad were the data mgmt performance issues? A: They were very difficult at the start but the situation has improved with the latest releases but there are still some improvements to do. Q; Has the AWG been useful to HEP? A: Yes the AWG was very useful, it shows a different way of addressing problems and a different approach to interfaces etc. AWG did use the HEP use-cases template. A: (Fab): The AWG has shown that often the different are only at the surface level but deep-down the applications need the same thing. Q: How different are the requirements? A: A consolidated requirements document is being prepared and will act as input to EGEE WP9: EO application (Luigi) Talk: 20 minutes Questions: 20 minutes Q: Complaints were expressed in previous reviews to the participation of ESA to the project. Are you happy with the developments during the 3rd year? A: We have a good, running testbed, could always be better. 10 days ago we could process 1 year of data over the week from home. It can always be better but it is there and working. Q: Did the reestablishment of the AWG help? A: Certainly - we see that our needs are being taken into account by the mware groups. Q: Is EDG an integration part of future EO computing plans? A: Convincing EO groups to use new technology for production usage is not easy. Security (e.g. working across firewalls) is a stumbling block that hopefully can be solved with time. Certainly continuation of EDG via EGEE will help in this process. Q: Does ESA plan to join the EGEE with its resources and applications? A: There is a wish of the EO community to join EGEE and has attended NA4 meetings. ESA is a keen member of this community. A (Fab): EO had a strong representation at the EGEE NA4 meeting held in Paris in December 2003. A: Selling the concept of grid is still a difficult task and so we will use it now for some programmes in the hope that this will provide more exposure of the community. Getting closer to EGEE will certainly help here. An outstanding issue is that some partners are losing their people and some help from the EU would be of benefit. A (Fab): Project such as EDG form persistent user communities and the discontinuation between projects is an issue for such communities in terms of support. Comments Kyriakos: no moving from reference implementation to pan-European infrastructure with many additional procedures and tasks. At the point when the EU is convinced that this transition is completed so that grids reach the same level as networks (i.e. GEANT) then the EU will reconsider its funding structure and approach. EGEE is a key project to establish this route. A: The work has been done, the products exist, how does the user community pick-it up? For the EO this is an issue since it is not funded via EGEE and we are missing the same critical mass as HEP and bio-medical. A (Guy W): The NA4 activity of EGEE is a good structure for including new applications and EO community is a very good early candidate. We foresee a 6 months ramp-up period with dedicated support for all applications in EGEE until they get up to speed. Kyriakos: EGEE can help but EO community must take an urgent the initiative itself to deploy a similar action to ensure grid take-up. The EU is prepared to discuss the EO needs. Fab: EGEE is prepared to participate in events such as the EU workshop being organised for the bio-medical community in March 2004. WP10: bio-medical (Vincent) Talk: 20 minutes Talk: 20 minutes Questions: 12 minutes Q: In the first review, the bio-medical showed a number of applications that could be modified to use the grid. In the 2nd review this was not shown. In this the final review, were new algorithms produced or existing ones changed as a result of grid work? A: The GATE is a good example of a grid-aware algorithm that has been ported and will demonstrated on EDG. The porting to the grid has not modified the algorithm s but there is clearly to do this in the future. Q: In 1st review, the bio-medical needed not just read-only data access and this was not addressed since it was considered that transactional models could not yet be accommodated. Is this still the case? A: Transactional model for data access is still not supported. We have concentrated on trying to solve the problem using meta-data management capabilities. Q: So you have solved the problem using versioning rather than transactions? A: Yes Q; meta-data mgmt, is it common? A: Reflection has started on meta-data but we have not yet gone into details. A (Johan): We have strong meta-data requirements and these are being partially addressed with the spitfire and WP5 software. WP9 also has a string need for meta-data and so clearly there is a cross-application service required here. Q: Why have you to installed a grid node in a bio-medical lab? A: It is a complicated operation and with limited resources we have chosen to remain as customers of the testbed. The installation procedure is aimed more at computer centres than bio-medical labs. This is why we ask for future mware to be easier to install so it can permeate more into our community. A (Fab): The bio-medical community is more use to shrink-wrapped solutions while the HEP community is more technically aware and hence have less difficulty with the installation procedures. This will be further addressed in EGEE. Q: The developers of bio-medical algorithms cannot be seen as purely end-users but have they been included in this work? A: Yes but they cannot control the resources that we want to contribute to the testbed. Q: WP10 s the most challenging for security requirements, will this be addressed I the review? A: Yes, in the security talk tomorrow. Samaras comment: simple installation should be a primary requirement for EGEE Les Robertson, LCG project leader, then reported on the exploitation of EDG within LCG: "1. The EDG project has been an important catalyst for the computing centres in Europe that will provide capacity for LHC computing, encouraging them to work together, and to identify and confront many difficult issues of operation, security and management well before LHC data arrives. 2. The approach of the middleware developments, building on components from Globus and elsewhere, and collaborating with the US projects, rather than starting from scratch, has further helped a global project like LCG to avoid wide divisions in setting up our overall service. We still have different software in the US and Europe - but the differences are close enough that we hope that the HEP community in the US, with NSF funding, will deploy more of the EDG components that we use, starting to close the gap. 3. An initial package of middleware was deployed last September as "LCG-1". This grid now has about 30 sites, including sites in the US, Taiwan and Japan. The second middleware package, LCG-2 includes the Resource Broker, the Replica Location Service, replica management tools, and the VO Management Service (VOMS). R-GMA is now beginning to be installed at LCG sites, where it will provide data management services for monitoring (including applications monitoring) and for accounting. 4. The LHC Grid is being run as an operational service - this is not a testbed - it includes an operations centre at Rutherford Laboratory, and a Call Centre at Karlsruhe. Discussions are advancing on setting up a second operations and call centre in Taiwan. We have agreed a process for incorporating additional VOs into LCG, the first being another HEP experiment, D0 from Fermilab. The background work has been done to extend this to VOs in other sciences as soon as resources for these become available through the EGEE project. 5. The second generation of the fabric management software developed by the EDG WP4 team has been adapted and integrated into the operations management system of CERN's computer centre. 6. Overall one of the major contributions of EDG has been to bring together the people in Europe interested in exploiting grid technology. The complexity and sociology of this collaboration has posed its own problems, and has made going from development and demonstration to operation very hard. But, on the other hand it has limited unnecessary divergence and formed an identifiable community that is now ready with EGEE to move towards a service environment." Q: What components other than those coming from EDG are used in LCG? A: the selection was made about a year ago and we had to choose from EDG, Globus and VDT. The question is more as to which components from those distributions will LCG include. WP11: dissemination (Roberto) Talk: 15 minutes Questions: 10 minutes Q: The level of hits via the web is interesting but not exceptional. How does it compare to the Globus site? A: We do not have comparison figures for the Globus site but this value does not include the internal web-site used for the technical work of the project. The web-site has been registered with web search engines to ensure sufficient coverage. Hits by web-crawlers have been removed from the figures shown. Q: Why should someone register for the Industry and Research Forum? A: Provide information about significant events and developments of the project. Q: Is a process in place to track the developments based on EDG in licensing terms and ho do you plan to profit from this? A: EDG uses an open-source license so other parties are free to further develop the software and exploit it commercially. Most sites downloading and developing the software are linked to the application groups and partners involved in the project. A (Stefano): DATAMAT is not concerned in exploiting commercially the results of EDG but rather are using the project a way of developing a team that can gain experience to develop business models providing consultancy or to provide grid infrastructure. A (CS-SI): Next month we will deploy a testbed for commercial usage based on EDG software. Q: What will happen to the EDG web - it must be useful to EGEE? A( Erwin): It will be maintained and linked to EGEE web-site A (Guy): The contacts made via the Industry & Research Forum will also be migrated to EGEE Michal Turala (CrossGrid project leader): Made a statement concerning collaboration between EDG and CrossGrid. He read a letter of appreciation of this connection - a copy has been provided to the reviewers. QA: Gabriel Zaquine Talk: 15 minutes Q: A lesson learnt was the importance of a dedicated testing team, is this foreseen in EGEE? A: A substantial testing team is foreseen Q: How can you compare the performance indicators, especially when the tests are made by experts? A (Erwin): The results are shown are tested by the end-users. The differences between the EDG 1.4 and EDG 2.x figures can be explained by naïve users using the grid as a black-box against those done by more informed users that have a strong support team. A (Francesco): We have a job re-submission facility that will re-submit jobs if it fails but this facility is not enabled for these results. Q: How many re-submissions were need to get a job through A (Francesco): In tests, we never needed 3 re-submissions Q: Is the Quality sufficient of the applications and what is the target? A (Bob): Quality objectives were defined with the applications around the time of EDG 1.4 and 99% was defined but this is quite vague (X% down per day, per month etc.) and in the future (EGEE) we will further develop the QA approach via SLAs. Q: Why the dip in performance of EDG 2.0 during December 2003? A (Frank): This was run over Christmas when it was treated as a black-box and we are very happy with these "unattended" results. Demonstrations WP10: Q: Why do you need dedicated resources as necessary improvement? A: The terminology is wrong. We really mean advance reserve to get guaranteed resources. The demo worked successfully - all jobs completed and the data was retrieved. WP9: demo Q: Is the meta-data facility used in this demo? A: Yes we are making use of the meta-data catalog The demo worked successfully and the result were displayed in the GUI. Q: You generate very long file names? A: Yes these are GUIDs which are intended for the use by the software (i.e. not necessary to be read by humans) but the user can direct the output to a directory of their choice. CMS DAGMAN demo: Q: How are the CEs chosen? A: The RB chooses them dynamically. DAGMAN executed 10 out of 11 steps successfully and was left running while we moved onto the next demo. WP8: ALICE demo Q: Is GENIUS developed as part o the funded program of EDG? A (Roberto): No EU funded resources were used for the development of GENIUS Q: It seems that the GUIs became more sophisticated the closer we get to CERN? A (Francesco): the DATMAT and NICE partners have been working together to ensure the GUI components developed by WP1 can be used in various assembled GUIs. A (Roberto): The GENIUS GUI can be used for many applications - it is not restricted to ALICE or HEP. GENIUS will be further developed in EGEE NA4. A (Erwin): We showed the WP9 portal in the demos of the 2nd review and wanted to show a different GUI this time. But all these GUIs are continuing to be developed. Roberto then showed the GENIUS portable being accessible via PDAs to continue to work with ALICE production jobs. Summary slide of the demos. Session closed at 18:55