LCG Management Board |
|
Date/Time:
|
Tuesday 22 August 2006 at 16:00 |
Agenda: |
|
Members: |
|
|
(Version
1 – 26.8.2006) |
Participants: |
J-Ph.Baud, I.Bird, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello,
Chris Eck, Michael Ernst, I.Fisk, B.Gibbard, J.Gordon, A.Heiss, F.Hernandez, J.Knobloch, M.Lamanna, E.Laure, G.Merino, B.Panzer, G.Poulard, Di Quing, H.Renshall, L.Robertson (chair), Y.Schutz, J.Shiers, D.Smith, O.Smirnova, J.Templon |
Action List |
|
Next Meeting: |
Tuesday 29 August from 16:00 to 1700 |
1. Minutes and Matters arising (minutes) |
|
1.1 Minutes of Previous MeetingThe minutes of the 8 August meeting were approved. . 1.2 Update on the LHCC Review Preparation- An updated agenda is attached. All of the sessions now have agreed speakers, or are in sections for which someone has undertaken to do the organisation, with the exception of the Tier-2 session. L.Robertson has still not been able to contact M.Jouvin, who had been asked to give a summary of the Tier-2 status. As this is still the holiday season he will wait until the beginning of the week before seeking a replacement. In the meantime G.Merino agreed to ask one of the Spanish Tier-2 federations if they will talk on their state of readiness, and any difficulties that they have encountered. - The agenda will now be put up in the agenda system as a Management Board meeting. - It was agreed that speakers should attach their talks to the agenda by lunchtime on 18 September at the latest, to allow MB members one week to review them. 1.3 Availability and Reliability measurements to mid August- An extract of the SAM Tier-1 availability figures for 1 July to mid-August is attached to the agenda, adjusted to show both availability and reliability (taking account of scheduled down periods). L.Robertson noted that this shows that availability continues to stagnate at 72%. Of the 10 sites being measured only 3 sites achieved the target, and 4 sites fail to come within 90% of the target. Many sites show extended down periods. As this is becoming the major problem for the services it should be discussed in some detail at the extended MB on 5 September at BNL, before the meeting of the Overview Board the following week. In preparation for this all sites are asked to send an analysis of the reason for each of the down periods of their site in July/August to the MB mailing list by the end of the month (Thursday 31 August). This will give time for an overall analysis to be prepared as an introduction to the topic at the MB. 1.4 Direct data transfers by LHCb from Worker Nodes to CERN- This point has been added to the agenda of this meeting as an AOB item. |
2. Action List Review (list of actions)Actions that are late
are highlighted in RED. |
In the absence of the planning officer this point was postponed to the next meeting. |
3. RFIO in DPM and Castor |
David Smith presented the status of the work going on to resolve the problem of the incompatibility between the versions of rfio used by DPM and Castor. A common library is being developed that will be used by both DPM and Castor with a single implementation of rfio and also other common functions including thread handling, socket functions and error reporting. The target is to have this library available with the appropriate DPM plug-ins for testing in mid-September. The Castor specific part is also well under way, with an estimated completion date for testing to begin in mid-October. By the end of September it is planned also to have a new version of DPM with SRM v2.2. Michael Ernst said that the availability of the new library was
very important for CMS’s CSA06 because of the large number of sites in Philippe Charpentier asked if this was being done in close collaboration with the ROOT support, as LHCb does not use rfio directly. It would be important that the same version of root worked with all sites – DPM, Castor 2 and Castor 1. Jean-Philippe said that at present there is no plan to produce a plug-in for Castor 1. Michael Ernst said that it is acceptable for CMS to have a DPM version to test no later than mid-September, but it is essential for CMS to have a production ready and packaged version for DPM no later than October 1st. There is a small number of Castor sites that can be handled manually. Coming back to the question of integration with ROOT, Jean-Philippe Baud will contact the ROOT team and clarify the situation. In order that we can understand the implications of the schedule on the Castor side Tony Cass will ask the Castor 1 sites (CNAF, PIC and ASGC) what their schedule is for migrating to Castor 2, and Jean-Philippe Baud will make an estimate of the cost of providing also a Castor-1 plug-in. |
4. Tier-1/Tier-2 Relationships |
Chris Eck had sent MB members the first version of the table prepared using data provided by the experiments and showing a preliminary association of Tier-1 and Tier-2 sites, and giving the resulting requirements for storage at the Tier-1s and data transfer bandwidth between the Tier-1s and the Tier-2s. He had not received much feedback. He pointed out that at the average Tier-1 site half of the data storage is required to support the Tier-2s and in some individual cases is close to or even exceeds the pledged storage capacity of the Tier-1. He also noted that the aggregate bandwidth from a Tier-1 to the Tier-2s that it serves is in some cases a factor of two higher than the bandwidth required between that Tier-1 and CERN. It would also be important that the Tier-1 and Tier-2 sites are comfortable with working together. The status of this work has to be presented to the Overview Board on 11 September and it is important that the Tier-2 centres have an opportunity to review the information prior to the meeting. He therefore asked Tier-1 sites to give any feedback before the end of the week (25 August) after which he would send the tables to the Tier-2 representatives, Ian Fisk said that the question of storage and data rates is a function of the experiment computing models and is it is not for the sites to comment on these numbers. Chris Eck said that indeed the numbers had been provided by the experiments following their computing models. Nevertheless the result is different at different sites in terms of the percentage of disk space used to support the Tier-2s, and there do appear to be anomalies (e.g. the over-commitment of RAL). This will have implications on the other work that can be performed at the Tier-1s, and the Tier-1s should be aware of this. On the other hand, it may be possible to achieve a better balance. As the tables do not contain the storage requirements at the Tier-1s for other purposes it is not possible for the Tier-1s to check that their resources match the experiments’ expectations. Yves Schutz noted that the Gilbert Poulard noted that the numbers for 2008 will change as the experiments revise their needs using a more realistic startup scenario for 2007/08. Les Robertson said that we should be considering the first full year of data taking At present this is said to be 2008, but the numbers should be valid for the first full year even if that is 2009 or later. The purpose of the exercise was to enable the sites to understand in realistic and practical terms what they will be using their resources for, and with which sites they have to work. This would then enable them to develop the necessary operational relationships and test out the data paths. The experiments know exactly how to extract the individual Tier-1 requirements for storage and data transfer bandwidth for purposes other than that already included in the tables. In order to avoid possible errors that may arise if these numbers are extracted by the sites themselves it was agreed that each experiment would provide these by the end of the week (25 August). A note specifying this in detail will be sent to the coordinators (see email from Chris Eck – 23 August 2006). |
5. Progress on the Review of Requirements |
Les Robertson introduced this point.
ALICE, ATLAS and LHCb have provided revised estimates of their requirements
based on the agreed startup scenario. The MB must
now decide how to proceed. The matter will be discussed at the Overview Board
on 11 September, but in preparation for that it would be useful for MB
members to see these numbers in order that they can discuss the implications
for their sites with their In the longer term each country will have to decide how to deal with the revised estimates. In the case of CERN, a shared collaboration facility, the MB itself should consider the implications on the budget and planning. It was agreed that Bernd Panzer will organise a meeting (preferable next week) with the experiment coordinators to look at this. |
6. AOB |
Les Robertson will propose the agenda by email
Les Robertson asked for two volunteers, one from a site and one from an experiment to develop a new set of milestones. No volunteers were forthcoming during the meeting. He will contact people individually to try to find volunteers.
Nick Brook presented the reasons why LHCb transfers data directly from worker nodes at Tier-2s to CERN (see foils). This is MC data that has been produced at up to 52 LCG sites. The data is all transferred directly to CERN where it is collated and re-distributed to Tier-1 sites for reconstruction and stripping. The daily volume of data (foil 3) is 10-14K files each 0.1-0,2 GB in size, producing an average data rate of 10-15 MBytes/sec.
The transfer takes place before the end of the job. There have been many problems of transfer failures, long delays (sometimes it takes 1,000s of seconds to transfer a file), and wide variations in transfer time from the same worker node (foil 4). These problems are not restricted to the smaller sites.
The previous production used a similar setup, with up to 12K files per day. The average data rates were similar but the problems of delays and failures were not seen.
At the last MB (LHCb not present) the question was why FTS is not used for this. Foil 7 considers direct transfers from the worker nodes to Tier-1s, then using FTS to transfer the data to CERN for collation and re-distribution. This could also be error prone, as LHCb has seen instabilities at some of the Tier-1s.
Since the data rates are relatively low, Nick Brook considers that the bottlenecks should be investigated and fixed rather than LHCb changing to use the standard FTS data transfer service.
John Gordon asked what rates had been seen when FTS is used. LHCb’s only experience with FTS has been in the re-distribution from CERN – where a few ten’s of MBytes/sec is required. LHCb has not used FTS to transfer data from Tier-2s.
There was some discussion on why LHCb is not using FTS. The experiment is using “cpu scavenging” due to their shortfall in MC production capacity, and so are running jobs at sites at which they do not have formal storage allocations on the local mass storage system where the data could be stored for processing by FTS.
Jamie Shiers noted that CMS had also reported similar symptoms of erratic transfer rates using the FTS service. After a careful investigation this has been traced to the behaviour of a network router at CERN intended to bypass the firewall for certain file transfers.
Les Robertson thanked Nick for clearly explaining the reasons for using direct transfers, but noted that the underlying issue for the people providing services is that there are many problems to be investigated in the supported services and these must be given priority. We should first see if the router issue mentioned by Jamie resolves the problem.
7. New Actions |
Action:
31 Aug
06 – Each site (CERN + Tier-1s) - to provide a report with the reasons for each failure of
the SAM basic tests at their site in July and the first half of August - to be
emailed to the MB before the end of the month. This will be discussed at the MB
meeting at BNL on 5 September, to prepare a position for the Overview
Board on 11 September.
25 Aug
06 - All experiments - to provide by the end of the week (24 August) for each of their Tier-1
sites their requirements for T1-T1 bandwidth, T0-T1 bandwidth and storage space
by mass storage class for purposes other than those already included in the
T1/T2 relationship table (see email from Chris Eck).
1 Sep 06
- Bernd Panzer - to organise a meeting with the experiment coordinators to review the
effects of the revised estimates on the
costs of the CERN facility.
Annex –
email from C.Eck From: Chris
Eck
|