LCG Management Board
Date/Time:	Tuesday 22 August 2006 at 16:00
Agenda:	http://agenda.cern.ch/fullAgenda.php?ida=a063098
Members:	http://lcg.web.cern.ch/LCG/Boards/MB/mb-members.html
	(Version 1 – 26.8.2006)
Participants:	J-Ph.Baud, I.Bird, N.Brook, F.Carminati, T.Cass, Ph.Charpentier, L.Dell’Agnello, Chris Eck, Michael Ernst, I.Fisk, B.Gibbard, J.Gordon, A.Heiss, F.Hernandez, J.Knobloch, M.Lamanna, E.Laure, G.Merino, B.Panzer, G.Poulard, Di Quing, H.Renshall, L.Robertson (chair), Y.Schutz, J.Shiers, D.Smith, O.Smirnova, J.Templon
Action List	https://twiki.cern.ch/twiki/bin/view/LCG/MbActionList
Next Meeting:	Tuesday 29 August from 16:00 to 1700
1. Minutes and Matters arising (minutes)
1.1 Minutes of Previous Meeting The minutes of the 8 August meeting were approved. . 1.2 Update on the LHCC Review Preparation - An updated agenda is attached. All of the sessions now have agreed speakers, or are in sections for which someone has undertaken to do the organisation, with the exception of the Tier-2 session. L.Robertson has still not been able to contact M.Jouvin, who had been asked to give a summary of the Tier-2 status. As this is still the holiday season he will wait until the beginning of the week before seeking a replacement. In the meantime G.Merino agreed to ask one of the Spanish Tier-2 federations if they will talk on their state of readiness, and any difficulties that they have encountered. - The agenda will now be put up in the agenda system as a Management Board meeting. - It was agreed that speakers should attach their talks to the agenda by lunchtime on 18 September at the latest, to allow MB members one week to review them. 1.3 Availability and Reliability measurements to mid August - An extract of the SAM Tier-1 availability figures for 1 July to mid-August is attached to the agenda, adjusted to show both availability and reliability (taking account of scheduled down periods). L.Robertson noted that this shows that availability continues to stagnate at 72%. Of the 10 sites being measured only 3 sites achieved the target, and 4 sites fail to come within 90% of the target. Many sites show extended down periods. As this is becoming the major problem for the services it should be discussed in some detail at the extended MB on 5 September at BNL, before the meeting of the Overview Board the following week. In preparation for this all sites are asked to send an analysis of the reason for each of the down periods of their site in July/August to the MB mailing list by the end of the month (Thursday 31 August). This will give time for an overall analysis to be prepared as an introduction to the topic at the MB. 1.4 Direct data transfers by LHCb from Worker Nodes to CERN - This point has been added to the agenda of this meeting as an AOB item.

2. Action List Review (list of actions)

Actions that are late are highlighted in RED.

In the absence of the planning officer this point was postponed to the next meeting.

3. RFIO in DPM and Castor

David Smith presented the status of the work going on to resolve the problem of the incompatibility between the versions of rfio used by DPM and Castor. A common library is being developed that will be used by both DPM and Castor with a single implementation of rfio and also other common functions including thread handling, socket functions and error reporting. The target is to have this library available with the appropriate DPM plug-ins for testing in mid-September. The Castor specific part is also well under way, with an estimated completion date for testing to begin in mid-October. By the end of September it is planned also to have a new version of DPM with SRM v2.2.

Michael Ernst said that the availability of the new library was very important for CMS’s CSA06 because of the large number of sites in Italy and the UK, and asked when the DPM version would be available in a packaged version suitable for distribution. Jean-Philippe Baud said that the DPM version with the new rfio library support would contain only these modifications (no other features or bug fixes) and so he expected that it would be able to be used immediately in a pre-production environment. Tony Cass said that the Castor version would be available for testing in mid-October, but as several people are at present on holiday he had not been able to get a good estimate of how long it would take to complete the testing and provide a distributed package.

Philippe Charpentier asked if this was being done in close collaboration with the ROOT support, as LHCb does not use rfio directly. It would be important that the same version of root worked with all sites – DPM, Castor 2 and Castor 1. Jean-Philippe said that at present there is no plan to produce a plug-in for Castor 1.

Michael Ernst said that it is acceptable for CMS to have a DPM version to test no later than mid-September, but it is essential for CMS to have a production ready and packaged version for DPM no later than October 1st. There is a small number of Castor sites that can be handled manually.

Coming back to the question of integration with ROOT, Jean-Philippe Baud will contact the ROOT team and clarify the situation.

In order that we can understand the implications of the schedule on the Castor side Tony Cass will ask the Castor 1 sites (CNAF, PIC and ASGC) what their schedule is for migrating to Castor 2, and Jean-Philippe Baud will make an estimate of the cost of providing also a Castor-1 plug-in.

4. Tier-1/Tier-2 Relationships

Chris Eck had sent MB members the first version of the table prepared using data provided by the experiments and showing a preliminary association of Tier-1 and Tier-2 sites, and giving the resulting requirements for storage at the Tier-1s and data transfer bandwidth between the Tier-1s and the Tier-2s. He had not received much feedback. He pointed out that at the average Tier-1 site half of the data storage is required to support the Tier-2s and in some individual cases is close to or even exceeds the pledged storage capacity of the Tier-1. He also noted that the aggregate bandwidth from a Tier-1 to the Tier-2s that it serves is in some cases a factor of two higher than the bandwidth required between that Tier-1 and CERN. It would also be important that the Tier-1 and Tier-2 sites are comfortable with working together. The status of this work has to be presented to the Overview Board on 11 September and it is important that the Tier-2 centres have an opportunity to review the information prior to the meeting. He therefore asked Tier-1 sites to give any feedback before the end of the week (25 August) after which he would send the tables to the Tier-2 representatives,

Ian Fisk said that the question of storage and data rates is a function of the experiment computing models and is it is not for the sites to comment on these numbers. Chris Eck said that indeed the numbers had been provided by the experiments following their computing models.

Nevertheless the result is different at different sites in terms of the percentage of disk space used to support the Tier-2s, and there do appear to be anomalies (e.g. the over-commitment of RAL). This will have implications on the other work that can be performed at the Tier-1s, and the Tier-1s should be aware of this. On the other hand, it may be possible to achieve a better balance. As the tables do not contain the storage requirements at the Tier-1s for other purposes it is not possible for the Tier-1s to check that their resources match the experiments’ expectations.

Yves Schutz noted that the ALICE numbers represent their requirements rather than being based on current pledges.

Gilbert Poulard noted that the numbers for 2008 will change as the experiments revise their needs using a more realistic startup scenario for 2007/08. Les Robertson said that we should be considering the first full year of data taking At present this is said to be 2008, but the numbers should be valid for the first full year even if that is 2009 or later.

The purpose of the exercise was to enable the sites to understand in realistic and practical terms what they will be using their resources for, and with which sites they have to work. This would then enable them to develop the necessary operational relationships and test out the data paths. The experiments know exactly how to extract the individual Tier-1 requirements for storage and data transfer bandwidth for purposes other than that already included in the tables. In order to avoid possible errors that may arise if these numbers are extracted by the sites themselves it was agreed that each experiment would provide these by the end of the week (25 August). A note specifying this in detail will be sent to the coordinators (see email from Chris Eck – 23 August 2006).

5. Progress on the Review of Requirements

Les Robertson introduced this point. ALICE, ATLAS and LHCb have provided revised estimates of their requirements based on the agreed startup scenario. The MB must now decide how to proceed. The matter will be discussed at the Overview Board on 11 September, but in preparation for that it would be useful for MB members to see these numbers in order that they can discuss the implications for their sites with their OB member. It was agreed that the data should be distributed to the MB (linked here) with a clear understanding that these are preliminary numbers that have not been endorsed by the collaborations. It is not known if CMS has changed its position with respect to this exercise.

In the longer term each country will have to decide how to deal with the revised estimates. In the case of CERN, a shared collaboration facility, the MB itself should consider the implications on the budget and planning. It was agreed that Bernd Panzer will organise a meeting (preferable next week) with the experiment coordinators to look at this.

6. AOB

6.1 Agenda for the BNL meeting

Les Robertson will propose the agenda by email

6.2 Preparation of a new set of Level-1 milestones for the next 18 months

Les Robertson asked for two volunteers, one from a site and one from an experiment to develop a new set of milestones. No volunteers were forthcoming during the meeting. He will contact people individually to try to find volunteers.

6.3 LHCb data transfers direct from worker nodes to CERN.

Nick Brook presented the reasons why LHCb transfers data directly from worker nodes at Tier-2s to CERN (see foils). This is MC data that has been produced at up to 52 LCG sites. The data is all transferred directly to CERN where it is collated and re-distributed to Tier-1 sites for reconstruction and stripping. The daily volume of data (foil 3) is 10-14K files each 0.1-0,2 GB in size, producing an average data rate of 10-15 MBytes/sec.

The transfer takes place before the end of the job. There have been many problems of transfer failures, long delays (sometimes it takes 1,000s of seconds to transfer a file), and wide variations in transfer time from the same worker node (foil 4). These problems are not restricted to the smaller sites.

The previous production used a similar setup, with up to 12K files per day. The average data rates were similar but the problems of delays and failures were not seen.

At the last MB (LHCb not present) the question was why FTS is not used for this. Foil 7 considers direct transfers from the worker nodes to Tier-1s, then using FTS to transfer the data to CERN for collation and re-distribution. This could also be error prone, as LHCb has seen instabilities at some of the Tier-1s.

Since the data rates are relatively low, Nick Brook considers that the bottlenecks should be investigated and fixed rather than LHCb changing to use the standard FTS data transfer service.

John Gordon asked what rates had been seen when FTS is used. LHCb’s only experience with FTS has been in the re-distribution from CERN – where a few ten’s of MBytes/sec is required. LHCb has not used FTS to transfer data from Tier-2s.

There was some discussion on why LHCb is not using FTS. The experiment is using “cpu scavenging” due to their shortfall in MC production capacity, and so are running jobs at sites at which they do not have formal storage allocations on the local mass storage system where the data could be stored for processing by FTS.

Jamie Shiers noted that CMS had also reported similar symptoms of erratic transfer rates using the FTS service. After a careful investigation this has been traced to the behaviour of a network router at CERN intended to bypass the firewall for certain file transfers.

Les Robertson thanked Nick for clearly explaining the reasons for using direct transfers, but noted that the underlying issue for the people providing services is that there are many problems to be investigated in the supported services and these must be given priority. We should first see if the router issue mentioned by Jamie resolves the problem.

7. New Actions

Action:

31 Aug 06 – Each site (CERN + Tier-1s) - to provide a report with the reasons for each failure of the SAM basic tests at their site in July and the first half of August - to be emailed to the MB before the end of the month. This will be discussed at the MB meeting at BNL on 5 September, to prepare a position for the Overview Board on 11 September.

25 Aug 06 - All experiments - to provide by the end of the week (24 August) for each of their Tier-1 sites their requirements for T1-T1 bandwidth, T0-T1 bandwidth and storage space by mass storage class for purposes other than those already included in the T1/T2 relationship table (see email from Chris Eck).

1 Sep 06 - Bernd Panzer - to organise a meeting with the experiment coordinators to review the effects of the revised estimates on the costs of the CERN facility.

Annex – email from C.Eck

From: Chris Eck
Sent: 23 August 2006 11:25
To: worldwide-lcg-management-board (LCG Management Board)
Cc: Nicholas Brook; Roger Jones; Dave Newbold; Yves Schutz; Andreas Heiss
Subject: ADDitional Data for T1/T2 Table

Hello,
The MB agreed yesterday to add a few figures to the summary per Tier-1 in the tables giving the Tier-1 resources required for Tier-2s collected by the Tier-2/Tier-1 Team.

Attached to this mail you will find a sample of a Tier-1 summary table. For each Tier-1 serving a given experiment this experiment will have to fill in 5 numbers in a copy of the sample table. These are:
1. The net required bandwidth (no overhead factors applied) from the Tier-0 to this Tier-1 for this experiment.
2. The sum of the net bandwidths required for Tier-1 to Tier-1 traffic at this Tier-1 for this experiment.
3. The storage, split into the same three classes used already for the Tier-2 requirements, required at this Tier-1 for this experiment in addition to the Tier-2 requirements already included in the table. The storage requirements should be again net values. The 70% disk storage efficiency will be added automatically.

Apparently, these figures are easily extracted from the Computing TDRs. I ask therefore the computing co-ordinators to send me (either directly or via their Tier-2/Tier-1 Team member) the filled-in tables by next Friday, August 25.
Many thanks,
Chris

TOTALS		T0=>T1	T1<=>T1	T2=>T1	T1=>T2	Storage for T2 TByte			Storage for T1 TByte
TOTALS		MByte/s	MByte/s	MByte/s	MByte/s	Tape1-Disk0	Tape1-Disk1	Tape0-Disk1	Tape1-Disk0	Tape1-Disk1	Tape0-Disk1
	ALICE			0.0	0.0	0.0	0.0	0.0
	ATLAS			13.2	13.9	0.0	522.4	0.0
	CMS			6.1	100.7	181.9	0.0	30.3
	LHCb			0.0	0.0	0.0	0.0	0.0
	SUM			19.3	114.6	181.9	522.4	30.3
	With 70% Disk Efficiency						746.3	43.3		0.0	0.0
	Total Storage Requirements					181.9		790	0		0
	Tape and Disk Pledges								1300		1500
	Balance								1118		710

Christoph Eck
LHC Computing Grid Project
Resource Manager
Information Technology Department
European Organization for Nuclear Research
CERN phone: +41 22 7674260
CH-1211 Geneva 23 gsm: +41 76 4873800

1. Minutes and Matters arising (minutes)

1.1 Minutes of Previous Meeting

1.2 Update on the LHCC Review Preparation

1.3 Availability and Reliability measurements to mid August

1.4 Direct data transfers by LHCb from Worker Nodes to CERN