DRAFT - Notes on the PEB – 16 December 2003

 

Present – Dario Barberis, Ian Bird, Nick Brook, Philippe Charpentier, Tony Doyle, Chris Eck, David Foster, Frédéric Hemmer, Werner Jank, Bob Jones, Matthias Kasemann, Alberto Masoni, Mirco Mazzucato, Bernd Panzer, Gilbert Poulard, Les Robertson (chair, secretary)

 

Actions – Actions are identified by bold blue italics.

 

Minutes – The minutes from the previous meeting were not available.

 

Information –

  • The PIC in Barcelona has been designated as a Tier 1 centre by Spain.
  • A second round of funding has been approved for GridPP – with £15.9M for people and equipment in 2004-07. An additional £1M is ear-marked for LCG, subject to negotiation with CERN.

 

Resources at CERN for 2004 – Bernd Panzer – See foils

  • A major increase in AFS space is requested, equivalent to 60% of the current capacity of the service. This coincides with the need to replace the backup tapes (moving from old IBM drives to STK 9940Bs) and aging and unreliable servers. The total cost would double the AFS materials budget of 2003. Ways of reducing the costs are being studied.
  • A high level plan for the CERN Fabric for 2004 is given on foil 4, showing the computing and physics data challenges, and other significant events.
  • The budget for new physics computing capacity is CHF 1.7M, including non-LHC experiments. This is in addition to the “LCG prototype” budget intended primarily for computing data challenges, grid testing and pilot service, etc.
  • The requests amount to
    • CPU - 500 KSI2000 (450K for LHC, 50K for non-LHC)
    • Disk – 140 TB (100 TB LHC, 40 TB non-LHC)
  • Bernd proposes a solution (foil 7) that satisfies the needs for the base services and physics data challenges throughout the year, except for a peak of a few weeks in the middle of the year. This includes “borrowing” capacity from the prototype during the main data challenge period in April-June. It was noted that additional requirements would arise later in the year, where there is spare capacity available.
  • The prototype situation for the computing data challenges (foil 8) would satisfy all the known requests, including the online tests of ALICE and ATLAS, the latter achieved through borrowing from the production system during March.
  • The requests for disk capacity cannot be fully satisfied within the budget. Bernd proposes to purchase 50 TB now to satisfy the “base” requests. The additional capacity required for data challenges could be partially met by using 15 TB of prototype space and assuming that 10 TB of the current (2003) storage can be freed up. He also anticipates that it may be possible to purchase an additional 10 TB early next year when all the budget commitments are understood (e.g. AFS). The result is given on foil 10.

The proposals were accepted.


Scheduling LCG-2 Roll-out and the Data Challenges – Les Robertson

 

  • See the draft note.
  • Les said that he had talked to or received feedback from all of the Regional Centres (RCs) in the target site list with the exception of FNAL. All of the centres contacted expressed willingness to go along with the draft plan, except FZK which has some reservations (email from Holger Marten to the GDB list dated 16 December). Most centres said that they must have formal requests for the capacity from their local experiment contacts. RAL also stressed the importance of experiments stating special requirements for the environment that would have to be satisfied by the RC.
  • There have been delays in the preparation of the middleware due to last minute critical problems with VDT (surprisingly none of the other VDT users had uncovered these). The problems have now been fixed and the software was released by the certification team today. An updated “realistic” schedule is given below.

 

Revised target date

Event

Comment

16dec03

LCG-2 delivered to installation team

Completed

17dec03

LCG-2 available for testing by experiments at CERN

Local testing with the LCG-2 code, preparation of experiment software packages

12jan04

LCG-2 package available to regional centres with documentation

 

19jan04

10 worker nodes in operation at each regional centre

Experiments can start installation, set-up, testing

26jan04

Base capacity reached in each centre; staffing available for  service operation, debugging, stabilisation

 

9feb04

Production service in operation

Define the stability criteria for the validation test (availability, system admin. response time; job-failure rate, ..)

 

Two week running in period

 

23feb04

Start 30-day validation test

Unstable sites will be withdrawn prior to the test period

1apr04

Full capacity of participating regional centres online. Additional sites join the production grid.

 

 

  • CMS – Werner

    The schedule is difficult for CMS as DC04 is planned to start in March, in the middle of the validation test. CMS will therefore have to start using their current network of RCs. This would remain as the backup until the LCG-2 service is proven to be superior. The current service uses SRB as the central catalogue. The longer term target is to use RLS, but CMS would prefer RCs to support also SRB in the LCG-2 service. Mirco said that testing all of these options is a significant cost for the support staff at RCs – making a clear statement that the mainline approach is RLS on LCG-2, with the current system and SRB as the backup should be considered by CMS management.

    Werner noted that people are working now in several places to adapt the CMS system to LCG-1. One of the outstanding issues is the need to resolve an incompatibility in the file naming convention used by POOL and the replica manager.

  • Mirco called for establishing a deployment team or task force for each of the experiments, with named members including a liaison with Flavia’s group. Dario said that this had already been established for ATLAS under Oxana. Other experiments to identify people responsible for their task force.
     
  • ATLAS – Dario

    The schedule fits well with the ATLAS plans: phase 1 of the data challenge starts at the beginning of April with about two months simulation. All of the building blocks are being assembled now, including the new production environment which supports job submission to CERN-LSF, Grid 3, NorduGrid and LCG-1. No problems are anticipated in adapting to LCG-2. ATLAS does not plan to use any non-Grid resources, if possible, as this would increase the work load of the production team considerably. They would like to see all available resources connected to LCG-2 (or one of the other Grid flavours in ATLAS: NorduGrid and Grid-3) by the beginning of April.

  • ALICE – Alberto

    The ALICE data challenge starts at the beginning of January using AliEn. As capacity becomes available in LCG-2 AliEn will use it. In the discussion Ian said that there are compatibility issues in the catalogue between LCG-1 and LCG-2, and so even if ALICE obtained satisfactory results in its test with LCG-1 the use of LCG-1 for the DC is not advisable. It is therefore better to wait for the LCG-2 deployment. The LCG-2 schedule is not ideal since the LCG-2 deployment will occur in the middle of the Data Challenge. Nevertheless the work on the AliEn-LCG-2 interface proceeds in parallel, with a dedicated team. ALICE expects that this approach will allow it to cope with the difficulties of operating the DC at the same time as the LCG-2 deployment. It is possible that, even with the best efforts from ALICE and the LCG teams, the LCG-2 deployment schedule will introduce some delays in the DC schedule.

  • LHCb – Philippe

    The data challenge starts at the beginning of April, with production through to June, and then analysis until the end of the year. If there is a delay in the schedule LHCb will use existing conventional tools – DIRAC interfaces to LCG and/or local batch systems. The schedule looks acceptable within the substantial error bars that go with any production run..

  • Capacity and Ramp-up

    Dario questioned why the significant capacity in the UK at Manchester and Liverpool was not in the initial target list of RCs. Tony said that the capacity at Manchester is not yet installed, and Liverpool has difficulties with manpower that will not be resolved until GridPP2 starts.
    Les to talk to John Gordon about this.

    BNL, Tokyo and Lyon are other major centres that are not included. At BNL it is believed that there is a manpower shortage
    (Les to confirm this with Vicky White, John Huth). The approach at present is to start up with sites that are motivated to start now and have good support to deal with initial difficulties. Alberto underlined that the experience on the LCG-1 test has shown that (as stated in the Proposed Schedule) it is fundamental at this stage to give priority to the service stability rather than to increase the number of sites.

    It was agreed that the suggested initial capacities proposed are reasonable, assuming that processors have a capacity rating of 0.8-1 SI200.
    All experiments must ask their national contacts to request through the local allocation processes that this capacity is made available at the selected RCs by 26 January. They should also give forewarning that all of their allocated capacity should be provided through LCG-2 from the beginning of April.

  • Mass Storage – Regional Centres will be asked to clarify their mass storage plans. Les to bring this to their attention.

  • Next steps –
    • Actions to be followed up.
    • The schedule to be sent to the RCs concerned, for discussion and final agreement at the GDB on 13 January 2004.

 

 

Target sites with proposed capacity for end January (worker node count assumed to be 2-cpu systems, each processor rated at 0.8-1.0 SI2000):

 

 

Site

# worker nodes

Comment / Mass Storage System

CERN

200

Castor

CNAF

200

Castor

Spain

60

Castor

FZK

100

dCache/TSM (to be confirmed)

NIKHEF

60

To be confirmed

FNAL

100

dCache/ENSTOR

Taipei

60

To be confirmed

RAL

200

To be confirmed

 

 

Les Robertson

18 December 2003