DRAFT
- Notes on the PEB – 16 December 2003
Present – Dario Barberis, Ian Bird,
Nick Brook, Philippe Charpentier, Tony Doyle,
Chris Eck, David Foster, Frédéric
Hemmer, Werner Jank, Bob Jones, Matthias
Kasemann, Alberto Masoni, Mirco Mazzucato, Bernd Panzer, Gilbert Poulard, Les
Robertson (chair, secretary)
Actions – Actions are
identified by bold blue italics.
Minutes – The minutes
from the previous meeting were not available.
Information –
- The PIC in Barcelona has been
designated as a Tier 1 centre by Spain.
- A second round of funding has been approved for GridPP – with
£15.9M for people and equipment in 2004-07. An additional £1M is
ear-marked for LCG, subject to negotiation with CERN.
Resources at CERN for 2004 – Bernd Panzer – See foils
- A major increase in AFS space is requested, equivalent to 60%
of the current capacity of the service. This coincides with the need to
replace the backup tapes (moving from old IBM drives to STK 9940Bs) and
aging and unreliable servers. The total cost would double the AFS
materials budget of 2003. Ways of reducing the costs are being studied.
- A high level plan for the CERN Fabric for 2004 is given on foil
4, showing the computing and physics data challenges, and other
significant events.
- The budget for new physics computing capacity is CHF 1.7M,
including non-LHC experiments. This is in addition to the “LCG prototype”
budget intended primarily for computing data challenges, grid testing and
pilot service, etc.
- The requests amount to
- CPU - 500 KSI2000 (450K for LHC, 50K for non-LHC)
- Disk – 140 TB (100 TB LHC, 40 TB non-LHC)
- Bernd proposes a solution (foil 7) that satisfies the needs for
the base services and physics data challenges throughout the year, except
for a peak of a few weeks in the middle of the year. This includes
“borrowing” capacity from the prototype during the main data challenge
period in April-June. It was noted that additional requirements would
arise later in the year, where there is spare capacity available.
- The prototype situation for the computing data challenges (foil
8) would satisfy all the known requests, including the online tests of
ALICE and ATLAS, the latter achieved through borrowing from the production
system during March.
- The requests for disk capacity cannot be fully satisfied within
the budget. Bernd proposes to purchase 50 TB now to satisfy the “base”
requests. The additional capacity required for data challenges could be
partially met by using 15 TB of prototype space and assuming that 10 TB of
the current (2003) storage can be freed up. He also anticipates that it
may be possible to purchase an additional 10 TB early next year when all
the budget commitments are understood (e.g. AFS). The result is given on
foil 10.
The proposals were accepted.
Scheduling
LCG-2 Roll-out and the Data Challenges – Les Robertson
- See the draft
note.
- Les said that he had talked to or received feedback from all of
the Regional Centres (RCs) in the target site list with the exception of
FNAL. All of the centres contacted expressed willingness to go along with
the draft plan, except FZK which has some reservations (email from Holger
Marten to the GDB list dated 16 December). Most centres said that they
must have formal requests for the capacity from their local experiment
contacts. RAL also stressed the importance of experiments stating special
requirements for the environment that would have to be satisfied by the
RC.
- There have been delays in the preparation of the middleware due
to last minute critical problems with VDT (surprisingly none of the other
VDT users had uncovered these). The problems have now been fixed and the
software was released by the certification team today. An updated
“realistic” schedule is given below.
Revised target date
|
Event
|
Comment
|
16dec03
|
LCG-2 delivered
to installation team
|
Completed
|
17dec03
|
LCG-2 available
for testing by experiments at CERN
|
Local testing
with the LCG-2 code, preparation of experiment software packages
|
12jan04
|
LCG-2 package
available to regional centres with documentation
|
|
19jan04
|
10 worker nodes
in operation at each regional centre
|
Experiments can
start installation, set-up, testing
|
26jan04
|
Base capacity reached
in each centre; staffing available for
service operation, debugging, stabilisation
|
|
9feb04
|
Production
service in operation
|
Define the
stability criteria for the validation test (availability, system admin.
response time; job-failure rate, ..)
|
|
Two week running
in period
|
|
23feb04
|
Start 30-day
validation test
|
Unstable sites
will be withdrawn prior to the test period
|
1apr04
|
Full capacity of
participating regional centres online. Additional sites join the production
grid.
|
|
- CMS – Werner
The schedule is difficult for CMS as DC04 is planned to start in March, in
the middle of the validation test. CMS will therefore have to start using
their current network of RCs. This would remain as the backup until the
LCG-2 service is proven to be superior. The current service uses SRB as
the central catalogue. The longer term target is to use RLS, but CMS would
prefer RCs to support also SRB in the LCG-2 service. Mirco said that
testing all of these options is a significant cost for the support staff
at RCs – making a clear statement that the mainline approach is RLS on
LCG-2, with the current system and SRB as the backup should be considered
by CMS management.
Werner noted that people are working now in several places to adapt the
CMS system to LCG-1. One of the outstanding issues is the need to resolve
an incompatibility in the file naming convention used by POOL and the
replica manager.
- Mirco called for establishing a deployment team or task force
for each of the experiments, with named members including a liaison with
Flavia’s group. Dario said that this had already been established for
ATLAS under Oxana. Other
experiments to identify people responsible for their task force.
- ATLAS – Dario
The schedule fits well with the ATLAS plans: phase 1 of the data challenge
starts at the beginning of April with about two months simulation. All of
the building blocks are being assembled now, including the new production
environment which supports job submission to CERN-LSF, Grid 3, NorduGrid
and LCG-1. No problems are anticipated in adapting to LCG-2. ATLAS does
not plan to use any non-Grid resources, if possible, as this would
increase the work load of the production team considerably. They would
like to see all available resources connected to LCG-2 (or one of the
other Grid flavours in ATLAS: NorduGrid and Grid-3) by the beginning of
April.
- ALICE – Alberto
The ALICE
data challenge starts at the beginning of January using AliEn. As capacity
becomes available in LCG-2 AliEn will use it. In the discussion Ian said
that there are compatibility issues in the catalogue between LCG-1 and
LCG-2, and so even if ALICE
obtained satisfactory results in its test with LCG-1 the use of LCG-1 for
the DC is not advisable. It is therefore better to wait for the LCG-2
deployment. The LCG-2 schedule is not ideal since the LCG-2 deployment
will occur in the middle of the Data Challenge. Nevertheless the work on
the AliEn-LCG-2 interface proceeds in parallel, with a dedicated team. ALICE expects that this
approach will allow it to cope with the difficulties of operating the DC
at the same time as the LCG-2 deployment. It is possible that, even with
the best efforts from ALICE
and the LCG teams, the LCG-2 deployment schedule will introduce some
delays in the DC schedule.
- LHCb – Philippe
The data challenge starts at the beginning of April, with production
through to June, and then analysis until the end of the year. If there is
a delay in the schedule LHCb will use existing conventional tools – DIRAC
interfaces to LCG and/or local batch systems. The schedule looks
acceptable within the substantial error bars that go with any production
run..
- Capacity and Ramp-up
Dario questioned why the significant capacity in the UK at Manchester and
Liverpool was not in the initial target list of RCs. Tony said that the
capacity at Manchester is not yet installed, and Liverpool has
difficulties with manpower that will not be resolved until GridPP2 starts.
Les to talk to John Gordon
about this.
BNL, Tokyo and Lyon
are other major centres that are not included. At BNL it is believed that
there is a manpower shortage (Les
to confirm this with Vicky White, John Huth). The approach at present is to start up with sites that are
motivated to start now and have good support to deal with initial
difficulties. Alberto underlined that the experience on the LCG-1 test has
shown that (as stated in the Proposed Schedule) it is fundamental at this
stage to give priority to the service stability rather than to increase
the number of sites.
It was agreed that the suggested initial capacities proposed are
reasonable, assuming that processors have a capacity rating of 0.8-1
SI200. All experiments must ask their
national contacts to request through the
local allocation processes that this capacity is made available at the
selected RCs by 26 January. They should also give forewarning that all of
their allocated capacity should be provided through LCG-2 from the
beginning of April.
- Mass Storage – Regional Centres will be asked to clarify their
mass storage plans. Les
to bring this to their attention.
- Next steps –
- Actions to be followed up.
- The schedule to be sent to the RCs
concerned, for discussion and final agreement at the GDB on 13 January
2004.
Target sites with
proposed capacity for end January (worker node count assumed to be 2-cpu
systems, each processor rated at 0.8-1.0 SI2000):
Site
|
# worker nodes
|
Comment
/ Mass Storage System
|
CERN
|
200
|
Castor
|
CNAF
|
200
|
Castor
|
Spain
|
60
|
Castor
|
FZK
|
100
|
dCache/TSM (to be confirmed)
|
NIKHEF
|
60
|
To
be confirmed
|
FNAL
|
100
|
dCache/ENSTOR
|
Taipei
|
60
|
To
be confirmed
|
RAL
|
200
|
To
be confirmed
|