emails on procurement planning and resource usage scheduling

 

 

-----Original Message-----

From: Alberto Aimar

Sent: 28 November 2005 18:38

To: worldwide-lcg-management-board (LCG Management Board)

Subject: Procurement Plans Issue

 

Dear MB members,

 

For some sites, if I did not misunderstand their milestones plans,

there is a considerable mismatch between:

- what is planned in terms of hw procurement in the Tier1 site, and

- what is being signed in the WLCG MoU Annexe 6.3 (Tier1 Computing Capabilities):

 

   https://uimon.cern.ch/twiki/pub/LCG/SitesPlans/PagesfromMoU_Annexe_6_3.pdf

 

In order to simplify the analysis I have summarized the procurement milestones

of all Tier1 sites in one single (short) Excel table:

 

https://uimon.cern.ch/twiki/pub/LCG/SitesPlans/ProcurementPlans-200511.xls

(values planned in white, values in MoU in pink)

 

I have added this issue as last item in tomorrow's MB agenda because

it would be useful to clarify it.

 

Regards.

Alberto.

 

 

 

-----Original Message-----

From: Kors Bos [mailto:bosk@nikhef.nl]

Sent: 28 November 2005 22:00

To: Alberto Aimar

Cc: worldwide-lcg-management-board (LCG Management Board); Jeff Templon

Subject: Re: Procurement Plans Issue

 

Alberto, All,

 

Of course there is a mismatch between what we plan to install in 2006

and what we said we would in the Phase-II-Planning tables. For the P2P

estimates we took what the experiment TDRs said that was needed. We have

always said that we would be as conservative as possible with installing

this equipment as our money would be worth more if we could wait longer

and this would be very much also in the interest of the experiments.

 

In this service phase of Service Challenge 3 we have seen such low

demand from the experiments that we are very reluctant to invest any

more money in additional hardware at this point in time as the current

infrastructure is already over-dimensioned by large factor: over the

past 3 months there have been a few TBytes written and nothing read back

and the farms are running idle when I speak (not speaking about CMS as

we don't have that VO in Holland).

 

I think it is un-realistic to maintain the overall planning for 2006 as

it is written in the MoU tables as we simply will not do it. We are not

going to spend a lot of money because it is written in some tables and

let it sit un-used. If this means not signing the MoU, let that be it

(although I am not the person to decide that).

 

I am very worried about this development as I have said many times. It

means that the experiments want us to to scale up orders of magnitude by

the time the beam comes on without ever having had the chance to test

this. We have now written a few TBytes over three months, something we

should be able to do in a few hours in 2007 and we still have never read

anything back something we should be doing constantly. We still don't

know by far what we can expect under real data taking circumstances and

there are only 487 days left.

 

Kors

 

 

 

 

From: Les Robertson
Sent: 01 December 2005 09:06
To: worldwide-lcg-management-board (LCG Management Board); Jeff Templon; Chris Eck
Subject: RE: Procurement Plans Issue

Dear MB Members

Kors raises important issues here that I think need to be discussed at the next MB.

The Phase 2 Planning Group intended to collect the plans of the regional centres. These of course are based on the requirements of the experiments but they also had to take account of the funding possibilities and, very importantly, the need for each centre to ramp up its services in terms of reliability, availability, functionality, performance, capacity .... Kors has pointed out that the experiments are not at present filling the available capacity, and Ricardo and others have noted that in part this is because the services are not yet working smoothly. We also know that we failed to achieve the (as we had thought rather modest) data transfer goals of SC3 before the summer, and it looks as if it will be April next year before we are able to complete this. This means that a lot of the equipment needed to get us up to the target performance (tape drives, disk servers, network switches, WAN links, ..) will have lain idle for long periods - but that is unavoidable. Recent experience proves that we will definitelty not achieve the performance that we need when the accelerator starts running if we do not go through a lengthy learning process.

In terms of Service Challenges for the Service Providers we have got stuck for the moment on data distribution, but there will undoubtedly be a substantial set of problems uncovered as we expand to the fuller set of services that the Tier-0 and Tier-1s must provide. Adding the less predictable workload of calibration, re-processing, ESD/AOD skimming, etc. places significant demands on the disk and LAN infrastructure - and it is important to show that our design estimates stand up in practice. The experiments have a need to test their full software chain, which of course adds complications to the testing cycles. It may be that we should plan some more artificial tests that concentrate on the infrastructure - as we did this year for data transfer. This should be discussed at the Mumbai workshop.

I also wonder if we should not have an MB working group to assemble the overall requirements for resources in LCG over the next twelve months and come up with an overall schedule - one person from each expeiment and 3-4 people from sites. I am worried about each site making its own independent conclusions about the needs for next year - when we have to demonstrate the full data rates (resource intensive) as well as run decent services for the experiments. While it may be that the cpu capacity may be over-sized for 2006 I doubt that the disks systems will be. For these it is not capacity that matters but accessibility.

A final, but very important point - I really do think that it is important that sites put their real and realistic plans into the MoU tables. We are asking the funding agencies to make real commitments and stick to them. The MoU process was specifically constructed so that firm commitments are only made for the coming year. Please look at your plans and the MoU tables for 2006 - and if there are discrepancies please change them now so that the MoU can be signed with the real numbers. We must take this seriously if we expect the agencies to do so.

Regards, Les   

 

 

 

 

From: Gordon, JC (John) [mailto:J.C.Gordon@rl.ac.uk]
Sent: 01 December 2005 10:26
To: Les Robertson; worldwide-lcg-management-board (LCG Management Board); Jeff Templon; Chris Eck
Subject: RE: Procurement Plans Issue

Les, I (and my colleagues) recognise the requirement to test the LHC computing systems in advance at their proposed level of complexity and we have plans to install all we can afford and PPARC will commit to this. However, PPARC as a funding body are not at all convinced of the point of installing computing resources that only get used for a small fraction of the time during these tests. They submit us to external review and the reviewers are apalled at our levels of utilisation. Each year we show these unused resources it gets harder to convince them to commit. There is a real risk that one year they will refuse completely and not commit to the MoU until they see that there is pressure on the existing resources. They did that this year and we had to fight hard to reverse the decision, that is why our installation plan is later than we forecast in earlier P2P meetings. Next year it will be harder still. They have approved the funding but it is hard to make the case for its release when we will get closer to the experiments 2008 requirements the longer we wait. I have raised this issue before at GDB and my prediction came true this year.

 

John

 

 

 

 

-----Original Message-----

From: Mirco Mazzucato [mailto:mirco.mazzucato@pd.infn.it]

Sent: 01 December 2005 15:34

To: John Gordon

Cc: Les Robertson; worldwide-lcg-management-board (LCG Management Board); Jeff Templon; Chris Eck

Subject: Re: Procurement Plans Issue

 

Dear Les

 

just to confirm that I have similar difficulties in INFN even if I try,

with more or less success,  to overcome them opening more the Center to 

currently running experiments.

Cheers

                                Mirco

 

 

 

From: Les Robertson
Sent: 01 December 2005 17:56
To: John Gordon; worldwide-lcg-management-board (LCG Management Board); Jeff Templon
Subject: RE: Procurement Plans Issue

Dear John

 

I agree that we need to improve the understanding of exactly which resources the experiments need for their work, when and at which sites. But we have to be clear to ourselves and to the funding agencies that we are not at present in a mode where systems are expected to be kept filled throughout the year. At present in many cases the oversight/funding agencies should be concerned with building up performance and demonstrating readiness more than with sustained throughput.

 

Regards, Les

 

 

 

 

 

From: Holger Marten
Sent: 01 December 2005 14:03
To: worldwide-lcg-management-board (LCG Management Board)
Subject: AW: Procurement Plans Issue

Hi Alberto,

Concerning GridKa:

There is an apparent mismatch between the disk milestones and what has been written in the MoU ("missing" 35 TB). The reason is that we submitted our planning to extend the **dCache disk space**. There are more than 40 additional TB of disk space already in use by the LHC experiments which are not dCache and thus have to be added to the total sum!

Comparing tape plannings with MoU numbers there are 20 TB missing. This is correct as we are still discussing wether to extend our existing robots with slots or go for another new robot system at the beginning of 2006. It is likely, that these discussions will proceed in January, and I didn't want to promise any time scales in the current version of the milestones document.

Regards,
   Holger

 

 

 

 

-----Original Message-----

From: Jeff Templon [mailto:templon@nikhef.nl]

Sent: 01 December 2005 17:04

To: Les Robertson

Cc: Gordon, JC (John); worldwide-lcg-management-board (LCG

Management Board)

Subject: Re: Procurement Plans Issue

 

 

We are developing a min / max system here.

We will give Alberto Aimar the "min" stuff ... this is what

we will deploy regardless of how it gets used, because we

want to start building up certain parts of the infrastructure

that are more or less capacity independent.  Then there is a

'max' which is what is in the MoU.  We try to make sure we

can deploy this IF IT IS NEEDED in a expeditious manner.

  But if we are clearly overdimensioned we are not going to

blindly follow the MoU capacity tables.  Every day we wait,

things get cheaper and better.

 

This of course is from the Tier-1 point of view.  From the

experiments point of view, every day we wait is another day

you don't have to figure out what bugs are left.  Bring it on.

 

                        JT

 

 

 

 

 

-----Original Message-----

From: Dominique Boutigny [mailto:boutigny@in2p3.fr]

Sent: 02 December 2005 09:47

To: Les Robertson

Cc: worldwide-lcg-management-board (LCG Management Board)

Subject: Re: Procurement Plans Issue

 

Hi Les,

 

Concerning IN2P3, we have a rather strict budget profile between now and

2010 and we should stick to it. Even, if it is financially more

interesting to buy the hardware later, it is not possible to change the

profile.

 

We will buy the hardware according to the MOU as long as we get the budget.

 

If the experiments are not able to use all the ressources immediatly

after they are made available at CC-IN2P3, then, it is up to the French

LCG collaboration and more generally to WLCG to help them to overcome

the difficulties.

In any case, the ressources declared in the MOU for 2006 are pledged, so

they have to be realistic numbers.

 

Best regards,

 

     Dominique

 

 

 

 

 

-----Original Message-----

From: Federico Carminati

Sent: 05 December 2005 11:05

To: boutigny@in2p3.fr

Cc: Les Robertson; worldwide-lcg-management-board (LCG Management Board)

Subject: Re: Procurement Plans Issue

 

Dear All,

    in relation with this thread, it is important to note that the 

under-utilisation of the resources comes from two facts:

 

- Lack of proper planning by the experiments

 

- Problems in the availability / usability of the resources

 

   If the LHC software were working perfectly, and all the sites were 

perfectly configured and available 100% of the time, much more of the 

resources would have been used. Of course not 100% of them, because 

experiments (or at least ALICE, to speak of something I know) did not 

plan perfectly. In any case I would like to see the discussion 

focused more on "how to make resources 100% (well as close as 

possible) available", rather then "experiments used less than 

promised, so let's buy less resources". I think the discussion as it 

is now tends to focus more on the lack of planning (which indeed is 

real to some extent) than on the effectively usability of the 

resources deployed (which is FAR from 100%). The lack of an agreed 

metrics does not help this debate to become factual. Best,

 

Federico Carminati

CERN-PH

1211 Geneva 23

Switzerland

Tel: +41 22 76 74959

Fax: +41 22 76 79480

Mobile: +41 76 487 4843