Deployment Team Meeting 2005-11-11
VRVS Car Room

http://agenda.cern.ch/fullAgenda.php?ida=a056650

Present: Stephen, olivier, Pete, Graeme, Steve T, Fraser, Alessandra
Apologies: Jens
Scribe: Jeremy

==================================================
Actions
==================================================

- Updates to the actions were recorded in the Wiki area. In relation to the testbed machines the following information was given:

- SouthGrid - machines ordered
- Glasgow - arrived (some in use for FTS - PPS and LFC tests)
- NorthGrid - machines ordered
- London - machines ordered

- Sites have been reminded about the need to enter data for the network questions in the Wiki area
- 051111-1 - All: Review training courses provided by NESC (see links: http://www.nesc.ac.uk/training/events/index.html and http://www.egee.nesc.ac.uk/schedreg/index.html). Report at next meeting - courses of interest (to yourself and Tier-2 in general) and others that should be organised.


==================================================
Discussion on review document
==================================================
- Removed some repetitivness
- Few responses from TB-SUPPORT
- In relation to the Google Map, Ireland claim they are working but are shown to be faling. Many flase positives? Mainly Job List Match failures. - ST: Possibly the firewall settings are too complicated.
- On "General status" SB and others commented that for releases the communication improved but is still not enough. 
- History on SFT pages not long enough for reports 
- ST: The shorter version is to increase lookup speed. The data is still archived.
- JC: Does the security area need to be mentioned more?
- AF: Is the vulnerability group acting as a bottleneck? It was thought the problem is with the developers picking up on problems not the middleware users reporting them. AF added that this is partly the result of not going public.
- GS: Should this issue be in place of box tax?  Does a serious site need 100s of machines? Could there be more work to consolidate services on machines. LFC not sure of load but probably if less than 100 WNs then sharing with MON box not so much of a problem. Agreed that the box tax issue should remain.
- 051111-2 JC: Circluate link to talk to PMB. 

==================================================
Working towards SC4
==================================================
- JC: Planning for SC4 needs to be made firmer. We need to cover milestones and tasks. Today we need to review the main components and our approach. 

- 1) SRM. All sites must have a fully working SRM implementation. v2.1 will be needed for SC4 but for now we install what we have. 
* Current status - RHUL in coming weeks. UCL unsure. LeSC - now debugging SGE. 
* Sheffield - working on it (0.5 FTE?)
* Liverpool - dCache on small number of nodes is being implemented now.
* Durham - by the end of November is still expected. 

- 2) Set up FTS channels (T1-T2) [useful to learn about network]
* This is being done for all Tier-2 sites publishing an SRM.
* Channels can be created on demand. Monitoring is already available. Link to RAL FTS pages: http://wiki.gridpp.ac.uk/wiki/FTS 

- 3) Target transfer rates - suggested 1TB data to each Tier-2 (store or delete files depending on hardware available)
* JC: A question is whether sites need to procure more hardware for this exercise while also supporting current production.
* The general feeling was that a significant storage allocation was not required but this should be reviewed quickly to allow purchases to be made.
- 051111-3 JC: reconfirm expectations for Tier-2 hardware 

- 4) Every site requires LFC for each VO supported (ATLAS/ALICE). 
* CMS position is unclear - they want Pool file catalogue but LCG supplies LFC. Their requirement is for database transactions to remain open for up to 2 hrs * not supported by mySQL. Use as generic middleware is being investigated.
* ST: adding extra VOs to LFC is very easy. T1 has PhenoGrid setup. 
- Timeline: Graeme wants to have document ready by end of November. One site per Tier-2 should then follow LFC setup procedure during December. All sites should follow in January and February. 
* One LFC per site is required because when jobs finish they need to conduct a local lookup.


- 5) FTS client (requires configuring with endpoint) at each site (UI and batch workers - it is part of the meta package)

- 6) VO Boxes
* SB: Do we need a milestone for when we expect a decision on deployment of VO Boxes?
* JC: No date is known. The GDB agreed all T1s should deploy and a working group be setup to look at integrating common services. It is quite possibly that the only short-term solution is to implement VO Boxes at most Tier-2s. 
* This will impact VOs supported. 

- 7) Milestones associated with other experiment tests as part of SC4. 


- Summary of highlevel milestones

* SRM
** 80% of sites to have working (file transfers with 2 other sites successful) SRM by end of December
** 100% of sites to have working SRM by end of January
** 40% of sites upgraded (& tested) to SRM v2.1 by end Febraury
** 100% of sites upgrade (& tested) to SRM v2.1 by end March

* FTS
** FTS channel to be created for all T1-T2 connections by end of January
** FTS client configured for all sites by end January 

* Transfers (rates TBC - aim 300-500Mb/s)
** Sustained transfers to 20% sites by end December
** Sustained transfers to 50% sites by end January
** Sustained individual transfers (>1TB continuous) to all sites completed by end of February

* LFC
** LFC document available by end November
** LFC installed at 1 site in each Tier-2 by end December
** LFC installed at 50% sites by end January
** LFC installed at all sites by end February

* VO Boxes (depending on experiment responses and general position)
** VOBs available at 1 site in each Tier-2 by mid-January
** VOBs available at 50% sites by mid-February
** VOBs available at all (participating) sites by end March

* Experiment specific tests TBC.




==================================================
Pakiti
==================================================
- Allows publishing of patch status (can be configured for detailed or summary figures)
- A collector will be run at RAL 
- FS: What is gained by doing it centrally? 
- ST/PG@ Advice can be given and prompts made. Highlights problems when auto updates fail on WNs.  
- There is little difference with Yumit - mainly in the web protocol area for central service. 
- Auto-update is used at Glasgow
051111-4 Audit sites to find out what sysadmins currently do for monitoring/updates. Revisit at future meeting. 

- Sites can use Yumit (monitoring tool) before gloabl Yum Update -> Yumit highlights nodes that did not update.
- If auto-update is being used is there a need to monitor? Yes - Sometimes nodes have problems. 

- Response procedure was discussed at last meeting - at least the document update procedure was! Little feedback even though Alessandra sent out links to JANET (best practice) information/example.
051111-5 All: Starting with Alessandra review document in Wiki, edit areas of concern (highlight issues with an asterix), then pass the "editing token to the next peron in the team to ensure everyone contributes. The order is (like the minute taker list) as follows:

> Alessandra
> Fraser
> Graeme
> Jeremy
> Olivier
> Pete
> Stephen
> Steve

051111-6 AF: Follow up on suggested links (JANET documents) by linking them to GridPP (General Unix Security) area. 

- It was highlighted that the response procedure is just a set of recommendations - only know if work when have incident. Focus should be more on prevention than cleanup. 

051111-6 JC: Ensure security incident prevention is topic of future meeting. This links in with Linda Cornwalls work.

==================================================
Use of Wiki area
==================================================
- Done items should be reviewed at next dteam meeting then put in the "done" area not before
- What do target dates mean and how are they to be agreed? Suggestion - JC as chair of meeting assigns targets. Owners respond if date unrealistic.
- Original idea of "by next time" is good and will apply to many actions
- Could actions be divided by person? Perhaps in the future but for now it may over complicate. JC pastes into excel for simple sort.
- Seperate tables for local and general issues may be appropriate. Tier-2s may wish to capture their own issues in the Tier-2 area.


==================================================
AOB
==================================================

- JC: There was a request at the GDB for Geant4 VO to be supported.
- Each VO should provide a report (some summary information is in the CIC portal pages)

- Ganglia - Andrew McNab waiting for Alessandra to buy new servers. Birmingham request is on hold until then.
051111-7 AF: Follow up on new server purchase and ensure Birmingham added to federated Ganglia area asap.

- AF: Problem with ATLAS. Changed storage SE from Classic SE to dCache took action to warn experiments. Some users responded. One user forgot - experiment needs procedure so that sites are not blamed especially as experiments are not responsive!

051111-8 JC: Operations report next week should contain a request for a procedure to be followed 

- OvDA: New opertons will be 64-bit. Is there a release available for this platform?

- JC: We need to look at renaming IC site as a test case to renaming all sites to the format UKI-Tier2-Site-Cluster. 


New Actions 
***********

051111-1 All: Review training courses provided by NESC (see links: http://www.nesc.ac.uk/training/events/index.html and http://www.egee.nesc.ac.uk/schedreg/index.html). Report at next meeting - courses of interest (to yourself and Tier-2 in general) and others that should be organised.

051111-2 JC: Circluate link to talk to PMB.

051111-3 JC: reconfirm expectations for Tier-2 hardware

051111-4 Audit sites to find out what sysadmins currently do for monitoring/updates. Revisit at future meeting.
 
051111-5 All: Starting with Alessandra review document in Wiki, edit areas of concern (highlight issues with an asterix), then pass the "editing token to the next peron in the team to ensure everyone contributes. 

051111-6 JC: Ensure security incident prevention is topic of future meeting. This links in with Linda Cornwalls work.

051111-7 AF: Follow up on new server purchase and ensure Birmingham added to federated Ganglia area asap.

051111-8 JC: Operations report next week should contain a request for a procedure to be followed