Minutes for storage phone conference 19 July 2006 ================================================= Present: Edinburgh - Greig (chair and minutes) Durham - Mark Glasgow - Graeme Lancaster - Matt, Brian RAL - Owen, Jiri Apologies: Jens 0. Quick review of actions See below. 1. SRM Nagios plugin news. Discussed during the action review. dCache Nagios plugin from DESY has not been tested yet. Greig will do this. It currently just monitors the status of dCache domains/cells on the http://dcache-host:2288 webpage. DPM monitoring being developed at Glasgow by Graeme, Paul Millar and a (currently unwell!) summer student. This will use Pauls' MonAMI (see presentation at GridPP16) system which provides a framework such that you provide a plugin for the service that you want to monitor and it can interface the plugin with Nagios (giving you alerts when services change state), Ganglia (giving historical plots of resources etc) and other systems. List of targets that should be monitored for DPM have been put in the metadata wiki. Graeme will circulate to list for further comment. In particular things that should be monitored are: * daemons are running and listening on correct ports * service pings (i.e. rfdir, srmPut...) * free disk space (using rfio statfs()) * database load * number of srm requests Greig thinks that it would be a Good Idea (TM) if something similar was done for dCache. This could incorporate the already existing nagios plugin above. It should be noted though that FNAL have already developed a substantial set of dCache monitoring tools: http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0607&L=gridpp-storage&T=0&P=9051 2. Storage accounting. How do we know if an SE is lying to us? What are the common misconfigurations? Again, this is possibly confused with the ideas of VOs sharing pools and SEs supporting quotas. The dCache and DPM GIP plugins themselves do not fudge the storage figures. However, what can be unclear is how much storage is actually available to each VO due to the possibility of VOs sharing pools/pool groups. One possibility was suggested where if pools were shared then a shared or contention flag would be raised to alert VOs to the possibility that not all of the storage is necessarily available for their use. Such a flag does not currently exist in v1.2 of the GLUE schema. V2 of the schema is currently under revision, but there has been little recent discussion of it. Also, see 3.2 below. 3. Storage issues http://www.gridpp.ac.uk/wiki/Storage_Issues 3.1 How can SRMs' be made more resilient? There is a problem in that the information held in the file catalogs about file locations will become out of sync with the site SRMs as sites lose data due to hardware failure. In the current computing model sites need to mitigate this effect by using technologies such as RAID, resilient dCache (if using WN disk space). Make sure filesystems's are balanced for good data transfer rates. i.e. for each pool/pool group, spread the filesystems across multiple servers to spread the load. The comment was made that storage resiliency is something that the experiments should have built into their computing models. For example, Brian stated that ATLAS will copy Tier-2 generated data to their associated Tier-1 and another Tier-1 so a copy will be available if the Tier-2 loses its replica. 3.2 Implementing "quotas" in dCache. For now (and where possible) VO quotas should be implemented in dCache by creating small pools (which can be less than the size of the partition they reside on) and allocating these to VO-specific pool groups. This is OK for a site that currently has no storage used on its SE, allowing for pool reconfiguration. The situation is more complicated at sites running a dCache which has no free space left to allow for pool draining and reconfiguration. 4. AOB Owen reported that Dcap and GSI-Dcap will be moving from 'active' to 'passive' mode (in the gridftp context) in next release of dCache. An alpha release of this (1.7.0) will be made on Friday. Owen also reported that there has been some discussions with the YAIM maintainers about moving dCache out of the standard YAIM. This will only leave DPM in the standard gLite release and as such will not give Tier-2s the choice of using dCache. ------------------------------------------------------------------------ ACTIONS: 41 10/08/2005 Agree licence with DESY Jens Open No news. 53 12/10/2005 Find reasoanable % for SE uptime for SC4 Jeremy Open No news, but ongoing discussion about definition of an 'available' site. 54 02/11/2005 Report on performance/scalability with pools on WNs Greig Open Information now in wiki about resilient dCache. Need access to a (small) cluster of machines that could be used to run a larger scale resilient dCache. 86 08/02/2006 Extend monitoring to do sites per VO and VOs per site Greig Open Progress. Greig has written a script that gets the storage information from the site BDIIs and publishes it via R-GMA into the GOCDB. This virtual database can be queried using R-GMA client tools. Not running script as a cron job yet, need to make it more robust first of all. Discussions ongoing about how best to visualise the data. 112 23/05/2006 Document Xen stuff in Wiki Jiri Open Done. http://www.gridpp.ac.uk/wiki/Xen 116 31/05/2006 Progress of Durham-MAN networking discussions. Mark Open No change from last week. 119 07/06/2006 Circulate next version of VO storage to list Jens Open No news. 121 05/07/2006 Get report from NGS on GPFS Owen Open Owen has spoken to a member of the Diamond collaboration which will use either GPFS or GFS for their storage solution. So far been tested with up to 8 nodes, but recommendation is to run with an odd number of nodes. Not clear why odd number is better. 123 12/07/2006 Ask DESY about Nagios monitoring dCache Owen Open Will close due to response from Action 124 below. 124 12/07/2006 Ask dCache community about Nagios monitoring Greig Open Closed. Greig asked the dCache user-forum for information about Nagios monitoring. Response was a plugin that is used at DESY to monitor their dCache via the information that is published by it on its web page monitor. Find more information here: http://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind0607&L=gridpp-storage&T=0&P=9357 125 12/07/2006 Add SURL publishing recipe for DPM to Wiki Graeme Open Progress. Graeme has created a set of dCache utilities that people can use to find out the SURLs of files that exist on each filesystem, in addition to other useful tools. Intends to write a generic operational procedure for dealing with the situation where a site SRM loses data due to a hardware failure. He will then create a section about DPM specific issues that sites should deal with. Same should be done for dCache. -------------------------------------------------------------------------- NEW ACTIONS 126 19/07/2006 Wiki page describing dCache specific steps when storage lost. Greig 127 19/07/2006 Test out dCache Nagios plugin. Greig 128 19/07/2006 Circulate DPM monitoring wiki page to the list Graeme