Minutes of storage phone conference 05 July 2006 Present: Durham: Mark Lancaster: Matt Edinburgh: Greig London/Imperial: Olivier RAL Tier 1: Steve, Derek RAL Storage: Owen, Jiri, Jens (chair+mins) Apologies: Glasgow: Graeme 0. Review of actions (see below) 1. GridPP follow up - these are the issues Owen and I picked up last week. Please contribute other issues so we can track them. Some of the issues are ones we know about already. 1.0. The meta-question of which priorities to assign to the issues below. TODO! The meta-question of how and where to track them? Actions, wiki, savannah, ... We agreed that Owen's point that the Wiki was best, was best. 1.1. The definitions of "uptime" and "available". What does it mean that the SE is available? How does the SE's availability affect the site's availability. This is just to summarise: "available" (also in terms of the MoU requirements) means available via an SRM (current version) interface. If it is not reachable via an SRM, it is not available. We discussed uptime. What came out of GridPP last week was that an SE can be fractionally up. If a pool holds 10% of your files, and it goes down, your SE is then 90% up. Thus, you can improve uptime/resiliency by keeping replicas (a la resilient dCache), but at the extra storage cost. The questions then are (a) do we need to measure this, and (b) if so, how. Jens suggested random samples, picked from the file catalogues, but Owen points out that permissions could be a problem. Olivier suggested instrumenting jobs - files they try to access can be logged as successful or failures, thus leading to an aggregated measure of SE uptime. RB monitoring can log return codes, but unfortunately there is no standard at the moment for the return value. 1.2. Enabling "dark storage" (sort of analogous to dark matter; it's there but you can't see it). We discussed solutions. dCache is the obvious solution; in this case it can be done in resilient mode or not. It is recommended not to run dCache in resilient mode in your production system! Greig has a resilient dCache on a single machine. Owen thinks this is best done by pretending to have an MSS backend; to manage write pools and read pools separately, a system which is slightly broken in disk-only dCache. Greig, however, is doing that in Edinburgh (ie disk only), but the problem is flushing from write pools to read pools: the flushing is done when a request comes in for a file which exists in the write pool, and it really should be a background task. May be fixed in a future version of dCache. Greig has a paper on resilient dCache - will circulate (ACTION). Non-resilient mode is a risk because pools may go down, but resilient mode is also a risk. Mark suggested looking at lower level services, such as iSCSI, to manage the distributed storage - are there other ways to solve them problem? Owen suggested talking to new NGS chap about GPFS (ACTION). People have also tried distributed filesystems, with mixed experiences - in the UK we have most experience with NFS for this purpose, and the conclusion was that it couldn't be recommended. There is also Andrew McNab's http version of xrootd which he aims to "SRM enable" - however, this comes at a time when the PMB is questioning whether we can support even three implementations in the long term. No new news from QMUL yet, but Alex has made it clear he will not run non-Open Source. dCache will have an old Qt style licence - source is available but will not allow for modification or redistribution (neither does TeX BTW, but you can modify if you rename it). Also, WNs will need incoming connectivity regardless of the solution... 1.4. [sic] The case of the cross-site SE (don't do this on your production system!) No news from Lancaster, but they are doing it with dCache and not on their production system! Usefulness will depend on a replication facility to replicate between the sites. 4. [sic] AOB Mark mentioned that people may be interested in the Google disk paper and will circulate [ACTION]. ------------------------------------------------------------------------ ACTIONS 41 10/08/2005 Agree licence with DESY Jens Open Being resolved with dCAche going OS. 53 12/10/2005 Find reasoanable % for SE uptime for SC4 Jeremy Open Progress - we now have a definition of uptime, following the discussion last week, but currently cannot measure it. 54 02/11/2005 Report on performance/scalability with pools on WNs Paul Open Still important, so reassigned to Greig. 86 08/02/2006 Extend monitoring to do sites per VO and VOs per site Greig Open Work is ongoing with Dave Kant; there is a plan to publish via R-GMA (i.e. an R-GMA producer queries the relevant GRIS and publishes the information into an archiver). 105 03/05/2006 Re-poke DESY or FNAL about SRM 2.1 for dCache Owen Open 2.1 has been tested but we still don't have one deployed in GridPP. Still open. 112 23/05/2006 Document Xen stuff in Wiki Jiri Open No news. Greig suggests this is low priority. 115 31/05/2006 Speak to dCache team about Tier-2 dCache configuration Owen Open Done. 116 31/05/2006 Progress of Durham-MAN networking discussions. Mark Open Ongoing. An external review is expected to report back within the next couple of weeks. Pete Clarke is kept au courant. 117 07/06/2006 Document dCache release procedure Owen/Greig Open Owen has sent the document to Greig but hasn't had feedback yet on this version. 118 07/06/2006 Document pool node dependency problems in savannah Owen Open Documented, but in which project? Issues in Savannah are tracked in many different projects; Owen should circulate a list of them for our general edification (ACTION). 119 07/06/2006 Circulate next version of VO storage to list Jens Open Still needed! Progress was made last week at GridPP, but the information still needs to be summarised and circulated to the UB. ------------------------------------------------------------------------ NEW ACTIONS 120 05/07/2006 Circulate paper on resilient dCache Greig Open 121 05/07/2006 Get report from NGS on GPFS Owen Open 122 05/07/2006 Circulate google disk paper Mark Open (Done!)