Minutes of storage phone conference 05 July 2006


Present:
	Durham: Mark
	Lancaster: Matt
	Edinburgh: Greig
	London/Imperial: Olivier
	RAL Tier 1: Steve, Derek
	RAL Storage: Owen, Jiri, Jens (chair+mins)

Apologies:
	Glasgow: Graeme


0. Review of actions (see below)


1. GridPP follow up - these are the issues Owen and I picked up last
   week.  Please contribute other issues so we can track them.

   Some of the issues are ones we know about already.

   1.0. The meta-question of which priorities to assign to the issues
	below.  TODO!

	The meta-question of how and where to track them?  Actions,
	wiki, savannah, ...  We agreed that Owen's point that the Wiki
	was best, was best.

   1.1. The definitions of "uptime" and "available".

	What does it mean that the SE is available?  How does the SE's
	availability affect the site's availability.

	This is just to summarise: "available" (also in terms of the
	MoU requirements) means available via an SRM (current version)
	interface.  If it is not reachable via an SRM, it is not
	available.

	We discussed uptime.  What came out of GridPP last week was
	that an SE can be fractionally up.  If a pool holds 10% of
	your files, and it goes down, your SE is then 90% up.  Thus,
	you can improve uptime/resiliency by keeping replicas (a la
	resilient dCache), but at the extra storage cost.

	The questions then are (a) do we need to measure this, and (b)
	if so, how.  Jens suggested random samples, picked from the
	file catalogues, but Owen points out that permissions could be
	a problem.  Olivier suggested instrumenting jobs - files they
	try to access can be logged as successful or failures, thus
	leading to an aggregated measure of SE uptime.  RB monitoring
	can log return codes, but unfortunately there is no standard
	at the moment for the return value.

   1.2. Enabling "dark storage" (sort of analogous to dark matter; it's
	there but you can't see it).

	We discussed solutions.  dCache is the obvious solution; in
	this case it can be done in resilient mode or not.  It is
	recommended not to run dCache in resilient mode in your
	production system!  Greig has a resilient dCache on a single
	machine.

	Owen thinks this is best done by pretending to have an MSS
	backend; to manage write pools and read pools separately, a
	system which is slightly broken in disk-only dCache.  Greig,
	however, is doing that in Edinburgh (ie disk only), but the
	problem is flushing from write pools to read pools: the
	flushing is done when a request comes in for a file which
	exists in the write pool, and it really should be a background
	task.  May be fixed in a future version of dCache.

	Greig has a paper on resilient dCache - will circulate
	(ACTION).  Non-resilient mode is a risk because pools may go
	down, but resilient mode is also a risk.

	Mark suggested looking at lower level services, such as iSCSI,
	to manage the distributed storage - are there other ways to
	solve them problem?  Owen suggested talking to new NGS chap
	about GPFS (ACTION).

	People have also tried distributed filesystems, with mixed
	experiences - in the UK we have most experience with NFS for
	this purpose, and the conclusion was that it couldn't be
	recommended.  There is also Andrew McNab's http version of
	xrootd which he aims to "SRM enable" - however, this comes at
	a time when the PMB is questioning whether we can support even
	three implementations in the long term.

	No new news from QMUL yet, but Alex has made it clear he will
	not run non-Open Source.  dCache will have an old Qt style
	licence - source is available but will not allow for
	modification or redistribution (neither does TeX BTW, but you
	can modify if you rename it).  Also, WNs will need incoming
	connectivity regardless of the solution...

   1.4. [sic] The case of the cross-site SE (don't do this on your
	production system!)

	No news from Lancaster, but they are doing it with dCache and
	not on their production system!  Usefulness will depend on a
	replication facility to replicate between the sites.

4. [sic] AOB

Mark mentioned that people may be interested in the Google disk paper
and will circulate [ACTION].

------------------------------------------------------------------------
ACTIONS

41	10/08/2005	Agree licence with DESY	Jens	Open

Being resolved with dCAche going OS.

53	12/10/2005	Find reasoanable % for SE uptime for SC4	Jeremy	Open

Progress - we now have a definition of uptime, following the
discussion last week, but currently cannot measure it.

54	02/11/2005	Report on performance/scalability with pools on WNs	Paul	Open

Still important, so reassigned to Greig.

86	08/02/2006	Extend monitoring to do sites per VO and VOs per site	Greig	Open

Work is ongoing with Dave Kant; there is a plan to publish via R-GMA
(i.e. an R-GMA producer queries the relevant GRIS and publishes the
information into an archiver).

105	03/05/2006	Re-poke DESY or FNAL about SRM 2.1 for dCache	Owen	Open

2.1 has been tested but we still don't have one deployed in GridPP.
Still open.

112	23/05/2006	Document Xen stuff in Wiki	Jiri	Open

No news.  Greig suggests this is low priority.

115	31/05/2006	Speak to dCache team about Tier-2 dCache configuration	Owen	Open

Done.

116	31/05/2006	Progress of Durham-MAN networking discussions.	Mark	Open

Ongoing.  An external review is expected to report back within the
next couple of weeks.  Pete Clarke is kept au courant.

117	07/06/2006	Document dCache release procedure	Owen/Greig	Open

Owen has sent the document to Greig but hasn't had feedback yet on
this version.

118	07/06/2006	Document pool node dependency problems in savannah	Owen	Open

Documented, but in which project? Issues in Savannah are tracked in
many different projects; Owen should circulate a list of them for our
general edification (ACTION).

119	07/06/2006	Circulate next version of VO storage to list	Jens	Open

Still needed!  Progress was made last week at GridPP, but the
information still needs to be summarised and circulated to the UB.

------------------------------------------------------------------------
NEW ACTIONS

120	05/07/2006	Circulate paper on resilient dCache	Greig	Open
121	05/07/2006	Get report from NGS on GPFS	Owen	Open
122	05/07/2006	Circulate google disk paper	Mark	Open (Done!)