Differences

This shows you the differences between two versions of the page.

--- archive:appds:wp2:main [05/07/2024 19:03] – removed - external edit (Unknown date) 127.0.0.1
+++ archive:appds:wp2:main [05/07/2024 19:03] (current) – ↷ Page moved from grants:archive:appds:wp2:main to archive:appds:wp2:main admin
@@ Line 1: / Line 1: @@
+====== WP2: Big Data Science Software ======
+The goal of this work package is first, to improve the basic software of the KCDC data centre. Then
+we will adapt this to move the data centre to the Big Data environment provided by KIT-SCC and
+MSU-SINP. Both experiences will be used to provide finally a concept and a standard software
+package for scientific data centres. Therefore, the project will serve as a general software solution
+for providing open access to (astroparticle) data independently of the exact kind of data. The
+software needs to be built as a modular, flexible framework with a good scalability (e.g. to large
+computing centres). The configuration is hold to be simple and doable also via a web interface.
+The entire software will be based solely on Open Source Software (Python, Django, mongoDB,
+HDF5, ROOT, etc.). Having done this step, the result enables to install and publish a dedicated
+Data Life Cycle Lab for the astroparticle physics community, where this project is aiming for.
+Dedicated access, storage, and interface software has to be developed.
+A brief summary of the requirements, how they are implemented in KCDC and how they will be
+improved is compiled in the following: When publishing data it is not enough to put some ASCII
+files on a plain webpage. To ensure that the user can actually use the data, an extensive
+documentation on how the data has been obtained is needed. Depending on the kind of data, this
+is at least a description of the detector and the reconstruction procedures employed. Since this
+information, in addition to a license agreement, is not expected to change often, creating some
+static pages is a viable option. Another important aspect is the user and access management. For
+KCDC it was enough to ensure that only registered users get access to the data and to certain
+parts of the portal, such as the data shop. Therefore, it was sufficient to check if an account is
+active and the user is authenticated. While there is already a basic implementation of a permission
+based access limitation, a useful categorization of the users into - possibly hierarchical - groups is
+needed (no administrator should manually manage privileges of single users) to effectively use it.
+The heart of the data centre is the data shop. The design goals have to ensure an easy access to
+a user defined subset of the whole dataset, a natural way to configure additional detector
+components or observables without the need to change the codebase and a clear overview of
+previous requests with the possibility to use these selections as a template for future requests. For
+KCDC, the data is stored in a MongoDB. Its scheme-less design allows us to collect all available
+information of an event, although the available detector components may vary for each event. In
+principle KCDC can use any kind of input format, as these are implemented as plugins. Currently
+three output formats are supported, HDF5, Root and ASCII. These can be extended by adding
+additional plugins, too. The requests are processed using a Celery based task queue. The
+simultaneous processing of requests can be achieved by adding more worker processes, which
+can be distributed among several machines. Together with the possibility to run the MongoDB on a
+sharded cluster, a scaling of the needed processing power with the demand can be achieved. The
+use and configuration of a sharded cluster has yet to be studied, however. Not included within
+KCDC, yet a nice addition would be the possibility to publish results, such as energy spectra
+together with the option to select multiple spectra to compile and download plots. Adding a way to
+visually select the data would be possible. In addition, the appropriate hardware has to be installed
+and commissioned. The described concept and working plan is valid also for the next steps, i.e. to
+generalize and consolidate the software package of a global astroparticle data centre for public
+and scientific use.
+Specific tasks:
+  * Movement of KCDC to large-scale computing facility and adapting the new environment (KIT-SCC, KIT-IKP, MSU-SINP)
+  * Optimizing data bank and access interfaces (MSU-SINP, ISDCT, KIT-SCC)
+  * A distributed system of storage and archiving the data is developed (MSU-SINP, KIT-SCC)
+  * Installation of appropriate hardware (KIT-SCC)
+  * Installing the Data Life Cycle Lab” (KIT-SCC)