Cloud & Big Data 2018

Source code

https://github.com/vivianasandagordaguaman/applicationsbigdatatechniquesinparticlephysics

Problem

Our application is addressing a modern scientific problem which is the need for automated processing of scientific data. The constantly growing amount of experimental data obtained for example from LHC in CERN is becoming way too ample to be processed by humans in reasonable time. In order to draw conclusions from these data, techniques such as Machine Learning, Big Data or Cloud are needed.

Data

Dataset used in this demonstration comes from UCI ML Repository and can be downloaded as .gz file from the webpage: https://archive.ics.uci.edu/ml/datasets/HIGGS#. This data was produced using Monte Carlo simulations, resembling real data obtained during experiments in a particle accelerator. First column contains the label of a given event (1 - event involving Higgs boson, 0 - background process not involving Higgs boson). First 21 features (columns 2-22) contain kinematic properties measured by the particle detectors in the accelerator. Last seven features contain combinations of the first 21 features - high-level features derived by physicists to help discriminate between the two classes.

Design

Project consists of two parts: machine learning model (Boosted trees with Python and XGBoost) and dataset statistics (Python with Spark). Machine learning model learns dependencies from data and is able to predict if a given event involves Higgs Boson. It is very useful that a trained model can indicate which events may be potentially interesting and require taking a closer look. Interesting events can be separated from the background noise and it makes data processing much easier. Generating dataset statistics also plays an important role because it can be helpful to have a deeper insight into the dataset and its features.

Usage

Both parts of the project are Python scripts which can be run, for instance, in a Linux terminal, with program output directed to standard output. Both use the aforementioned dataset as input.

Insights

The most important insight that we gained from this project is how crucial it is to use automated data processing in experimental physics. Looking at the size of the dataset, it becomes obvious that it would be infeasible to be processed by humans and the real experimental datasets can be even larger. It highlights that Machine Learning and Big Data techniques are necessary to draw conclusions from datasets of that size.

Improvements

To improve the machine learning results by decreasing overfitting the following improvements could be suggested: adding more data (the rest of the original dataset, the model was trained using only a part of it), trying different values of regularization parameters like gamma (the minimum loss reduction), alpha (L1 regularization), maximum depth of a tree, eta (learning rate) and subsample (subsample ratio of the training instances), training the model for more epochs.

Challenges

The most enjoyable part of the project was searching for the best machine learning algorithm for the given problem and reading about a different approach to a well-known machine learning algorithm - decision tree classifier. The most challenging aspect of the project was the size of the original dataset (2.8 GB after compression) which made it a bit cumbersome to work with. The most time-consuming part was training the model with different hyperparameters to obtain better performance. What could be changed about the project is that the model could be trained on a more powerful machine, which would make it possible to use the whole dataset and train the model for more epochs to obtain better results.