Phd Position F - M Reusable And Adaptable Machine Learning For Network Security H/F

INRIA | 15 Oct 2024

D�tail du poste

Job details
Job Type
Temporary
Contract

Full Job DescriptionLe descriptif de l'offre ci-dessous est en Anglais

Type de contrat : CDD

Niveau de diplôme exigé : Bac +5 ou équivalent

Fonction : Doctorant

Contexte et atouts du poste
The PhD position is proposed by the RESIST team of the Inria Nancy Grand Est research lab, the French national public institute dedicated to research in digital Science and technology. The team is one of the European research group in network management and is particularly focused on empowering scalability and security of networked systems through a strong coupling between monitoring, analytics and network orchestration. https://team.inria.fr/resist/

Mission confiée
Scientific context :

Cybersecurity is a major concern everywhere with the growth of connected devices that are beyond common computers. To circumvent these problems, decades of research and development have led to build new techniques and tools to fight back against the attacks on the Internet. Nonetheless, the number of attacks and their magnitude still grow. The attack surface continues to increase along with the number of connected devices but also due to the number of applications, services or software that today make the IT ecosystem far from its origin. Techniques used by both attackers and defenders evolve to complex mechanisms [1]. This leads to the massive use of encryption to avoid data leaks but simultaneously attackers benefit from encryption to hide their own activities. As a result intrusion detection methods relying on artificial intelligence have been investigated both in research and in industry [2]. During the last twenty years, there has been an increasing adoption of advanced analytics techniques, especially machine learning, in all areas of networking [3]. Many proposals are being developed to achieve a higher level of automation, including data-driven networks [4], knowledge-defined networks [5] and more recently self-driving networks [6]. The key objectives of all these techniques is to extract relevant information from observations in order to reach different goals such as enhancing performance or end-user experience, lowering the carbon footprint or improving network security in the context of this thesis.

[1] I. Friedberg, F. Skopik, G. Settanni, and R. Fiedler. Combating advanced persistent threats : From network event
correlation to incident detection. Computers & Security, 48 :35 - 57, 2015.

[2] A. L. Buczak and E. Guven, "A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection," in IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176, Secondquarter 2016, doi : 10.1109/COMST.2015.2494502

[3] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-Solano, and O. M. Caicedo. A
comprehensive survey on machine learning for networking : evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1) :16, Jun 2018.

[4] J. Jiang, V. Sekar, I. Stoica, and H. Zhang. Unleashing the potential of data-driven networking. In COMSNETS
(Revised Selected Papers and Invited Papers), volume 10340 of Lecture Notes in Computer Science, pages 110-126. Springer, 2017.

[5] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón, M. Solé, V. Muntés-Mulero, D. Meyer, S.
Barkai, M. J. Hibbett, G. Estrada, K. Ma'ruf, F. Coras, V. Ermagan, H. Latapie, C. Cassar, J. Evans, F. Maino, J.
Walrand, and A. Cabellos. Knowledge-defined networking. SIGCOMM Comput. Commun. Rev., 47(3) :2-10, Sept.
2017.

[6] N. Feamster and J. Rexford. Why (and how) networks should run themselves. CoRR, abs/1710.11583, 2017.

Scientific challenges

Like in other domains leveraging Machine Learning (ML), each proposed ML-based solution for network operations will require to select, configure or extend a ML technique according to a particular scenario. Major problems concern the definition of features, metrics and ML algorithms. The re-usability or adaptation of existing results is limited. Context-specific data interpretation or integration in an analytics framework is required. Some proposals have been made for port numbers and IP addresses [7] but the proposed metrics are too coarse-grained. This is still far from being satisfactory and the same applies for mathematical elements manipulated in the algorithms like kernel functions or neural networks, which have not been specifically designed for networking [8]. A major research challenge is the definition of network-based features that are meaningful and reusable in a variety of scenarios (with a focus on network security) and that can BE integrated in different ML algorithms. For instance, applying ML algorithms on network data requires the definition of new metrics capable of capturing the properties of network configurations, packets, flows, etc. Therefore, a key challenge is to represent them in a meaningful space such that
semantic operations, like distance, similarity or comparisons can BE applied. IT is also important to evaluate the impact and contribution of the collected attributes for the final targeted goals (e.g. detecting attacks). A second challenge is to select the right attributes according to given criteria. Obviously, a major criterion would BE the contribution of a feature to the accuracy of the learnt model but others must BE taken into account : overhead/cost to collect and transform necessary data or privacy impact.

[7] S. E. Coull, F. Monrose, and M. Bailey. On measuring the similarity of network hosts : Pitfalls, new metrics, and
empirical analyses. In Network and Distributed System Security Symposium, 01 2011.
[8] M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, and J. Lloret. Network traffic classifier with convolutional and
recurrent neural networks for internet of things. IEEE Access, 5, 2017.

Principales activités
The first objective of the thesis is to define new representations of network data as features for ML algorithms. An in-depth study of usable raw data is necessary to identify their different nature (numerical, categorical, discrete...). Catching these characteristics is required to define usable features, metrics or distances over these data. From raw data to usable data, several transformations might BE necessary. Different approaches will BE considered. First, to BE easily integrated in common ML algorithms, embedding techniques can BE defined to represent various types of network elements (flows, packet, topologies, forwarding tables, etc.) as fixed-size vectors. The latter must catch the intrinsic properties of the data they represent, for example structural
properties of topologies or functional properties of forwarding tables. Second, graph neural networks have been leveraged to model the dependencies between a network topology, routing and traffic [9]. We also expect to explore this direction by using graphs as representing other types of data such as flow or packet dependencies. In addition, temporal graph neural network can BE leveraged to catch temporal features. The PhD candidate will evaluate the relevance of the different features by using them in conjunction with different ML algorithms.

The second objective of the thesis is to define a method to automatically select the right set of features from those defined in this first objective. Also, the data to BE collected accordingly need to BE inferred. Assuming as input some constraints regarding the targeted goal (for example a minimal accuracy and/or a maximum amount of data to BE collected), the method would select the best features and the minimal set of data to avoid gathering too much data while reaching a high level of accuracy. Under the context of network security, the goal will BE to identify and so mitigate the attacks promptly. (Deep) Reinforcement Learning will BE considered as a first orientation in order to continuously adapt the feature sets in an evolving environment. Generative models will BE also investigated to discard and modify data, or even insert or build synthetic information [10] in order to keep the accuracy at the targeted level while lowering privacy impact.

[9] Krzysztof Rusek, José Suárez-Varela, Albert Mestres, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2019.
Unveiling the potential of Graph Neural Networks for network modeling and optimization in SDN. In Proceedings of
the 2019 ACM Symposium on SDN Research (SOSR'19).

[10] N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399-410, Oct 2016

Compétences

Required qualifications
- Required qualification : Master in computer science
- Required knowledge : networking, programming (python, java or others...)
- Knowledge and skills in the following fields will BE appreciated :
machine learning, artificial intelligence, big data, Linux (command line use, shells)

Avantages- Subsidized meals
- Partial reimbursement of public transport costs
- Leave : 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
- Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
- Professional equipment available (videoconferencing, loan of computer equipment, etc.)
- Social, cultural and sports events and activities
- Access to vocational training
- Social security coverage

Rémunération
Salary : 1982€ gross/month for 1st and 2nd year. 2085€ gross/month for 3rd year.

Monthly salary after taxes : around 1596, 05€ for 1st and 2nd year. 1678, 99€ for 3rd year. (medical insurance included).

Finalisez votre candidature

sur le site du recruteur.