Machine learning and artificial intelligence are becoming increasingly important for business success. With the Hadoop platform and frameworks like TensorFlow a data science environment can be built very easily. However, in highly regulated industries, such as the financial industry, there are many regulatory and technological hurdles to overcome before going live. This article presents a proven approach to building a data science pipeline in a highly regulated environment. In particular, the different requirements from operation, application development, and data science are taken into account.
Data science and machine learning often talk about developing and training models. Somewhat simplified, a model consists of one or more algorithms that are executed as part of a computer program. During training, the algorithms are selected and specific parameters are defined. The way in which data scientists work differs from traditional application development in that data plays an essential role. In an iterative process, the data is processed, analyzed, cleaned, and used to create and train models in multiple steps. The trained models are then used in production.
Ideally, data scientists want to work with as much productive data as possible. In many cases, this is personal data that is particularly worthy of protection. Anonymizing or pseudonymizing the data would solve this problem. However, anonymized data is often useless for exploratory analysis and training of models. One use case is the automatic categorization of revenue data to create a digital budget book. In simple anonymization, the IBAN could be masked. Thus, the direct relation to an account and thus to a person is no longer possible. By analyzing the intended use of the individual data records, however, the personal reference can be restored in most cases.