The Machine Learning Process and Azure ML

In the following Chapter 0 of this text, the reader is presented with the workflow to develop and deploy enterprise machine learning solutions regardless of the technology or framework used.

The Machine Learning Process

The machine learning process ensures that regardless of the framework used, solutions are able to use a consistent approach and process to ingest data, train, test, deploy, and monitor the model. The steps involved in the workflow are

Ingest data
Organize, Clean, and Prepare data
Test and Validate data
Deploy the model
Manage and Track the model

Workflow

The five steps listed above are summarized in Figure 0-1 below:

Figure 0-1 above (the Machine Learning Workflow) can be summarized as having the following set of operations:

Ingest: Data is collected from public and private sources to generate a dataset to learn from.
Organize, Clean, and Prepare: The dataset generated in operation no. 1 is organized, cleaned, and prepared (also called data munging, data prep, or data wrangling) to stage in a secure and accessible environment or repository for machine learning.
Test and Validate: The staged dataset from operation no. 2 is tested and validated using various machine learning algorithms to generate a model.
Deploy, Manage, and Track: The model generated in operation no. 3 is deployed to production, and tracked for usage behavior, analytics, and changes to data.

Data Roles

The above workflow can be applied irrespective of the machine learning framework or library used to generate the model. Also, the above listed steps are performed by different roles within an organization. These roles are captured in Figure 0-2 below:

The data roles in the above illustration don't all necessary exist inside a single organization. Depending on the size of an organization or security requirements, two or more roles can be filled with a single team or individual. These roles or tasks are explained below:

Data Engineer: Data Engineers are usually subject-matter experts or SME's responsible for copying data from internal and external sources to an internal location from where it can be ingested into a machine learning solution.
Data Scientist: A Data Scientist cleans, prepares, and explores the copied data to a central location. Data Scientists are well-versed in a host of libraries (mostly open-source) for data analysis and visualization.
Machine Learning Engineer: Machine Learning Engineers are technical resources with a thorough knowledge of the algorithms needed to train, deploy, and interpret models using enterprise data made available to them.
Developer: Developers have little to no knowledge of the machine learning models they are working with, but are very good with enterprise tools and corporate IT systems. Working with data scientists and/or machine learning engineers, developers are required to deploy, manage, and track the models and also consume these models with external systems.

Challenges

The challenges typically associated with implementing a machine learning workflow in the organization are captured in Figure 0-3.

These data challenges can be summarized as follows:

ETL Tools: Various data copy/movement tools within the enterprise present challenges such as vendor management, shadow IT, app security segmentation, etc. A disconnected business process to extract, transform. and load data from internal and external data-stores and file-shares leads to different individuals or teams utilizing different tools, technologies, and even security setups than is prescribed by corporate guidelines.
Data Stores and File-shares: Data for machine learning comes from a wide variety of sources, both within the organization and outside. This not only poses a security problem but also data ownership issues since different teams and individuals have different requirements and security postures that cannot all be incorporated into a single and cohesive setup.
Tools and Libraries: Within any mid-size to large organization, hiring of various roles happens at different times and often by different teams. This causes multiple tools and libraries for machine learning testing, deployment, and management without the knowledge of that particular tool or library ever leaving a team or organizational unit.
Cumbersome Infrastructure Management: Software solutions are typically deployed to new virtual hardware with every new major release of a software product. Machine learning products are no exception to this rule. However, provisioning new hardware infrastructure for every new machine learning solution is very difficult and cumbersome, and causes the same data to be across multiple silos within the enterprise.

Azure Machine Learning Service

The Azure Machine Learning Service does not exist in isolation. It comprises a host of Azure Services that everyone involved in the workflow interacts with (refer to Figure 0-4 below).

Each of the data roles explained in the previous section titled 'Data Roles' and given in Figure 0-2 are again provided in Figure 0-4 above. While each of them have a different security setup depending on the task that that role is supposed to perform, and each one having access to a different cloud service, all data roles involved in the machine learning workflow interact with the Azure Machine Learning Service as shown in the illustration above.

PreviousAbout the Author NextIntroduction to Azure Machine Learning Service

Last updated 5 years ago