1. Start from the process, not from the data
The central point is: Get an exact picture of what you want to know. For exactly which application questions is it worthwhile to gain additional knowledge about how the parameters of a production environment are currently evolving? The more concretely you define the knowledge goal and the associated analysis question, the more precisely you can determine the data (pots) with which you can gain reliable insights.
It is normal to look at the process first. However, in practice it is often the other way around and you start from the data. This is the case in the SPC environment, for example, where there are QM-relevant data points galore and it is therefore quite obvious to try out all this data at the same time. In practice, however, the majority of such purely data-driven IT projects then develop in directions that are of little or even no practical use. Big Data projects in particular have once again demonstrated this quite impressively in the recent past. Anyone who thinks that this could be substantially different in machine learning is overlooking the fact that even the most elaborate algorithms are only as smart as the questions that guide them in their work. Therefore, for all teams looking to take the horsepower of their predictive quality program on the road: Never start from the data. Always think from the outcome and get clarity on the analytics goal.
2. Involve specialist colleagues right from the start
Against this backdrop, in most cases it is not even necessary to have designated data scientists to train the algorithms appropriately. Instead, it is far more important to ensure that the business users are also involved from the start. After all, there are two core issues. First, the precise formulation of the information target. And secondly, identifying all requirements and framework conditions that have a significant impact on the performance of the processes to be examined. How exactly does the process run and which parameters must be fulfilled so that the desired quality is actually achieved? In addition, it must be clear which characteristics actually constitute the quality to be aimed for and in which tolerance band these characteristics may move.
Only when all of these things are precisely specified can the exact data pools be tapped that best contribute to answering the analysis setup in all of its dimensions. Specialized data scientists can certainly accompany this conceptual work. But without the knowledge of the business users, it will hardly be possible to set up a suitable analysis machinery and make the algorithm of choice (see 5.) really productive.
3. Initially work only with historical data
It is also extremely important to start with existing data and not to use live data from a production environment right at the beginning of the training. After all, at this stage of the implementation, it is only a matter of testing the algorithm and the analysis model. To do this, you need data of the highest possible quality. Appropriate data cleansing is then almost always required. Performing this basic task with data that is already available is much easier and less expensive than with live data. Questions then include: Are our data complete? Are they consistent? Already categorized correctly? And very importantly, always the question: Do they really fit my current analysis topic?
Only when the training data has been optimized in terms of all these questions (and there are a whole series of others) do you steer the ML algorithm in the right direction. The data quality then plays a major role in determining whether an analysis model bears fruit and whether its results actually relate to the desired use case in each case.
4. Automate analyses as far as possible
At this point, at the latest, the question arises as to how much training data is actually required for a machine learning algorithm to adequately map a specific use case. Experience shows: In quality management, one should assume at least 500 data sets per use case. It is even better to have up to 2,000 sets that are fed through the algorithm. This increases the certainty that the algorithm will recognize the patterns to be examined without requiring further training later on.
using a photo of an engine compartment as an example, on which it is to be checked whether certain screws have been mounted in accordance with the specification. A comparatively simple application, no question. But it also has its challenges. After all, we want to enable the algorithm to check across all models and independently of the current image quality. If we train it with 500 images, the pattern recognition is already brought to a level that brings a positive ROI. After all, the algorithm will then probably only raise its hand in just under a quarter of the cases and prompt the user to check the current case again himself.
By the way, such human interventions are referred to as supervised machine learning. As long as the initial training effort remains within reasonable limits, however, it is always worthwhile to strive for unsupervised machine learning right away. In the above-mentioned example of image-based engine compartment analysis, three to four times the number of images is probably sufficient for the algorithm to be able to perform its analysis correctly even if something changes in the design of the supplied images, such as the angle of capture.
5. Testing different algorithms
In QM, algorithms are usually supposed to output either numbers or categories. Depending on the application, different algorithms can deliver the best results. The market already offers a number of interesting alternatives. First and foremost in the open source environment. It therefore makes sense to have an open platform on which the algorithms of various machine learning providers can be tested. Edge.One offers you such a platform. Once you have tailored your analytics model to fit and cleaned up the data to be processed, you can try out the different market offerings at your leisure. Machine learning solutions, after all, always work on a case-by-case basis. Therefore, make sure that your tool environment gives you maximum freedom of choice in order to send exactly the algorithm into the race that best solves your current analysis problem.