Resource-aware Data Mining or M2M Mining
Katharina Morik
TU Dortmund University, Faculty for Computer Science, LS 8
Data mining has developed from the analysis of records in a (relational) database into diverse directions. Data of diverse forms are analyzed, e.g., time series, graph or network data, linguistic data, streams of measurements, audio or video data. The results have also become more complex than the visualization of the data, interesting patterns or rules, and decision functions with binary or real values. The structural support vector machine or graphical models in general are some examples for this trend.
Moreover, the deployment models show now a large variety. Here, we want to focus on analysis results that control a technical process, i.e. the learning result is to be used by some physical system. This is not entirely new, since early knowledge discovery results like, e.g., customer segmentations were already used for printing addresses for personalized mailing actions - and a printer is a physical system. However, machine-to-machine mining generalizes this deployment.
Machine-to-machine (M2M) refers to technologies that allow both wireless and wired systems to communicate with other devices of the same ability. M2M uses a device (such as a sensor or meter) to capture an event (such as temperature, inventory level, etc.), which is relayed through a network (wireless, wired or hybrid) to an application (software program), that translates the captured event into meaningful information (for example, items need to be restocked). (Wikipedia)
In analogy to M2M communication, we call M2M mining the process of analyzing possibly distributed data (from sensors in a general sense) in order to enhance the performance of a physical system.
The data mining approaches to resource-awareness are twofold. On the one hand, data mining methods extract information that help to decrease resource consumption of technical devices. For instance, behavior patterns can be used to adapt mobile phones to a user such that less energy is consumed and the battery duration is enhanced. On the other hand, the data mining algorithms are to demand less memory, less energy, less runtime. Streaming algorithms are an example of the latter. In this talk, some approaches to M2M mining are presented that have successfully used RapidMiner. Some examples are:
• The measurements by Tcherenkov light deliver masses of data. They are analyzed within the IceCube collaboration in order to filter out neutrinos. Within the Magic and the Fact projects, two telescopes deliver masses of data and gamma rays are to be separated from other particles of the particle shower. In the long run, a learned model is to control the telescopes so that they measure the same particle shower in stereo.
• The Nokia data challenge offered data that could be used to predict next cell of mobile phone users.
• The process measuremets in a steel production step are used to learn a model which is then applied in order to predict the point in time, when the process can be ended without loss of quality. This saves resources.
Part of the work presented in this talk has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 "Providing Information by Resource-Constrained Analysis", projects A1 and C3.
Hadoop and Beyond: potential and limitation of distributed systems for data mining
Andras Benczur
Institute for Computer Science and Control of the Hungarian Academy of Sciences (MTA SZTAKI)
Based on two selected data mining tasks from our practice, Andras Benczur illustrates the potential and the limitations of Hadoop based systems for solving complex tasks. The first example is Web document classification. Here Hadoop is very efficient for parsing, word counting, host level feature aggregation but is limited at generating features based on the linkage of the pages and has no support for complex pipelines with partial result reuse. As another example, for de-duplication (aka entity resolution) Hadoop is theoretically inefficient but in fact beats some existing alternatives such as Bulk Synchronous Parallel. Andras Benczur also describes the design principles two emerging alternate, non-Hadoop-based frameworks of his personal choice, Stratosphere and GraphLab, for solving similar tasks.
Andras Benczur received his Ph.D. at the Massachusetts Institute of Technology in applied mathematics in 1997. Since then he is researcher at the Institute for Computer Science and Control of the Hungarian Academy of Sciences (MTA SZTAKI) where he heads the Informatics Laboratory of 30 researchers since 2008. The lab participates in international research and national industry projects in information retrieval and business intelligence. Among others his research on Web information retrieval was honored by a Yahoo! Faculty Research Grant, he lead the KDD Cup 2007 winner team and organized the ECML/PKDD 2010 Discovery Challenge on Web Quality.