class: center, middle, inverse, title-slide # Anomaly Detection in
Streaming Time Series Data ## Priyanga Dilini Talagala ### StatScale Seminar, Lancaster University
04.12.2020 --- class: center, middle, inverse # Anomaly Detection in Streaming Time Series Data ## Priyanga Dilini Talagala 04-12-2020
priyangad@uom.lk
pridiltal
prital.netlify.app </br> (Slides and papers available) <br/><br/>The slides are powered by `xaringan` R package --- class: center, middle # Hello from the team! <img src="fig/team.png" width="100%" style="display: block; margin: auto;" /> --- .pull-left[ <img src="fig/JCGS_logo.png" width="40%" style="display: block; margin: auto;" /> Priyanga Dilini Talagala, Rob J Hyndman, Kate Smith-Miles, (2020) **Anomaly detection in high-dimensional data**. Journal of Computational & Graphical Statistics, *to appear* <br/> <div class="figure" style="text-align: center"> <img src="fig/stray-logo.png" alt="on CRAN" width="45%" /> <p class="caption">on CRAN</p> </div> ] .pull-right[ <img src="fig/JCGS_logo.png" width="40%" style="display: block; margin: auto;" /> Priyanga Dilini Talagala, Rob J Hyndman, Kate Smith-Miles, Sevvandi Kandanaarachchi and Mario A Munoz (2020) **Anomaly detection in streaming nonstationary temporal data**. Journal of Computational & Graphical Statistics, 20(1), 13-27. <div class="figure" style="text-align: center"> <img src="fig/oddstream1.png" alt="on CRAN" width="45%" /> <p class="caption">on CRAN</p> </div> ] --- ## Anomaly detection [CRAN Task View: Anomaly Detection with R](https://github.com/pridiltal/ctv-AnomalyDetection) -- ### Anomaly detection in temporal data <img src="fig/outtype.png" width="100%" style="display: block; margin: auto;" /> --- background-image:url('fig/outtype2.png') background-position: 70% 70% background-size: 100% class: left, top, clear ## Anomalous series in temporal data --- background-image:url('fig/outtype3.png') background-position: 70% 70% background-size: 100% class: left, top, clear ## Anomalous series in temporal data --- background-image:url('fig/2_application.png') background-position: 70% 70% background-size: 100% class: left, top, clear ### Anomalous series within a space of a collection of series --- class: center, middle <p><font size=12> <span style="color:blue"> stray (S</span>earch and <span style="color:blue">TR</span>ace <span style="color:blue">A</span>nomal<span style="color:blue">Y<span>) </font size=12></p> <div class="figure" style="text-align: center"> <img src="fig/stray-logo.png" alt="on CRAN" width="30%" /> <p class="caption">on CRAN</p> </div> `devtools::install_github("pridiltal/stray")` --- ## Anomaly detection in high dimensional Data ### Main contributions - Propose a framework to detect anomalies in high dimensional data. Our proposed algorithm addresses the limitations of HDoutliers algorithm (Wilkinson, 2018). -- ### What is an anomaly ? - We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. -- ### Main assumptions - There is a large distance between typical data and the anomalies in comparison to the distance among typical data. --- ## stray <img src="fig/stray_plot1.png" width="50%" style="display: block; margin: auto;" /> - Normalize the columns of the data. (median and IQR) - This prevents variables with large variances having disproportional influence on Euclidean distances. --- ## Why not "nearest neighbour" distances? <img src="fig/stray_plot2.png" width="50%" style="display: block; margin: auto;" /> - Calculate the nearest neighbour distance --- ## stray <img src="fig/stray_plot5.png" width="50%" style="display: block; margin: auto;" /> - Select the <span style="color:red"> k nearest neighbour </span> distance with the <span style="color:red"> maximum gap </span> --- ## Calculate anomalous threshold - Use extreme value theory (EVT) to calculate a data driven outlier threshold. -- - Let **n** be the size of the dataset -- - Sort the resulting **n** outlier scores -- - Consider the half of the outlier scores with the smallest values as typical -- - Search for any significant large gap in the upper tail (Bottom up searching algorithm proposed by Schwarz, 2008) --- ## Spacing theorem (Weissman, 1978) Let `\(X_{1}, X_{2}, ..., X_{n}\)` be a sample from a distribution function `\(F\)` . </br> Let `\(X_{1:n} \geq X_{2:n} \geq ... \geq X_{n:n}\)` be the order statistics. </br> The available data are `\(X_{1:n}, X_{2:n}, ..., X_{k:n}\)` for some fixed `\(k\)`. </br> Let `\(D_{i,n} = X_{i:n} - X_{i+1:n},\)` `\((i = 1,2,..., k)\)` be the spacing between successive order statistics.</br> If `\(F\)` is in the maximum domain of attraction of the Gumbel distribution, then the spacings `\(D_{i,n}\)` are asymptotically independent and exponentially distributed with mean proportional to `\(i^{-1}\)`. <img src="fig/P2_plot17.png" width="55%" style="display: block; margin: auto;" /> --- ## stray <img src="fig/stray_plot6.png" width="50%" style="display: block; margin: auto;" /> `outliers <- find_HDoutliers(data)` <br/> `display_HDoutliers(data, outliers)` --- ## Advantages of the proposed algorithm - Detect clusters of outlying points -- - Applied to both uni- and multi- dimensional data -- - Handle large datasets due to the use of approximate KNN searching algorithm -- - Does not require a training set to build the decision model -- - Deal with multimodal typical classes -- - Outlier threshold has a probabilistic interpretation --- background-image:url('fig/2_application.png') background-position: 70% 70% background-size: 100% class: right, top, clear ### Anomalous series within a space of a collection of series --- ## Feature based representation of time series .pull-left[ - Mean - Variance - Changing variance in remainder - Level shift using rolling window - Variance change - Strength of linearity - Strength of curvature ] .pull-right[ - Strength of spikiness - Burstiness of time series (Fano Factor) - Minimum - Maximum - The ratio between 50% trimmed mean and the arithmetic mean - Moment - Ratio of means of data that is below and above the global mean ] --- ## Approach 1: Using stray - use a moving window to deal with streaming data - Extract time series features from window - Apply stray algorithm to identify anomalous series .pull-left[ <img src="fig/P2_plot22.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/stray.gif" width="60%" style="display: block; margin: auto;" /> ] `tsfeatures <- oddstream::extract_tsfeatures(ts_data)` <br/> `outliers <- stray::find_HDoutliers(tsfeatures)` <br/> `stray::display_HDoutliers(tsfeatures, outliers)` --- class:: center, clear .pull-left[ <img src="fig/P2_plot21a.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="fig/P2_plot21b.png" width="100%" style="display: block; margin: auto;" /> ] --- class: center, clear <p><font size=12> <span style="color:blue">oddstream </br> (O</span>utlier <span style="color:blue">D</span>etection in <span style="color:blue">D</span>ata <span style="color:blue">STREAM</span>s) </font size=12></p> <img src="fig/oddstream_logo.png" width="30%" style="display: block; margin: auto;" /> `devtools::install_github("pridiltal/oddstream")` --- ## Dimension reduction for time series .pull-left[ `load(train_data)` <img src="fig/4_typical.png" width="80%" style="display: block; margin: auto;" /> ] -- .pull-right[ `tsfeatures <- oddstream::extract_tsfeatures` </br> `(train_data)` <img src="fig/5_high_typical.gif" width="50%" style="display: block; margin: auto;" /> ] -- </br> `pc<- oddstream::get_pc_space(tsfeatures)`</br> `oddstream::plotpc(pc$pcnorm)` <img src="fig/6_typicalfeature.png" width="35%" style="display: block; margin: auto;" /> --- ## Anomalous threshold calculation - Estimate the probability density function of the 2D PC space `\(\longrightarrow\)` Kernel density estimation -- - Draw a large number N of extremes `\((arg min_{x\in X}[f_{2}(x)])\)` from the estimated probability density function -- - Define a `\(\Psi\)`-transform space, using the `\(\Psi\)`-transformation defined by (Clifton et al., 2011) <img src="fig/10_psitrans.png" width="50%" style="display: block; margin: auto;" /> - `\(\Psi\)`-transform maps the density values back into space into which a Gumbel distribution can be fitted. -- - Anomalous threshold calculation `\(\longrightarrow\)` extreme value theory --- class: center, top, clear `oddstream::find_odd_streams(train_data, test_stream)` <img src="fig/18_oddstream_mvtsplot.gif" width="50%" style="display: block; margin: auto;" /> .pull-left[ <img src="fig/16_oddstream_out_loc.gif" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/17_oddstream_pcplot.gif" width="90%" style="display: block; margin: auto;" /> ] --- class: top ### Feature Based Representation of Time series .pull-left[ <img src="fig/3_batch.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/tsfeatures.png" width="100%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Anomaly Detection with <br/> <span style="color:#cc5900"> Non-stationarity </span> --- #### Anomaly detection with non-stationarity <img src="fig/19_nonstationaritytypes.png" width="70%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot2.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/21_noCD1.png" width="35%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot3.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/21_noCD2.png" width="35%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot4.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/21_noCD3.png" width="35%" style="display: block; margin: auto;" /> --- ### Anomaly detection with non-stationarity <img src="fig/20_suddenplot2.png" width="100%" style="display: block; margin: auto;" /> <img src="fig/22_conceptdrift_pval.png" width="100%" style="display: block; margin: auto;" /> - `\(H_{0} : f_{t_{0}} = f_{t_{t}}\)` - squared discrepancy measure `\(T = \int[f_{t_{0}}(x) - f_{t_{t}}(x)]^{2}dx\)` (Anderson et al., 1994) --- ### Anomaly detection with non-stationarity <img src="fig/23_sudden_out.png" width="100%" style="display: block; margin: auto;" /> --- class: clear, middle, center .pull-left[ ### stray <img src="fig/P2_plot21a.png" width="75%" style="display: block; margin: auto;" /> - Definition: distance - no training set ] .pull-right[ ### oddstream <img src="fig/P2_plot21b.png" width="75%" style="display: block; margin: auto;" /> - Definition: density - need a training set ] --- ### What Next? - Explore more on feature extraction and feature selection methods to create a better feature space suitable for streaming data context. -- - Use other dimension reduction techniques such as multidimensional scaling analysis, random projection to see the effect on the performance of the proposed framework. -- - Do more experiments on density estimation methods to get a better tail estimation. -- - Extend the algorithm to work with Multidimensional Multivariate Data Streams. --- .pull-left[ <img src="fig/JCGS_logo.png" width="40%" style="display: block; margin: auto;" /> Priyanga Dilini Talagala, Rob J Hyndman, Kate Smith-Miles, (2020) **Anomaly detection in high-dimensional data**. Journal of Computational & Graphical Statistics, *to appear* <br/> <div class="figure" style="text-align: center"> <img src="fig/stray-logo.png" alt="on CRAN" width="45%" /> <p class="caption">on CRAN</p> </div> ] .pull-right[ <img src="fig/JCGS_logo.png" width="40%" style="display: block; margin: auto;" /> Priyanga Dilini Talagala, Rob J Hyndman, Kate Smith-Miles, Sevvandi Kandanaarachchi and Mario A Munoz (2020) **Anomaly detection in streaming nonstationary temporal data**. Journal of Computational & Graphical Statistics, 20(1), 13-27. <div class="figure" style="text-align: center"> <img src="fig/oddstream1.png" alt="on CRAN" width="45%" /> <p class="caption">on CRAN</p> </div> ] --- class: center, middle # Thank You .pull-left[ <img src="fig/oddstream1.png" width="45%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="fig/stray-logo.png" width="45%" style="display: block; margin: auto;" /> ]
priyangad@uom.lk
pridiltal
prital.netlify.app </br> (Slides and papers available) <br/><br/>The slides are powered by `xaringan` R package