MACROS GTIVE Generic Tool for Important Variable Extraction c 2007 — 2013 DATADVANCE, llc Contact information Phone +7 (495) 781 60 88 Web www.datadvance.net Email [email protected] Technical support, questions, bug reports [email protected] Everything else Mail DATADVANCE, llc Pokrovsky blvd. 3 building 1B, 4 floor 109028 Moscow Russia User manual prepared by Pavel Erofeev, Pavel Prikhodko, Evgeny Burnaev i Contents List of figures iv List of tables v 1 Introduction 1.1 What is GTIVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Documentation structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 Overview 2.1 Problem statement . . . . . . . . . . . . . . . . . . 2.2 Quality metrics . . . . . . . . . . . . . . . . . . . . 2.3 Input Definition Domain Importance . . . . . . . . 2.4 State of the art methods . . . . . . . . . . . . . . . 2.4.1 Sample based techniques . . . . . . . . . . . 2.4.2 Black box based techniques . . . . . . . . . 2.5 Scores variance estimation . . . . . . . . . . . . . . 2.6 Remark on other sensitivity analysis methods . . . 2.7 Remark on the selection of techniques for GTIVE 3 Internal workflow 3.1 General workflow . . . . . 3.2 Preprocessing . . . . . . . 3.3 Results . . . . . . . . . . . 3.3.1 Feature scores . . . 3.3.2 Standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 User configurable options 4.1 RidgeFS . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Mutual Information (Kraskov estimate) . . . . . . . . . . 4.3 Mutual Information (Histogram based estimate) . . . . . 4.4 SMBFAST (Surrogate Model-Based FAST) . . . . . . . . 4.5 Elementary Effects . . . . . . . . . . . . . . . . . . . . . 4.6 Extended FAST (Fourier Amplitude Sensitivity Testing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 5 6 6 6 9 11 11 11 . . . . . 13 13 13 14 14 15 . . . . . . 16 16 16 17 18 19 20 5 Limitations 22 6 Selection of technique 6.1 Selection of the technique by the user . . . . . . . . . . . . . . . . . . . . . . 6.2 Default automatic selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 24 ii CONTENTS 7 Usage Examples 7.1 Artificial Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Example 1: simple function, no cross-feature interaction . . . . . . . 7.1.2 Example 2: usage of confidence intervals to determine redundant variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Example 3: difference between ’main’ and ’total’ scores in FAST . . . 7.2 Real world data examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 T-AXI problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Stringer (Super-Stiffener) Stress Analysis problem . . . . . . . . . . . 7.2.3 Fuel System Analysis problem . . . . . . . . . . . . . . . . . . . . . . 27 27 27 References 38 Index 40 Index: Options 41 iii 30 31 32 32 35 37 List of Figures 2.1 The Newton’s law of universal gravitation . . . . . . . . . . . . . . . . . . . 6.1 The internal decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 7.1 7.2 T-AXI. Feature scores estimated by the GTIVE . . . . . . . . . . . . . . . T-AXI. Index of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 iv 3 List of Tables 2.1 2.2 Illustration. Scores for the Newton’s law of universal gravitation problem . . Pearson’s and Spearman’s correlation coefficients and GTIVE techniques. . 4 12 5.1 5.2 Technique summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minimum sample size (blackbox budget) for GTIVE techniques . . . . . . . 22 23 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 Example 1. RidgeFS scores . . . . . . . . . . . . . . . . . . . . . . . . Example 1. Elementary Effects scores . . . . . . . . . . . . . . . . . . . Example 1. Mutual Information (Kraskov estimate) scores . . . . . . . Example 1. Mutual Information (histogram estimate) scores . . . . . . Example 1. FAST scores . . . . . . . . . . . . . . . . . . . . . . . . . . Example 2. GT IVE scores and the standard deviation of scores . . . . Example 2. FAST (total) scores . . . . . . . . . . . . . . . . . . . . . . Example 2. FAST (main) scores . . . . . . . . . . . . . . . . . . . . . . Stage data for 10 stage design (stage.e3c-des) . . . . . . . . . . . . . . Initial data for 10 stage design (init.e3c-des) . . . . . . . . . . . . . . . IGV data for 10 stage design (igv.e3c-des) . . . . . . . . . . . . . . . . T-AXI. Features that influence Compressor Pressure Ratio the most (a) T-AXI. Features that influence Compressor Pressure Ratio the most (b) Stringer stress analysis. Feature scores estimated by GTIVE . . . . . Stringer stress analysis. Approximation error ratio . . . . . . . . . . . . Fuel System Analysis. Features scores and Approximation error ratio . 27 28 28 28 29 30 31 31 32 32 33 34 34 35 36 37 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 What is GTIVE Generic Tool for Important Variable Extraction (GT IVE) is a software package for performing global sensitivity analysis on user-provided data. In the [13] sensitivity analysis is defined as the study of how the variation (uncertainty) in the output of a statistical model can be attributed to different variations in the inputs of the model. In other words it is a technique for systematically changing variables (features) in a model to determine the effects of such changes. 1.2 Documentation structure Documentation for GTIVE includes: • User manual (this document) which contains: – A general overview of the tool’s functionality; – Short descriptions of the algorithms; – Recommendations on the tool’s usage; – Examples of applications to model problems. • Technical reference [3] for C++ and Python API which includes: – Description of system requirements; – Installation steps; – Quick start guide; – C++ and Python API reference. The present document has the following structure: • Chapter 2 is an introduction to the tool’s functionality. It contains an overview of relevant sensitivity analysis concepts, and explains the way the tool is applied and what results it produces. • Chapter 3 describes the internal workflow of the tool. 1 CHAPTER 1. INTRODUCTION • Chapter 4 describes specific sensitivity analysis techniques implemented in the tool. • Chapter 5 describes limitations on the sample size for different techniques. • Chapter 6 describes how the sensitivity analysis technique is selected automatically in a particular problem. • Chapter 7 gives some examples of GTIVE tool use for some model and real world problems. 2 Chapter 2 Overview The main goal of GT IVE is to estimate feature scores for the user-provided dependency 1 which can be represented as data sample 2 or interface to some black box 3 . So it solves the problem of global sensitivity analysis. As an illustration we give the following simple example. Consider the Newton’s law of universal gravitation. Say we know that every point mass attract every other point mass, but don’t know what features affect that. And say, that for some reason we think that following features may affect the force of attraction: • m1 , m2 - the masses of the bodies • r - distance between bodies • T - environment temperature • p - atmospheric pressure • L1 , L2 - bodies luminosity Figure 2.1: The Newton’s law of universal gravitation Also we did 30 experiments and measured all the features considered and the corresponding force of attraction. Applying GT IVE to this task give us the following feature ranks, see 2.1 In general the tool helps to answer the following questions: 1 also known as function or model also known as training data (or samples) 3 some device, system or object that provides output for a given input 2 3 CHAPTER 2. OVERVIEW m1 m2 r T p L1 L2 0.19 0.20 0.61 0.0 0.0 0.0 0.0 Table 2.1: Illustration. Scores for the Newton’s law of universal gravitation problem 1. What features have no influence on the dependency and thus can be dropped in the further study? 2. If we want to reduce the number of features considered in the problem which features should we drop? 3. What features are the most influential so that they should be measured with the highest accuracy or have the highest variability in the Design of Experiments? GTIVE calculates sensitivity indices (features scores) for each input variable (feature). That are the numbers that show relative importance of each feature in some sense. Looking at the scores one can say if one feature is more important than the others and guess to what extent. This information may be useful in the following tasks: • In the Surrogate Model (SM) construction it may be beneficial to remove the least important features, because less features mean more dense sample and denser sample may provide more accurate approximation. Also many SM construction techniques may work better in smaller dimensions in terms of time/memory requirements. • In the Design of Experiment knowing what features influence dependency the most one can plan the sample generation in a way that most important features have the highest variability. Also, if data is obtained as some physical measurements, knowing feature scores may tell what input variables should be measured with the highest accuracy. • In the Optimization, when the number of allowed function calls (budget) is limited, knowing what features are less important allows for not changing them in the optimization process. Reducing number of variables by not considering features that have little effect on the dependency, one can do more optimization iterations with the same budget possibly acquiring better solution. Examples of GTIVE applications to the mentioned above tasks are presented in the Chapter 7. In this chapter the sensitivity analysis problem statement is given and short review of the state of the art methods, used in the tool, is provided. 2.1 Problem statement The problem of the global sensitivity analysis is to estimate how variations in the output of the model can be attributed to the variations in the model inputs on all design space. Let Y = f (X), X ∈ Rp , Y ∈ Rq be some considered dependency. f (X) may be some physical experiment or a solver code. Without loss of generality only the case of q = 1 will be considered below. If q > 1 (the model has many outputs) each output is treated 4 CHAPTER 2. OVERVIEW independently. GTIVE procedure calculates score wi for each feature xi from a feature set X = (x1 , . . . , xp ) also known as input vector such that higher score reveals more sensitivity (higher variations) of the output Y with respect to the variations of the corresponding input. The scores are positive numbers generally between 0 and 1, higher score indicates that the variable is ”more important”. There are several different techniques implemented in the tool; the precise meaning of the score is technique-dependent. For a sensitivity analysis technique we wish it to share the following properties: ◦ If one variable is more important than the other in a technique defined way, we want it’s score to be higher ◦ We want feature scores to be proportional to the corresponding variables influence, so that comparing scores one would get the idea of relative importance of variables These properties allow to rank features in the order of importance and give the idea of approximately to what extent one feature is more important than other features. 2.2 Quality metrics To compare techniques performance the following measures could be introduced. These are intuitive straightforward ways to check the variable importance, however huge amount of data or time is required to evaluate them, so these measures are not very suited for practical use and are mostly useful as reference in the benchmarking of different sensitivity analysis methods. • Index of variability may be used to compare importance of the features or even feature subsets if we can calculate dependency value in a given point. Let features in the vector X be split into two subsets X = (Z(X), U (X)), where the subvector Z(X) contains all important features (features with high scores) and U (X) contains all unimportant features (features with low scores). Let us define by X̂(X) = (Z(X), U0 ) some vector, where all unimportant features are fixed to some average values. Then the Index of Variability can be computed as follows: q < (f (X) − f (X̂(X)))2 > · 100%, I(Z) = max(f (X)) − min(f (X)) (2.1) where < .. >, max, min are some test sample mean, maximum and minimum. The higher index of variability the less important features are chosen in Z and the more important are fixed in U . • Approximation error ratio. Another way to estimate i-th feature importance is to build an approximation (surrogate model) fSMi (Zi (X)) where Zi (X) = (x1 , . . . , xi−1 , xi+1 , . . . , xp )), i.e. input formed from X using all features except i-th, and compare it’s accuracy with approximation fSM (X), built using all features. So the error measure can be defined as: p < (f (X) − fSMi (Zi (X)))2 > Err(i) = p , (2.2) < (f (X) − fSM (X))2 > where < .. > is the sample mean. Higher approximation error ratio means that i-th feature is more important. 5 CHAPTER 2. OVERVIEW 2.3 Input Definition Domain Importance It’s important to note that the scores returned by GTIVE depend on the variation intervals of the factors. If a factor is restricted to a very narrow interval, then its score might be low even if factor is important. Moreover, the scores returned by GTIVE are invariant under changes of units of measurement for individual factors (as long as changes are linear). In such cases the effects of rescaled intervals are compensated by the corresponding changes in the response function. For example, consider the case when we have a function f (x1 , x2 ) = x1 + x2 , with x1 ∈ [−1, 1] and x2 ∈ [−1, 1]. It’s obvious to expect x1 and x2 to have equal scores in these conditions. Now, let us expand x1 to region [−2, 2], while keeping f (x1 , x2 ) the same. In this case though in each point local importance of x1 and x2 remains similar, on the global scale x2 would provide 4 times more variation to the output, thus rising it’s feature score. It’s equivalent to the case when we leave x1 at [−1, 1] and change function to f (x1 , x2 ) = 2 ∗ x1 + x2 . On the contrary, consider the case when we change the measurement units of the feature. For example, x1 and x2 were defined in kilograms and we want to change the measurement units of x1 to grams. In this case, though new values of rescaled x1 would become 1000 times larger but it’s feature score would remain the same. 2.4 State of the art methods There are lots of approaches to the problem of global sensitivity analysis [5, 13, 12, 7, 14, 8]. Technique appropriate for each task depends on the problem conditions and user requirements. We’ve designed the GTIVE tool to include the most effective state of the art methods, covering different problem settings. In this section brief overview of the techniques used in the GTIVE is provided. We may group sensitivity analysis techniques in two big groups: ◦ Methods that can work with any sample. ◦ Methods that require sample of a particular structure to work. Generally, the methods of the second group are more precise, but due to the sample form requirements one usually needs to have an interface to the considered function to be able to generate required specific sample. For each situation different techniques are implemented in the GTIVE and we refer to them as sample based and black box based correspondingly. 2.4.1 Sample based techniques These techniques require some data sample (X, Y) given, where X = {X i , i = 1, . . . , K}, Y = {Y i , i = 1, . . . , K}, components of input vector X i = (xi1 , . . . , xip ), Y i = f (X i ), K is the total number of samples. In the GTIVE the following sample based techniques are implemented: • RidgeFS In case the sample is small and so there is no benefit in using complex approaches feature scores may be estimated with linear model. 6 CHAPTER 2. OVERVIEW It is assumed that Y = Xb + ε, b = (b1 , . . . , bp ) are some coefficients and ε = {εi , i = 1, .. . , K} is zero mean white noise. Coefficients b are estimated as b̂ = XT X + λI XT Y, where I ∈ Rp×p and λ is tuned using LOO CV approach, see [5]. Then feature score for i-th variable is estimated as 2 bbi /var(xi ) , i = 1, . . . , p, wi = var(Y) (2.3) where var(xi ) is a variance of the i-th feature, estimated using sample. Pros: Works fast Can handle very large data sets Best possible choice if the true model is linear Cons: Not suitable for strongly non linear models • Mutual Information A group of techniques that estimate feature score by computing Mutual Information of considered feature and the output: Z p (xi , Y ) I (xi , Y ) = p (xi , Y ) log dxi dY. (2.4) p (xi ) p (Y ) The idea is to measure how far the joint distribution p (xi , Y ) of the feature and the output is from the case of two independent random values where p (xi , Y ) = p (xi ) p (Y ). The greater the difference the more relevant feature is. Feature score for i-th variable is estimated as: wi = I(xi , Y ), i = 1, . . . , p. (2.5) In the GTIVE we adopted two techniques to estimate Mutual Information (kraskov and histogram estimates). Kraskov estimate gives more accurate results, but becomes computationally expensive and so can’t be used for large data samples. Histogram based estimate may be crude on small samples, but is very cheap in terms of memory and computation time, so it can be applied to a very large data sets. In more details: – Kraskov estimate is an estimation of Mutual Information technique based on nearest neighbor approach. The technique provides good accuracy for small and moderate sample sizes, but becomes very computationally expensive in case of large samples. Define a metric in space Z = (X, Y ) as ρz (Z, Z ∗ ) = max(ρx (X, X ∗ ), ρy (Y, Y ∗ )), where ρx (X, X ∗ ) is the Euclidean norm in the X space and ρy (Y, Y ∗ ) is the Euclidean norm in the Y space. Let k be the algorithm parameter setting number of nearest neighbors in the Z space, then let (j) = ρz (Z j , k-th neighbor of Z j ) (2.6) 7 CHAPTER 2. OVERVIEW . We set njx and njy as number of points in the X and Y spaces correspondingly whos distance to X j and Y j is smaller than (j). In [8] it’s shown that Ik (xi , y) ≈ ψ(k) − hψ(nx + 1) + ψ(ny + 1)i − ψ(K), (2.7) where h...i is the sample mean, k is the number of nearest neighbors (algorithm parameter), ψ(z) is Euler digamma function. – Histogram based estimate is an estimate of Mutual Information technique using histogram based pdf estimation. Method may be less accurate than previous one in case of small and moderate samples, but can handle very large data sets. In this approach pdf of xi , Y and pdf of (xi , Y ) are estimated using histograms. For example pdf of xi is estimated as PK j j=1 I(xi ∈ (xi − h/2, xi + h/2)) p̂i (x) = , (2.8) Kh where h is a bin size, I(·) is an indicator function. In the GTIVE implementation the cross validation approach is used to estimate optimal histogram bin size h, see [5]. If the sample size is at least 20000 points, then accelerated optimization procedure for the bin size selection is used. Pros: Works fast Can handle small as well as large data sets. Sample of few dozens points is sufficient to catch the most important features. As the sample size increase resolution grows. Robust to noise and outliers Cons: Cant handle feature interdependencies • SMBFAST (Surrogate Model-Based FAST) Surrogate Model-Based FAST is a complex approach combining the surrogate modeling paradigm and the idea of black box analysis with the extended FAST method (see 2.4.2). Currently all GTApprox techniques except the Mixture of Approximators and Geostatistical Gaussian Processes are available in SMBFAST for training the internal surrogate model, and same features and restrictions apply (see the GT Approx manual [2] for details). Due to the model training overhead, SMBFAST may be time consuming but it is the most accurate of all currently implemented sample-based techniques. Pros: The most accurate of all currently implemented sample-based techniques Incorporates approximation capabilities of GT Approx Cons: May take a long time (as building of GT Approx model inside is required) 8 CHAPTER 2. OVERVIEW 2.4.2 Black box based techniques These techniques generate new sample points during their work so they require connection to some black box function Y = f (X). In case of black box based method term budget4 is used instead of sample size. Note that in these methods one has to specify the region (some hypercube) where points are generated. In the GTIVE the following black box based techniques are implemented: • Elementary Effects is a screening technique able to work with relatively small samples. The idea of Elementary Effects approach is to generate uniform (in terms of spacefilling properties) set of trajectories in the design space. On each step of trajectory only one component xi of input vector X is changed, and the following function is estimated: di (X) = where δi is a step size. Y (x1 , . . . , xi + δi , . . . , xp ) − Y (x1 , . . . , xi , . . . , xp ) , δi (2.9) Score for i-th feature is computed as wi = ∆2i µi , i = 1, . . . , p, π 2 var(Y) (2.10) P where µi = 1r rj=1 d2i (X j ), r is a number of steps changing i-th feature value on all trajectories, X j is the input value at these steps; ∆i is a range of possible values for i-th feature; var(Y) is a sample variance of black box values on generated sample points. Actually the method gives normalized estimate of average squared partial derivatives. Pros: Can provide reliable estimates even for very small budgets. Minimal number of black box function calls equals few times number of features which is sufficient to get estimation for not very complex cases Cons: Generates trajectories randomly Not robust to outliers • Extended FAST (Fourier Amplitude Sensitivity Testing) is a technique suited for the case when cheap black box is available (like surrogate model, see 2.4.1). It requires quite many samples to estimate score. The idea here is to measure what portion of output variance is described by the variance of the feature. To do so for each feature main indices are estimated as Si = 4 Vxi [E∼xi (Y |xi )] , V (Y ) number of function calls allowed for method 9 (2.11) CHAPTER 2. OVERVIEW where Vxi [·] is a variance with respect to xi , E∼xi (·|xi ) is a conditional mean with respect to all features except xi . Instead of computing multivariate Monte Carlo estimates, method uses space filling one-dimensional curves of the form xi (s) = 1 1 + arcsin(sin(vi s + φi )) 2 π (2.12) to generate sample points. Here each feature have some frequency vi assigned from some incommensurate set vi , s is the coordinate on one-dimensional curve and φi is a some random constant phase shift. Using Fourier decomposition in case of (2.12) we may say that ∞ X f (X) = (Aj cos(js) + Bj sin(js)), j=−∞ 1 Aj = 2π Bj = 1 2π Z π f (s)cos(js)ds, −π Z π f (s)sin(js)ds. −π These integrals can be estimated using points generated on the curve (2.12). In this case, e.g. conditional variance can be estimated as Vxi [E∼xi (Y |xi )] = 2 K X 2 (A2jvi + Bjv ), jvi is an integer, i (2.13) j=1 where K is some predefined number. Another appealing property of this approach is it’s ability to accurately estimate total indices. In this case all cross-variable interactions that include i-th feature are taken into account in the corresponding scores, i.e. the score is estimated as follows: Si = 1 − V∼xi [Exi (Y |x1 , . . . , xi−1 , xi+1 , . . . , xp )] . V (Y ) (2.14) To do this estimation unique frequency vi is given to xi and the same frequency v is given to all other features, then the same procedure as above is performed. The score for i-th feature is wi = Si , i = 1, . . . , p. (2.15) Pros: Can give main effect as well as total effect estimations Needs less samples than for most of other variance based approaches (about 72 points per feature is recommended) Cons: Still requires relatively large samples What technique to choose in each case is decided by the initial problem conditions (we have sample or black box) and best practice. For details see Chapter 6. 10 CHAPTER 2. OVERVIEW 2.5 Scores variance estimation It’s possible to compute score estimation variances to check how reliable obtained scores values are. When one obtains score and estimation of variance one may expect that there is high probability (usually estimated at 99.99966%) of the true score value lying inside the √ √ [score − 3· variance, score + 3· variance] range. So if zero is outside of this range one may decide that score is significantly larger than zero. It means that corresponding feature has significant influence on the function value, and this feature can be treated as important. Actually, estimation of true confidence intervals for scores is quite a complicated problem. However, we consider that our approximation for confidence intervals is sufficiently accurate to help in selection of the important features. 2.6 Remark on other sensitivity analysis methods In this section we will discuss GTIVE methods with respect to well-known Pearson’s and Spearman’s correlation coefficients. Let us consider the limitations of these correlation coefficients: • Pearson’s correlation coefficient is suitable only for using with linear functional dependencies. There is an analog of such a technique in GTIVE, namely RidgeFS. • Spearman’s correlation coefficient is suitable only for monotonic functions. In GTIVE we do not make such assumptions for nonlinear techniques (i.e. for all except RidgeFS). To clarify these points, we will give an example. Let us consider the sensitivity analysis problem for a function f = x2 +2y 2 , x, y ∈ [−1, 1]. In this case, nonlinear GTIVE techniques are supposed to identify correctly the presence of dependency and the influence of each variable on the output. The results are summarized in the table 2.2 (for uniformity, GTIVE scores are given after taking the square root). As expected, since the function is not linear and monotonic, the first three techniques gave inaccurate results. 2.7 Remark on the selection of techniques for GTIVE The selection of techniques for GTIVE was associated with different factors. 1. The need to provide basic modes of operation: • reliable linear solution on a small sample: RidgeFS • medium-size sample, from 50 to 500 points: Mutual Information (kraskov) • large sample, from 200 to several hundred thousand points: Mutual Information (histogram) • black box with small budget, from 2 · (inputDimension + 1) to ≈ 2000: Elementary Effects 11 CHAPTER 2. OVERVIEW Technique X Y Pearson 59% 41% Spearman 78% 22% RidgeFS 74% 26% Mutual Inf (kraskov) 33% 67% Mutual Inf (hist) 34% 66% Elementary Effects 35% 65% FAST 34% 66% Table 2.2: Pearson’s and Spearman’s correlation coefficients and GTIVE techniques. • black box with large budget, from 65 · inputDimension to hundreds of thousands: FAST 2. The popularity of techniques: • RidgeFS is a standard linear estimate. • Mutual Information is a widely used technique for feature selection in biology, medicine, image processing (e.g. see [17], [11], [10], [16]). • Elementary Effects is a standard screening technique based on computation of average partial derivatives and recommended in [13]. • FAST is a common way to calculate so-called global sensitivity indexes. The efficient calculation of such indexes with FAST is described in [13] and [9]. Examples of usage of this approach are given in [15] and [19]. 12 Chapter 3 Internal workflow 3.1 General workflow As described in Section 2.4, GTIVE includes two types of techniques: blackbox- and samplebased. Main difference, regarding the tool’s internal workflow, is that there is no preprocessing step in the blackbox-based mode since in this mode GTIVE generates the sample itself and ensures it has a correct structure and does not contain any degenerate data. Conversely, in sample-based mode the sample analysis is essential because in general there are no guarantees for the sample quality. Thus GTIVE internal workflow generally consists of the following steps: 1. Preprocessing. Only in sample-based mode. In this step, redundant data is removed from the training set and the sample is normalized — see Section 3.2. 2. Analyzing training data and options, selecting technique. In this step, training sample properties and options specified by user are analyzed for compatibility, and the most appropriate estimation technique is selected — see Chapter 6. 3. Estimating feature scores and scores standard deviation. In this step, feature scores are estimated using the technique selected in the previous step. If the VarianceEstimateRequired option is on, the result also includes score standard deviation (std calculation is off by default). For vector functions (functions with multidimensional output), feature scores and scores standard deviation are estimated for each component independently — see Section 3.3 for the results description. For individual technique descriptions, see Chapter 4 and Section 2.4. 3.2 Preprocessing As we work with initial training dataset some reasonable preprocessing must be applied to it in order to remove possible degeneracies in the data. Let X, Y be the N × (p + q) matrix of the training data, where the rows are (p + q)-dimensional training points, and the columns are individual scalar components of the input or output. The matrix X, Y consists of the sub-matrices X and Y. We perform the following operations with the matrix X, Y : 1. Remove all exact duplicates: search for rows in X, Y containing the same data and, if two or more matches are found, delete every row except one (since repeated data 13 CHAPTER 3. INTERNAL WORKFLOW points do not add any information). A warning is sent to log if there were any rows removed. 2. Remove all constant columns in sub-matrices X and Y. A constant column means that all the training vectors have the same value of one of the input components. In particular for X, this means that the training DoE is degenerate and covers only a certain section of the original design space. Column removals also produce a warning to log. e Y e consisting of the submatrices X e and As a result, we obtain a reduced matrix X, e Accordingly, we define effective input dimension (p̃) as the number of columns in Y. e and effective sample size (Ñ ) as the number of rows in X, e Y e . X, e and Y e matrices are normalized so that for each component 3. Next, sample values in the X of the input and output its mean equals 0 and standard deviation equals 1: xi = yi − yi xi − xi , yi = σ(xi ) σ(yi ) (3.1) This is the last sample preprocessing step if not using the Mutual Information technique. This means that for RidgeFS and SMBFAST techniques the scores are estimated using the normalized reduced matrix rather than the original matrix X, Y . Mutual Information technique includes one more preprocessing step below. 4. The Mutual Information technique is known to possibly show some performance degradation when feature values are distributed over a uniform grid (which is the case after the normalization). Due to this, in case of using the Mutual Information technique (whether Kraskov or histogram estimate), a small scale uniform noise in range [−10−10 , 10−10 ] is applied to all input and output components. If rank transform is on (see option RankTransform), the noise is applied after the transform. Thanks to its small scale, it does not have any significant effect on the final results, while the robustness of the Mutual Information technique is notably improved. 3.3 Results The resulting output of GTIVE contains a feature score matrix S and, if std calculation is on (see option VarianceEstimateRequired), a score standard deviation matrix D. The size of both matrices is q × p: the number of rows is equal to the output dimension q, the number of columns is equal to the number of features, or the input dimension p (the original input dimension, not the effective input dimension p̃). 3.3.1 Feature scores Each element sij of the S matrix is the sensitivity of the i -th output component to the j -th feature. In general, sij is a positive real number, except some special cases: • In the sample-based mode, if the value of the j -th feature in the sample is constant (the X matrix contains a constant column), all scores of this feature (j -th column in S) are set to NaN (special not-a-number value) since there is no way to estimate the sensitivity of the output to a constant component. 14 CHAPTER 3. INTERNAL WORKFLOW • In the sample-based mode, if the value of the i -th response component in the sample is constant (the Y matrix contains a constant column), the scores of all features vs this output (i -th row in S) are set to 0.0 — it is assumed that this output is insensitive to all features since its value is constant. • The first of the above rules has priority: if the sample contains both a constant feature xj and a constant output yi , the sij score is NaN. • In the blackbox-based mode, if the generation region (see 2.4.2) is defined in such a way that the lower and upper bounds of some feature are equal, this feature is interpreted as a constant input, so its resulting score will be NaN, similarly to the sample-based mode with a constant column. Note that GTIVE can’t handle features collinearity. For instance, if the values of two features are always equal, they are assigned equal scores, while in reality it is possible that the output is totally insensitive to the first feature and changes its value only due to the change of the second feature. This is one of the examples of a degenerate data sample, and such features have to be filtered out before passing data to GTIVE. 3.3.2 Standard deviation Standard deviation matrix D is structurally similar to the score matrix: each element σij is the standard deviation of the sij score. In general, σij is a non-negative real number, except the case than sij score is NaN. In this case, σij is also set to NaN. Note that standard deviation is calculated only when VarianceEstimateRequired is on, else the D matrix is empty. 15 Chapter 4 User configurable options GTIVE combines a number of scores estimation techniques of different types. By default, the tool selects the optimal technique compatible with the user–specified options and in agreement with the best practice experience. Alternatively, the user can directly specify the technique through advanced options of the tool. This section describes the available techniques and it’s options; selection of the technique in a particular problem is described in Chapter 6. 4.1 RidgeFS Short name: LR General description: Estimation of feature scores as normalized coefficients of regularized linear regression. Regularization coefficient is estimated by minimization of generalized cross-validation criterion [5]. Also, see Section 2.4.1. Variance estimation: Yes Restrictions: Can be applied to data sample only. Strengths and weaknesses: A very robust and fast technique with a wide applicability in terms of the input space dimensions and amount of the training data. It is, however, usually rather crude, and the estimation can hardly be significantly improved by adding new training data. Options: No options. 4.2 Mutual Information (Kraskov estimate) Short name: Kraskov General description: Mutual information estimate of feature scores based on the nearest neighbors information [8]. Also, see Section 2.4.1. 16 CHAPTER 4. USER CONFIGURABLE OPTIONS Variance estimation: Yes Strengths and weaknesses: Is a robust nonlinear estimation technique, however can be applied only to small moderate samples due to memory limitations. Method tends to underscore features in case of heavy cross-feature interactions. Restrictions: Can be applied to data sample only. Options: • NumberOfNeighbors Values: integer in range [1, 0.8 · (effective sample size) − 1]. Default: 0 (auto). Short description: number of nearest neighbors used to estimate mutual information Description: Option specifies number of nearest neighbors used in estimation of mutual information if ’kraskov’ technique is selected (manually or automatically). Increasing this value gives smaller variance of score estimation at the cost of higher systematic errors and vice versa. Best practice recommend to set it as a small integer value of around 5 in most cases. • RankTransform Values: on, off Default: on Short description: Apply rank transform (copula transform) before computing mutual information. Description: If this option is on (True), rank transform is applied to the input sample before computing mutual information. In most cases, it allows for a more accurate mutual information estimate. 4.3 Mutual Information (Histogram based estimate) Short name: Hist General description: Mutual information estimate of feature scores based on the histogram construction. Also, see Section 2.4.1. Strengths and weaknesses: Too crude for small samples, but have very low memory requirements so can be applied in the case of very large data sets. If the sample size is at least 20000, then accelerated optimization of histogram parameters is used. Tends to underscore features in case of heavy cross-feature interactions. Variance estimation: Yes Restrictions: Can be applied to data sample only. 17 CHAPTER 4. USER CONFIGURABLE OPTIONS Options: • RankTransform Values: on, off Default: on Short description: Apply rank transform (copula transform) before computing mutual information. Description: If this option is on (True), rank transform is applied to the input sample before computing mutual information. In most cases, it allows for a more accurate mutual information estimate. 4.4 SMBFAST (Surrogate Model-Based FAST) Short name: SMBFAST General description: Surrogate Model-Based FAST combines the surrogate modelling and usage of extended FAST method. Also, see Section 2.4.1. Strengths and weaknesses: SMBFAST may be time consuming but it is the most accurate of all currently implemented sample-based techniques. Variance estimation: Yes Restrictions: Can be applied to data sample only. Options: • Accelerator Values: integer in range [1, 5], or 0 (auto) Default: 0 (automatically set by the approximator) Short description: Five-position switch to control trade-off between speed and accuracy for the internal approximator used in SMBFAST. Description: Since SMBFAST builds a surrogate model (to be used as a FAST blackbox), it actually uses GT Approx internally and makes certain options of this internal approximator available as GTIVE options. This option is essentially the same as GTApprox/Accelerator, except that 0 is also a valid value, meaning that the setting will be automatically selected by the internal approximator. • NumberOfCVFold Values: integer in range [2, 231 -2], or 0 (auto) Default: 0 (auto select) Short description: The number of cross-validation subsamples to estimate the variance of scores. 18 CHAPTER 4. USER CONFIGURABLE OPTIONS Description: In order to estimate the variance of scores, the principle of cross validation is used. Cross validation involves dividing the input sample into a number of subsamples (cross-validation subsets). This option sets the number of subsamples to divide in. • SensitivityIndexesType Values: enumeration: total, main Default: total Short description: Select the type of score index to be computed. Description: This option is a switch selecting the type of index computed by the FAST procedure used internally in SMBFAST. Main index estimate is usually more reliable, but this index takes into account only the influence of the considered feature on the output, ignoring the influence of cross-feature interactions. Total index estimates total influence of the variable on the output, taking into account all possible interactions between the considered feature and other input features, but its estimate is generally less reliable. • SurrogateModelType Values: enumeration: LR, SPLT, HDA, GP, HDAGP, SGP, GeoGP, TA, iTA, RSM, or Auto Default: Auto Short description: Specify the algorithm for the internal approximator used in SMBFAST. Description: Since SMBFAST builds a surrogate model (to be used as a FAST blackbox), it actually uses GT Approx internally and makes certain options of this internal approximator available as GTIVE options. This option is essentially the same as GTApprox/Technique. Default (Auto) selects a technique according to the GTApprox decision tree, with a single difference: HDAGP is never selected automatically, and where GTApprox would select HDAGP, the GP technique is used instead. 4.5 Elementary Effects Short name: EE General description: A screening technique estimating feature scores as an average of the function partial derivatives [13]. Also, see Section 2.4.2. Strengths and weaknesses: Can work with very small budgets and still give reliable estimates in most cases, however may take time if the budget is big, due to complex problem of selecting appropriate set of trajectories. Note that method actually allows some randomization, so one can get different estimates by varying global Seed parameter. Variance estimation: Yes 19 CHAPTER 4. USER CONFIGURABLE OPTIONS Restrictions: Can be applied to the black box only. Options: • Deterministic Values: boolean. Default: on. Short description: require IVE process to be deterministic. Description: If this switch is turned on, then all random processes in all algorithms are started with some fixed seed ensuring result to be the same on every run. In the current version the switch affects only black-box based techniques (FAST and Elementaty Effects). • Seed Values: integer [1, 2147483647]. Default: 100. Short description: change fixed seed when Deterministic is on. Description: Enables user to use different fixed seeds for IVE process. In the current version the switch affects only black-box based techniques (FAST and Elementaty Effects). • MinCurveNum Values: integer [1, 2147483647]. Default: 200. Short description: number of space filling curves tested to compute elementary effects. Also, see Section 2.4.2. Description: Option specifies number of curves to be used in estimation of elementary effects. The more curves is used the better parameter space is explored, resulting in more accurate scores estimation, however it takes additional time. 4.6 Extended FAST (Fourier Amplitude Sensitivity Testing) Short name: FAST General description: Variance based estimation of feature scores. Methods can estimate cross variable interactions as well as isolated (main) variable indices (which can be useful to some additional manual dependency analysis) [12]. Also, see Section 2.4.2. Strengths and weaknesses: Needs large enough computational budget (number of function calls): at least 65 calls per feature to get stable estimate, — however is very precise (if the budget is enough) even in the case of strong variables inter-dependencies. Note that method actually allows some randomization, so one can get different estimates by varying global Seed parameter. 20 CHAPTER 4. USER CONFIGURABLE OPTIONS Variance estimation: Yes Restrictions: Can be applied to the black box only. Options: • Deterministic Values: boolean. Default: on. Short description: require IVE process to be deterministic. Description: If this switch is turned on, then all random processes in all algorithms are started with some fixed seed ensuring result to be the same on every run. In the current version the switch affects only black-box based techniques (FAST and Elementaty Effects). • Seed Values: integer [1, 2147483647]. Default: 100. Short description: change fixed seed when Deterministic is on. Description: Enables user to use different fixed seeds for IVE process. In the current version the switch affects only black-box based techniques (FAST and Elementaty Effects). • SensitivityIndexesType Values: enum: total, main . Default: total. Short description: selects type of score index to be computed Description: Switch selects if the FAST procedure should compute ’main’ or ’total’ score index. ’Main’ index takes into account only isolated influence of the considered feature on the output ignoring the influence of cross-features interactions. ’total’ index estimates total influence of the variable on the output, taking into account all possible interactions between the considered feature and other input features, but it’s estimate is generally less reliable. • NumberOfSearchCurves Values: integer [0, 2147483647]. Default: 0 (“0” means auto selection: 4, if the budget is sufficient, and less otherwise). Short description: adds random multistart to FAST curves used for estimation of sensitivity indexes Description: Option allows performing multistart when building FAST space filling curves. It can potentially increase accuracy at the cost of increasing the budget requirements NumberOfSearchCurves times. Minimal allowable budget is equal to 65 · p̃ · NumberOfSearchCurves, where p̃ is the effective dimension of input vector (the number of not-constant input factors). 21 Chapter 5 Limitations The maximum size of the training sample, which can be processed by GTIVE, is primarily determined by the user’s hardware. Necessary hardware resources depend significantly on the specific technique — see descriptions of individual techniques. Accuracy of estimation tends to improve as the sample size increases. Technique Input type Performance on huge training sets RidgeFS sample Kraskov sample Histogram sample SMBFAST sample potentially long runtime EE blackbox potentially long runtime FAST blackbox Other restrictions linear dependencies only limited by available RAM Table 5.1: Technique summary Contrary to the maximum size, there is a certain minimum for the size of the training set (or for the available number of blackbox calls), which depends on the technique used. As explained in Section 3.2, this condition refers to the effective values, i.e. the ones obtained after preprocessing. An error with the corresponding error code will be returned if this condition is violated. The requirements on minimum sample size (budget) are summarized in Table 5.2. For most techniques there are two different limits, depending on whether the calculation of scores standard deviation is required by user or not (see option VarianceEstimateRequired). Table 5.2 denotes the following: 22 CHAPTER 5. LIMITATIONS • p̃: the effective input dimension after the sample preprocessing. • s: the GTIVE/SMBFAST/NumberOfCVFold option value. • N N : the GTIVE/MutualInformation/NumberOfNeighbors option value. Corresponding limit is in effect only if the option is set by user. • N R: the GTIVE/FAST/NumberOfSearchCurves option value. limit is in effect only if the option is set by user. Corresponding • dxe is the value of x rounded up (to the next integer). Technique RidgeFS SMBFAST Mutual Information (Kraskov) Minimum size (bugdet) std calculation on std calculation off p̃ + 2 2p̃+3 ·s s−1 N N +1 20, or 0.8 p̃ + 1 20, or N N + 1 3 3 2(p̃ + 1) p̃ + 1 65p̃ · 3, or 65p̃ · N R, N R ≥ 3 65p̃, or 65p̃ · N R Mutual Information (histogram) EE FAST 2p̃ + 3 Table 5.2: Minimum sample size (blackbox budget) for GTIVE techniques 23 Chapter 6 Selection of technique This section details manual and automatic selection of one of the techniques described in Chapter 4. 6.1 Selection of the technique by the user The user may specify the technique by setting the option Technique, which may have the following values: • Auto — best technique will be determined automatically (default) • RidgeFS • Mutual Information — to select specific estimation type additional parameter /MutualInformation/Algorithm may be specified having possible values ’kraskov’ for Kraskov estimation or ’hist’ for histogram based approach. If none is specified than ’kraskov’ estimate is used if there is < 500 sample points and ’hist’ estimate is used otherwise. If ’hist’ estimate is used and the sample size is at least 20000, then accelerated optimization of ’hist’ parameters is used. • SMBFAST • ElementaryEffects • FAST 6.2 Default automatic selection The decision tree, describing the default selection of the estimation technique is shown in Figure 6.1. The factors influencing the choice are: • Input type, i.e. sample or blackbox. • Sample size (for blackbox, budget) K and effective input dimension p̃ of the training sample. The result is the estimated feature scores. The selection is performed in agreement with properties of individual technique as described in Chapter 4. In particular for the sample input: 24 CHAPTER 6. SELECTION OF TECHNIQUE Figure 6.1: The GTIVE internal decision tree for the choice of default estimation method 25 CHAPTER 6. SELECTION OF TECHNIQUE • If p̃ ≤ 10, K < 300, and 2p̃ + 2 ≤ K < 2 · (2p̃ + 3), RidgeFS is selected. • If p̃ ≤ 10, K < 300, but K ≥ 2 · (2p̃ + 3), SMBFAST is selected. • In other cases, Mutual Information is selected, which uses the Histogram technique if K > 500 (and accelerated histogram estimate if K > 20000), and Kraskov if 20 ≤ K ≤ 500. For the black box input: • If K ≥ 4 · (72p̃ + 1) then the FAST technique is chosen. • If 2 · (p̃ + 1) ≤ K < 4 · (72p̃ + 1) then the Elementary Effects is used. • If p̃ + 1 ≤ K < 2 · (p̃ + 1), the tool will start only if score variance estimation is not required by user (see option VarianceEstimateRequired). Otherwise, if variance estimation is required or K < p̃ + 1, the tool will not start. 26 Chapter 7 Usage Examples In this section we will apply GTIVE to some artificial model functions and some real world data sets to demonstrate method properties. 7.1 Artificial Examples In this section we will demonstrate performance of various techniques implemented in the GTIVE on some known artificial functions. 7.1.1 Example 1: simple function, no cross-feature interaction In this example we will consider the function: f (x1 , x2 , x3 , x4 , x5 ) = x21 + 2x22 + 3x23 + 4x24 + 5x25 , xi ∈ [−1, 1], i = 1, . . . , 5 (7.1) . In this case we have no cross-feature interactions. So we can approximately estimate that true scores should have ratio 1 : 4 : 9 : 16 : 25. In this example, we will refer to these scores as True. We’ve calculated feature scores with all methods for different sample sizes and presented comparison with our expectations of what true features might be in this problem in the tables below. Results for RidgeFS are presented in the Table 7.1. As expected, RidgeFS assumes linear dependency, so methods fails to estimate correct scores. Sample size x1 x2 x3 x4 x5 True 0,0181 0,0727 0,1636 0,2909 0,4545 30 0,1847 0,1628 0,1113 0,2425 0,2983 100 0,2164 0,2074 0,1687 0,2128 0,1944 500 0,1449 0,1876 0,2312 0,1505 0,2855 Table 7.1: Example 1. RidgeFS scores 27 CHAPTER 7. USAGE EXAMPLES Results for Elementary Effects are presented in the Table 7.2. Elementary Effects gives satisfactory close to True results on 30 points sample already and very close results on 100 points. Sample size x1 x2 x3 x4 x5 True 0,0181 0,0727 0,1636 0,2909 0,4545 30 0.0152 0,0754 0,1782 0,2811 0,4213 100 0,0193 0,0721 0,1691 0,2952 0,4415 Table 7.2: Example 1. Elementary Effects scores Results for Mutual Information (Kraskov estimate) are presented in the Table 7.3. Kraskov estimate gives satisfactory results on 30 points and quite close to True on 500 points. Sample size x1 x2 x3 x4 x5 True 0,0181 0,0727 0,1636 0,2909 0,4545 30 0,1058 0,1051 0,0963 0,26478 0,4279 100 0,0867 0,0785 0,1220 0,2562 0,4563 500 0,0366 0,0774 0,1375 0,2772 0,4711 Table 7.3: Example 1. Mutual Information (Kraskov estimate) scores Results for Mutual Information (histogram estimate) are presented in the Table 7.4. As expected, Histogram based estimation of Mutual Information is inferior to Kraskov estimate on small samples, but still manages to do close to True estimation. Sample size x1 x2 x3 x4 x5 True 0,0181 0,0727 0,1636 0,2909 0,4545 30 0,0622 0 0,0656 0,2988 0,5733 100 0 0,0856 0,1486 0,2513 0,5142 500 0,0059 0,0315 0,1287 0,2914 0,5422 750 0,0084 0,0585 0,1501 0,2958 0,4868 1000 0 0,0513 0,1725 0,3020 0,4740 2000 0,0039 0,0609 0,1791 0,2967 0,4591 Table 7.4: Example 1. Mutual Information (histogram estimate) scores Results for FAST are presented in the Table 7.5. FAST needs at least 65×6 = 390 points to work on this sample. It gives satisfactory results on 500 points, and good on 1000 points. 28 CHAPTER 7. USAGE EXAMPLES Sample size x1 x2 x3 x4 x5 True 0,0181 0,0727 0,1636 0,2909 0,4545 500 0,0339 0,0963 0,2589 0,2744 0,3362 750 0,0442 0,0824 0,1638 0,2370 0,4723 1000 0,0273 0,0808 0,1681 0,2697 0,4538 Table 7.5: Example 1. FAST scores 29 CHAPTER 7. USAGE EXAMPLES 7.1.2 Example 2: usage of confidence intervals to determine redundant variables In this example we will demonstrate how knowing confidence intervals can tell us whether the function depends on the feature or not. For simplicity let us consider the function: f (x1 , x2 , x3 , x4 , x5 ) = x21 + x1 x22 + 0.01x23 , xi ∈ [−1, 1], i = 1, 2, 3. (7.2) Here the function depends very weakly on x3 . We generate 200 points random sample for this function and apply GTIVE (in this case Mutual Information kraskov algorithm will be used). Results for scores and the standard deviation of scores (the square root of estimated variance of scores) are provided in the table 7.6. Sample size x1 x2 x3 Scores 0,7494 0,2506 0,0 stdScores 0,1019 0,0745 0,0516 Table 7.6: Example 2. GT IVE scores and the standard deviation of scores Using confidence intervals one may additionally check whether we can trust obtained score values. Score value for third feature is zero so it’s contribution was not detected on this sample size. To check if scores for the first and the second features are significantly larger than zero one should check if for i-th feature zero belongs to the interval (Scorei − 3 · stdScorei , Scorei + 3 · stdScorei ). For the first feature: Score1 − 3 · stdScore1 = 0.4437 > 0 For the second feature: Score2 − 3 · stdScore2 = 0.0272 > 0 which means that both scores with very high probability are significantly larger than zero. And obviously this value is negative for the third feature. 30 CHAPTER 7. USAGE EXAMPLES 7.1.3 Example 3: difference between ’main’ and ’total’ scores in FAST In this example we will consider FAST performance for the function: f (x1 , x2 , x3 ) = x21 + 2x1 x2 + x23 , xi ∈ [−1, 1], i = 1, 2, 3, (7.3) that on the one hand is still simple enough to form some expectations of what true scores should be, but on the other hand it already has some feature interactions. So in this example one may expect to see x1 having the largest score, x2 be on the second place and x3 be the least important feature. We will use this example to demonstrate the difference between main and total FAST scores. Main scores take into account only isolated variable contribution to the variance of output, meaning that main scores would ignore influence of the x1 · x2 term. Total scores on the other side should account all feature interactions. In the manual dependency analysis comparison of these two indices allows for some investigation of the dependency nature. We’ve estimated these scores using 500 and 1000 points samples to show the difference in the results. Total scores are presented in the Table 7.7. Sample size x1 x2 x3 500 0,4449 0,3985 0,1503 1000 0,4965 0,4125 0,0869 Table 7.7: Example 2. FAST (total) scores Main scores are presented in the Table 7.8. Sample size x1 x2 x3 500 0,3353 0,0019 0,6604 1000 0,5016 0,0033 0,4910 Table 7.8: Example 2. FAST (main) scores Let ST 1 , ST 2 , ST 3 be total indexes of variables and SM 1 , SM 2 , SM 3 be main indices. One may see that SM 2 ≈ 0, ST 2 SM 2 , it gives one a hint that x2 feature appears only in interaction with some other. Also one may remember that ST i = SM i +interaction terms, i.e. say ST 1 = SM 1 + S12 + S13 , where for the example S12 - is a term accounting for x1 and x2 interaction. Notice also that SM 1 ≈ SM 3 , SM 2 ≈ 0 and ST 1 ≈ ST 2 + ST 3 ⇒ S12 + S13 ≈ S12 + S23 + S13 + S23 ⇒ S23 ≈ 0. As a result we can make an educated guess that our function has the following form f (x1 , x2 , x3 ) = f1 (x1 ) + f2 (x3 ) + f3 (x1 , x2 ) + f4 (x1 , x3 ). 31 CHAPTER 7. USAGE EXAMPLES 7.2 Real world data examples In this section we will show application of GTIVE to some real world data problems. 7.2.1 T-AXI problem • Problem description: In this problem we consider The T-C DES (Turbomachinery Compressor DESign) code (meanline axial flow compressor design tool), which is the first step of T-AXI (an axisymmetric method for a complete turbomachinery geometry design [18]). Program tcdes.e3c-des.exe is used for calculation of outputs f (X) for new generated inputs X. Program can be downloaded from the link: http://gtsl.ase.uc.edu/T-AXI/. Program uses a 163 dimensional feature vector describing geometry and the working condition as an input. The task is to determine subset of the most important features for the Compressor Pressure Ratio (With IGV) output. The dependency is considered only for X ∈ V (X 0 ) = {X : xi ∈ [(1 − α)x0i , (1 + α)x0i ]}, i = 1, . . . , 163 where α = 0.1, X 0 = (x01 , . . . , x0163 ) is given in Tables 7.9 – 7.11. Stage Parameter 1 2 3 4 5 6 7 8 9 10 Stage rotor inlet angle [deg] Stage rotor inlet Mach no. Total Temperature Rise [K] Rotor loss coef. Stator loss coef. Rotor Solidity Stator Solidity Stage Exit Blockage Stage bleed [%] Rotor Aspect Ratio Stator Aspect Ratio Rotor Axial Velocity Ratio Rotor Row Space Coef. Stator Row Space Coef. Stage Tip radius [m] 10,3 0,59 52,696 0,053 0,07 1,666 1,353 0,963 0 2,354 3,024 0,863 0,296 0,3 0,3507 13,5 0,51 52,301 0,0684 0,065 1,486 1,277 0,956 0 2,517 2,98 0,876 0,4 0,336 0,3358 15,8 0,475 51,117 0,0684 0,065 1,447 1,308 0,949 0 2,33 2,53 0,909 0,41 0,438 0,3283 18 0,46 49,736 0,0689 0,06 1,38 1,281 0,942 0 2,145 2,21 0,917 0,476 0,441 0,3212 19,2 0,443 49,144 0,069 0,06 1,274 1,374 0,935 1,3 2,061 2,005 0,932 0,39 0,892 0,3151 19,3 0,418 43,617 0,069 0,065 1,257 1,474 0,928 0 2,028 1,638 0,947 0,482 0,455 0,3084 16,3 0,402 45,69 0,069 0,065 1,31 1,379 0,921 2,3 1,62 1,355 0,971 0,515 0,886 0,3042 15 0,383 47,269 0,069 0,065 1,317 1,276 0,914 0 1,417 1,16 0,967 0,58 0,512 0,2995 13,6 0,35 48,255 0,069 0,065 1,326 1,346 0,907 0 1,338 1,142 0,98 0,64 0,583 0,297 13,4 0,313 47,565 0,07 0,1 1,391 1,453 0,9 0 1,361 1,106 0,99 0,72 0,549 0,2946 Table 7.9: Stage data for 10 stage design (stage.e3c-des) Mass Flow Rate [kg/s] Rotor Angular Velocity [rpm] Inlet Total Pressure [Pa] Inlet Total Temperature [K] Mach 3 - Last Stage Clearance Ratio 54,4 12299,5 101325 288,15 0,272 0,0015 Table 7.10: Initial data for 10 stage design (init.e3c-des) • Solution workflow: We perform the following steps to make the analysis: 32 CHAPTER 7. USAGE EXAMPLES Soldity 0,6776 Aspect ratio 5,133 Phi Loss Coef. 0,039 Inlet Mach 0,47 Lambda 0,97 IGV Row Space Coef. 0,4 Table 7.11: IGV data for 10 stage design (igv.e3c-des) 1. We generate data sample of 104 points. One may use available code as a black box as well, but we didn’t do it because code fails to compute outputs in many points. 2. On a given sample, feature scores are estimated using GTIVE with default settings. By default, in this case histogram based estimate is used, see 4.3. 3. Estimated feature scores are plotted on the picture 7.1. Looking at the picture one may see that there are clearly 12 most influential features. So it’s natural to perform preliminary optimization of compressor varying only this 12 features instead of all 163. 4. To validate the results of the GTIVE we estimated the Index of Variability (2.1) of different important feature subsets Z adding features one by one starting from the ones with higher GTIVE scores and from the lower scores. Results are presented on the Figure 7.2. • Results: In the Tables 7.12 – 7.13 the most important feature is filled with dark green, next 11 important ones are filled with light green color. Figure 7.1: T-AXI. Feature scores estimated by GTIVE. Note: This image was obtained using an older MACROS version. Actual results in the current version may differ. 33 CHAPTER 7. USAGE EXAMPLES Figure 7.2: T-AXI. Index of Variance. Note: This image was obtained using an older MACROS version. Actual results in the current version may differ. Stage Parameter 1 2 3 4 5 6 7 8 9 10 Stage rotor inlet angle [deg] Stage rotor inlet Mach no. Total Temperature Rise [K] Rotor loss coef. Stator loss coef. Rotor Solidity Stator Solidity Stage Exit Blockage Stage bleed [%] Rotor Aspect Ratio Stator Aspect Ratio Rotor Axial Velocity Ratio Rotor Row Space Coef. Stator Row Space Coef. Stage Tip radius [m] 10,3 0,59 52,696 0,053 0,07 1,666 1,353 0,963 0 2,354 3,024 0,863 0,296 0,3 0,3507 13,5 0,51 52,301 0,0684 0,065 1,486 1,277 0,956 0 2,517 2,98 0,876 0,4 0,336 0,3358 15,8 0,475 51,117 0,0684 0,065 1,447 1,308 0,949 0 2,33 2,53 0,909 0,41 0,438 0,3283 18 0,46 49,736 0,0689 0,06 1,38 1,281 0,942 0 2,145 2,21 0,917 0,476 0,441 0,3212 19,2 0,443 49,144 0,069 0,06 1,274 1,374 0,935 1,3 2,061 2,005 0,932 0,39 0,892 0,3151 19,3 0,418 43,617 0,069 0,065 1,257 1,474 0,928 0 2,028 1,638 0,947 0,482 0,455 0,3084 16,3 0,402 45,69 0,069 0,065 1,31 1,379 0,921 2,3 1,62 1,355 0,971 0,515 0,886 0,3042 15 0,383 47,269 0,069 0,065 1,317 1,276 0,914 0 1,417 1,16 0,967 0,58 0,512 0,2995 13,6 0,35 48,255 0,069 0,065 1,326 1,346 0,907 0 1,338 1,142 0,98 0,64 0,583 0,297 13,4 0,313 47,565 0,07 0,1 1,391 1,453 0,9 0 1,361 1,106 0,99 0,72 0,549 0,2946 Table 7.12: T-AXI. Features that influence Compressor Pressure Ratio the most (a) Mass Flow Rate [kg/s] Rotor Angular Velocity [rpm] Inlet Total Pressure [Pa] Inlet Total Temperature [K] Mach 3 - Last Stage Clearance Ratio 54,4 12299,5 101325 288,15 0,272 0,0015 Table 7.13: T-AXI. Features that influence Compressor Pressure Ratio the most (b) 34 CHAPTER 7. USAGE EXAMPLES 7.2.2 Stringer (Super-Stiffener) Stress Analysis problem • Problem description Special tool for Stress Analysis build upon a physical model computes Reserve Factors (RFs) constraints for a side panel (of an airplane) defined by its geometry (Gj , j = 1, . . . , 5) and applied forces (Fi , i = 1, 2, 3) [1, 4]. Our task here is to check whether all inputs equally influence the output RFs. In particular, the case of stringer RF (RF STR) is considered. • Solution workflow 1. We have a code that can compute RFs for the given point, so we may use black box technique. 2. We estimate feature scores with default settings and various budget and Seeds (see 4.5 for details) to check what size of budget for GTIVE gives reliable estimates and how stable the estimates are (Elementary Effects technique was taken by default, see Section 4.5). 3. Results for different budget sizes are presented in the Table 7.14. For each budget size 10 runs with different seeds were made to estimate standard deviation of results. One can see that mean estimates are already quite reliable on 50 points and variance of the results reduces as sample size increase. Also one may notice that RF STR is independent from feature F1 . 4. To validate the results of the GTIVE we used approximation error ratio measure (2.2) of RF STR. Results of this experiment are presented in Table 7.15 and show that error of approximation are in agreement with the values of feature scores, estimated by GTIVE. F1 F2 F3 GT IVE mean std mean std mean 50 pnts 300 pnts 1000 pnts 0 0 0 0 0 0 0,0477 0,0602 0,0624 0,0207 0,0107 0,0041 0,2713 0,2889 0,286 GT IVE G2 mean 50 pnts 0,062 300 pnts 0,0673 1000 pnts 0,0674 G3 std mean G1 std mean 0,0704 0,0732 0,0377 0,0216 0,0703 0,008 0,0129 0,0715 0,0049 G4 std mean std G5 std mean std 0,0197 0,1056 0,0354 0,3323 0,0836 0,1079 0,0233 0,0074 0,1001 0,0123 0,3135 0,0296 0,0998 0,0157 0,0037 0,1043 0,0062 0,3089 0,0136 0,0996 0,0058 Table 7.14: Stringer stress analysis. Feature scores estimated by GTIVE • Results – GTIVE showed that RF STR value is independent of the feature F1 – GTIVE using as few points as possible was able to estimate reliably relative importance of each feature 35 CHAPTER 7. USAGE EXAMPLES GT IVE Score Approx error if fixing feature / full model error F1 F2 F3 G1 0 0,0624 0,286 0,0715 22,83 126,04 29,85 G3 0,1043 G4 0,3089 G5 0,0996 45,72 142,46 43,63 0,98 G2 GT IVE Score 0,0674 Approx error if fixing 29,83 feature / full model error Table 7.15: Stringer stress analysis. Approximation error ratio 36 CHAPTER 7. USAGE EXAMPLES 7.2.3 Fuel System Analysis problem • Problem description The objective of the Research into Fuel Systems project is to deliver application that can predict pressures and mass flows for gravity feed aircraft fuel systems [6]. The desktop application comprises a two phase flow (air and fuel) analysis engine that is derived from experimental observations. One of the task the MACROS models are used for in this project is to approximate pressure loss coefficient and volume flow quality of the fuel flow on the diaphragm section of the pipe using experimental data. Experimental data is a 244 points sample with 6 features describing fuel flow (flow velocity (V), pressure after the diaphragm (P), temperature (T), densities of fuel (ρf uel ) and air (ρair ), ratio of diaphragm diameters (ri )) and two outputs pressure loss coefficient (Cp ) and volume flow quality (Q). We will use GTIVE to determine which features should be measured with the most accuracy. This is very important for experimental design: if the feature is unimportant then we shouldn’t do additional expensive experiments in order to explore the dependence of the outputs (Cp and Q) on this feature, and we can measure this feature with less precision in the experiments. • Solution workflow 1. We have a sample of experimental data, so sample based technique is going to be used. 2. GTIVE scores were computed with the default settings (Mutual information Kraskov estimate was used in this case, see Section 4.2). 3. To validate results we’ve calculated the Approximation error ratio measure 2.2 for both outputs. Table shows that GTIVE scores are in good agreement with feature scores, see Table 7.16. Q V P T ρair ρf uel ri GT IVE Score Approximation error if fixing feature / full model error 0.7204 0.925 0.2697 0.0688 1.37 3.09 1,07 1.04 1.07 1,26 Cp GT IVE Score Approximation error if fixing feature / full model error V 0.1888 P 0.1166 T 0.0843 ρair 0.0773 ρf uel 0.0944 ri 0.4383 1.04 1.19 1.12 1.12 1.03 4.45 0.1731 0.6628 Table 7.16: Fuel System Analysis. Features scores and Approximation error ratio • Results – It can be seen that values of scores are in good correspondence with errors of approximation. 37 Bibliography [1] E. Burnaev. Construction of the metamodels in support of stiffened panel optimization. In Proceedings of the conference MMR 2009 Mathematical Methods in Reliability, 2009. [2] Datadvance. Generic Tool for Approximation: User manual. [3] DATADVANCE, llc. MACROS Generic Tool for Important Variable Extraction, 2011. [4] S. Grihon. Application of response surface methodology to stiffened panel optimization. In Proceedings of 47th conference on AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, 2006. [5] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer, 2008. [6] E. Kitanin. Air Evolution Research in Fuel Systems 4. Technical report, IRIAS, 2010. [7] I. Kononenko. An adaptation of relief for attribute estimation in regression. 1997. [8] A. Kraskov. Estimating mutual information. Physical review. E, Statistical, nonlinear, and soft matter physics, 69:40–79, 2004. [9] H. Liu. Relative entropy-based probabilistic sensitivity analysis methods for design under uncertainty , aiaa-2004-4589. 10-th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2004. [10] F. Maes. Multimodality image registration by maximization of mutual information. IEEE transactions on Medical Imaging (16), pages 187–198, 1998. [11] P. Qiu. Fast calculation of pairwise mutual information for gene regulatory network reconstruction. Computer Methods and Programs in Biomedicine, 94(2):177–180, 2009. [12] A. Saltelli. A quantitative model-independent method for global sensitivity analysis of model output. Technometrics, 41:39–56, 1999. [13] A. Saltelli. Global Sensitivity Analysis The Primer. Wiley, 2008. [14] B. Schulkopf. Measuring statistical dependence with hilbert-schmidt norms. In Algorithmic Learning Theory, 3734:63–74, 2005. [15] V. Schwieger. Variance-based sensitivity analysis for model evaluation in engineering surveys. Data Processing, pages 1–10, 2004. 38 BIBLIOGRAPHY [16] H. Sundar. Robust computation of mutual information using spatially adaptive meshes. Proceeding MICCAI’07. Proceedings of the 10-th international conference on Medical image computing and computer-assisted intervention, Part I:950–958, 2007. [17] G. Tourassia. Application of the mutual information criterion for feature selection in computer-aided diagnosis. Med. Phys. 28, 2001. [18] M. Turner. A turbomachinery design tool for teaching design concepts for axial-flow fans compressors and turbines. Proceedings of GT2006, 2006. [19] S. Vallaghe. A global sensitivity analysis of three- and four-layer eeg conductivity models. Biomedical Engineering, IEEE Transactions (56), pages 988–995, 2009. 39 Index design of experiment, 4 feature score, sensitivity index, 4, 5 global sensitivity analysis, 4, 6 GT IVE, Generic Tool for Important Variable Extraction, 1 optimization, 4 Options:, 18 Deterministic, 20, 21 MinCurveNum, 20 NumberOfCVFold, 18 NumberOfNeighbors, 17 NumberOfSearchCurves, 21 RankTransform, 17, 18 Seed, 20, 21 SensitivityIndexesType, 19, 21 SurrogateModelType, 19 VarianceEstimateRequired, 13 Quality metrics: Approximation Error, 5 Index of Variablity, 5 surrogate model, 4 Techniques: Black box based:, 9 Elementary Effects, 9, 19 Extended FAST, Extended Fourier Amplitude Sensitivity Test, 9, 20 Sample based:, 6 Linear regression (RidgeFS), 6, 16 Mutual Information, 7 SMBFAST, 8 40 Index: Options Accelerator, 18 Deterministic, 20, 21 MinCurveNum, 20 NumberOfCVFold, 18 NumberOfNeighbors, 17 NumberOfSearchCurves, 21 RankTransform, 17, 18 Seed, 20, 21 SensitivityIndexesType, 19, 21 SurrogateModelType, 19 Technique, 24 VarianceEstimateRequired, 13 41

Download PDF

advertisement