i IBM SPSS Modeler 15 User’s Guide
Note: Before using this information and the product it supports, read the general information under Notices on p. 249. This edition applies to IBM SPSS Modeler 15 and to all subsequent releases and modifications until otherwise indicated in new editions. Adobe product screenshot(s) reprinted with permission from Adobe Systems Incorporated. Microsoft product screenshot(s) reprinted with permission from Microsoft Corporation. Licensed Materials - Property of IBM © Copyright IBM Corporation 1994, 2012. U.S.
Preface IBM® SPSS® Modeler is the IBM Corp. enterprise-strength data mining workbench. SPSS Modeler helps organizations to improve customer and citizen relationships through an in-depth understanding of data. Organizations use the insight gained from SPSS Modeler to retain profitable customers, identify cross-selling opportunities, attract new customers, detect fraud, reduce risk, and improve government service delivery.
Contents 1 About IBM SPSS Modeler 1 IBM SPSS Modeler Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 IBM SPSS Modeler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM SPSS Modeler Administration Console . . . . . . . . . . . . . . . . . . . . . . . .
Changing the icon size for a stream . . . . . . . . . Using the Mouse in IBM SPSS Modeler . . . . . . Using Shortcut Keys . . . . . . . . . . . . . . . . . . . . Printing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automating IBM SPSS Modeler . . . . . . . . . . . . . . . 4 ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... Understanding Data Mining .....
7 Building CLEM Expressions 105 About CLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 CLEM Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Values and Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Expressions and Conditions . . . . . . . . . . . . . . .
Logical Functions. . . . . . . . . . . . . . . . . . . . . . . Numeric Functions . . . . . . . . . . . . . . . . . . . . . Trigonometric Functions . . . . . . . . . . . . . . . . . Probability Functions . . . . . . . . . . . . . . . . . . . . Bitwise Integer Operations . . . . . . . . . . . . . . . Random Functions . . . . . . . . . . . . . . . . . . . . . . String Functions. . . . . . . . . . . . . . . . . . . . . . . . SoundEx Functions . . . . . . . . . . . . . . . . . . . . . Date and Time Functions . .
Deploying Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Stream Deployment Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 The Scoring Branch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 10 Exporting to External Applications 195 About Exporting to External Applications . . . . . . . . . . . . . .
Customizing the Nodes Palette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Customizing the Palette Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Changing a Palette Tab View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 CEMI Node Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
C Notices 249 Index 252 x
Chapter About IBM SPSS Modeler 1 IBM® SPSS® Modeler is a set of data mining tools that enable you to quickly develop predictive models using business expertise and deploy them into business operations to improve decision making. Designed around the industry-standard CRISP-DM model, SPSS Modeler supports the entire data mining process, from data to better business results. SPSS Modeler offers a variety of modeling methods taken from machine learning, artificial intelligence, and statistics.
2 Chapter 1 IBM SPSS Modeler Server SPSS Modeler uses a client/server architecture to distribute requests for resource-intensive operations to powerful server software, resulting in faster performance on larger data sets. SPSS Modeler Server is a separately-licensed product that runs continually in distributed analysis mode on a server host in conjunction with one or more IBM® SPSS® Modeler installations.
3 About IBM SPSS Modeler can be shared by multiple users, or accessed from the thin-client application IBM SPSS Modeler Advantage. You install the adapter on the system that hosts the repository. IBM SPSS Modeler Editions SPSS Modeler is available in the following editions.
4 Chapter 1 IBM SPSS Modeler Documentation Documentation in online help format is available from the Help menu of SPSS Modeler. This includes documentation for SPSS Modeler, SPSS Modeler Server, and SPSS Modeler Solution Publisher, as well as the Applications Guide and other supporting materials. Complete documentation for each product (including installation instructions) is available in PDF format under the \Documentation folder on each product DVD.
5 About IBM SPSS Modeler IBM SPSS Modeler Administration Console User Guide. Information on installing and using the console user interface for monitoring and configuring SPSS Modeler Server. The console is implemented as a plug-in to the Deployment Manager application. IBM SPSS Modeler Solution Publisher Guide. SPSS Modeler Solution Publisher is an add-on component that enables organizations to publish streams for use outside of the standard SPSS Modeler environment.
6 Chapter 1 Demos Folder The data files and sample streams used with the application examples are installed in the Demos folder under the product installation directory. This folder can also be accessed from the IBM SPSS Modeler 15 program group on the Windows Start menu, or by clicking Demos on the list of recent directories in the File Open dialog box.
Chapter 2 New Features New and Changed Features in IBM SPSS Modeler 15 From this release onwards, IBM® SPSS® Modeler has the following editions. IBM® SPSS® Modeler Professional is the new name for the existing SPSS Modeler product. IBM® SPSS® Modeler Premium is a separately-licensed product that provides additional features to those supplied by SPSS Modeler Professional. The new features for these editions are described in the following sections.
8 Chapter 2 Default settings for database connections. You can now specify default settings for SQL Server and Oracle database connections, as well as those already supported for IBM DB2 InfoSphere Warehouse. Stream properties and optimization redesign. The Options tab on the Stream Properties dialog box has been redesigned to group the options into categories. The Optimization options have also moved from User Options to Stream Properties.
9 New Features SQL generation enhancements. The Aggregate node now supports SQL generation for date, time, timestamp, and string data types, in addition to integer and real. With IBM Netezza databases, the Sample node supports SQL generation for simple and complex sampling, and the Binning node supports SQL generation for all binning methods except Tiles. In-database model scoring.
10 Chapter 2 New features in IBM SPSS Modeler Premium IBM® SPSS® Modeler Premium is a separately-licensed product that provides additional features to those supplied by IBM® SPSS® Modeler Professional. Previously, SPSS Modeler Premium included only IBM® SPSS® Modeler Text Analytics . The full set of SPSS Modeler Premium features is now as follows.
11 New Features The Netezza Time Series node analyzes time series data and can predict future behavior from past events. The Netezza Generalized Linear model expands the linear regression model so that the dependent variable is related to the predictor variables by means of a specified link function. Moreover, the model allows for the dependent variable to have a non-normal distribution.
Chapter IBM SPSS Modeler Overview 3 Getting Started As a data mining application, IBM® SPSS® Modeler offers a strategic approach to finding useful relationships in large data sets. In contrast to more traditional statistical methods, you do not necessarily need to know what you are looking for when you start. You can explore your data, fitting different models and investigating different relationships, until you find useful information.
13 IBM SPSS Modeler Overview Launching from the Command Line You can use the command line of your operating system to launch IBM® SPSS® Modeler as follows: E On a computer where IBM® SPSS® Modeler is installed, open a DOS, or command-prompt, window. E To launch the SPSS Modeler interface in interactive mode, type the modelerclient command followed by the required arguments; for example: modelerclient -stream report.
14 Chapter 3 To Connect to a Server E On the Tools menu, click Server Login. The Server Login dialog box opens. Alternatively, double-click the connection status area of the SPSS Modeler window. E Using the dialog box, specify options to connect to the local server computer or select a connection from the table. Click Add or Edit to add or edit a connection. For more information, see the topic Adding and Editing the IBM SPSS Modeler Server Connection on p. 14.
15 IBM SPSS Modeler Overview Note: You cannot edit a server connection that was added from IBM® SPSS® Collaboration and Deployment Services, since the name, port, and other details are defined in IBM SPSS Collaboration and Deployment Services. Figure 3-3 Server Login Add/Edit Server dialog box To Add Server Connections E On the Tools menu, click Server Login. The Server Login dialog box opens. E In this dialog box, click Add. The Server Login Add/Edit Server dialog box opens.
16 Chapter 3 Searching for Servers in IBM SPSS Collaboration and Deployment Services Instead of entering a server connection manually, you can select a server or server cluster available on the network through the Coordinator of Processes, available in IBM® SPSS® Collaboration and Deployment Services. A server cluster is a group of servers from which the Coordinator of Processes determines the server best suited to respond to a processing request.
17 IBM SPSS Modeler Overview E Edit options.cfg, located in the /config directory of your SPSS Modeler installation directory. Edit the temp_directory parameter in this file to read: temp_directory, "C:/spss/servertemp". E After doing this, you must restart the SPSS Modeler Server service. You can do this by clicking the Services tab on your Windows Control Panel. Just stop the service and then start it to activate the changes you made. Restarting the machine will also restart the service.
18 Chapter 3 This sequence of operations is known as a data stream because the data flows record by record from the source through each manipulation and, finally, to the destination—either a model or type of data output. Figure 3-5 A simple stream IBM SPSS Modeler Stream Canvas The stream canvas is the largest area of the IBM® SPSS® Modeler window and is where you will build and manipulate data streams.
19 IBM SPSS Modeler Overview Field Ops. Nodes perform operations on data fields, such as filtering, deriving new fields, and determining the measurement level for given fields. Graphs. Nodes graphically display data before and after modeling. Graphs include plots, histograms, web nodes, and evaluation charts. Modeling. Nodes use the modeling algorithms available in SPSS Modeler, such as neural nets, decision trees, clustering algorithms, and data sequencing. Database Modeling.
20 Chapter 3 Figure 3-8 Outputs tab The Models tab is the most powerful of the manager tabs. This tab contains all model nuggets, which contain the models generated in SPSS Modeler, for the current session. These models can be browsed directly from the Models tab or added to the stream in the canvas.
21 IBM SPSS Modeler Overview Figure 3-10 CRISP-DM view The Classes tab provides a way to organize your work in SPSS Modeler categorically—by the types of objects you create. This view is useful when taking inventory of data, streams, and models. Figure 3-11 Classes view IBM SPSS Modeler Toolbar At the top of the IBM® SPSS® Modeler window, you will find a toolbar of icons that provides a number of useful functions. Following are the toolbar buttons and their functions.
22 Chapter 3 Cut & move to clipboard Copy to clipboard Paste selection Undo last action Redo Search for nodes Edit stream properties Preview SQL generation Run current stream Run stream selection Stop stream (Active only while stream is running) Add SuperNode Zoom in (SuperNodes only) Zoom out (SuperNodes only) No markup in stream Insert comment Hide stream markup (if any) Show hidden stream markup Open stream in IBM® SPSS® Modeler Advantage Stream markup consists of stream comments, mod
23 IBM SPSS Modeler Overview Customizing the Toolbar You can change various aspects of the toolbar, such as: Whether it is displayed Whether the icons have tooltips available Whether it uses large or small icons To turn the toolbar display on and off: E On the main menu, click: View > Toolbar > Display To change the tooltip or icon size settings: E On the main menu, click: View > Toolbar > Customize Click Show ToolTips or Large Buttons as required.
24 Chapter 3 Figure 3-12 Maximized stream canvas As an alternative to closing the nodes palette, and the managers and project panes, you can use the stream canvas as a scrollable page by moving vertically and horizontally with the scrollbars at the side and bottom of the SPSS Modeler window. You can also control the display of screen markup, which consists of stream comments, model links, and scoring branch indications.
25 IBM SPSS Modeler Overview Figure 3-13 Changing the icon size To scale the entire stream (stream properties method) E From the main menu, choose Tools > Stream Properties > Options > Layout. E Choose the size you want from the Icon Size menu. E Click Apply to see the result. E Click OK to save the change. To scale the entire stream (menu method) E Right-click the stream background on the canvas. E Choose Icon Size and select the size you want.
26 Chapter 3 Using the Mouse in IBM SPSS Modeler The most common uses of the mouse in IBM® SPSS® Modeler include the following: Single-click. Use either the right or left mouse button to select options from menus, open pop-up menus, and access various other standard controls and options. Click and hold the button to move and drag nodes. Double-click. Double-click using the left mouse button to place nodes on the stream canvas and edit existing nodes. Middle-click.
27 IBM SPSS Modeler Overview Table 3-2 Supported shortcuts for old hot keys Shortcut Key Ctrl+Alt+D Ctrl+Alt+L Ctrl+Alt+R Ctrl+Alt+U Ctrl+Alt+C Ctrl+Alt+F Ctrl+Alt+X Ctrl+Alt+Z Delete Function Duplicate node Load node Rename node Create User Input node Toggle cache on/off Flush cache Expand SuperNode Zoom in/zoom out Delete node or connection Printing The following objects can be printed in IBM® SPSS® Modeler: Stream diagrams Graphs Tables Reports (from the Report node and Project Reports)
28 Chapter 3 revenue data or as complex as transforming web log data into a set of fields and records with usable information. For more information, see the topic About CLEM in Chapter 7 on p. 105. Scripting is a powerful tool for automating processes in the user interface. Scripts can perform the same kinds of actions that users perform with a mouse or a keyboard. You can set options for nodes and perform derivations using a subset of CLEM. You can also specify output and manipulate generated models.
Chapter Understanding Data Mining 4 Data Mining Overview Through a variety of techniques, data mining identifies nuggets of information in bodies of data. Data mining extracts information in such a way that it can be used in areas such as decision support, prediction, forecasts, and estimation. Data is often voluminous but of low value and with little direct usefulness in its raw form. It is the hidden information in the data that has value.
30 Chapter 4 Typically, you will use these facilities to identify a promising set of attributes in the data. These attributes can then be fed to the modeling techniques, which will attempt to identify underlying rules and relationships. Typical Applications Typical applications of data mining techniques include the following: Direct mail. Determine which demographic groups have the highest response rate. Use this information to maximize the response to future mailings. Credit scoring.
31 Understanding Data Mining together. It may not even be online. If it exists only on paper, data entry will be required before you can begin data mining. Check whether the data covers the relevant attributes The object of data mining is to identify relevant attributes, so including this check may seem odd at first. It is very useful, however, to look at what data is available and to try to identify the likely relevant factors that are not recorded.
32 Chapter 4 A Strategy for Data Mining As with most business endeavors, data mining is much more effective if done in a planned, systematic way. Even with cutting-edge data mining tools, such as IBM® SPSS® Modeler, the majority of the work in data mining requires a knowledgeable business analyst to keep the process on track.
33 Understanding Data Mining Figure 4-1 CRISP-DM process model The six phases include: Business understanding. This is perhaps the most important phase of data mining. Business understanding includes determining business objectives, assessing the situation, determining data mining goals, and producing a project plan. Data understanding. Data provides the “raw materials” of data mining.
34 Chapter 4 have been resolved adequately. Similarly, the evaluation phase can lead you to reevaluate your original business understanding, and you may decide that you have been trying to answer the wrong question. At this point, you can revise your business understanding and proceed through the rest of the process again with a better target in mind. The second key point is the iterative nature of data mining.
35 Understanding Data Mining Classification nodes The Auto Classifier node creates and compares a number of different models for binary outcomes (yes or no, churn or do not churn, and so on), allowing you to choose the best approach for a given analysis. A number of modeling algorithms are supported, making it possible to select the methods you want to use, the specific options for each, and the criteria for comparing the results.
36 Chapter 4 The PCA/Factor node provides powerful data-reduction techniques to reduce the complexity of your data. Principal components analysis (PCA) finds linear combinations of the input fields that do the best job of capturing the variance in the entire set of fields, where the components are orthogonal (perpendicular) to each other. Factor analysis attempts to identify underlying factors that explain the pattern of correlations within a set of observed fields.
37 Understanding Data Mining The Self-Learning Response Model (SLRM) node enables you to build a model in which a single new case, or small number of new cases, can be used to reestimate the model without having to retrain the model using all data. The Time Series node estimates exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), and multivariate ARIMA (or transfer function) models for time series data and produces forecasts of future performance.
38 Chapter 4 preconditions. Apriori requires that input and output fields all be categorical but delivers better performance because it is optimized for this type of data. The CARMA model extracts a set of rules from the data without requiring you to specify input or target fields. In contrast to Apriori the CARMA node offers build settings for rule support (support for both antecedent and consequent) rather than just antecedent support.
39 Understanding Data Mining Segmentation nodes The Auto Cluster node estimates and compares clustering models, which identify groups of records that have similar characteristics. The node works in the same manner as other automated modeling nodes, allowing you to experiment with multiple combinations of options in a single modeling pass.
40 Chapter 4 Data Mining Examples The best way to learn about data mining in practice is to start with an example. A number of application examples are available in the IBM® SPSS® Modeler Applications Guide, which provides brief, targeted introductions to specific modeling methods and techniques. For more information, see the topic Application Examples in Chapter 1 on p. 5.
Chapter 5 Building Streams Stream-Building Overview Data mining using IBM® SPSS® Modeler focuses on the process of running data through a series of nodes, referred to as a stream. This series of nodes represents operations to be performed on the data, while links between the nodes indicate the direction of data flow. Typically, you use a data stream to read data into SPSS Modeler, run it through a series of manipulations, and then send it to a destination, such as a table or a viewer.
42 Chapter 5 Figure 5-1 Completed stream on the stream canvas This section contains more detailed information on working with nodes to create more complex data streams. It also discusses options and settings for nodes and streams. For step-by-step examples of stream building using the data shipped with SPSS Modeler (in the Demos folder of your program installation), see Application Examples on p. 5. Working with Nodes Nodes are used in IBM® SPSS® Modeler to help you explore data.
43 Building Streams A runnable node that processes stream data is known as a terminal node. A modeling or output node is a terminal node if it is located at the end of a stream or stream branch. You cannot connect further nodes to a terminal node. Note: You can customize the Nodes palette. For more information, see the topic Customizing the Nodes Palette in Chapter 12 on p. 223.
44 Chapter 5 Figure 5-2 Stream created by double-clicking nodes from the palettes To Connect Nodes Using the Middle Mouse Button On the stream canvas, you can click and drag from one node to another using the middle mouse button. (If your mouse does not have a middle button, you can simulate this by pressing the Alt key while dragging with the mouse from one node to another.
45 Building Streams Figure 5-5 Connected nodes When connecting nodes, there are several guidelines to follow.
46 Chapter 5 Disabling Nodes in a Stream Process nodes with a single input within streams can be disabled, with the result that the node is ignored during running of the stream. This saves you from having to remove or bypass the node and means you can leave it connected to the remaining nodes. You can still open and edit the node settings; however, any changes will not take effect until you enable the node again.
47 Building Streams Figure 5-8 Connecting a new node between two connected nodes E With the middle mouse button, click and drag the connection arrow into which you want to insert the node. Alternatively, you can hold down the Alt key while clicking and dragging to simulate a middle mouse button. Figure 5-9 New stream E Drag the connection to the node that you want to include and release the mouse button. Note: You can remove new connections from the node and restore the original by bypassing the node.
48 Chapter 5 Figure 5-10 Deleting the connection between nodes in a stream To delete all connections to and from a node, do one of the following: Select the node and press F3. Select the node, and on the main menu click: Edit > Node > Disconnect Setting Options for Nodes Once you have created and connected nodes, there are several options for customizing nodes. Right-click a node and select one of the menu options.
49 Building Streams Figure 5-11 Pop-up menu options for nodes Click Edit to open the dialog box for the selected node. Click Connect to manually connect one node to another. Click Disconnect to delete all links to and from the node. Click Rename and Annotate to open the Annotations tab of the editing dialog box. Click New Comment to add a comment related to the node. For more information, see the topic Adding Comments and Annotations to Nodes and Streams on p. 78.
50 Chapter 5 Click Save Node to save the node’s details in a file. You can load node details only into another node of the same type. Click Store Node to store the selected node in a connected IBM SPSS Collaboration and Deployment Services Repository. Click Cache to expand the menu, with options for caching the selected node. Click Data Mapping to expand the menu, with options for mapping data to a new source or specifying mandatory fields.
51 Building Streams Figure 5-12 Caching at the Type node to store newly derived fields To Enable a Cache E On the stream canvas, right-click the node and click Cache on the menu. E On the caching submenu, click Enable. E You can turn the cache off by right-clicking the node and clicking Disable on the caching submenu. Caching Nodes in a Database For streams run in a database, data can be cached midstream to a temporary table in the database rather than the file system.
52 Chapter 5 Note: The following databases support temporary tables for the purpose of caching: DB2, Netezza, Oracle, SQL Server, and Teradata. Other databases will use a normal table for database caching. The SQL code can be customized for specific databases - contact Support for assistance. To Flush a Cache A white document icon on a node indicates that its cache is empty. When the cache is full, the document icon becomes solid green.
53 Building Streams Figure 5-13 Data Preview from a model nugget From the Generate menu, you can create several types of nodes. Locking Nodes To prevent other users from amending the settings of one or more nodes in a stream, you can encapsulate the node or nodes in a special type of node called a SuperNode, and then lock the SuperNode by applying password protection. Working with Streams Once you have connected source, process, and terminal nodes on the stream canvas, you have created a stream.
54 Chapter 5 Figure 5-14 Streams tab in the managers pane with pop-up menu options From this tab, you can: Access streams. Save streams. Save streams to the current project. Close streams. Open new streams. Store and retrieve streams from an IBM SPSS Collaboration and Deployment Services repository (if available at your site). For more information, see the topic About the IBM SPSS Collaboration and Deployment Services Repository in Chapter 9 on p. 158.
55 Building Streams Logging and status. Options controlling SQL logging and record status. For more information, see the topic Setting SQL logging and record status options for streams on p. 63. Layout. Options relating to the layout of the stream on the canvas. For more information, see the topic Setting layout options for streams on p. 64.
56 Chapter 5 Figure 5-15 Setting general options for a stream Decimal symbol. Select either a comma (,) or a period (.) as a decimal separator. Grouping symbol. For number display formats, select the symbol used to group values (for example, the comma in 3,000.00). Options include none, period, comma, space, and locale-defined (in which case the default for the current locale is used). Encoding. Specify the stream default method for text encoding. (Note: Applies to Var.
57 Building Streams Maximum number of rows to show in Data Preview. Specify the number of rows to be shown when a preview of the data is requested for a node. For more information, see the topic Previewing Data in Nodes on p. 52. Maximum members for nominal fields. Select to specify a maximum number of members for nominal (set) fields after which the data type of the field becomes Typeless. This option is useful when working with large nominal fields.
58 Chapter 5 Figure 5-17 Setting date and time options for a stream Import date/time as. Select whether to use date/time storage for date/time fields or whether to import them as string variables. Date format. Select a date format to be used for date storage fields or when strings are interpreted as dates by CLEM date functions. Time format. Select a time format to be used for time storage fields or when strings are interpreted as times by CLEM time functions. Rollover days/mins.
59 Building Streams 2-digit dates start from. Specify the cutoff year to add century digits for years denoted with only two digits. For example, specifying 1930 as the cutoff year will assume that 05/11/02 is in the year 2002. The same setting will use the 20th century for dates after 30; thus 05/11/73 is assumed to be in 1973. Save As Default. The options specified apply only to the current stream. Click this button to set these options as the default for all streams.
60 Chapter 5 Decimal places (standard, scientific, currency). For number display formats, specifies the number of decimal places to be used when displaying or printing real numbers. This option is specified separately for each display format. Calculations in. Select Radians or Degrees as the unit of measurement to be used in trigonometric CLEM expressions. For more information, see the topic Trigonometric Functions in Chapter 8 on p. 139. Save As Default.
61 Building Streams Figure 5-19 Setting stream optimization options Note: Whether SQL pushback and optimization are supported depends on the type of database in use. For the latest information on which databases and ODBC drivers are supported and tested for use with IBM® SPSS® Modeler 15, see the corporate Support site at http://www.ibm.com/support. Enable stream rewriting. Select this option to enable stream rewriting in SPSS Modeler. Two types of rewriting are available, and you can select one or both.
62 Chapter 5 reduce network traffic and speed stream operations. Note that the Generate SQL check box must be selected for SQL optimization to have any effect. Optimize syntax execution. This method of stream rewriting increases the efficiency of operations that incorporate more than one node containing IBM® SPSS® Statistics syntax. Optimization is achieved by combining the syntax commands into a single operation, instead of running each as a separate operation. Optimize other execution.
63 Building Streams Setting SQL logging and record status options for streams These settings include various options controlling the display of SQL statements generated by the stream, and the display of the number of records processed by the stream. Figure 5-20 Setting SQL logging and record status options for a stream Display SQL in the messages log during stream execution. Specifies whether SQL generated while running the stream is passed to the message log.
64 Chapter 5 Reformat SQL for improved readability. Specifies whether SQL displayed in the log should be formatted for readability. Show status for records. Specifies when records should be reported as they arrive at terminal nodes. Specify a number that is used for updating the status every N records. Save As Default. The options specified apply only to the current stream. Click this button to set these options as the default for all streams.
65 Building Streams Stream scroll rate. Specify the scrolling rate for the stream canvas to control how quickly the stream canvas pane scrolls when a node is being dragged from one place to another on the canvas. Higher numbers specify a faster scroll rate. Icon name maximum. Specify a limit in characters for the names of nodes on the stream canvas. Icon size. Select an option to scale the entire stream view to one of a number of sizes between 8% and 200% of the standard icon size. Grid cell size.
66 Chapter 5 Figure 5-22 Messages tab in stream properties dialog box In addition to messages regarding stream operations, error messages are reported here. When stream running is terminated because of an error, this dialog box will open to the Messages tab with the error message visible. Additionally, the node with errors is highlighted in red on the stream canvas.
67 Building Streams Figure 5-23 Stream running with error reported If SQL optimization and logging options are enabled in the User Options dialog box, then information on generated SQL is also displayed. For more information, see the topic Setting optimization options for streams on p. 60. You can save messages reported here for a stream by clicking Save Messages on the Save button drop-down list (on the left, just below the Messages tab).
68 Chapter 5 Figure 5-24 Viewing execution times for nodes in the stream In the table of node execution times, the columns are as follows. Click a column heading to sort the entries into ascending or descending order (for example, to see which nodes have the longest execution times). Terminal Node. The identifier of the branch to which the node belongs. The identifier is the name of the terminal node at the end of the branch. Node Label. The name of the node to which the execution time refers. Node Id.
69 Building Streams Parameters can also be set for SuperNodes, in which case they are visible only to nodes encapsulated within that SuperNode. To Set Stream and Session Parameters through the User Interface E To set stream parameters, on the main menu, click: Tools > Stream Properties > Parameters E To set session parameters, click Set Session Parameters on the Tools menu. Figure 5-25 Setting parameters for the session Prompt?.
70 Chapter 5 Note that long name, storage, and type options can be set for parameters through the user interface only. These options cannot be set using scripts. Click the arrows at the right to move the selected parameter further up or down the list of available parameters. Use the delete button (marked with an X) to remove the selected parameter.
71 Building Streams Figure 5-27 Specifying available values for a parameter Type. Displays the currently selected measurement level. You can change this value to reflect the way that you intend to use the parameter in IBM® SPSS® Modeler. Storage. Displays the storage type if known. Storage types are unaffected by the measurement level (continuous, nominal or flag) that you choose for work in SPSS Modeler. You can alter the storage type on the main Parameters tab.
72 Chapter 5 Analytical Decision Management or Predictive Applications 5.x. All streams require a designated scoring branch before they can be deployed; additional requirements and options depend on the deployment type. For more information, see the topic Storing and Deploying Repository Objects in Chapter 9 on p. 160. Viewing Global Values for Streams Using the Globals tab in the stream properties dialog box, you can view the global values set for the current stream.
73 Building Streams Searching for Nodes in a Stream You can search for nodes in a stream by specifying a number of search criteria, such as node name, category and identifier. This feature can be especially useful for complex streams containing a large number of nodes. To Search for Nodes in a Stream E On the File menu, click Stream Properties (or select the stream from the Streams tab in the managers pane, right-click and then click Stream Properties on the pop-up menu). E Click the Search tab.
74 Chapter 5 Node category. Check this box and click a category on the list to search for a particular type of node. Process Node means a node from the Record Ops or Field Ops tab of the nodes palette; Apply Model Node refers to a model nugget. Keywords include. Check this box and enter one or more complete keywords to search for nodes having that text in the Keywords field on the Annotations tab of the node dialog box. Keyword text that you enter must be an exact match.
75 Building Streams Figure 5-30 Opening section of stream description The stream description is displayed in the form of an HTML document consisting of a number of sections. General Stream Information This section contains the stream name, together with details of when the stream was created and last saved. Description and Comments This section includes any: Stream annotations (see Annotations on p.
76 Chapter 5 Inputs. Lists the input fields together with their storage types (for example, string, integer, real and so on). Outputs. Lists the output fields, including the additional fields generated by the modeling node, together with their storage types. Parameters. Lists any parameters relating to the scoring branch of the stream and which can be viewed or edited each time the model is scored.
77 Building Streams Exporting Stream Descriptions You can export the contents of the stream description to an HTML file. To export a stream description: E On the main menu, click: File > Export Stream Description E Enter a name for the HTML file and click Save. Running Streams Once you have specified the required options for streams and connected the required nodes, you can run the stream by running the data through nodes in the stream. There are several ways to run a stream within IBM® SPSS® Modeler.
78 Chapter 5 Some nodes have further displays giving additional information about stream execution. These are displayed by selecting the corresponding row in the dialog box. The first row is selected automatically. Working with Models If a stream includes a modeling node (that is, one from the Modeling or Database Modeling tab of the nodes palette), a model nugget is created when the stream is run.
79 Building Streams Figure 5-34 Stream with comments added Others can then view these comments on-screen, or you can print out an image of the stream that includes the comments. You can list all the comments for a stream or SuperNode, change the order of comments in the list, edit the comment text, and change the foreground or background color of a comment. For more information, see the topic Listing Stream Comments on p. 84.
80 Chapter 5 The appearance of the text box changes to indicate the current mode of the comment (or annotation shown as a comment), as the following table shows. Table 5-1 Comment and annotation text box modes Comment text box Annotation text box Mode Indicates Obtained by... Edit Creating a new comment or annotation, or double-clicking an existing one. Clicking the stream background after editing, or single-clicking an existing comment or annotation.
81 Building Streams Figure 5-37 Comment in edit mode When you click again, the border changes to solid lines to show that editing is complete. Figure 5-38 Completed comment Double-clicking a comment changes the text box to edit mode—the background changes to white and the comment text can be edited. You can also attach comments to SuperNodes. Operations Involving Comments You can perform a number of operations on comments.
82 Chapter 5 Right-click the stream background and click New Comment on the pop-up menu. Click the New Comment button in the toolbar. E Enter the comment text (or paste in text from the clipboard). E Click a node in the stream to save the comment. To attach a comment to a node or nugget E Select one or more nodes or nuggets on the stream canvas. E Do one of the following: On the main menu, click: Insert > New Comment Right-click the stream background and click New Comment on the pop-up menu.
83 Building Streams E Edit the comment text. You can use standard Windows shortcut keys when editing, for example Ctrl+C to copy text. Other options during editing are listed in the pop-up menu for the comment. E Click outside the text box once to display the resizing controls, then again to complete the comment. To resize a comment text box E Select the comment to display the resizing controls. E Click and drag a control to resize the box. E Click outside the text box to save the change.
84 Chapter 5 If the comment was originally a stream or SuperNode annotation that had been converted to a freestanding comment, the comment is deleted from the canvas but its text is retained on the Annotations tab for the stream or SuperNode. To show or hide comments for a stream E Do one of the following: On the main menu, click: View > Comments Click the Show/hide comments button in the toolbar.
85 Building Streams Figure 5-39 Listing comments for a stream Text. The text of the comment. Double-click the text to change the field to an editable text box. Links. The name of the node to which the comment is attached. If this field is empty, the comment applies to the stream. Positioning buttons. These move a selected comment up or down in the list. Comment Colors.
86 Chapter 5 E Click the Annotations tab. E Select the Show annotation as comment check box. E Click OK. To convert a SuperNode annotation to a comment E Double-click the SuperNode icon on the canvas. E Click the Annotations tab. E Select the Show annotation as comment check box. E Click OK. Annotations Nodes, streams, and models can be annotated in a number of ways. You can add descriptive annotations and specify a custom name.
87 Building Streams Figure 5-40 Annotations tab options Name. Select Custom to adjust the autogenerated name or to create a unique name for the node as displayed on the stream canvas. Tooltip text. (For nodes and model nuggets only) Enter text used as a tooltip on the stream canvas. This is particularly useful when working with a large number of similar nodes. Keywords.
88 Chapter 5 Show annotation as comment. (For stream and SuperNode annotations only) Check this box to convert the annotation to a freestanding comment that will be visible on the stream canvas. For more information, see the topic Adding Comments and Annotations to Nodes and Streams on p. 78. ID. Displays a unique ID that can be used to reference the node for the purpose of scripting or automation. This value is automatically generated when the node is created and will not change.
89 Building Streams Saving Multiple Stream Objects When you exit IBM® SPSS® Modeler with multiple unsaved objects, such as streams, projects, or model nuggets, you will be prompted to save before completely closing the software. If you choose to save items, a dialog box will open with options for saving each object. Figure 5-41 Saving multiple objects E Simply select the check boxes for the objects that you want to save. E Click OK to save each object in the required location.
90 Chapter 5 Encrypting and Decrypting Information When you save a stream, node, project, output file, or model nugget, you can encrypt it to prevent its unauthorized use. To do this, you select an extra option when saving, and add a password to the item being saved. This encryption can be set for any of the items that you save and adds extra security to them; it is not the same as the SSL encryption used if you are passing files between IBM® SPSS® Modeler and IBM® SPSS® Modeler Server.
91 Building Streams Models palette (.gen) Nodes (.nod) Output (.cou) Projects (.cpj) Opening New Files Streams can be loaded directly from the File menu. E On the File menu, click Open Stream. All other file types can be opened using the submenu items available on the File menu.
92 Chapter 5 Map to. This method starts with the node to be introduced to the stream. First, right-click the node to introduce; then, using the Data Mapping > Map To option from the pop-up menu, select the node to which it should join. This method is particularly useful for mapping to a terminal node. Note: You cannot map to Merge or Append nodes. Instead, you should simply connect the stream to the Merge node in the normal manner.
93 Building Streams Step 3: Replace the template source node. Using the Data Mapping option on the pop-up menu for the template source node, click Select Replacement Node, then select the source node for the replacement data. Figure 5-45 Selecting a replacement source node Step 4: Check mapped fields. In the dialog box that opens, check that the software is mapping fields properly from the replacement data source to the stream. Any unmapped essential fields are displayed in red.
94 Chapter 5 Mapping between Streams Similar to connecting nodes, this method of data mapping does not require you to set essential fields beforehand. With this method, you simply connect from one stream to another using Map to from the Data Mapping pop-up menu. This type of data mapping is useful for mapping to terminal nodes and copying and pasting between streams. Note: Using the Map to option, you cannot map to Merge, Append, and all types of source nodes.
95 Building Streams To Set Essential Fields E Right-click the source node of the template stream that will be replaced. E On the menu, click: Data Mapping > Specify Essential Fields Figure 5-48 Specifying essential fields E Using the Field Chooser, you can add or remove fields from the list. To open the Field Chooser, click the icon to the right of the fields list.
96 Chapter 5 Mapped. Lists the fields selected for mapping to template fields. These are the fields whose names may have to change to match the original fields used in stream operations. Click in the table cell for a field to activate a list of available fields. If you are unsure of which fields to map, it may be useful to examine the source data closely before mapping. For example, you can use the Types tab in the source node to review a summary of the source data.
97 Building Streams Figure 5-51 ToolTip and custom node name Insert values automatically into a CLEM expression. Using the Expression Builder, accessible from a variety of dialog boxes (such as those for Derive and Filler nodes), you can automatically insert field values into a CLEM expression. Click the values button on the Expression Builder to choose from existing field values. Figure 5-52 Values button Browse for files quickly.
98 Chapter 5 Figure 5-53 Selecting the Demos folder from the list of recently-used directories Minimize output window clutter. You can close and delete output quickly using the red X button at the top right corner of all output windows. This enables you to keep only promising or interesting results on the Outputs tab of the managers pane. A full range of keyboard shortcuts is available for the software. For more information, see the topic Keyboard Accessibility in Appendix A on p. 238.
Chapter Handling Missing Values 6 Overview of Missing Values During the Data Preparation phase of data mining, you will often want to replace missing values in the data. Missing values are values in the data set that are unknown, uncollected, or incorrectly entered. Usually, such values are invalid for their fields. For example, the field Sex should contain the values M and F.
100 Chapter 6 Figure 6-1 Specifying missing values for a continuous variable Reading in mixed data. Note that when you are reading in fields with numeric storage (either integer, real, time, timestamp, or date), any non-numeric values are set to null or system missing. This is because, unlike some applications, does not allow mixed storage types within a field.
101 Handling Missing Values In general terms, there are two approaches you can follow: You can exclude fields or records with missing values You can impute, replace, or coerce missing values using a variety of methods Both of these approaches can be largely automated using the Data Audit node.
102 Chapter 6 Screening or Removing Fields To screen out fields with too many missing values, you have several options: You can use a Data Audit node to filter fields based on quality. You can use a Feature Selection node to screen out fields with more than a specified percentage of missing values and to rank fields based on importance relative to a specified target. Instead of removing the fields, you can use a Type node to set the field role to None.
103 Handling Missing Values The @ functions can be used in conjunction with the @FIELD function to identify the presence of blank or null values in one or more fields. The fields can simply be flagged when blank or null values are present, or they can be filled with replacement values or used in a variety of other operations.
104 Chapter 6 Note on Discarding Records When using a Select node to discard records, note that syntax uses three-valued logic and automatically includes null values in select statements. To exclude null values (system-missing) in a select expression, you must explicitly specify this by using and not in the expression.
Chapter Building CLEM Expressions 7 About CLEM The Control Language for Expression Manipulation (CLEM) is a powerful language for analyzing and manipulating the data that flows along IBM® SPSS® Modeler streams. Data miners use CLEM extensively in stream operations to perform tasks as simple as deriving profit from cost and revenue data or as complex as transforming web log data into a set of fields and records with usable information.
106 Chapter 7 Figure 7-1 Derive node creating a new field based on a formula CLEM expressions can also be used for global search and replace operations. For example, the expression @NULL(@FIELD) can be used in a Filler node to replace system-missing values with the integer value 0. (To replace user-missing values, also called blanks, use the @BLANK function.
107 Building CLEM Expressions Figure 7-2 Filler node replacing system-missing values with 0 More complex CLEM expressions can also be created. For example, you can derive new fields based on a conditional set of rules.
108 Chapter 7 Figure 7-3 Conditional Derive comparing values of one field to those of the field before it CLEM Examples To illustrate correct syntax as well as the types of expressions possible with CLEM, example expressions follow. Simple Expressions Formulas can be as simple as this one, which derives a new field based on the values of the fields After and Before: (After - Before) / Before * 100.0 Notice that field names are unquoted when referring to the values of the field.
109 Building CLEM Expressions Complex Expressions Expressions can also be lengthy and more complex. The following expression returns true if the value of two fields ($KX-Kohonen and $KY-Kohonen) fall within the specified ranges. Notice that here the field names are single-quoted because the field names contain special characters. ('$KX-Kohonen' >= -0.2635771036148072 and '$KX-Kohonen' <= 0.3146203637123107 and '$KY-Kohonen' >= -0.18975617885589602 and '$KY-Kohonen' <= 0.
110 Chapter 7 Frequently, special functions are used in combination, which is a commonly used method of flagging blanks in more than one field at a time. @BLANK(@FIELD)-> T Additional examples are discussed throughout the CLEM documentation. For more information, see the topic CLEM Reference Overview in Chapter 8 on p. 127. Values and Data Types CLEM expressions are similar to formulas constructed from values, field names, operators, and functions.
111 Building CLEM Expressions Characters—Always use single backquotes like this ` . For example, note the character d in the function stripchar(`d`,"drugA"). The only exception to this is when you are using an integer to refer to a specific character in a string. For example, note the character 5 in the function lowertoupper("druga"(5)) —> "A". Note: On a standard U.K. and U.S. keyboard, the key for the backquote character (grave accent, Unicode 0060) can be found just below the Esc key.
112 Chapter 7 If you want to override precedence, or if you are in any doubt of the order of evaluation, you can use parentheses to make it explicit—for example, sqrt(abs(Signal)) * (max(T1, T2) + Baseline) Stream, Session, and SuperNode Parameters Parameters can be defined for use in CLEM expressions and in scripting.
113 Building CLEM Expressions Determining the length (number of characters) for a string variable—length(STRING). Checking the alphabetical ordering of string values—alphabefore(STRING1, STRING2). Removing leading or trailing white space from values—trim(STRING), trim_start(STRING), or trimend(STRING). Extract the first or last n characters from a string—startstring(LENGTH, STRING) or endstring(LENGTH, STRING).
114 Chapter 7 Figure 7-4 Filler node replacing system-missing values with 0 For more information, see the topic Functions Handling Blanks and Null Values in Chapter 8 on p. 156.
115 Building CLEM Expressions Calculating Time Passed You can easily calculate the time passed from a baseline date using a family of functions similar to the following one. This function returns the time in months from the baseline date to the date represented by the date string DATE as a real number. This is an approximate figure, based on a month of 30.0 days.
116 Chapter 7 You can also use a number of counting functions to obtain counts of values that meet specific criteria, even when those values are stored in multiple fields.
117 Building CLEM Expressions Working with Multiple-Response Data A number of comparison functions can be used to analyze multiple-response data, including: value_at first_index / last_index first_non_null / last_non_null first_non_null_index / last_non_null_index min_index / max_index For example, suppose a multiple-response question asked for the first, second, and third most important reasons for deciding on a particular purchase (for example, price, personal recommendation, review,
118 Chapter 7 functions. In addition, the Builder controls automatically add the proper quotes for fields and values, making it easier to create syntactically correct expressions. Figure 7-5 Expression Builder dialog box Note: The Expression Builder is not supported in scripting or parameter settings.
119 Building CLEM Expressions Accessing the Expression Builder The Expression Builder is available in all nodes where CLEM expressions are used, including Select, Balance, Derive, Filler, Analysis, Report, and Table nodes. You can open it by clicking the calculator button just to the right of the formula field.
120 Chapter 7 E Double-click or click the yellow arrow button to add the field or function to the expression field. E Use the operand buttons in the center of the dialog box to insert the operations into the expression. Selecting Functions The function list displays all available CLEM functions and operators. Scroll to select a function from the list, or, for easier searching, use the drop-down list to display a subset of functions or operators.
121 Building CLEM Expressions After you have selected a group of functions, double-click to insert the functions into the expression field at the point indicated by the position of the cursor. Selecting Fields, Parameters, and Global Variables The field list displays all fields available at this point in the data stream. Scroll to select a field from the list. Double-click or click the yellow arrow button to add a field to the expression.
122 Chapter 7 Viewing or Selecting Values Field values can be viewed from a number of places in the system, including the Expression Builder, data audit reports, and when editing future values in a Time Intervals node. Note that data must be fully instantiated in a source or Type node to use this feature, so that storage, types, and values are known.
123 Building CLEM Expressions Checking CLEM Expressions Click Check in the Expression Builder (lower right corner) to validate the expression. Expressions that have not been checked are displayed in red. If errors are found, a message indicating the cause is displayed.
124 Chapter 7 Figure 7-12 Find/Replace dialog box E With the cursor in a text area, press Ctrl+F to access the Find/Replace dialog box. E Enter the text you want to search for, or choose from the drop-down list of recently searched items. E Enter the replacement text, if any. E Click Find Next to start the search. E Click Replace to replace the current selection, or Replace All to update all or selected instances. E The dialog box closes after each operation.
125 Building CLEM Expressions Characters \0nn \0mnn \xhh \uhhhh \t \n \r \f \a \e \cx Matches The character with octal value 0nn (0 <= n <= 7) The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) The character with hexadecimal value 0xhh The character with hexadecimal value 0xhhhh The tab character (‘\u0009’) The newline (line feed) character (‘\u000A’) The carriage-return character (‘\u000D’) The form-feed character (‘\u000C’) The alert (bell) character (‘\u0007’) The escape character (‘\u001B’
126 Chapter 7 Boundary matchers \Z \z Matches The end of the input but for the final terminator, if any The end of the input
Chapter CLEM Language Reference 8 CLEM Reference Overview This section describes the Control Language for Expression Manipulation (CLEM), which is a powerful tool used to analyze and manipulate the data used in IBM® SPSS® Modeler streams. You can use CLEM within nodes to perform tasks ranging from evaluating conditions or deriving values to inserting data into reports. For more information, see the topic About CLEM in Chapter 7 on p. 105.
128 Chapter 8 For more information, see the topic Values and Data Types in Chapter 7 on p. 110. Additionally, these rules are covered in more detail in the following topics. Integers Integers are represented as a sequence of decimal digits. Optionally, you can place a minus sign (−) before the integer to denote a negative number—for example, 1234, 999, −77. The CLEM language handles integers of arbitrary precision. The maximum integer size depends on your platform.
129 CLEM Language Reference Strings Generally, you should enclose strings in double quotation marks. Examples of strings are "c35product2" and "referrerID". To indicate special characters in a string, use a backslash––for example, "\$65443". (To indicate a backslash character, use a double backslash, \\.) You can use single quotes around a string, but the result is indistinguishable from a quoted field ('referrerID'). For more information, see the topic String Functions on p. 141.
130 Chapter 8 Format DD/MM/YYYY MM/DD/YY MM/DD/YYYY DD-MM-YY DD-MM-YYYY MM-DD-YY MM-DD-YYYY DD.MM.YY DD.MM.YYYY MM.DD.YY MM.DD.YYYY DD-MON-YY DD/MON/YY DD.MON.YY DD-MON-YYYY DD/MON/YYYY DD.MON.YYYY MON YYYY q Q YYYY ww WK YYYY Examples 15/01/1963 01/15/63 01/15/1963 15-01-63 15-01-1963 01-15-63 01-15-1963 15.01.63 15.01.1963 01.15.63 01.15.1963 15-JAN-63, 15-jan-63, 15-Jan-63 15/JAN/63, 15/jan/63, 15/Jan/63 15.JAN.63, 15.jan.63, 15.Jan.
131 CLEM Language Reference Format MM.SS (H)H.(M)M.(S)S (H)H.(M)M (M)M.(S)S Examples 55.58, 01.00 12.1.12, 1.1.1, 22.12.12 12.23, 7.45, 22.7 55.58, 1.0 CLEM Operators The following operators are available. Operation or and = == /= /== > >= < <= &&=_0 &&/=_0 + >< - * Comments Used between two CLEM expressions. Returns a value of true if either is true or if both are true. Used between two CLEM expressions. Returns a value of true if both are true. Used between any two comparable items.
132 Chapter 8 Operation && &&~~ || ~~ ||/& INT1 << N INT1 >> N / ** rem div Comments Used between two integers. The result is the bitwise ‘and’ of the integers INT1 and INT2. Used between two integers. The result is the bitwise ‘and’ of INT1 and the bitwise complement of INT2. Used between two integers. The result is the bitwise ‘inclusive or’ of INT1 and INT2. Used in front of an integer. Produces the bitwise complement of INT. Used between two integers.
133 CLEM Language Reference Functions Reference The following CLEM functions are available for working with data in IBM® SPSS® Modeler. You can enter these functions as code in a variety of dialog boxes, such as Derive and Set To Flag nodes, or you can use the Expression Builder to create valid CLEM expressions without memorizing function lists or field names.
134 Chapter 8 Convention INT, INT1, INT2 CHAR STRING LIST ITEM DATE Description Any integer, such as 1 or –77. A character code, such as `A`. A string, such as "referrerID". A list of items, such as ["abc" "def"]. A field, such as Customer or extract_concept. A date field, such as start_date, where values are in a format such as DD-MON-YYYY. A time field, such as power_flux, where values are in a format such as HHMMSS.
135 CLEM Language Reference Conversion Functions Conversion functions allow you to construct new fields and convert the storage type of existing files. For example, you can form new strings by joining strings together or by taking strings apart. To join two strings, use the operator ><. For example, if the field Site has the value "BRAMLEY", then "xx" >< Site returns "xxBRAMLEY". The result of >< is always a string, even if the arguments are not strings.
136 Chapter 8 Function count_greater_than(ITEM1, LIST) count_less_than(ITEM1, LIST) count_not_equal(ITEM1, LIST) count_nulls(LIST) count_non_nulls(LIST) date_before(DATE1, DATE2) Result first_index(ITEM, LIST) Integer first_non_null(LIST) Any Integer Integer Integer Integer Integer Boolean first_non_null_index(LIST) Integer ITEM1 = ITEM2 Boolean ITEM1 /= ITEM2 Boolean ITEM1 < ITEM2 Boolean ITEM1 <= ITEM2 Boolean ITEM1 > ITEM2 Boolean ITEM1 >= ITEM2 Boolean last_index(ITEM, LIST) Integer
137 CLEM Language Reference Function Result max_n(LIST) Number member(ITEM, LIST) Boolean min(ITEM1, ITEM2) Any min_index(LIST) Integer min_n(LIST) Number time_before(TIME1, TIME2) Boolean value_at(INT, LIST) Description Returns the maximum value from a list of numeric fields or null if all of the field values are null. For more information, see the topic Summarizing Multiple Fields in Chapter 7 on p. 115. Returns true if ITEM is a member of the specified LIST.
138 Chapter 8 Numeric Functions CLEM contains a number of commonly used numeric functions.
139 CLEM Language Reference Function Result mean_n(LIST) Number sdev_n(LIST) Number Description Returns the mean value from a list of numeric fields or null if all of the field values are null. Returns the standard deviation from a list of numeric fields or null if all of the field values are null. Trigonometric Functions All of the functions in this section either take an angle as an argument or return one as a result.
140 Chapter 8 Function Result cdf_normal(NUM, MEAN, STDDEV) Real cdf_t(NUM, DF) Real Description Returns the probability that a value from the normal distribution with the specified mean and standard deviation will be less than the specified number. Returns the probability that a value from Student’s t distribution with the specified degrees of freedom will be less than the specified number.
141 CLEM Language Reference Function Result integer_bitcount(INT) Integer integer_leastbit(INT) Integer integer_length(INT) Integer testbit(INT, N) Boolean Description Counts the number of 1 or 0 bits in the two’s-complement representation of INT. If INT is non-negative, N is the number of 1 bits. If INT is negative, it is the number of 0 bits. Owing to the sign extension, there are an infinite number of 0 bits in a non-negative integer or 1 bits in a negative integer.
142 Chapter 8 In CLEM, a string is any sequence of characters between matching double quotation marks ("string quotes"). Characters (CHAR) can be any single alphanumeric character. They are declared in CLEM expressions using single backquotes in the form of ``, such as `z`, `A`, or `2`. Characters that are out-of-bounds or negative indices to a string will result in undefined behavior. Note.
143 CLEM Language Reference Function Result ismidstring(SUBSTRING, STRING) Integer isnumbercode(CHAR) Boolean isstartstring(SUBSTRING, STRING) Integer issubstring(SUBSTRING, N, STRING) Integer issubstring(SUBSTRING, STRING) Integer issubstring_count(SUBSTRING, N, STRING): Integer issubstring_lim(SUBSTRING, N, STARTLIM, ENDLIM, STRING) Integer isuppercode(CHAR) Boolean last(CHAR) String length(STRING) Integer Description If SUBSTRING is a substring of STRING but does not start on the f
144 Chapter 8 Function Result locchar(CHAR, N, STRING) Integer locchar_back(CHAR, N, STRING) Integer lowertoupper(CHAR) lowertoupper (STRING) CHAR or String matches Boolean replace(SUBSTRING, NEWSUBSTRING, STRING) String replicate(COUNT, STRING) String Description Used to identify the location of characters in symbolic fields. The function searches the string STRING for the character CHAR, starting the search at the Nth character of STRING.
145 CLEM Language Reference Function Result stripchar(CHAR,STRING) String skipchar(CHAR, N, STRING) Integer skipchar_back(CHAR, N, STRING) Integer startstring(LENGTH, STRING) String strmember(CHAR, STRING) Integer subscrs(N, STRING) CHAR substring(N, LEN, STRING) String substring_between(N1, N2, STRING) String trim(STRING) String trim_start(STRING) String trimend(STRING) String Description Enables you to remove specified characters from a string or field.
146 Chapter 8 Function Result unicode_char(NUM) CHAR unicode_value(CHAR) NUM uppertolower(CHAR) uppertolower (STRING) CHAR or String Description Returns the character with Unicode value NUM. Returns the Unicode value of CHAR Input can be either a string or character and is used in this function to return a new item of the same type with any uppercase characters converted to their lowercase equivalents. Note: Remember to specify strings with double quotes and characters with single backquotes.
147 CLEM Language Reference Note: Date and time functions cannot be called from scripts.
148 Chapter 8 Function Result date_in_weeks(DATE) Real date_in_years(DATE) Real date_months_difference (DATE1, DATE2) Real datetime_date(YEAR, MONTH, DAY) Date datetime_day(DATE) Integer datetime_day_name(DAY) String datetime_hour(TIME) Integer datetime_in_seconds(TIME) datetime_in_seconds(DATE), datetime_in_seconds(DATETIME) Real datetime_minute(TIME) Integer datetime_month(DATE) Integer datetime_month_name (MONTH) datetime_now Real String Timestamp datetime_second(TIME) Integer
149 CLEM Language Reference Function datetime_time(ITEM) datetime_timestamp(YEAR, MONTH, DAY, HOUR, MINUTE, SECOND) datetime_timestamp(DATE, TIME) Result Time datetime_timestamp (NUMBER) Timestamp datetime_weekday(DATE) Integer datetime_year(DATE) Integer date_weeks_difference (DATE1, DATE2) Real date_years_difference (DATE1, DATE2) Real time_before(TIME1, TIME2) Boolean time_hours_difference (TIME1, TIME2) Real time_in_hours(TIME) Real time_in_mins(TIME) Real time_in_secs(TIME) Intege
150 Chapter 8 Function Result time_mins_difference(TIME1, TIME2) Real time_secs_difference(TIME1, TIME2) Integer Description Returns the time difference in minutes between the times or timestamps represented by TIME1 and TIME2, as a real number. If you select Rollover days/mins in the stream properties dialog box, a higher value of TIME1 is taken to refer to the previous day (or the previous hour, if only minutes and seconds are specified in the current format).
151 CLEM Language Reference Sequence functions Record indexing Averaging, summing, and comparing values Monitoring change—differentiation @SINCE Offset values Additional sequence facilities For many applications, each record passing through a stream can be considered as an individual case, independent of all others. In such situations, the order of records is usually unimportant. For some classes of problems, however, the record sequence is very important.
152 Chapter 8 For this reason, @SINCE does not evaluate its condition for the current record. Use a similar function, @SINCE0, if you want to evaluate the condition for the current record as well as previous ones; if the condition is true in the current record, @SINCE0 returns 0. Note: @ functions cannot be called from scripts.
153 CLEM Language Reference Function Result @MAX(FIELD, EXPR) Number @MAX(FIELD, EXPR, INT) Number @MIN(FIELD) Number @MIN(FIELD, EXPR) Number @MIN(FIELD, EXPR, INT) Number @OFFSET(FIELD, EXPR) Any Description Returns the maximum value for FIELD over the last EXPR records received so far, including the current record. FIELD must be the name of a numeric field. EXPR may be any expression evaluating to an integer greater than 0.
154 Chapter 8 Function Result @OFFSET(FIELD, EXPR, INT) Any @SDEV(FIELD) Real @SDEV(FIELD, EXPR) Real @SDEV(FIELD, EXPR, INT) Real @SINCE(EXPR) Any @SINCE(EXPR, INT) Any @SINCE0(EXPR) Any @SINCE0(EXPR, INT) Any @SUM(FIELD) Number Description Performs the same operation as the @OFFSET function with the addition of a third argument, INT, which specifies the maximum number of values to look back.
155 CLEM Language Reference Function Result @SUM(FIELD, EXPR) Number @SUM(FIELD, EXPR, INT) Number @THIS(FIELD) Any Description Returns the sum of values for FIELD over the last EXPR records received by the current node, including the current record. FIELD must be the name of a numeric field. EXPR may be any expression evaluating to an integer greater than 0. If EXPR is omitted, or if it exceeds the number of records received so far, the sum over all of the records received so far is returned.
156 Chapter 8 Function Result @GLOBAL_MIN(FIELD) Number @GLOBAL_SDEV(FIELD) Number @GLOBAL_MEAN(FIELD) Number @GLOBAL_SUM(FIELD) Number Description Returns the minimum value for FIELD over the whole data set, as previously generated by a Set Globals node. FIELD must be the name of a numeric field. If the corresponding global value has not been set, an error occurs. Returns the standard deviation of values for FIELD over the whole data set, as previously generated by a Set Globals node.
157 CLEM Language Reference Special Fields Special functions are used to denote the specific fields under examination, or to generate a list of fields as input. For example, when deriving multiple fields at once, you should use @FIELD to denote “perform this derive action on the selected fields.” Using the expression log(@FIELD) derives a new log field for each selected field. Note: @ functions cannot be called from scripts.
Chapter Using IBM SPSS Modeler with a Repository 9 About the IBM SPSS Collaboration and Deployment Services Repository IBM® SPSS® Modeler can be used in conjunction with an IBM SPSS Collaboration and Deployment Services repository, enabling you to manage the life cycle of data mining models and related predictive objects, and enabling these objects to be used by enterprise applications, tools, and solutions.
159 Using IBM SPSS Modeler with a Repository Figure 9-1 Objects in the IBM SPSS Collaboration and Deployment Services Repository Extensive Versioning and Search Support The repository provides comprehensive object versioning and search capabilities. For example, suppose that you create a stream and store it in the repository where it can be shared with researchers from other divisions.
160 Chapter 9 For more information, see the topic Connecting to the Repository on p. 161. Storing and Deploying Repository Objects Streams created in IBM® SPSS® Modeler can be stored in the repository just as they are, as files with the extension .str. In this way, a single stream can be accessed by multiple users throughout the enterprise. For more information, see the topic Storing Objects in the Repository on p. 164. It is also possible to deploy a stream in the repository.
161 Using IBM SPSS Modeler with a Repository Other Deployment Options While IBM SPSS Collaboration and Deployment Services offers the most extensive features for managing enterprise content, a number of other mechanisms for deploying or exporting streams are also available, including: Export the stream and model for later use with IBM® SPSS® Modeler Solution Publisher Runtime. Export one or more models in PMML, an XML-based format for encoding model information.
162 Chapter 9 Ensure secure connection. Specifies whether a Secure Sockets Layer (SSL) connection should be used. SSL is a commonly used protocol for securing data sent over a network. To use this feature, SSL must be enabled on the server hosting the repository. If necessary, contact your local administrator for details. Entering Credentials for the Repository Figure 9-3 Entering IBM SPSS Collaboration and Deployment Services Repository credentials User ID and password.
163 Using IBM SPSS Modeler with a Repository Figure 9-4 Browsing the IBM SPSS Collaboration and Deployment Services Repository contents The explorer window initially displays a tree view of the folder hierarchy. Click a folder name to display its contents. Objects that match the current selection or search criteria are listed in the right pane, with detailed information on the selected version displayed in the lower right pane. The attributes displayed apply to the most recent version.
164 Chapter 9 Storing Objects in the Repository Figure 9-5 Storing a model You can store streams, nodes, models, model palettes, projects, and output objects in the repository, from where they can be accessed by other users and applications. Note: A separate license is required to access an IBM® SPSS® Collaboration and Deployment Services repository. For more information, see http://www.ibm.
165 Using IBM SPSS Modeler with a Repository Choosing the Location for Storing Objects Figure 9-6 Choosing the location for storing an object Save in. Shows the current folder—the location where the object will be stored. Double-click a folder name in the list to set that folder as the current folder. Use the Up Folder button to navigate to the parent folder. Use the New Folder button to create a folder at the current level. File name. The name under which the object will be stored. Store.
166 Chapter 9 Figure 9-7 Adding information about the object Author. The username of the user creating the object in the repository. By default, this shows the username used for the repository connection, but you can change this name here. Version Label. Select a label from the list to indicate the object version, or click Add to create a new label. Avoid using the “[” character in the label. Ensure that no boxes are checked if you do not want to assign a label to this object version.
167 Using IBM SPSS Modeler with a Repository Assigning Topics to a Stored Object Topics are a hierarchical classification system for the content stored in the repository. You can choose from the available topics when storing objects, and users can also search for objects by topic. The list of available topics is set by repository users with the appropriate privileges (for more information, see the Deployment Manager User’s Guide).
168 Chapter 9 Figure 9-9 Setting security options for an object Principal. The repository username of the user or group who has access rights on this object. Permissions. The access rights that this user or group has for the object. Add. Enables you to add one or more users or groups to the list of those with access rights on this object. For more information, see the topic Adding a User to the Permissions List on p. 169. Modify.
169 Using IBM SPSS Modeler with a Repository Adding a User to the Permissions List Figure 9-10 Adding a user to the permissions list for an object Select provider. Choose a security provider for authentication. The repository can be configured to use different security providers; if necessary, contact your local administrator for more information. Find. Enter the repository username of the user or group you want to add, and click Search to display that name in the user list.
170 Chapter 9 Read. By default, a user or group that is not the object owner has only Read access rights to the object. Select the appropriate check boxes to add Write, Delete, and Modify Permissions access rights for this user or group. Storing Streams You can store a stream as a .str file in the repository, from where it can be accessed by other users. Note: For information on deploying a stream, to take advantage of additional repository features, see Deploying Streams on p. 184.
171 Using IBM SPSS Modeler with a Repository E Specify connection settings to the repository if necessary. For more information, see the topic Connecting to the Repository on p. 161. For specific port, password, and other connection details, contact your local system administrator. E In the Repository: Store dialog box, choose the folder where you want to store the object, specify any other information you want to record, and click the Store button.
172 Chapter 9 Storing Models and Model Palettes You can store an individual model as a .gm file in the repository, from where it can be accessed by other users. You can also store the complete contents of the Models palette as a .gen file in the repository. Storing a model E Click the object on the Models palette in SPSS Modeler, and on the main menu click: File > Models > Store Model... E Alternatively, right-click an object in the Models palette and click Store Model.
173 Using IBM SPSS Modeler with a Repository or File > Projects > Retrieve Project... or File > Outputs > Retrieve Output... E Alternatively, right-click in the managers or project pane and click Retrieve on the pop-up menu. E To retrieve a node, on the SPSS Modeler main menu click: Insert > Node (or SuperNode) from Repository... E Specify connection settings to the repository if necessary. For more information, see the topic Connecting to the Repository on p. 161.
174 Chapter 9 Look in. Shows the folder hierarchy for the current folder. To navigate to a different folder, select one from this list to navigate there directly, or navigate using the object list below this field. Up Folder button. Navigates to one level above the current folder in the hierarchy. New Folder button. Creates a new folder at the current level in the hierarchy. File name. The repository file name of the selected object. To retrieve that object, click Retrieve. Files of type.
175 Using IBM SPSS Modeler with a Repository E Select the object version you want to work with. E Click Continue. Searching for Objects in the Repository You can search for objects by name, folder, type, label, date, or other criteria. Searching by Name To search for objects by name: E On the IBM® SPSS® Modeler main menu click: Tools > Repository > Explore... E Specify connection settings to the repository if necessary. For more information, see the topic Connecting to the Repository on p. 161.
176 Chapter 9 When searching for objects by name, an asterisk (*) can be used as a wildcard character to match any string of characters, and a question mark (?) matches any single character. For example, *cluster* matches all objects that include the string cluster anywhere in the name. The search string m0?_* matches M01_cluster.str and M02_cluster.str but not M01a_cluster.str. Searches are not case sensitive (cluster matches Cluster matches CLUSTER).
177 Using IBM SPSS Modeler with a Repository Topics. You can search on models associated with specific topics from a list set by repository users with the appropriate privileges (for more information, see the Deployment Manager User’s Guide). To obtain the list, check this box, then click the Add Topics button that is displayed, select one or more topics from the list and click OK. Label. Restricts the search to specific object version labels. Dates.
178 Chapter 9 Figure 9-16 Locked object To lock an object E In the repository explorer window, right-click the required object. E Click Lock. To unlock an object E In the repository explorer window, right-click the required object. E Click Unlock. Deleting Repository Objects Before deleting an object from the repository, you must decide if you want to delete all versions of the object, or just a particular version.
179 Using IBM SPSS Modeler with a Repository Figure 9-17 Select versions to delete Managing Properties of Repository Objects You can control various object properties from SPSS Modeler. You can: View the properties of a folder View and edit the properties of an object Create, apply and delete version labels for an object Viewing Folder Properties To view properties for any folder in the repository window, right-click the required folder. Click Folder Properties.
180 Chapter 9 Displays the folder name, creation, and modification dates. Permissions tab Specifies read and write permissions for the folder. All users and groups with access to the parent folder are listed. Permissions follow a hierarchy. For example, if you do not have read permission, you cannot have write permission. If you do not have write permission, you cannot have delete permission. Figure 9-19 Folder properties Users And Groups.
181 Using IBM SPSS Modeler with a Repository E In the repository window, right-click the required object. E Click Object Properties. Figure 9-20 Object properties General Tab Name. The name of the object as viewed in the repository. Created on. Date the object (not the version) was created. Last modified. Date the most recent version was modified. Author. The user’s login name. Description. By default, this contains the description specified on the object’s Annotation tab in SPSS Modeler. Linked topics.
182 Chapter 9 Figure 9-21 Version properties The following properties can be specified or modified for specific versions of a stored object: Version. Unique identifier for the version generated based on the time when the version was stored. Label. Current label for the version, if any. Unlike the version identifier, labels can be moved from one version of an object to another. The file size, creation date, and author are also displayed for each version. Edit Labels.
183 Using IBM SPSS Modeler with a Repository Figure 9-22 Object access rights Users And Groups. Lists the repository users and groups that have at least Read access to this object. Select the Write and Delete check boxes to add those access rights for this object to a particular user or group. Click the Add Users/Groups icon on the right side of the Permissions tab to assign access to additional users and groups. The list of available users and groups is controlled by the administrator.
184 Chapter 9 To define a new label and apply it to the object E Type the label name in the New Label field. E Click the right-arrow button to move the new label to the Applied Labels list. E Click OK. Deploying Streams To enable a stream to be used with the thin-client application IBM® SPSS® Modeler Advantage, it must be deployed as a stream (.str file) in the repository. Whether a stream is deployed as a stream (.str file) or as a scenario (.
185 Using IBM SPSS Modeler with a Repository Figure 9-23 Storing a stream in the repository E In the Repository: Store dialog box, choose the folder where you want to store the object, specify any other information you want to record, and click the Store button. For more information, see the topic Setting Object Properties on p. 164. Stream Deployment Options The Deployment tab in the Stream Options dialog box allows you to specify options for deploying the stream.
186 Chapter 9 Figure 9-24 Stream Deployment options Deployment type. Choose how you want to deploy the stream. All streams require a designated scoring node before they can be deployed; additional requirements and options depend on the deployment type. . The stream will not be deployed to the repository. All options are disabled except stream description preview. Scoring Only. The stream is deployed to the repository when you click the Store button.
187 Using IBM SPSS Modeler with a Repository Scoring node. Select a graph, output or export node to identify the stream branch to be used for scoring the data. While the stream can actually contain any number of valid branches, models, and terminal nodes, one and only one scoring branch must be designated for purposes of deployment. This is the most basic requirement to deploy any stream. Scoring Parameters. Allows you to specify parameters that can be modified when the scoring branch is run.
188 Chapter 9 Scoring and Modeling Parameters When deploying a stream to IBM SPSS Collaboration and Deployment Services, you can choose which parameters can be viewed or edited each time the model is updated or scored. For example, you might specify maximum and minimum values, or some other value that may be subject to change each time a job is run.
189 Using IBM SPSS Modeler with a Repository Figure 9-26 Stream with scoring branch highlighted If the stream already had a scoring branch defined, the newly-designated branch replaces it as the scoring branch. You can set the color of the scoring branch indication by means of a Custom Color option. For more information, see the topic Setting Display Options in Chapter 12 on p. 220. You can show or hide the scoring branch indication by means of the Show/hide stream markup toolbar button.
190 Chapter 9 To designate a branch as the scoring branch (Tools menu) E Connect the model nugget to a terminal node (a processing or output node downstream from the nugget). E On the main menu, click: Tools > Stream Properties > Deployment E On the Deployment type list, click Scoring Only or Model Refresh as required. For more information, see the topic Stream Deployment Options on p. 185. E Click the Scoring node field and select a terminal node from the list. E Click OK.
191 Using IBM SPSS Modeler with a Repository Single Model in Stream Figure 9-28 Scoring branch with single model in the stream If a single linked model nugget is on the scoring branch when it is identified as such, that nugget becomes the refresh model for the stream. Multiple Models in Stream If there is more than one linked nugget in the stream, the refresh model is chosen as follows.
192 Chapter 9 Figure 9-29 Scoring branch with more than one model in the stream You right-click the Analysis node and use its menu to set the scoring branch, which is now highlighted. Doing so also designates the model closest to the Analysis node as the refresh model, as indicated by the highlighted refresh link.
193 Using IBM SPSS Modeler with a Repository Figure 9-31 Scoring branch with refresh link switched to first model nugget If you subsequently deselect both model links as refresh links, only the scoring branch is highlighted, not the links. The deployment type is set to Scoring Only. Figure 9-32 Scoring branch with multiple models and no refresh links Note: You can choose to set one of the links to Replace status, but not the other one.
194 Chapter 9 No Models in Stream If there are no models in the stream, or only models with no model links, the deployment type is set to Scoring Only. Checking a Scoring Branch for Errors When you designate the scoring branch, it is checked for errors, such as not having an Enterprise View node in the stream when deploying as a scenario. Figure 9-33 Scoring branch with errors If an error is found, the scoring branch is highlighted in the scoring branch error color, and an error message is displayed.
Chapter Exporting to External Applications 10 About Exporting to External Applications IBM® SPSS® Modeler provides a number of mechanisms to export the entire data mining process to external applications, so that the work you do to prepare data and build models can be used to your advantage outside of SPSS Modeler as well.
196 Chapter 10 E Specify connection settings to the repository if necessary. For more information, see the topic Connecting to the Repository in Chapter 9 on p. 161. For specific port, password, and other connection details, contact your local system administrator. Note: The repository server must also have the IBM SPSS Modeler Advantage software installed.
197 Exporting to External Applications E In the Export (or Save) dialog box, specify a target directory and a unique name for the model. Note: You can change options for PMML export in the User Options dialog box. On the main menu, click: Tools > Options > User Options and click the PMML tab. For more information, see the topic Setting PMML Export Options in Chapter 12 on p. 221.
198 Chapter 10 Figure 10-3 Selecting the XML file for a model saved using PMML Use variable labels if present in model. The PMML may specify both variable names and variable labels (such as Referrer ID for RefID) for variables in the data dictionary. Select this option to use variable labels if they are present in the originally exported PMML. If you have selected the variable label option but there are no variable labels in the PMML, the variable names are used as normal.
199 Exporting to External Applications Neural Net C5.0 Logistic Regression Genlin SVM Bayes Net Apriori Carma K-Means Kohonen TwoStep KNN Statistics Model The following model created in SPSS Modeler can be exported as PMML 3.2: Decision List Database native models. For models generated using database-native algorithms, PMML export is available for IBM InfoSphere Warehouse models only.
Chapter Projects and Reports 11 Introduction to Projects A project is a group of files related to a data mining task. Projects include data streams, graphs, generated models, reports, and anything else that you have created in IBM® SPSS® Modeler. At first glance, it may seem that SPSS Modeler projects are simply a way to organize output, but they are actually capable of much more. Using projects, you can: Annotate each object in the project file.
201 Projects and Reports CRISP-DM View By supporting the Cross-Industry Standard Process for Data Mining (CRISP-DM), IBM® SPSS® Modeler projects provide an industry-proven and non-proprietary way of organizing the pieces of your data mining efforts. CRISP-DM uses six phases to describe the process from start (gathering business requirements) to finish (deploying your results).
202 Chapter 11 Classes View The Classes view in the project pane organizes your work in IBM® SPSS® Modeler categorically by the types of objects created. Saved objects can be added to any of the following categories: Streams Nodes Models Tables, graphs, reports Other (non-SPSS Modeler files, such as slide shows or white papers relevant to your data mining work) Figure 11-3 Classes view Adding objects to the Classes view also adds them to the default phase folder in the CRISP-DM view.
203 Projects and Reports E On the main menu, click: File > Project > New Project... Adding to a Project Once you have created or opened a project, you can add objects, such as data streams, nodes, and reports, using several methods. Adding Objects from the Managers Using the managers in the upper right corner of the IBM® SPSS® Modeler window, you can add streams or output. E Select an object, such as a table or a stream, from one of the manager tabs. E Right-click, and click Add to Project.
204 Chapter 11 Adding Nodes from the Canvas You can add individual nodes from the stream canvas by using the Save dialog box. E Select a node on the canvas. E Right-click, and click Save Node. Alternatively, on the main menu click: Edit > Node > Save Node... E In the Save dialog box, select Add file to project. E Create a name for the node and click Save. This saves the file and adds it to the project. Nodes are added to the Nodes folder in Classes view and to the default phase folder in CRISP-DM view.
205 Projects and Reports Transferring a Project Make sure that the project you want to transfer is open in the project pane. To transfer a project: E Right-click the root project folder and click Transfer Project. E If prompted, log in to IBM SPSS Collaboration and Deployment Services Repository. E Specify the new location for the project and click OK. Setting Project Properties You can customize a project’s contents and documentation by using the project properties dialog box.
206 Chapter 11 Summary. You can enter a summary for your data mining project that will be displayed in the project report. Contents. Lists the type and number of components referenced by the project file (not editable). Save unsaved object as. Specifies whether unsaved objects should be saved to the local file system, or stored in the repository. For more information, see the topic About the IBM SPSS Collaboration and Deployment Services Repository in Chapter 9 on p. 158.
207 Projects and Reports Figure 11-6 Annotations tab in the project properties dialog box E Enter keywords and text to describe the project. Folder Properties and Annotations Individual project folders (in both CRISP-DM and Classes view) can be annotated. In CRISP-DM view, this can be an extremely effective way to document your organization’s goals for each phase of data mining.
208 Chapter 11 Figure 11-7 Project folder with CRISP-DM annotation Name. This area displays the name of the selected field. Tooltip text. Create custom ToolTips that will be displayed when you hover the mouse pointer over a project folder. This is useful in CRISP-DM view, for example, to provide a quick overview of each phase’s goals or to mark the status of a phase, such as “In progress” or “Complete.” Annotation field.
209 Projects and Reports Closing a Project When you exit IBM® SPSS® Modeler or open a new project, the existing project file (.cpj) is closed. Some files associated with the project (such as streams, nodes or graphs) may still be open. If you want to leave these files open, reply No to the message ... Do you want to save and close these files? If you modify and save any associated files after the close of a project, these updated versions will be included in the project the next time you open it.
210 Chapter 11 Figure 11-9 Generated report window To generate a report: E Select the project folder in either CRISP-DM or Classes view. E Right-click the folder and click Project Report. E Specify the report options and click Generate Report.
211 Projects and Reports Figure 11-10 Selecting options for a report The options in the report dialog box provide several ways to generate the type of report you need: Output name. Specify the name of the output window if you choose to send the output of the report to the screen. You can specify a custom name or let IBM® SPSS® Modeler automatically name the window for you. Output to screen. Select this option to generate and display the report in an output window.
212 Chapter 11 File type. Available file types are: HTML document. The report is saved as a single HTML file. If your report contains graphs, they are saved as PNG files and are referenced by the HTML file. When publishing your report on the Internet, make sure to upload both the HTML file and any images it references. Text document. The report is saved as a single text file. If your report contains graphs, only the filename and path references are included in the report.
213 Projects and Reports The total number of nodes in each stream is listed within the report. The numbers are shown under the following headings, which use IBM® SPSS® Modeler terminology, not CRISP-DM terminology: Data readers. Source nodes. Data writers. Export nodes. Model builders. Build, or Modeling, nodes. Model appliers. Generated models, also known as nuggets. Output builders. Graph or Output nodes. Other. Any other nodes related to the project.
214 Chapter 11 Figure 11-11 Report displayed in a web browser
Chapter Customizing IBM SPSS Modeler 12 Customizing IBM SPSS Modeler Options There are a number of operations you can perform to customize IBM® SPSS® Modeler to your needs. Primarily, this customization consists of setting specific user options such as memory allocation, default directories, and use of sound and color. You can also customize the Nodes palette located at the bottom of the SPSS Modeler window.
216 Chapter 12 Figure 12-1 System Options dialog box Maximum memory. Select to impose a limit in megabytes on SPSS Modeler’s memory usage. On some platforms, SPSS Modeler limits its process size to reduce the toll on computers with limited resources or heavy loads. If you are dealing with large amounts of data, this may cause an “out of memory” error. You can ease memory load by specifying a new threshold. Use system locale. This option is selected by default and set to English (United States).
217 Customizing IBM SPSS Modeler is the path used for all client-side operations and output files (if they are referenced with relative paths). Set Server Directory. The Set Server Directory option on the File menu is enabled whenever there is a remote server connection. Use this option to specify the default directory for all server files and data files specified for input or output.
218 Chapter 12 Figure 12-2 User Options dialog box, Notifications tab Show stream execution feedback dialog. Select to display a dialog box that includes a progress indicator when a stream has been running for three seconds. The dialog box also includes details of the output objects created by the stream. Close dialog upon completion. By default, the dialog box closes when the stream finishes running. Clear this check box if you want the dialog box to remain visible when the stream finishes.
219 Customizing IBM SPSS Modeler Visual Notifications The options in this group are used to specify the behavior of the Outputs and Models tabs in the managers pane at the top right of the display when new items are generated. Select New Model or New Output from the list to specify the behavior of the corresponding tab. The following options are available for New Model: Add model to stream. If selected (default), adds a new model to the stream, as well as to the Models tab, as soon as the model is built.
220 Chapter 12 Select Always to always open a new output window. Select If generated by current stream to open a new window for output generated by the stream currently visible in the canvas. Select Never to restrict the software from automatically opening new windows for generated output. Click Default Values to revert to the system default settings for this tab.
221 Customizing IBM SPSS Modeler Standard Fonts & Colors (effective on restart). Options in this control box are used to specify the SPSS Modeler screen design, color scheme, and the size of the fonts displayed. Options selected here do not take effect until you close and restart SPSS Modeler. Look and feel. Enables you to choose a standard color scheme and screen design. You can choose from SPSS Standard (default), a design common across IBM SPSS products.
222 Chapter 12 Figure 12-4 User Options dialog box, PMML tab Export PMML. Here you can configure variations of PMML that work best with your target application. Select With extensions to allow PMML extensions for special cases where there is no standard PMML equivalent. Note that in most cases this will produce the same result as standard PMML. Select As standard PMML... to export PMML that adheres as closely as possible to the PMML standard. Standard PMML Options. When the As standard PMML...
223 Customizing IBM SPSS Modeler Customizing the Nodes Palette Streams are built using nodes. The Nodes Palette at the bottom of the IBM® SPSS® Modeler window contains all of the nodes it is possible to use in stream building. For more information, see the topic Nodes Palette in Chapter 3 on p. 18. You can reorganize the Nodes Palette in two ways: Customize the Palette Manager. For more information, see the topic Customizing the Palette Manager on p. 223.
224 Chapter 12 Figure 12-6 Palette Manager showing the tabs displayed on the Nodes Palette Palette Name. Each available palette tab, whether shown on the Nodes Palette or not, is listed. This includes any palette tabs that you have created. For more information, see the topic Creating a Palette Tab on p. 225. No. of nodes. The number of nodes displayed on each palette tab. A high number here means you may find it more convenient to create subpalettes to divide up the nodes on the tab.
225 Customizing IBM SPSS Modeler Creating a Palette Tab Figure 12-7 Palette tab creation on the Create/Edit Palette dialog box To create a custom palette tab: E From the Tools menu, open the Palette Manager. E To the right of the Shown? column, click the Add Palette button; the Create/Edit Palette dialog box is displayed. E Type in a unique Palette name. E In the Nodes available area, select the node to be added to the palette tab.
226 Chapter 12 Figure 12-8 Palette Manager showing the tabs displayed on the Nodes Palette To select which tabs are to be shown on the Nodes Palette: E From the Tools menu, open the Palette Manager. E Using the check boxes in the Shown? column, select whether to include or hide each palette tab. To permanently remove a palette tab from the Nodes Palette, highlight the node and click the Delete button to the right of the Shown? column. Once deleted, a palette tab cannot be recovered.
227 Customizing IBM SPSS Modeler Figure 12-9 Subpalettes available for the Modeling Palette tab To select subpalettes for display on a palette tab: E From the Tools menu, open the Palette Manager. E Select the palette that you require. E Click the Sub Palettes button; the Sub Palettes dialog box is displayed. E Using the check boxes in the Shown? column, select whether to include each subpalette on the palette tab. The All subpalette is always shown and cannot be deleted.
228 Chapter 12 the palette tab. For example, if you created a palette tab that contains the nodes you use most frequently for creating your streams, you could create four subpalettes that break the selections down by source node, field operations, modeling, and output. Note: You can only select subpalette nodes from those added to the parent palette tab. Figure 12-10 Subpalette creation on the Create/Edit Sub Palette dialog box To create a subpalette: E From the Tools menu, open the Palette Manager.
229 Customizing IBM SPSS Modeler To change the nodes shown on a palette tab, select the palette tab and then, from the menu on the left, select to display either all nodes, or just those in a specific subpalette. Figure 12-11 Modeling palette tab showing the Classification subpalette CEMI Node Management CEMI is now deprecated and has been replaced by CLEF, which offers a much more flexible and easy-to-use feature set.
Chapter Performance Considerations for Streams and Nodes 13 You can design your streams to maximize performance by arranging the nodes in the most efficient configuration, by enabling node caches when appropriate, and by paying attention to other considerations as detailed in this section. Aside from the considerations discussed here, additional and more substantial performance improvements can typically be gained by making effective use of your database, particularly through SQL optimization.
231 Performance Considerations for Streams and Nodes The following operations cannot be performed in most databases.
232 Chapter 13 Figure 13-1 Caching at the Type node to store newly derived fields To Enable a Cache E On the stream canvas, right-click the node and click Cache on the menu. E On the caching submenu, click Enable. E You can turn the cache off by right-clicking the node and clicking Disable on the caching submenu. Caching Nodes in a Database For streams run in a database, data can be cached midstream to a temporary table in the database rather than the file system.
233 Performance Considerations for Streams and Nodes Note: The following databases support temporary tables for the purpose of caching: DB2, Netezza, Oracle, SQL Server, and Teradata. Other databases will use a normal table for database caching. The SQL code can be customized for specific databases - contact Support for assistance. Performance: Process Nodes Sort. The Sort node must read the entire input data set before it can be sorted.
234 Chapter 13 Aggregate. When the Keys are contiguous option is not set, this node reads (but does not store) its entire input data set before it produces any aggregated output. In the more extreme situations, where the size of the aggregated data reaches a limit (determined by the SPSS Modeler Server configuration option Memory usage multiplier), the remainder of the data set is sorted and processed as if the Keys are contiguous option were set.
235 Performance Considerations for Streams and Nodes offset value is not a literal integer; for example, @OFFSET(Sales, Month). The offset value is the field name Month, whose value is unknown until executed. The server must save all values of the Sales field to ensure accurate results. Where an upper bound is known, you should provide it as an additional argument; for example, @OFFSET(Sales, Month, 12). This operation instructs the server to store no more than the 12 most recent values of Sales.
Appendix Accessibility in IBM SPSS Modeler A Overview of Accessibility in IBM SPSS Modeler This release offers greatly enhanced accessibility for all users, as well as specific support for users with visual and other functional impairments. This section describes the features and methods of working using accessibility enhancements, such as screen readers and keyboard shortcuts.
237 Accessibility in IBM SPSS Modeler Controlling the Automatic Launching of New Windows The Notifications tab on the User Options dialog box is also used to control whether newly generated output, such as tables and charts, are launched in a separate window. It may be easier for you to disable this option and open an output window only when required. E To set these options, on the Tools menu, click User Options. E Click the Notifications tab.
238 Appendix A Controlling the Automatic Launching of New Windows The Notifications tab on the User Options dialog box is also used to control whether newly generated output is launched in a separate window. It may be easier for you to disable this option and open an output window as needed. E To set these options, on the Tools menu, click User Options. E Click the Notifications tab. E In the dialog box, select New Output from the list in the Visual Notifications group. E Under Open Window, select Never.
239 Accessibility in IBM SPSS Modeler Shortcut Key Ctrl+F6 Ctrl+F7 Ctrl+F8 Function Moves focus to the stream canvas. Moves focus to the managers pane. Moves focus to the project pane. Node and Stream Shortcuts Shortcut Key Ctrl+N Ctrl+O Ctrl+number keys Ctrl+Down Arrow Ctrl+Up Arrow Enter Ctrl+Enter Alt+Enter Shift+Spacebar Ctrl+Shift+Spacebar Left/Right Arrow Up/Down Arrow Alt+Left/Right Arrow Alt+Up/Down Arrow Ctrl+A Ctrl+Q Ctrl+W Ctrl+Alt+D Function Creates a new blank stream canvas.
240 Appendix A Shortcut Key Ctrl+Alt+L Ctrl+Alt+R Ctrl+Alt+U Ctrl+Alt+C Ctrl+Alt+F Tab Shift+Tab Ctrl+Tab Any alphabetic key F1 F2 F3 F6 F10 Shift+F10 Delete Esc Ctrl+Alt+X Ctrl+Alt+Z Ctrl+Alt+Shift+Z Ctrl+E Function When a model nugget is selected in the stream, opens an Insert dialog box to enable you to load a saved model from a .nod file into the stream. Displays the Annotations tab for a selected node, enabling you to rename the node. Creates a User Input source node.
241 Accessibility in IBM SPSS Modeler Shortcut Key Ctrl+End Ctrl+1 Ctrl+2 Ctrl+3 Function With focus on any control in the Expression Builder, this will move the insertion point to the end of the expression. In the Expression Builder, moves focus to the expression edit control. In the Expression Builder, moves focus to the function list. In the Expression Builder, moves focus to the field list.
242 Appendix A Shortcuts for Comments When working with on-screen comments, you can use the following shortcuts. Shortcut Key Alt+C Alt+M Tab Enter Alt+Enter or Ctrl+Tab Esc Alt+Shift+Up Arrow Alt+Shift+Down Arrow Alt+Shift+Left Arrow Alt+Shift+Right Arrow Function Toggles the show/hide comment feature. Inserts a new comment if comments are currently displayed; shows comments if they are currently hidden. On the stream canvas, cycles through all the source nodes and comments in the current stream.
243 Accessibility in IBM SPSS Modeler Cluster Viewer only The Cluster Viewer has a Clusters view that contains a cluster-by-features grid. To choose the Clusters view instead of the Model Summary view: E Press Tab repeatedly until the View button is selected. E Press Down Arrow twice to select Clusters. From here you can select an individual cell within the grid: E Press Tab repeatedly until you arrive at the last icon in the visualization toolbar.
244 Appendix A E Spacebar. Selects the Variable File node. E Ctrl+Enter. Adds the Variable File node to the stream canvas. This key combination also keeps selection on the Variable File node so that the next node added will be connected to it. E Tab. Moves focus back to the node palette. E Right Arrow 4 times. Moves to the Derive node. E Spacebar. Selects the Derive node. E Alt+Enter. Adds the Derive node to the canvas and moves selection to the Derive node.
245 Accessibility in IBM SPSS Modeler Use F3 to destroy all connections for a selected node in the canvas. Once you have created a stream, use Ctrl+E to run the current stream. A complete list of shortcut keys is available. For more information, see the topic Shortcuts for Navigating the Main Window on p. 238. Using a Screen Reader A number of screen readers are available on the market.
246 Appendix A Accessibility in the Interactive Tree Window The standard display of a decision tree model in the Interactive Tree window may cause problems for screen readers. To access an accessible version, on the Interactive Tree menus click: View > Accessible Window This displays a view similar to the standard tree map, but one which JAWS can read correctly. You can move up, down, right, or left using the standard arrow keys.
247 Accessibility in IBM SPSS Modeler Typing the first letter to find element in tree list. When looking for an element in the categories pane, extracted results pane, or library tree, you can type the first letter of the element when the pane has the focus. This will select the next occurrence of an element beginning with the letter you entered. Drop-down lists. In a drop-down list for dialog boxes, you can use the Spacebar to select an item and then close the list.
Appendix B Unicode Support Unicode Support in IBM SPSS Modeler IBM® SPSS® Modeler is fully Unicode-enabled for both IBM® SPSS® Modeler and IBM® SPSS® Modeler Server. This makes it possible to exchange data with other applications that support Unicode, including multi-language databases, without any loss of information that might be caused by conversion to or from a locale-specific encoding scheme.
Appendix C Notices This information was developed for products and services offered worldwide. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used.
250 Appendix C Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment.
251 Notices
Index 508 compliance, 236 abs function, 138 accessibility, 236, 247 example, 243–244 features in IBM SPSS Modeler, 236 tips in IBM SPSS Modeler, 246 adding to a project, 202 adding IBM SPSS Modeler Server connections, 14, 16 Aggregate node performance, 234 allbutfirst function, 141 allbutlast function, 141 alphabefore function, 141 and operator, 137 annotating nodes, 78, 86 streams, 78, 86 annotations converting to comments, 85 folder, 207 project, 206 application examples, 4 applications, 30 applications o
253 Index comments keyboard shortcuts, 242 listing all in a stream, 84 on nodes and streams, 78 comparison functions, 135 concatenating strings, 135 conditions, 111 connections server cluster, 16 to IBM SPSS Collaboration and Deployment Services Repository, 161–162 to IBM SPSS Modeler Server, 13–14, 16 conventions, 133 conversion functions, 135 Coordinator of Processes, 16 COP, 16 copy, 21 cos function, 139 cosh function, 139 count_equal function, 115, 135 count_greater_than function, 115, 135 count_less_t
254 Index div function, 138 documentation, 4 domain name (Windows) IBM SPSS Modeler Server, 13 DTD, 197 enable nodes , 46 encoding, 56, 248 endstring function, 141 Enterprise View node, 185 equals operator, 135 error messages, 65 essential fields, 91, 94 Evaluation node performance, 234 examples Applications Guide, 4 overview, 5 execution times, viewing, 67 exponential function, 138 exporting PMML, 196, 198 stream descriptions, 77 Expression Builder, 240 accessing, 119 finding and replacing text, 123 overv
255 Index options, 215 overview, 12, 215 running from command line, 13 tips and shortcuts, 96 IBM SPSS Modeler Advantage, 160, 184 IBM SPSS Modeler Server domain name (Windows), 13 host name, 13–14 password, 13 port number, 13–14 user ID, 13 IBM SPSS Statistics models, 39 icons setting options, 24, 64 if, then, else functions, 137 importing PMML, 197–198 INDEX function, 152 @INDEX function, 150, 152 information functions, 134 integer_bitcount function, 140 integer_leastbit function, 140 integer_length func
256 Index MEAN function, 150, 152 @MEAN function, 150, 152 mean_n function, 115, 138 member function, 135 memory managing, 215–216 Merge node performance, 233 messages displaying generated SQL, 63 middle mouse button simulating, 26, 43 min function, 135 MIN function, 152 @MIN function, 150, 152 min_index function, 117, 135 min_n function, 115, 135 minimizing, 23 missing values, 100–101, 113 CLEM expressions, 102 filling, 99 handling, 99 in records, 101 mod function, 138 model nuggets, 78 model refresh, 185
257 Index objects properties, 208 OFFSET function, 152 @OFFSET function, 150, 152 performance considerations, 234 oneof function, 141 opening models, 90 nodes, 90 output, 90 projects, 202 states, 90 streams, 90 operator precedence, 131 operators in CLEM expressions, 120 joining strings, 135 optimization, 60 options, 215 display, 220 for IBM SPSS Modeler, 215 PMML, 221 stream properties, 54–55, 57, 59–60, 63–65, 67 user, 217 or operator, 137 output, 19 output files saving, 89 output nodes, 42 output objects
258 Index refresh source nodes, 57 refreshing models, 190 regression, 245 rem function, 138 renaming nodes, 86 streams, 74 replace function, 141 replacing models, 219 replacing text, 123 replicate function, 141 reports adding to projects, 202 generating, 209 saving output, 89 setting properties, 209 resizing, 23 retrieving objects from the IBM SPSS Collaboration and Deployment Services Repository, 172 rollover days, 58 round function, 138 rule sets evaluating, 56 running streams, 77 SAS files encoding, 248
259 Index stream rewriting enabling, 60 streams, 12 adding comments, 78 adding nodes, 43, 46 adding to projects, 202–203 annotating, 78, 86 backup files, 88 building, 41 bypassing nodes, 45 connecting nodes, 43 deployment options, 185 disabling nodes, 46 loading, 90 options, 54–55, 57, 59–60, 63–64 renaming, 74, 86 running, 77 saving, 88 scaling to view, 24 storing in the IBM SPSS Collaboration and Deployment Services Repository, 170 viewing execution times, 67 string functions, 141 strings, 127, 129 manip
260 Index uppertolower function, 141 user ID IBM SPSS Modeler Server, 13 user options, 217 user-defined functions (UDFs), 120 user-missing values, 99 UTF-8 encoding, 56, 248 @VALIDATION_PARTITION function, 157 value_at function, 117, 135 values, 110 adding to CLEM expressions, 122 viewing from a data audit, 122 variables, 29 version labels, IBM SPSS Collaboration and Deployment Services Repository object, 183 visual programming, 17 warnings, 65 setting options, 217 welcome dialog box, 220 white space remov