icon ${title}

[<- Back to DPR Home](/data_processing) Let's get started with a basic dpr-config.json configuration file. ``` { "run": {"log": "${!taskname}-${!datetime}.log", "params": {"custid": "CUST-1234"}}, "settings": { "webConnections": [], "dbConnections": [], "extensionLibraryDir": ".\\lib" }, "project": { "id": "P1", "name": "Log Analysis", "storageMethod": "file", "processTemplates": [], "processes": [ {"name": "FindAllTradeEntriesInLog", "steps": [ {"type": "importTextFile", "file": "${!basedir}\\inputs\\trades-fi-02162017.log", "lineDelimiter": ">>>>"}, {"type": "parseField", "field} ]} ], "tasklists": [] } } ``` The configuration file contains settings and defined processing steps. ## Run-time Parameters 1. `imports` - This is a array of strings, each containing another configuration file to be loaded as part of the defined project. Only processes, process templates, and tasks will be imported from config files specified in the imports. Settings are not imported. ``` { "imports": ["log-processes.json", "c:\\project\\shared-defns\\file-defns.json"], "settings": {}, "project": {} } ``` 2. `run` - This allows for defining the logging filenames as well as other parameters to be used. ``` { "run": {"log": "${!taskname}-${!datetime}.log", "params": {"custid": "CUST-1234"}}, "settings": {}, "project": {} } ``` ## Settings 1. `webConnections` - Settings for authentication on websites, including BASIC and OAuth2 options. This is an array of items each of which will have the settings for one site/url. ```` "settings": {"webConnections": [ {"name": "BayAreaRETS", "baseUrl": "http://mlslistings.com:8080/", "username": "dan", "password": "mypassword"} ]} ```` 2. `dbConnections` - Settings for database connections, as shown in the example. This is an array of items each of which will have the settings for one database connection. ```` "settings": {"dbConnections": [ {"name": "TradesDB", "dbtype": "MySQL", "server": "localhost", "port": 3306, "databaseid": "trading", "username": "dan", "password": "mypassword"} ]} ```` The available db types that have built-in settings are the following: _MySQL_, _Oracle_, _SQLServer_, _Postgres_, _ODBC_, _DB2_, and _Sybase_. If you are wanting to use another database or the specified database is not working due to incompatibilities with the default database driver names, specify the driverClassName & url and remove the dbtype. The driver classname and URL is provided by the database vendor - the url below is applicable for MySQL. An example is as follows: ```` "settings": {"dbConnections": [ {"name": "MyNewVendorDB", "driverClassName": "com.newvendor.jdbc.Driver", "url": "jdbc:mysql://<HOST>:<PORT>/<DBID>", "server": "localhost", "port": 3306, "databaseid": "trading", "username": "dan", "password": "mypassword"} ]} ```` 3. `extensions` - This allows for adding JAR libraries to the run-time - such as for database drivers, as well as adding modules that are custom. Module extensions are currently released based on specific customer requirements that are not added to the core product. Note that the `libs` directory under your project base directory is the default - so any JAR libraries found there are automatically added as extensions. ``` { "settings": { "extensions": { "dir": "${!basedir}/libs" "modules": ["com.mycompany.dpr.MyExtension"], }} ``` ## Project 1. `defaultParams` - The list of names/value string parameters that you might need to reference commonly across the entire project. To use these, you can simply refer to them in your settings with *${parameterName}* ```` "project": { "defaultParams": {"defaultDateFormat": "MM/dd/yyyy", "defaultTimeFormat": "MM/dd/yyyy HH:mm:ss Z"} } ```` 2. `processTemplates` - Templates are used to hold a set of processing steps that might be shared across multiple processes. Examples of this might be a common set of steps to source and split a log file into its named parts, but then you might have a process to look at performance log entries while another could deal with errors. This is a array containing multiple templates. ```` "processTemplates": [ {"name": "ImportLogFileToFields", "steps": [ {"type": "importTextFile", "file": "${!basedir}\\inputs\\${filename}", "lineDelimiter": ">>>>"}, {"type": "parseField", "field": "Line", "delimiter": "|", "fieldDefn": [ {"name": "Severity"}, {"name": "Thread"}, {"name": "SessionId"}, {"name": "LogTime"}, {"name": "Msg"} ]} ]} ] ```` 3. `processes` - Processes are the core processing construct for DPR. These have names and define a series of steps that are executed to come up with the expected output. This is an array containing multiple processes. ```` "processes": [ {"name": "ParseLogFile", "steps": [ {"type": "importTextFile", "file": "${!basedir}\\inputs\\${filename}", "lineDelimiter": ">>>>"}, {"type": "parseField", "field": "Line", "delimiter": "|", "fieldDefn": [ {"name": "Severity"}, {"name": "Thread"}, {"name": "SessionId"}, {"name": "LogTime"}, {"name": "Msg"} ]}, {"type": "filter", "includes": [ {"field": "Severity", "operator": "in", "value0": "SEVERE", "value1": "WARN", "value2": "INFO"}, {"conditions": [ {"field": "Msg", "operator: "contains", "value": "Got Request:"}, {"field": "Severity", "operator": "equals", "value": "INFO"}]} ]} ]} ] ```` 4. `tasklists` - These allow you to create a set of process that run in order. This is an array that would contain multiple tasklist definitions. ```` "tasklists": [ {"name": "RunCompare", "tasks": [ {"type": "executeProcess", "failOnError": true, "process": "DownloadAndParseSource1"}, {"type": "executeProcess", "failOnError": true, "process": "DownloadAndParseSource2"}, {"type": "executeProcess", "failOnError": true, "process": "CompareTxns"}, ]} ] ```` ## Processing Steps The processing that's available can be categorized as import data, processing, and exporting data. *Importing Data* - [Text files](#import-text-file) - [CSV/Tab delimited files ](#import-csv-delimited-files) - [Excel file](#import-excel-files) - [Databases](#database-import) - [Web services](#web-service) - [XML source](#import-xml-files) - [JSON source](#import-json-files) - [Directory scanner](#directory-scanner) - [Use existing source](#use-existing-dataset) - [Combine data sources](#combine-data) *Processing* - [Parsing](#parsing) - [Convert fields](#convert-fields) - [Filtering](#filtering) - [Summarize](#summarize-data) - [Calculated fields](#calculated-fields) - [Compare data](#compare-data) - [Integrate data](#integrate-data) - [Mapping](#mapping-data) - [Parse XML](#parse-xml) *Exporting Data* - [Database inserts/updates](#database-inserts-updates) - [Excel file](#output-to-excel) - [CSV file](#output-to-csv) *Other Helpful Tools* - [Apply template](#apply-template) - [Local sub-process](#local-sub-process) - [Invoke Process](#invoke-process) - [Distinct Values](#distinct-values) ### Importing Data #### Import Text File The text file source will bring in lines of text. It works with multi-line records as well, if there's a unique delimiter for each line. This can be helpful when parsing logs where exceptions are logged, which are multi-line. Example: ```` {"type": "importTextFile", "file": "${!basedir}\\inputs\\MyApplication.log", "delimiter": ">>>", "includeDelimiterInField": true} ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | The path to the file to be imported. This can be a local file or a http/ftp based URL, if a corresponding web connection is setup for authentication settings. | | delimiter | String | The string that separates multiple line records. The default is the line-break `\n` | | includeDelimiterInField | Boolean | Indicates whether the delimiter should be part of the line that follows the delimiter | | numberOfHeaderLinesToIgnore | Number | Some files contain header records that don't match the rest of the file, these can be skipped | | removeLineBreaks | Boolean | If specified as true, removes line break (`\n`) characters within each line
*Output Fields* - LineNum - this is the line number for that record, if the delimiter is not a line-break, it will be the line number based on the line as it might show up in a text editor - Line - the string containing the line #### Import CSV/Delimited Files This source will import files with delimited field values using the specified delimiter. It also allows for headers in files. The resulting fields are captured as text which can later be converted based on the appropriate formats. Example: ```` {"type": "importCSVFile", "file": "${!basedir}\\inputs\\cust_101017.dat", "delimiter": "|"} ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | The path to the file to be imported. This can be a local file or a http/ftp based URL, if a corresponding web connection is setup for authentication settings. | | delimiter | Character | The character that separates each field value. The default is `,` if not specified | | trimSpaces | Boolean | Indicates whether to remove leading and trailing spaces | | treatFirstRowAsHeader | Boolean | If specified as true, then the values in the first row will be the resulting fields | | readAheadNumRows | Number | The number of fields are determined based on the first record, which may not contain all fields, hence this is to specify that additional rows should be read to determine the fields in the resulting output #### Import Excel Files This source will read a single 'table' in an Excel sheet. Both .xls and .xlsx/.xlsm files can be read. Example: ```` {"type": "readExcel", "file": "${!basedir}\\inputs\\TradesDump.xlsx", "worksheet": "Sheet1", "startCell": "A2", "endCell": "N2"} ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | The path to the file to be imported. This can be a local file or a http/ftp based URL, if a corresponding web connection is setup for authentication settings. | | worksheet | String | The name of the worksheet to import data from. Required. | | startCell | String | The name of the cell from which your data starts. Examples would be "A2". | | endCell | String | The name of the cell where your data ends. This can either be the last column cell of the starting row or the last cell of the last row. If you have a 3 column, 4 row table that spans A1:C4, you can specify "C1" or "C4". This is useful if you don't know how many rows your table will contain. | | treatFirstRowAsHeader | Boolean | If specified as true, then the values in the first row will be the resulting fields | | requiredColumn | String | The name of the column that must have a value otherwise, the row is skipped or the table is considered at and end. This is useful in the case of summary rows that may not contain data in certain columns. Example could be "A". | ||| | *Header Table* || Certain Excel files can contain a header table which contains qualifying information, such as the date period of a report, or the client for that report. If you want to pull in that information into your resulting data, specify the 2 options below. The data is expected to be in name/value format - either with column 1 being the name and column 2 being the value or row 1 being the name and row 2 containing the values. Otherwise, these need not be specified. | | headerStartCell | String | The name of the starting cell in the header table. | | headerCellCell | String | This is the ending cell of that header table. It is required only if the headerStartCell is specified. | ||| | *Transposed Data Columns* || For viewing purposes, Excel tables often contain tabulated data such as a table of Regions going down and Sales Period across with the cells in the table containing the sales for each region for each sales period. Rather than having each sales period as a separate field, you might want to have a field called SalesPeriod, and have multiple rows, one for each sales period for each region. If this is required, you would use the below settings options. Otherwise, these need not be specified. | | dataColumnsStartCell | String | The name of the starting cell in the table. If your table started at A1 with row 1 being the headers, and the sales periods starting from column B, you would specify "B1" as the start cell. | | dataColumnsEndCellTitle | String | If you have columns after the data-driven columns, this would specify the title of the column that should be pulled in normally vs. being transposed into rows. For example, if after your Month columns, you had a column titled 'Sales Manager' which contained data about the sales managers for each region, you could specify that so the tool would pull columns from that column on normally. | | dataColumnsFieldName | String | The field name that the header columns should be stored as. In the example, it would be "SalesPeriod" . | | dataColumnValueFieldName | String | This is the field name into which the values for the cells would be stored. In the example, this would be "Sales". | ||| | *Other Arcane Options* ||| | treatDatesAsGMT | Boolean | If an Excel file was created in a different country and this is specified as true, the dates will remain as they were entered. If this is omitted or specified as false, datetimes will be converted to the local timezone. | | copyDataFromMergedCells | Boolean | If cells are merged, if set to true, the cell contents will be copies to each row. This is useful when Excel files have summary columns that are merged for readibility purposes. | #### Database Import This source will run a query against the database with parameters if required. Simple case: ```` {"type": "dbQuery", "dataSource": "TradesDB", "sql": "select * from trades"} ```` With parameters: ```` {"type": "dbQuery", "dataSource": "TradesDB", "sql": "select * from trades where trader = ?", "parameters": [ {"name": "TraderId", "type": "string", "value": "jsmith"} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | dataSource | String | The defined database connection in the settings area. Required. | | sql | String | The SQL to run to get data. This can contain parameters based on `?` in the sql. Either the sql or the sqlFile is required. | | sqlFile | String | The path to the file containing the SQL to run to get data. Either the sql or the sqlFile is required. | | parameters | Collection | List of parameters that map to question marks in the SQL. This is useful if you are running this process/step based on another dataset as part of [Invoke Subprocess] (#invoke-subprocess). | | parameters[]/name | String | Name of the parameter. If this is a sub-process, this is the field name of the dataset in the calling process | | parameters[]/type | [Field Type (string)](#field-types) | Type of the parameter, used to bind the value in the prepared statement SQL JDBC call. | | parameters[]/value | String | Default value of the parameter, in case this is not invoked as a sub-process. | #### Web Service This source will call a web service and parse out the response, expected to be in XML or JSON formats. Basic, digest, and OAuth2 authentication mechanisms are supported. It relies on the webConnection being setup correctly, and identifies the webConnection based on the baseUrl specified in the webConnection settings vs. the URL supplied in the web service config. Example: ```` {"type": "webService", "url": "https://qvikly.com/lists/api/v1/datasets", "method": "POST", "importNodePath": "/datasets", "headers": { {"Authorization": "Bearer ${authsecret}"}, {"clientid": "${clientid}"}, {"email": "${email}"} }} ```` | Parameter | Type | Description | | --- | --- | --- | | url | String | The url for the web service. This URL is matched against the list of web connections to see if the base URL on a web connection matches this url. Required. | | method | String | Can be one of the following: _GET_, _POST_, _PUT_, _DELETE_, _HEAD_, and _OPTIONS_. If not specified, assumed to be a GET call. | | headers | Collection | List of header parameters to be passed. This is not required. | | headers[]/name | String | Name of the header parameter. | | headers[]/value | String | Value of the header parameter. | | form | Collection | List of form parameters to be passed. This is not required. | | form[]/name | String | Name of the form parameter. | | form[]/value | String | Value of the form parameter. | | importNodePath | String | The xpath to the node(s) that you want to have in the resulting dataset. | #### Import XML Files This source will import a XML file based on the specified node(s) to import the data from. Example: ```` {"type": "importXMLFile", "file": "${!basedir}\\inputs\\Trades.xml", "nodePath": "/Trades", "includeAttributes": true} ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | The path to the file to be imported. This can be a local file or a http/ftp based URL, if a corresponding web connection is setup for authentication settings. | | nodePath | String | The xpath to the node(s) from which the data will be flattened in the resulting dataset. | | includeAttributes | Boolean | Whether to include attributes on the XML nodes or not. | #### Import JSON Files This source will import a JSON file based on the specified node(s) to import the data from. Example: ```` {"type": "importJSON", "file": "${!basedir}\\inputs\\Trades.json", "nodePath": "/Trades"} ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | The path to the file to be imported. This can be a local file or a http/ftp based URL, if a corresponding web connection is setup for authentication settings. | | nodePath | String | The xpath to the node(s) from which the data will be flattened in the resulting dataset. | #### Directory Scanner This source can be used to get a list of files in a local directory. This is useful if you want to run the same file import process for multiple files and possibly combine the output. Example: ```` {"type": "scanDirectory", "baseDirectory": "${!basedir}\\inputs\AppLogs", "includes": "**/*.log, **/*.log.bak", "excludes": "admin*.log" } ```` | Parameter | Type | Description | | --- | --- | --- | | baseDir | String | The local directory to be scanned to get the list of files. | | includes | String | A comma separated list of expressions to match the files to | | excludes | String | A comma separated list of expressions to match the files to, if it matches the excluded criteria, it will be excluded even if it was matching the includes as well | | includeDirs | Boolean | Whether to include directories in the result or just files, defaults to false. | | includeFiles | Boolean | Whether to include files in the result or just directories, defaults to true. | | detailed | Boolean | Whether to include file details such as size, last modified, parent directory, and filename. Defaults to false. | | includeFullPath | Boolean | Whether to add the full file path to the output dataset, defaults to true. |
*Output Fields* - FilePath - the fully qualified path of the file - ParentDirectory - the parent directory containing the file, included if 'detailed' is specified - FileName - the name of the file, included if 'detailed' is specified - LastModified - the date when the file was last modified, included if 'detailed' is specified - Size - the size in bytes of the file, included if 'detailed' is specified #### Use Existing Dataset This source allows you to use a dataset from a process that's been run before. For example, if you have sourced some data and need to use it for 2 different purposes, then instead of recalculating it for the 2 purposes, you would create the dataset which is cached to a data file. Then you would reference that dataset and use it in your 2 processes. Example: ```` {"type": "useData", "src": "Dataset1"} ```` | Parameter | Type | Description | | --- | --- | --- | | src | String | The name of the dataset to be used as the source for the remaining steps in the process. | #### Combine Data This source allows you to collate multiple datasets so they appear as one to anyone using this source. Example: ```` {"type": "combineData", "primaryDataset": "Dataset1", "additionalDatasets": [ {"name": "Dataset2", "mapping": [ {"src": "ds2fld1", "tgt": "ds1fld1"}, {"src": "ds2fld2", "tgt": "ds1fld2"} ]}, {"name": "Dataset3", "mapping": [ {"src": "ds3fld1", "tgt": "ds1fld1"}, {"src": "ds3fld2", "tgt": "ds1fld2"} ]} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | primaryDataset | String | The name of the dataset to be used as the first source and the one who's fields will be the basis to which the other datasets will be mapped. | | includeSourceDatasetNameInResult | Boolean | If specified, a field to hold the name of the source dataset will be populated. | | *additionalDatasets* || The list of other datasets to be included in the combined data. For each, a name will be specified along with any special field mapping required. | | additionalDatasets[]/name | String | Name of the additional dataset to be combined | | additionalDatasets[]/mapping/src | String | Name of the field from the dataset to be combined that requires to be mapped | | additionalDatasets[]/mapping/tgt | String | Name of the field from the primary dataset that the source field should be named to | ### Processing #### Parsing This module allows you to parse a field into multiple fields, whether there is a common delimiter, fixed field sizes, or there is field specific parsing required. Common set of delimiters: ```` {"type": "parseField", "field": "Line", "delimiter": "|", "maxFieldsToParse": 5, "fieldDefn": [ {"name": "Severity"}, {"name": "Thread"}, {"name": "SessionId"}, {"name": "LogTime"}, {"name": "Msg"} ] } ```` Field-specific delimiters: ```` {"type": "parseField", "field": "Line", "parseType": "fieldSpecificDelimiters", "fieldDefn": [ {"name": "Thread", "endingDelimiter": "\t"}, {"name": "Severity", "endingDelimiter": " "}, {"name": "SessionId", "fixedLength": 10, "suppress": true}, {"name": "Msg", "endingDelimiter": "\n"} ] } ```` | Parameter | Type | Description | | --- | --- | --- | | field | String | The name of the field to be parsed. | | delimiter | String | The delimiter with which to parse the field. If multiple delimiters are required, specify as `delimiter0`, `delimiter1`, ... | | maxFieldsToParse | Number | If the number of fields is not known and you want all the remaining text at the end to be part of a field, then use this to limit the number of fields that are output | | parseType | String | The options are:
*fieldSpecificDelimiters* - this is used if the ending delimiter for each field needs to be specified individually
*fixedFieldSizes* - in the case of fixed field lengths
if left blank, common delimiters is the default | | containsHeaderRow | Boolean | If specified, then the field values from the first row will be used for the field names | | includeDelimiterInField | Boolean | If specified, the delimiter will be left as part of the field value | | *fieldDefn* | Collection | List of fields to be parsed to | | fieldDefn[]/name | String | Name of the resulting field. If not specified, then default names will be generated. | | fieldDefn[]/endingDelimiter | String | If field specific delimiters is specified as the parsing type, this delimiter will be used to mark the end of the field value | | fieldDefn[]/fixedLength | Number | If fixed field sizes are specified, this will be used to determine the field value | | fieldDefn[]/suppress | Boolean | If specified, this field will not be part of the resulting dataset. | #### Convert Fields This module allows you to drop fields, rename them, format them, or convert field types. Example: ```` {"type": "convertFields", "dropFields": "Column4,Column5,Column6", "convert": [ {"name": "Column1", "newName": "TradeNumber", "type": "number"}, {"name": "Column2", "newName": "ExecutionDate", "type": "date", "conversionFormat": "MMDDYYYY-HHmmss"}, {"name": "Column3", "newName": "CUSIP", "toUpperCase": true}, {"name": "Column7", "newName": "SecurityId", "trimSpaces": true}, {"name": "Column19", "newName": "ExchangeRate", "type": "decimal"}, ] } ```` | Parameter | Type | Description | | --- | --- | --- | | dropFields | String | A comma delimited list of fields that will be dropped on the output. | | *convert* | Collection | List of conversions to be applied | | convert[]/name | String | Name of the field the conversion applies to. Required. | | convert[]/newName | String | This will rename the field to the new name | | convert[]/type | [Field Type (string)](#field-types) | The type that the field will be converted to. This can be from a numeric/date field to a string or vice versa. Conversion formats are typically required with these conversions | | convert[]/conversionFormat | String | This refers to the format when a type conversion is specified. It is applicable for dates/times. | | convert[]/toLowerCase | Boolean | If specified as true, the value of the field will be converted to lower case. | | convert[]/toUpperCase | Boolean | If specified as true, the value of the field will be converted to upper case. | | convert[]/trimSpaces | Boolean | If specified as true, any leading or spaces at the end of the field value will be removed. | | convert[]/substringStart | Number | When trying to take a part of the field's value, such as trying to drop the first 2 letters, this is the 0 based index of where the resulting value should start from. | | convert[]/substringLength | Number | In combination with substringStart, this specifies how many letters of the fields from where it starts should be kept. The remaining will be discarded. The string to be cut can be shorter than the specified length. | #### Filtering This source allows you to create multiple inclusion and exclusion filters. Simple Example: ```` {"type": "filter", "include": { "field": "Msg", "operator": "contains", "value": "PerfStat"}} ```` Complex Example: ```` {"type": "filter", "includes": [ {"conditions": [ {"field": "TradeAmount", "operator": "greaterThan", "value": "10000"}, {"field": "TradeDate", "operator": "greaterThan", "value: "12/08/2016"}, {"field": "TradeDate", "operator": "lessThan", "value: "12/21/2016"}]}, {"conditions": [ {"field": "TradeAmount", "operator": "greaterThan", "value": "25000"}, {"field": "SecType", "operator": "equals", "value: "EQ"}, {"field": "TradeDate", "operator": "greaterThan", "value: "1/1/2017"}]}, ], "excludes": [ {"field": "SecType", "operator": "in", "value": ["FI", "MBS", "CDS"]} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | includes | Collection of Rules | A list of single or multiple conditions that if any match, the record would be part of the resulting dataset. | | excludes | Collection of Rules | A list of single or multiple conditions that if any match, would remove the record from the resulting dataset. | | includes[]/conditions or excludes[]/conditions | Collection of Criteria | If the rule has multiple criteria, then use the `conditions` to create the list of criteria, otherwise, you can specify a criteria without placing it in a conditions array | | condition[]/field | String | Name of the field to be used for the criteria | | condition[]/operator | [Operator (string)](#operators) | The appropriate operator to match the field value to | | condition[]/value | String | The string representation of the value to be used in evaluating the criteria. It will be converted to the appropriate datatype of the field based on the default formats for each datatype. An array of values is supported if the operator is one of the following: in, startsWith, contains, endsWith and their negatives (ex. doestNotContain, ..). | #### Summarize Data This source allows you to create summarized data from source data. It can also tabulate the data, so certain columns will be driven by the values of a specified field. Example: ```` {"type": "summarizeData", "groupBy": "TradeDate,SecType", "groupByField": [ {"field": "TradeDate", "dateSummaryType": "byMonth"}, {"field": "SecType", "tabulate": true} ], "summary": [ {"field": "TradeAmount", "function": "SUM"} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | groupBy | String | A comma separated list of the fields that you want to group the data by. Required. | | groupByField | Collection | A list of more details on how group by fields should be handled | | groupByField[]/field | String | The field the additional detail is about | | groupByField[]/dateSummaryType | String | If the group by field is a date/time, you can specify how to group, since often the group isn't by the specific date/time. Valid values are: *byDate*, *byWeek*, *byMonth*, *byQuarter*, *byMonthOfYear*, *byDayOfWeek*, *byHour*, *byMinute*, *byHourOfDay* | | groupByField[]/numericBreakpoints | String | A comma-separated list of numbers, into which you want to group numeric fields. For example, for a performance summary, you might have < 100ms, 100-250ms, .. For this your setting would be 100,250,... | | groupByField[]/tabulate | Boolean | For one group by field, tabulate can be specified as an option, in which case, the various values of that field will be set as the fields in the output and you will get the summary results underneath, much as in a pivot table | | summary | Collection | A list of the fields that you want to summarize and how each should be summarized. | | summary[]/field | String | The field to be summarized | | summary[]/function | String | The name of the function to use to summarize. The options are: *sum*, *avg*, *count*, *min*, *max*, *firstNonEmptyValue*, *listOfValues* | #### Calculated Fields This allows you to specify formulas for new fields or to change existing fields. The language for the formula expression is [Groovy](http://groovy-lang.org/single-page-documentation.html). Single Calc Example: ```` { "type": "calculateFields", "field": "MyCalcedField", "resulttype": "string", "expr": "if (Region == 'Northeast') return 'NE'; if (Region == 'Northwest') return 'NW'; return 'UNKNOWN';"} ```` Multiple Calculations: ```` { "type": "calculateFields", "calc": [ {"field": "MyCalcedField", "resulttype": "string", "expr": "if (Region == 'Northeast') return 'NE'; if (Region == 'Northwest') return 'NW'; return 'UNKNOWN';"}, {"field": "Sales", "resulttype": "number", "expr": "if (Sales < 1000) Sales = 0; return Sales; "} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | calc | Collection | A list of calculated fields and associated configuration. This is only required as a collection if you have multiple. Single calculated fields can be defined as shown in the example above without the calc wrapper. | | calc[]/field | String | Name of the field that will be calculated. If this is named the same as an existing input field, it will be overwritten. | | calc[]/resulttype | [Field Type (string)](#field-types) | It may not always be clear what the resulting type of the field will be. Required. | | calc[]/expr | String | This is the groovy syntax expression script to evaluate any logic. Note that all input field values are accessible in the expression script. | | calc[]/scriptFile | String | This is the path to the file containing the Groovy script/expression. This can be specified instead of specifying the expression in the config file itself. | #### Compare Data This is a extremely powerful utility that allows you to find matching rows, differing rows, or differences in columns between 2 datasets. The main assumption is that you need to have matching columns as the basis of comparing 2 datasets, without regard to the order of the data. When comparing data, only the fields that don't match will show up in the resulting dataset - matching field values will be blanked out. Example: ```` { "type": "compare", "method": "combineData", "sourceDataset": "CampaignSignups", "compareToDataset": "CustList", "prefixForSource2Fields": "CustDB-", "sourceMatchFields": "cust_email", "source2MatchFields": "emailaddr", "sourceCompareFields": "first_name,last_name,company", "source2CompareFields": "firstName,lastName,accountName"} ```` | Parameter | Type | Description | | --- | --- | --- | | sourceDataset | String | Name of the dataset that will be the basis of your comparison | | compareToDataset | String | Name of the dataset that will be compare the source dataset with | | method | String | The valid options are: *findDifferences*, *findMatches*, *combineData*. Find differences will find rows missing in the compared dataset that exist in the original list. Find matches will find rows that exist in both datasets. For field level compares, use combine data. Default is *combineData* and is not required to be specified. | | sourceMatchFields | String | Comma separated list of the fields that will be combined to form the match key of the source dataset. Required. | | source2MatchFields | String | Comma separated list of the fields that will be combined to form the match key of the compare to dataset. Order needs to match the match fields on the source dataset. If not specified, it is assumed to be the same as `sourceMatchFields`. | | sourceCompareFields | String | Only used where purpose = compareData. This is the list of the fields whose values will be compared. If not specified, then all fields that have the same name between the source dataset and the compare to dataset. | | source2CompareFields | String | Only used where purpose = compareData. This is the list of the fields whose values will be compared from the compare to dataset. Order needs to match the compare fields on the source dataset. If not specified, it is assumed to be the same as `sourceCompareFields`. | | prefixForSource2Fields | String | Only used where purpose = compareData. This is used to prefix the field names from the compare to dataset to distinguish from the source dataset. Defaults to 'Compare-'. | | includeMatchedRecords | Boolean | Records where there is no difference will still be output. Defaults to false. | | includeMatchingFields | Boolean | Indicates the field values for all compared fields should be present or only values where there is a difference. Defaults to false. |
*Output Fields* - CompareCode - code indicating the result - 0 means all matching, 1 means record only in source dataset, 2 means record only in compared dataset, 3 indicates field differences found. - DiffFields - a comma separated list of the fields that were found to be different #### Integrate Data This processing module joins data between the current source and a referenced dataset. For each source record, the output can contain a single record containing fields from the match dataset or multiple resulting records. This would be used to pull data from a reference dataset, such as wanting to pull customer names and address information onto a list of invoices to be sent. Single Match Criteria Example: ```` { "type": "join", "mergeToDataset": "CustList", "prefixForSource2Fields": "CustDB-", "includeSource2Fields": "firstname,lastname,mailingaddr-line1,mailingaddr-line2,mailingaddr-city,mailingaddr-state,mailingaddr-zip", "matchCondition": {"source1Field": "email", "operator": "equals", "source2Field": "emailaddr"}} ```` Multiple Match Criteria Example: ```` { "type": "join", "mergeToDataset": "CustList", "prefixForSource2Fields": "CustDB-", "includeSource2Fields": "firstname,lastname,mailingaddr-line1,mailingaddr-line2,mailingaddr-city,mailingaddr-state,mailingaddr-zip", "matchCondition": { "criteria": [ {"source1Field": "email", "operator": "equals", "source2Field": "emailaddr"}, {"source1Field": "lastname", "operator": "equals", "source2Field": "lastname"} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | mergeToDataset | String | Name of the dataset that the current source will be joined with. Required. | | stopAtFirstMatch | Boolean | If the dataset being merged will only have a single matching record, this tells the processor to stop as soon as a match is found. Sometimes you want to have multiple matches. Defaults to true. | | prefixForSource2Fields | String | This is used to prefix the field names from the compare to dataset that have the same field name on the source dataset. Optional. | | excludeSource1Fields | String | If the goal is just to find the matches rather than augment the current source with additional information, this is an optimization so only the fields that you want to have end up in the result show up and the rest are excluded. This is a comma separated list of the fields to exclude from your source | | includeSource2Fields | String | A comma separated list of the fields to pull in from the dataset being merged. If not specified, all fields will be pulled in. | | excludeSource2Fields | String | If *includeSource2Fields* is not specified, a comma separated list of fields to exclude from the dataset can also be specified. | | *matchCondition* || The match condition that will be used. Additionally multiple match criteria or a single one can be specified. | | matchCondition/criteria[]/source1Field | String | Name of the field in the current source to be used for matching with the merged dataset | | matchCondition/criteria[]/source2Field | String | Name of the field in the dataset to be merged that will be used for matching | | matchCondition/criteria[]/operator | [Operator (string)](#operators) | The operator to be used to match. Atleast one match criteria must have an equals operator. | #### Mapping Data This is used if you want to map a certain set of field values to another value. An example would be to map 'Northeast' to 'NE' or to map 'Mexico', 'USA' and 'Canada' to 'North America' as the continent. Single Field Example: ```` { "type": "mapValues", "targetField": "Continent", "src": "Country", "matchtype": "contains", "rule": [ {"result": "North America", "f1": ["Mexico", "USA", "United States", "Canada"]}, {"result": "Asia", "f1": ["Vietnam", "India", "China", "Cambo"]} ]} ```` Multiple Field Example: ```` { "type": "mapValues", "targetField": "CustSegment", "src1": "AnnualMileage", "matchtype1": "greaterThan", "src2": "PreviousCustSegment", "matchtype2": "equals", "rule": [ {"result": "Platinum", "f1": [100000], "f2": ["Platinum"]}, {"result": "Platinum", "f1": [200000], "f2": ["Gold"]}, {"result": "Platinum", "f1": [250000], "f2": ["Silver"]}, {"result": "Gold", "f1": [10000], "f2": ["Platinum"]}, {"result": "Gold", "f1": [50000], "f2": ["Gold"]}, {"result": "Gold", "f1": [100000], "f2": ["Silver"]}, {"result": "Silver", "f1": [10000], "f2": ["Silver"]} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | targetField | String | Name of the output field to which mapped values will be assigned. Required. | | src | String | The name of the field(s) that will be used in the mapping evaluation process. | | matchtype | [Operator (string)](#operators) | The operator to be used to evaluate that mapping value. | | *rule* || Mapping rules are specified, each of which correspond to a single output value, and potentially multiple match scenarios | | rule[]/result | String | Value that should be the result in the target field if the rule values match the source | | rule[]/discard | Boolean | Sometimes, you may need to drop certain records if they match your criteria, perhaps the data is unmapped or not desired | | rule[]/f{1-..} || A container for a list of values to be mapped for field #X. This will correspond with the fields specified in 'src'. In the simple example, only a single field is specified. This is an array of values. | #### Parse XML If you have a field containing XML and you need to parse/flatten it into a tabular format, this can be used. Note that the XMLs in each row would need to be relatively the same in terms of the fields contained. Example: ```` { "type": "parseXML", "field": "CustRecXML", "parentNodeName": "customers"} ```` | Parameter | Type | Description | | --- | --- | --- | | field | String | Name of the input field containing the XML to be parsed. Required. | | parentNodePath | String | In the XML, by specifying this, the values for all child nodes under the node referred to by this XPath will be imported. Example would be `/Customers`. | | parentNodeName | String | In the XML, by specifying this, the values for all child nodes under the node with the specified name will be imported. Example would be `Customers`. | | nodeName | String | All nodes with the specified name will be imported. | | nodesXPath | String | All nodes matching the specified XPath will be imported. | | treatChildNodeNameAsValue | Boolean | If specified as true, the node name will be imported as a field value as well. This is sometimes useful if the node name itself containing useful info. | | childNodeNameField | String | If treatChildNodeNameAsValue is specified as true, this is required and is the field into which the node name will be stored. | | includeNodePathInResults | Boolean | If specified as true, this is the path to the node being imported. | | nodePathField | String | If includeNodePathInResults is specified as true, this is required and is the field into which the ndoe path will be stored. | | keepSourceFields | Boolean | By default true, if this is specified as false, only the fields parsed from the XML field will remain on the output. Otherwise the other fields from the input are retained. | | keepParsedField | Boolean | If specified as true, the field containing the XML to be parsed will also remain. | ### Exporting Data #### Database Inserts/Updates This is used to store your data into a database. It can do matching with what you might already have in the database to determine whether to insert data or to do an update, or skip the update. Example: ```` {"type": "storeData", "targetDatasource": "CustDB", "targetTable": "Customer", "addIfNoMatchFound": true, "actionOnMatch": "overwrite", "match": [{"src": "emailaddr", "tgt": "email"}], "fieldMapping": [ {"src": "firstname", "tgt" : "first_name"}, {"src": "lastname", "tgt" : "last_name"}, {"src": "emailaddr", "tgt" : "email"} ] } ```` | Parameter | Type | Description | | --- | --- | --- | | targetDatasource | String | This is the dbConnection to be used. It must be configured in the settings as a dbconnection. Required. | | targetTable | String | The table into which inserts and updates will be done. | | addIfNoMatchFound | Boolean | This can be used to only do updates and not inserts if a match is not found. If not specified, inserts will be done. | | actionOnMatch | String | Can be either *overwrite* or *skip* which determines whether if a matching record exists, an update should be made or no update should be done | | *match* || A collection of source/target fields that will be used to match records from the source and the database | | match[]/src | String | Name of the field from the current source that will be matched to the database records | | match[]/tgt | String | Name of the database column on the target table that will be matched to the source | | fieldMapping || A collection of the source/target fields as they will be stored in the database | | fieldMapping[]/src | String | Name of the field from the current source that this field mapping is for | | fieldMapping[]/tgt | String | Name of the database column on the target table into which the source field will be put into. If not specified, then it is assumed that the target column name is the same as the source field | #### Output to Excel This is used if you want to output one or more datasets to an Excel file. If it is a step in process, then the current source will be output. Otherwise, the list of specified datasets will be output. Example: ```` { "type": "outputToExcel", "file": "${!basedir}\\dump.xlsx" } ```` Example with Formatting: ```` { "type": "outputToExcel", "file": "${!basedir}\\dump.xlsx" "specs": [ {"name": "", "order": "emailaddr,firstname,lastname,userid,billacctid,modifieddt", "fields": [ {"field": "emailaddr", "width": 130}, {"field": "modifieddt", "width": 100, "format": "MM/dd/yyyy"} ]} ]} ```` Another Example with Multiple Datasets: ```` { "type": "outputToExcel", "file": "${!basedir}\\SalesReport.xlsx", "datasets": "Sales Summary,Sales Comp" } ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | Name of the file to be output. Required. | | directory | String | The directory where the file should be output. Can be ommitted and the file itself can contain the path. | | datasets | String | A comma separated list of the datasets to be output. Each will be output to a different tab. | | *specs* | Collection | A list of specs for each dataset, including field ordering as well as widths and output formats | | specs[]/name | String | Name of the dataset for which this spec applies | | specs[]/order | String | A comma separated list of fields in the order that they should be output | | *specs[]/fields* | Collection | A list of field specs in terms of how the field should be output | | specs[]/fields[]/name | String | Name of the field for which this spec applies | | specs[]/fields[]/width | Number | The width of the column for this field | | specs[]/fields[]/format | [Display Format (string)](#field-formats) | The format in which the field will be displayed. This works for date/time and numeric fields only. | #### Output to CSV This is used if you want to output the current source to a text file with delimiters. Example: ```` { "type": "outputToCSV", "file": "${!basedir}\\dump.csv" } ```` | Parameter | Type | Description | | --- | --- | --- | | file | String | Name of the file to be output. Required. | | directory | String | The directory where the file should be output. Can be ommitted and the file itself can contain the path. | | delimiter | String | The delimiter to be used. This is by default a comma. | ## Helpful Utilities ### Apply Template Templates can be defined that require a series of steps that can be used in multiple processes. An example might be some parsing logic for a type of file that you want to share more broadly. The defined steps are put in place of where the call to apply the template is created. Definition of Template: ```` "processTemplates": [ {"name": "ParseApplicationPerformanceLogFile", "steps": [ {} // the steps defined are similar to any other process definition ]} ] ```` Example of Using Template: ```` "steps": [ {"type": "importTextFile", "file": "${!basedir}\\inputs\\appperf1.log"}, {"type": "applyTemplate", "name": "ParseApplicationPerformanceLogFile"} // this is applying the templated steps ] ```` ### Local Sub-Process Typically when a process is run, the output is stored in a data file in the `data` directory under the base directory. This allows for processes using that resulting data to not have to re-source and re-calculate it. Sometimes, you may need to massage multiple sets of source data in some slight manner and don't need it to be stored and having it calculated on the fly is ok. A local sub-process is essentially a way of creating that data source with some processing steps added to it and naming it so it can be referenced within your process. An example could be where you are combining important events from various sources and how those are filtered is specific to this process. Example Of Combining Multiple Relevant Logs: ```` { "name": "FindErrors", "steps":[ {"type":"subprocess", "name": "websvclogs", "src": "AppServiceCalls", "steps": [ {"type": "filter", "include": {"field": "Duration", "operator": "greaterThan", "value": "50"}} ]}, {"type":"subprocess", "name": "webresp", "src": "WebLog", "steps": [ {"type": "filter", "includes": [ {"field": "Duration", "operator": "greaterThan", "value": "100"}, {"field": "HttpStatusCode", "operator": "notEqual", "value": "200"}, {"field": "Duration", "operator": "greaterThan", "value": "100"}, ] ]}, {"type": "combineData", "primaryDataset": "websvclogs", "additionalDatasets": [{"name": "webresp"}]} ]} ```` ### Invoke Process This module will invoke the specified process for each record from the current source. For example, you could have a directory scan which gives you a list of files and then for each of those, invoke a process to process the log file. Example: ```` Assuming 'ProcessLogFile' is defined as such: {"name": "ProcessLogFile", "steps": [ {"type": "importTextFile", "directory": "$(!basedir)\\inputs", "file": "app.log", "delimiter": ">>>>"}, {"type": "parseField", "field": "Line", "delimiter": "|", "fieldDefn": [ {"name": "Severity"}, {"name": "Thread"}, {"name": "SessionId"}, {"name": "LogTime"}, {"name": "Msg"} ]} ]} { "type": "invokeProcess", "process": "ProcessLogFile", "map": [ {"sourceField": "FilePath", "step": 0, "targetPath": "@file"} ]} ```` | Parameter | Type | Description | | --- | --- | --- | | process | String | Name of the process to be invoked for each record in the current source. Required. | | *map* || Mapping of the source fields to the steps in the target process | | map[]/sourceField | String | Name of the field from the source to be copied as the config into the target process | | map[]/step | Number | The zero-based index to the step in the target process on which the config setting will be overridden with the record's value | | map[]/targetPath | String | The xpath value of the configuration field containing the setting to be overridden | ### Distinct Values This module allows you to get the list of distinct values for a combination of fields. This can be useful as part of profiling data. Example: ```` { "type": "getDistinctValues", "fields": "TxnType,SecType" } ```` | Parameter | Type | Description | | --- | --- | --- | | fields | String | A comma separated list of fields where the module will get the various combinations of those fields. Required. | ### Iterative Development You typically add a step and then run the process and verify the data. Once you've built up the process so it's final, it can typically involve a step that outputs the data to a file, database, etc.. If you want to ignore some step, such as an output step, you can do so by adding a ignore=true to the step as such. Example: ```` { "type": "outputToExcel", "ignore": true, ... } ```` ## Reference Types ### Operators Operators is something you will see in many configuration settings, such as with filters. This is the list of operators that are available : _equals_, _contains_, _like_, _lessThan_, _lessThanOrEqual_, _greaterThan_, _greaterThanOrEqual_, _in_, _startsWith_, _endsWith_, _doesNotEqual_, _notLike_, _notIn_, _doesNotStartWith_, _doesNotContain_, and _doesNotEndWith_. ### Field Types Field types are also often required to be specified, such as with conversions. The typical ones are: _string_, _number_, _decimal_, _dateTime_, _date_, _time_, and _Boolean_. ### Field Formats Formats are required to specify how fields should be parsed or how they should be output. This applies to Boolean, date, and numeric fields. | Format | Description | | --- | --- | | *Dates and Times* | An example of a date format that contains date and time is `MM/dd/yyyy HH:mm:ss Z` | | M | Numeric for month (1 for Jan, 11 for November) | | MM | Padded 2 letter numeric for month (01 for Jan) | | MMM | 3 letter abbreviation for month (Jan) | | d | Numeric for day of month | | dd | Padded 2 letter numeric for day of month (05 for 5th day of month) | | yyyy | Padded 4 letter numeric for year (ex. 2017) | | yy | 2 letter numeric for year (69 for 1969, and 17 for 2017) | | HH | Padded 2 letter numeric for hour of day (0-12) | | mm | Padded 2 letter numeric for minute of hour (0-59) | | ss | Padded 2 letter numeric for seconds (0-59) | | *Numeric* | An example of a numeric format is `#,###,###.00` | | .0000 | Specifies the number of decimal points that will be output (ex. 1.234 with format .00 will be output as 1.23) | | .# | Specifies the decimal points will be based on the actual number (ex. 1.23 vs. 1.234) | #,###,### | Specifies that the numeric will have comma separators for thousands | | *Boolean* | Booleans (yes/no) can be output in a few formats | | True/False | This is the default if not specified | | Yes/No | yes for true, no for false | | Y/N | Y for true, N for false |