# RegEx The Regex plugin creates a single “union” table in a database from any files matching a [regularexpression]( https://en.wikipedia.org/wiki/Regular_expression) (RegEx) by leveraging any of the flat file connectors ([CSV](/jobs/csv/), [Excel](/jobs/excel/)) as a base. .. image:: assets/regex-1.png :alt: union flat files together See the [Union plugin](/jobs/union/) if you are trying to "union" multiple database tables together into single database table. ## Create a RegEx job Create a job using any of the flat file connectors: - [CSV](/jobs/csv/) - [Excel](/jobs/excel/) - [JSON](/jobs/json/) Edit the job. ### Input changes Rename the input’s `use` to `base`. `base` is how the Regex plugin knows which underlying flat file connector to use. Add a new input `use` key with the value `flatfile.iov2#RegexInput`. Remove the input `source` and add `regex` with the regular expression value. ### SDL changes If there is an `sdl` section, remove it entirely. ### Steps changes Update the `steps` with the job's steps to these steps: ```json "steps": [ { "column": "__mtime__", "use": "mitto.iov2.steps#MaxTimestamp" }, { "use": "mitto.iov2.steps.upsert#SetUpdatedAt" }, { "transforms": [ { "use": "mitto.iov2.transform#ExtraColumnsTransform" }, { "use": "mitto.iov2.transform#ColumnsTransform" } ], "use": "mitto.iov2.steps#Input" }, { "use": "mitto.iov2.steps#CreateTable" }, { "use": "mitto.iov2.steps.upsert#CreateTempTable" }, { "transforms": [ { "use": "mitto.iov2.transform#FlattenTransform" } ], "use": "mitto.iov2.steps#Output" }, { "key": "__source__", "use": "mitto.iov2.steps.upsert#SyncTempTable" }, { "use": "mitto.iov2.steps#CollectMeta" } ] ``` The Regex connector uses all the input parameters (e.g. delimiter, encoding, skip, etc) from the initial job's plugin. ### Output table The resulting database table has three additional columns: - `__index__` - the index of each row in each separate file - `__source__` - the source file the rows originate from - `__mtime__` - the modified time of the source file ### Upsert Upsert is supported for RegEx jobs by using the `__source__` as the primary key and `__mtime__` as the last modified time. Every time the RegEx job is run, it checks the output table and finds the maximum `__mtime__`. Only files with modified times greater than or equal to this `__mtime__` will be upsert on each run. If you ever need to "reset" or do a full table refresh, you can drop the output table and rerun the RegEx job. ## Example We have 3 CSV files we want to union together into a single database table. Here's what the initial CSV files look like: .. figure:: assets/regex__initial_csv.png Initial CSV Here's the resulting database table for `regex_1.csv` after using the **XSV** plugin: | id | name | __index__ | |----|------|-----------| | 1 | Justin | 1 | And here's the resulting database table after using the **RegEx** plugin: | id | name | __index__ | __source__ | __mtime__ | |----|------|-----------|------------|-----------| | 3 | Birdie | 1 | regex_3.csv | 2020-02-19 17:30:20.680760 | | 2 | Bear | 1 | regex_2.csv | 2020-02-19 17:30:20.628760 | | 1 | Justin | 1 | regex_1.csv | 2020-02-19 17:30:20.572761 | Here's the job config for the original XSV job: ```json { "input": { "delimiter": ",", "encoding": "ASCII", "includes_header": true, "source": "/var/mitto/data/regex_1.csv", "use": "xsv.iov2#XsvInput2" }, "output": { "dbo": "postgresql://dbo/analytics", "schema": "test", "tablename": "regex_1", "use": "call:mitto.iov2.db#todb" }, "sdl": { "columns": [ { "__header__": "id", "__jpath__": "$.'id'", "name": "id", "type": "Boolean" }, { "__header__": "name", "__jpath__": "$.'name'", "length": 30, "name": "name", "type": "String" } ], "name": "regex_1" }, "steps": [ { "transforms": [ { "use": "mitto.iov2.transform#ExtraColumnsTransform" }, { "use": "mitto.iov2.transform#ColumnsTransform" } ], "use": "mitto.iov2.steps#Input" }, { "use": "mitto.iov2.steps#CreateTable" }, { "transforms": [ { "use": "mitto.iov2.transform#FlattenTransform" } ], "use": "mitto.iov2.steps#Output" }, { "use": "mitto.iov2.steps#CollectMeta" } ] } ``` And here's the modified job config using the RegEx plugin: ```json { "input": { "delimiter": ",", "encoding": "ASCII", "includes_header": true, "regex": "^regex_.+\\.csv$", "base": "xsv.iov2#XsvInput2", "use": "flatfile.iov2#RegexInput" }, "output": { "dbo": "postgresql://dbo/analytics", "schema": "test", "tablename": "regex", "use": "call:mitto.iov2.db#todb" }, "steps": [ { "column": "__mtime__", "use": "mitto.iov2.steps#MaxTimestamp" }, { "use": "mitto.iov2.steps.upsert#SetUpdatedAt" }, { "transforms": [ { "use": "mitto.iov2.transform#ExtraColumnsTransform" }, { "use": "mitto.iov2.transform#ColumnsTransform" } ], "use": "mitto.iov2.steps#Input" }, { "use": "mitto.iov2.steps#CreateTable" }, { "use": "mitto.iov2.steps.upsert#CreateTempTable" }, { "transforms": [ { "use": "mitto.iov2.transform#FlattenTransform" } ], "use": "mitto.iov2.steps#Output" }, { "key": "__source__", "use": "mitto.iov2.steps.upsert#SyncTempTable" }, { "use": "mitto.iov2.steps#CollectMeta" } ] } ``` The input's `use` was renamed to `base`. A new `use` (for the RegEx plugin) was added to the input. The input's `source` was removed and `regex` was added. The output `tablename` was updated. The `steps` were completely replaced with the RegEx steps. The input's `regex` for this particular job is `^regex_.+\\.csv$`. We are looking for any files that start with (`^`)  `regex_`, have any characters in the middle (`.+`), and end with (`$`) `.csv`. This `regex` will feed in the 3 files we want to union: - regex_1.csv - regex_2.csv - regex_3.csv ## Regular Expression Syntax One helpful tool for testing regular expressions is [RegExr](https://regexr.com). One important thing to remember, In older versions of Zuar Runner (pre-2.9.x) because a job's config is stored in JSON, when using escape characters (`\`), you need to double escape them (`\\`). ### PRO-TIP: Testing in the Files Page You can test what your regular expression will match by using the search input on the Files Page. .. image:: assets/regex-2.png :alt: files page search input