# RegEx

The Regex plugin creates a single “union” table in a database from any files
matching a [regularexpression](
https://en.wikipedia.org/wiki/Regular_expression) (RegEx) by
leveraging any of the flat file connectors ([CSV](/jobs/csv/),
[Excel](/jobs/excel/)) as a base.

.. image:: assets/regex-1.png
   :alt: union flat files together

See the [Union plugin](/jobs/union/) if you are trying to "union"
multiple database tables together into single database table.

## Create a RegEx job

Create a job using any of the flat file connectors:

- [CSV](/jobs/csv/)
- [Excel](/jobs/excel/)
- [JSON](/jobs/json/)

Edit the job.

### Input changes

Rename the input’s `use` to `base`. `base` is how the Regex plugin knows which
underlying flat file connector to use.

Add a new input `use` key with the value `flatfile.iov2#RegexInput`.

Remove the input `source` and add `regex` with the regular expression value.

### SDL changes

If there is an `sdl` section, remove it entirely.

### Steps changes

Update the `steps` with the job's steps to these steps:

```json
"steps": [
	{
		"column": "__mtime__",
		"use": "mitto.iov2.steps#MaxTimestamp"
	},
	{
		"use": "mitto.iov2.steps.upsert#SetUpdatedAt"
	},
	{
		"transforms": [
			{
				"use": "mitto.iov2.transform#ExtraColumnsTransform"
			},
			{
				"use": "mitto.iov2.transform#ColumnsTransform"
			}
		],
		"use": "mitto.iov2.steps#Input"
	},
	{
		"use": "mitto.iov2.steps#CreateTable"
	},
	{
		"use": "mitto.iov2.steps.upsert#CreateTempTable"
	},
	{
		"transforms": [
			{
				"use": "mitto.iov2.transform#FlattenTransform"
			}
		],
		"use": "mitto.iov2.steps#Output"
	},
	{
		"key": "__source__",
		"use": "mitto.iov2.steps.upsert#SyncTempTable"
	},
	{
		"use": "mitto.iov2.steps#CollectMeta"
	}
]
```

The Regex connector uses all the input parameters (e.g. delimiter, encoding,
skip, etc) from the initial job's plugin.

### Output table

The resulting database table has three additional columns:

- `__index__` - the index of each row in each separate file
- `__source__` - the source file the rows originate from
- `__mtime__` - the modified time of the source file

### Upsert

Upsert is supported for RegEx jobs by using the `__source__` as the primary
key and `__mtime__` as the last modified time.

Every time the RegEx job is run, it checks the output table and finds the
maximum `__mtime__`. Only files with modified times greater than or equal to
this `__mtime__` will be upsert on each run.

If you ever need to "reset" or do a full table refresh, you can drop the
output table and rerun the RegEx job.

## Example

We have 3 CSV files we want to union together into a single database table.

Here's what the initial CSV files look like:

.. figure:: assets/regex__initial_csv.png

   Initial CSV


Here's the resulting database table for `regex_1.csv` after using the **XSV** plugin:

| id | name | __index__ |
|----|------|-----------|
| 1 | Justin | 1 |

And here's the resulting database table after using the **RegEx** plugin:

| id | name | __index__ | __source__ | __mtime__ |
|----|------|-----------|------------|-----------|
| 3 | Birdie | 1 | regex_3.csv | 2020-02-19 17:30:20.680760 |
| 2 | Bear | 1 | regex_2.csv | 2020-02-19 17:30:20.628760 |
| 1 | Justin | 1 | regex_1.csv | 2020-02-19 17:30:20.572761 |

Here's the job config for the original XSV job:

```json
{
	"input": {
		"delimiter": ",",
		"encoding": "ASCII",
		"includes_header": true,
		"source": "/var/mitto/data/regex_1.csv",
		"use": "xsv.iov2#XsvInput2"
	},
	"output": {
		"dbo": "postgresql://dbo/analytics",
		"schema": "test",
		"tablename": "regex_1",
		"use": "call:mitto.iov2.db#todb"
	},
	"sdl": {
		"columns": [
			{
				"__header__": "id",
				"__jpath__": "$.'id'",
				"name": "id",
				"type": "Boolean"
			},
			{
				"__header__": "name",
				"__jpath__": "$.'name'",
				"length": 30,
				"name": "name",
				"type": "String"
			}
		],
		"name": "regex_1"
	},
	"steps": [
		{
			"transforms": [
				{
					"use": "mitto.iov2.transform#ExtraColumnsTransform"
				},
				{
					"use": "mitto.iov2.transform#ColumnsTransform"
				}
			],
			"use": "mitto.iov2.steps#Input"
		},
		{
			"use": "mitto.iov2.steps#CreateTable"
		},
		{
			"transforms": [
				{
					"use": "mitto.iov2.transform#FlattenTransform"
				}
			],
			"use": "mitto.iov2.steps#Output"
		},
		{
			"use": "mitto.iov2.steps#CollectMeta"
		}
	]
}
```

And here's the modified job config using the RegEx plugin:

```json
{
	"input": {
		"delimiter": ",",
		"encoding": "ASCII",
		"includes_header": true,
		"regex": "^regex_.+\\.csv$",
		"base": "xsv.iov2#XsvInput2",
		"use": "flatfile.iov2#RegexInput"
	},
	"output": {
		"dbo": "postgresql://dbo/analytics",
		"schema": "test",
		"tablename": "regex",
		"use": "call:mitto.iov2.db#todb"
	},
	"steps": [
		{
			"column": "__mtime__",
			"use": "mitto.iov2.steps#MaxTimestamp"
		},
		{
			"use": "mitto.iov2.steps.upsert#SetUpdatedAt"
		},
		{
			"transforms": [
				{
					"use": "mitto.iov2.transform#ExtraColumnsTransform"
				},
				{
					"use": "mitto.iov2.transform#ColumnsTransform"
				}
			],
			"use": "mitto.iov2.steps#Input"
		},
		{
			"use": "mitto.iov2.steps#CreateTable"
		},
		{
			"use": "mitto.iov2.steps.upsert#CreateTempTable"
		},
		{
			"transforms": [
				{
					"use": "mitto.iov2.transform#FlattenTransform"
				}
			],
			"use": "mitto.iov2.steps#Output"
		},
		{
			"key": "__source__",
			"use": "mitto.iov2.steps.upsert#SyncTempTable"
		},
		{
			"use": "mitto.iov2.steps#CollectMeta"
		}
	]
}
```

The input's `use` was renamed to `base`. A new `use` (for the RegEx plugin)
was added to the input. The input's `source` was removed and `regex` was
added. The output `tablename` was updated. The `steps` were completely
replaced with the RegEx steps.

The input's `regex` for this particular job is `^regex_.+\\.csv$`. We are
looking for any files that start with (`^`)  `regex_`, have any characters in
the middle (`.+`), and end with (`$`) `.csv`.

This `regex` will feed in the 3 files we want to union:

- regex_1.csv
- regex_2.csv
- regex_3.csv

## Regular Expression Syntax

One helpful tool for testing regular expressions is
[RegExr](https://regexr.com).

One important thing to remember, In older versions of Zuar Runner (pre-2.9.x)
because a job's config is stored in JSON, when using escape characters (`\`),
you need to double escape them (`\\`).


### PRO-TIP: Testing in the Files Page

You can test what your regular expression will match by using the search input
on the Files Page.

.. image:: assets/regex-2.png
   :alt: files page search input