Union

Zuar Runner’s Union plugin combines multiple database tables into a single “union” database table.

../_images/union-1.png

The input tables all need to be from the same source database. However, the output database table can be a completely different database altogether.

See the RegEx plugin if you are trying to “union” multiple flat files together into a single database table.

Example Union Use Case

As an example, we want to take two tables that look like this:

Table A

../_images/union__table_a.png

Table B

../_images/union__table_b.png

And union them together.

../_images/union__union_table.png

Notice the addition of the table_name column to differentiate between the rows of each table after the union.

What if the Tables Have Different Columns or Data Types?

Zuar Runner handles this by learning the input data as it is processed and creates the correct output union table.

Table A

../_images/union__table_a2.png

Notice Table A has 4 columns: id, words, places, and last_update.

Table B:

../_images/union__table_b2.png

Notice Table B is missing the places column.

Union Table

../_images/union__table_union2.png

Notice the addition of the table_name column to differentiate between the rows of each table after the union. Also notice rows associated with Table B have NULL values for the places column.

Create a Zuar Runner Union Job

Zuar Runner Union jobs are created with the Generic plugin. The job type is io (input/output).

Example job configuration:

{
    "input": {
        "dbo": "postgresql://db/analytics",
        "use": "union.iov2#TableSetInput",
        "last_modified_column_name": "last_update",
        "source_column_name": "table_name",
        "table_names": {
            "example.table_a": "a",
            "example.table_b": "b"
        }
    },
    "output": {
        "dbo": "postgresql://db/analytics",
        "schema": "unioned",
        "tablename": "tables_unioned",
        "use": "call:mitto.iov2.db#todb"
    },
    "steps": [
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

Union job config values:

  • Input → source_column_name - This is the new column that will

    appear in the “union” table. In our example above, this is how the new column table_name is created.

  • Input → table_names - This JSON object lists the tables to be

    unioned ({schema}.{table}) and the values created for the source_column_name‘s rows. In our example above, this is how the new table_name column is populated with values from each table (e.g. a and b).

Union with Upsert

Zuar Runner Union jobs can be configured to handle upsert.

Example job configuration:

{
    "input": {
        "dbo": "postgresql://db/analytics",
        "use": "union.iov2#TableSetInput",
        "last_modified_column_name": "last_update",
        "source_column_name": "table_name",
        "table_names": {
            "example.table_a": "a",
            "example.table_b": "b"
        }
    },
    "output": {
        "dbo": "postgresql://db/analytics",
        "schema": "unioned",
        "tablename": "tables_unioned",
        "use": "call:mitto.iov2.db#todb"
    },
    "steps": [
        {
            "use": "union.iov2.steps#TimestampBySource",
            "source_column_name": "table_name",
            "last_modified_column_name": "last_update"
        },
        {
            "use": "mitto.iov2.steps.upsert#SetUpdatedAt"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#ExtraColumnsTransform"
                },
                {
                    "use": "mitto.iov2.transform#ColumnsTransform"
                }
            ],
            "use": "mitto.iov2.steps#Input"
        },
        {
            "use": "mitto.iov2.steps#CreateTable"
        },
        {
            "use": "mitto.iov2.steps.upsert#CreateTempTable"
        },
        {
            "transforms": [
                {
                    "use": "mitto.iov2.transform#FlattenTransform"
                }
            ],
            "use": "mitto.iov2.steps#Output"
        },
        {
            "use": "mitto.iov2.steps.upsert#SyncTempTable",
            "key": [
                "table_name",
                "id"
            ]
        },
        {
            "use": "mitto.iov2.steps#CollectMeta"
        }
    ]
}

Union job with upsert config values:

  • Input → source_column_name - This is the new column that will appear in the “union” table. In our example above, this is how the new column table_name is created.

  • Input → table_names - This JSON object lists the tables to be unioned ({schema}.{table}) and the values created for the source_column_name‘s rows. In our example above, this is how the new table_name column is populated with values from each table (e.g. a and b).

  • Input → last_modified_column_name - This is the timestamp column in the “source” tables that is used for upsert. In our example above, this is the last_update column.

  • Steps → source_column_name - This is the same column from the input. This column is used to determine the last modified column value per section of the union. In our example above, this column is table_name.

  • Steps → last_modified_column_name - This is the timestamp column used for upsert. In our example above, this column is last_update.

  • Steps → key - This is a JSON array listing the column(s) that uniquely identify a row in the union table. In the example above, the primary key from the source tables is id. We also add in the table_name column value from the input. The combination of these two values is how we uniquely identify a row in the union table.