Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ.
Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ. DateTime Behavior: Converts all DateTime columns to TimeStampType. If there is a format string will add that to the metadata. If there is no format string, will use the ISO format ("yyyy-mm-dd hh:mm:ss")
Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.
Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.
Converts from a Spark SQL schema to the Alpine 'TabularSchema' type.
Converts from a Spark SQL schema to the Alpine 'TabularSchema' type. The 'TabularSchema' object this method returns can be used to create any of the tabular Alpine IO types (HDFSTabular dataset, dataTable etc.)
Date format behavior: If the column def has not metadata stored at the DATE_METADATA_KEY constant, it wll convert DateType objects to ColumnType(DateTime, "yyyy-mm-dd") and TimeStampType objects to ColumnType(DateTime, "yyyy-mm-dd hh:mm:ss") otherwise will create a column type of ColumnType(DateTime, custom_date_format) where custom_date_format is whatever date format was specified by the column metadata.
-a Spark SQL DataFrame schema
the equivalent Alpine schema for that dataset
Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.
Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.
Date/Time behavior: The same as convertTabularSchemaToSparkSQLSchema(tabularSchema, false). Will NOT convert special date formats to String. Instead will render Alpine date formats as Spark SQL TimeStampType. The original date format will be stored as metadata in the StructFiled object for that column definition.
An Alpine 'TabularSchemaOutline' object with fixed column definitions containing a name and Alpine specific type.
Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.
Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.
- the full HDFS path
For use with hive.
For use with hive. Returns a Spark data frame given a hive table.
Returns a DataFrame from an Alpine HdfsTabularDataset.
Returns a DataFrame from an Alpine HdfsTabularDataset. The DataFrame's schema will correspond to the column header of the Alpine dataset. Uses the databricks csv parser from spark-csv with the following options: 1.withParseMode("DROPMALFORMED"): Catch parse errors such as number format exception caused by a string value in a numeric column and remove those rows rather than fail. 2.withTreatEmptyValuesAsNulls(true) -> the empty string will represent a null value in char columns as it does in alpine 3.If a TSV, The delimiter attributes specified by the TSV attributes object
Date format behavior: DateTime columns are parsed as dates and then converted to the TimeStampType according to the format specified by the Alpine type 'ColumnType' format argument. The original format is save in the schema as metadata for that column. It can be accessed with SparkSqlDateTimeUtils.getDatFormatInfo(structField) for any given column.
Alpine specific object. Usually input or output of operator.
Spark SQL DataFrame
JAVA TIME STAMP OBJECT--> STRING Take in a dataFrame and map of the column names to the date formats we want to print and use the Spark SQL UDF date_format to convert from the TimeStamp type to a string representation of the date or time.
JAVA TIME STAMP OBJECT--> STRING Take in a dataFrame and map of the column names to the date formats we want to print and use the Spark SQL UDF date_format to convert from the TimeStamp type to a string representation of the date or time.
input data where date columns are represented as java TimeStamp Objects
columnName -> dateFormat to convert to
STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a dataframe and a map of the column names to the date formats represented in them and use the Spark SQL: "unix_timestamp" UDF, to change the columns with string dates into unix timestamps in seconds and a custom udf to change that into java dates.
STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a dataframe and a map of the column names to the date formats represented in them and use the Spark SQL: "unix_timestamp" UDF, to change the columns with string dates into unix timestamps in seconds and a custom udf to change that into java dates. Preserves original naming of the columns. Columns which were originally DateTime columns will now be of TimeStampType rather than StringType.
the input dataframe where the date rows are as strings.
columnName -> dateFormat for parsing
"Illegal Date Format" if one of the date formats provided is not a valid Java SimpleDateFormat pattern. And "Could not parse dates correctly. " if the date format is valid, but doesn't correspond to the data that is actually in the column.
Write a DataFrame as an HDFSAvro dataset, and return the an instance of the Alpine HDFSAvroDataset type which contains the 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
More general version of saveAsTSV.
More general version of saveAsTSV. Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
where file will be written (this function will create a directory of part files)
- data to write
- an object which specifies how the file should be written
from parameters. Includes name and UUID
Write a DataFrame to HDFS as a Parquet file, and return an instance of the HDFSParquet IO base type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data. Uses the default TSVAttributes object which specifies that the data be written as a Tab Delimited File. See TSVAAttributes for more information and use the saveAsCSV file to customize csv options such as null string and delimiters.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
The path to which we'll save the data frame.
The data frame that we want to save.
The format that we want to store in.
Whether to overwrite any existing file at the path.
Mandatory source operator information to be included in the output object.
Mandatory addendum information to be included in the output object.
After saving the data frame, returns an HdfsTabularDataset object.
Will not properly handle data formats. Use toStructField
Converts from a Spark SQL data type to an Alpine-specific ColumnType
Converts from a Spark SQL data type to an Alpine-specific ColumnType
This doesn't properly handle date formats. Use convertColumnTypeToSparkSQLDataType instead
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
The path to which we'll save the data frame.
The data frame that we want to save.
The format that we want to store in.
Whether to overwrite any existing file at the path.
Mandatory source operator information to be included in the output object.
Mandatory addendum information to be included in the output object.
After saving the data frame, returns an HdfsTabularDataset object.
Use signature with HdfsStorageFormatType rather than HdfsStorageFormat enum or saveDataFrameDefault
:: AlpineSdkApi ::