SparkRuntimeUtils

Instance Constructors

new SparkRuntimeUtils(sc: SparkContext)

Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def convertColumnTypeToSparkSQLDataType(columnType: TypeValue, keepDatesAsStrings: Boolean): (DataType, Option[String])

Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ.
Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ. DateTime Behavior: Converts all DateTime columns to TimeStampType. If there is a format string will add that to the metadata. If there is no format string, will use the ISO format ("yyyy-mm-dd hh:mm:ss")

Definition Classes
SparkSchemaUtils
def convertSparkSQLDataTypeToColumnType(structField: StructField): ColumnDef

Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.
Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.

Definition Classes
SparkSchemaUtils
def convertSparkSQLSchemaToTabularSchema(schema: StructType): TabularSchema

Converts from a Spark SQL schema to the Alpine 'TabularSchema' type.
Converts from a Spark SQL schema to the Alpine 'TabularSchema' type. The 'TabularSchema' object this method returns can be used to create any of the tabular Alpine IO types (HDFSTabular dataset, dataTable etc.)
Date format behavior: If the column def has not metadata stored at the DATE_METADATA_KEY constant, it wll convert DateType objects to ColumnType(DateTime, "yyyy-mm-dd") and TimeStampType objects to ColumnType(DateTime, "yyyy-mm-dd hh:mm:ss") otherwise will create a column type of ColumnType(DateTime, custom_date_format) where custom_date_format is whatever date format was specified by the column metadata.
schema
-a Spark SQL DataFrame schema
returns
the equivalent Alpine schema for that dataset

Definition Classes
SparkSchemaUtils
def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema, keepDatesAsStrings: Boolean): StructType

Definition Classes
SparkSchemaUtils
def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema): StructType

Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.
Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.
Date/Time behavior: The same as convertTabularSchemaToSparkSQLSchema(tabularSchema, false). Will NOT convert special date formats to String. Instead will render Alpine date formats as Spark SQL TimeStampType. The original date format will be stored as metadata in the StructFiled object for that column definition.
tabularSchema
An Alpine 'TabularSchemaOutline' object with fixed column definitions containing a name and Alpine specific type.
returns

Definition Classes
SparkSchemaUtils
def deleteFilePathIfExists(outputPathStr: String): AnyVal

Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.
Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.
outputPathStr
- the full HDFS path
returns
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getDataFrame(dataset: HiveTable): DataFrame

For use with hive.
For use with hive. Returns a Spark data frame given a hive table.
def getDataFrame(dataset: HdfsTabularDataset): DataFrame

Returns a DataFrame from an Alpine HdfsTabularDataset.
Returns a DataFrame from an Alpine HdfsTabularDataset. The DataFrame's schema will correspond to the column header of the Alpine dataset. Uses the databricks csv parser from spark-csv with the following options: 1.withParseMode("DROPMALFORMED"): Catch parse errors such as number format exception caused by a string value in a numeric column and remove those rows rather than fail. 2.withTreatEmptyValuesAsNulls(true) -> the empty string will represent a null value in char columns as it does in alpine 3.If a TSV, The delimiter attributes specified by the TSV attributes object
Date format behavior: DateTime columns are parsed as dates and then converted to the TimeStampType according to the format specified by the Alpine type 'ColumnType' format argument. The original format is save in the schema as metadata for that column. It can be accessed with SparkSqlDateTimeUtils.getDatFormatInfo(structField) for any given column.
dataset
Alpine specific object. Usually input or output of operator.
returns
Spark SQL DataFrame
def getDateMap(tabularSchema: TabularSchema): Map[String, String]

Definition Classes
SparkSchemaUtils
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def mapDFtoCustomDateTimeFormat(dataFrame: DataFrame, map: Map[String, String]): DataFrame

JAVA TIME STAMP OBJECT--> STRING Take in a dataFrame and map of the column names to the date formats we want to print and use the Spark SQL UDF date_format to convert from the TimeStamp type to a string representation of the date or time.
JAVA TIME STAMP OBJECT--> STRING Take in a dataFrame and map of the column names to the date formats we want to print and use the Spark SQL UDF date_format to convert from the TimeStamp type to a string representation of the date or time.
dataFrame
input data where date columns are represented as java TimeStamp Objects
map
columnName -> dateFormat to convert to
def mapDFtoUnixDateTime(dataFrame: DataFrame, map: Map[String, String]): DataFrame

STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a dataframe and a map of the column names to the date formats represented in them and use the Spark SQL: "unix_timestamp" UDF, to change the columns with string dates into unix timestamps in seconds and a custom udf to change that into java dates.
STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a dataframe and a map of the column names to the date formats represented in them and use the Spark SQL: "unix_timestamp" UDF, to change the columns with string dates into unix timestamps in seconds and a custom udf to change that into java dates. Preserves original naming of the columns. Columns which were originally DateTime columns will now be of TimeStampType rather than StringType.
dataFrame
the input dataframe where the date rows are as strings.
map
columnName -> dateFormat for parsing

Exceptions thrown
Exception
"Illegal Date Format" if one of the date formats provided is not a valid Java SimpleDateFormat pattern. And "Could not parse dates correctly. " if the date format is valid, but doesn't correspond to the data that is actually in the column.
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def saveAsAvro(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsAvroDataset

Write a DataFrame as an HDFSAvro dataset, and return the an instance of the Alpine HDFSAvroDataset type which contains the 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
def saveAsCSV(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

More general version of saveAsTSV.
More general version of saveAsTSV. Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
path
where file will be written (this function will create a directory of part files)
dataFrame
- data to write
tSVAttributes
- an object which specifies how the file should be written
sourceOperatorInfo
from parameters. Includes name and UUID
def saveAsParquet(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

Write a DataFrame to HDFS as a Parquet file, and return an instance of the HDFSParquet IO base type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
def saveAsTSV(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDataset

Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data.
Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the to the saved data. Uses the default TSVAttributes object which specifies that the data be written as a Tab Delimited File. See TSVAAttributes for more information and use the saveAsCSV file to customize csv options such as null string and delimiters.
def saveDataFrame[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, storageFormat: T, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef], tSVAttributes: TSVAttributes): HdfsTabularDataset

Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
path
The path to which we'll save the data frame.
dataFrame
The data frame that we want to save.
storageFormat
The format that we want to store in.
overwrite
Whether to overwrite any existing file at the path.
sourceOperatorInfo
Mandatory source operator information to be included in the output object.
addendum
Mandatory addendum information to be included in the output object.
returns
After saving the data frame, returns an HdfsTabularDataset object.
def saveDataFrameDefault[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo]): HdfsTabularDataset
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def toStructField(columnDef: ColumnDef, nullable: Boolean = true): StructField

Definition Classes
SparkSchemaUtils
def validateDateFormatMap(map: Map[String, String]): Unit
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Deprecated Value Members

def convertColumnTypeToSparkSQLDataType(columnType: TypeValue): DataType

Definition Classes
SparkSchemaUtils
Annotations
@deprecated
Deprecated
Will not properly handle data formats. Use toStructField
def convertSparkSQLDataTypeToColumnType(dataType: DataType): TypeValue

Converts from a Spark SQL data type to an Alpine-specific ColumnType
Converts from a Spark SQL data type to an Alpine-specific ColumnType

Definition Classes
SparkSchemaUtils
Annotations
@deprecated
Deprecated
This doesn't properly handle date formats. Use convertColumnTypeToSparkSQLDataType instead
def saveDataFrame(path: String, dataFrame: DataFrame, storageFormat: HdfsStorageFormat, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsTabularDataset

Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
path
The path to which we'll save the data frame.
dataFrame
The data frame that we want to save.
storageFormat
The format that we want to store in.
overwrite
Whether to overwrite any existing file at the path.
sourceOperatorInfo
Mandatory source operator information to be included in the output object.
addendum
Mandatory addendum information to be included in the output object.
returns
After saving the data frame, returns an HdfsTabularDataset object.

Annotations
@deprecated
Deprecated
Use signature with HdfsStorageFormatType rather than HdfsStorageFormat enum or saveDataFrameDefault

class SparkRuntimeUtils extends SparkSchemaUtils

Instance Constructors

new SparkRuntimeUtils(sc: SparkContext)

Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def convertColumnTypeToSparkSQLDataType(columnType: TypeValue, keepDatesAsStrings: Boolean): (DataType, Option[String])

def convertSparkSQLDataTypeToColumnType(structField: StructField): ColumnDef

def convertSparkSQLSchemaToTabularSchema(schema: StructType): TabularSchema

def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema, keepDatesAsStrings: Boolean): StructType

def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema): StructType

def deleteFilePathIfExists(outputPathStr: String): AnyVal

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def getDataFrame(dataset: HiveTable): DataFrame

def getDataFrame(dataset: HdfsTabularDataset): DataFrame

def getDateMap(tabularSchema: TabularSchema): Map[String, String]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def mapDFtoCustomDateTimeFormat(dataFrame: DataFrame, map: Map[String, String]): DataFrame

def mapDFtoUnixDateTime(dataFrame: DataFrame, map: Map[String, String]): DataFrame

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def saveAsAvro(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsAvroDataset

def saveAsCSV(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

def saveAsParquet(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

def saveAsTSV(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDataset

def saveDataFrame[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, storageFormat: T, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef], tSVAttributes: TSVAttributes): HdfsTabularDataset

def saveDataFrameDefault[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo]): HdfsTabularDataset

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def toStructField(columnDef: ColumnDef, nullable: Boolean = true): StructField

def validateDateFormatMap(map: Map[String, String]): Unit

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Deprecated Value Members

def convertColumnTypeToSparkSQLDataType(columnType: TypeValue): DataType

def convertSparkSQLDataTypeToColumnType(dataType: DataType): TypeValue

def saveDataFrame(path: String, dataFrame: DataFrame, storageFormat: HdfsStorageFormat, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsTabularDataset

Inherited from SparkSchemaUtils

Inherited from AnyRef

Inherited from Any

Ungrouped