SparkRuntimeUtils

Instance Constructors

new SparkRuntimeUtils(sc: SparkContext)

Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def convertColumnTypeToSparkSQLDataType(columnType: TypeValue, keepDatesAsStrings: Boolean): (DataType, Option[String])

Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ.
Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ. DateTime Behavior: Converts all DateTime columns to TimeStampType. If there is a format string will add that to the metadata. If there is no format string, will use the ISO format ("yyyy-mm-dd hh:mm:ss")

Definition Classes
SparkSchemaUtils
def convertSparkSQLDataTypeToColumnType(structField: StructField): ColumnDef

Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.
Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.

Definition Classes
SparkSchemaUtils
def convertSparkSQLSchemaToTabularSchema(schema: StructType): TabularSchema

Converts from a Spark SQL schema to the Alpine 'TabularSchema' type.
Converts from a Spark SQL schema to the Alpine 'TabularSchema' type. The 'TabularSchema' object this method returns can be used to create any of the tabular Alpine IO types (HDFSTabular dataset, dataTable etc.)
Date format behavior: If the column def has not metadata stored at the DATE_METADATA_KEY constant, it wll convert DateType objects to ColumnType(DateTime, "yyyy-mm-dd") and TimeStampType objects to ColumnType(DateTime, "yyyy-mm-dd hh:mm:ss") otherwise will create a column type of ColumnType(DateTime, custom_date_format) where custom_date_format is whatever date format was specified by the column metadata.
schema
-a Spark SQL DataFrame schema
returns
the equivalent Alpine schema for that dataset

Definition Classes
SparkSchemaUtils
def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema, keepDatesAsStrings: Boolean): StructType

Definition Classes
SparkSchemaUtils
def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema): StructType

Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.
Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.
Date/Time behavior: The same as convertTabularSchemaToSparkSQLSchema(tabularSchema, false). Will NOT convert special date formats to String. Instead will render Alpine date formats as Spark SQL TimeStampType. The original date format will be stored as metadata in the StructFiled object for that column definition.
tabularSchema
An Alpine 'TabularSchemaOutline' object with fixed column definitions containing a name and Alpine specific type.
returns

Definition Classes
SparkSchemaUtils
def deleteFilePathIfExists(outputPathStr: String): AnyVal

Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.
Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.
outputPathStr
- the full HDFS path
returns
def deleteOrFailIfExists[T <: HdfsStorageFormatType](path: String, overwrite: Boolean): Unit

Annotations
@throws( ... )
lazy val driverHdfs: FileSystem
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getDataFrame(dataset: HiveTable): DataFrame

For use with hive.
For use with hive. Returns a Spark data frame given a hive table.
def getDataFrame(dataset: HdfsTabularDataset): DataFrame

Returns a DataFrame from an Alpine HdfsTabularDataset.
Returns a DataFrame from an Alpine HdfsTabularDataset. The DataFrame's schema will correspond to the column header of the Alpine dataset. Uses the databricks csv parser from spark-csv with the following options: 1.withParseMode("DROPMALFORMED"): Catch parse errors such as number format exception caused by a string value in a numeric column and remove those rows rather than fail. 2.withTreatEmptyValuesAsNulls(true) -> the empty string will represent a null value in char columns as it does in alpine 3.If a CSV, The delimiter attributes specified by the CSV attributes object * Date format behavior: DateTime columns are parsed as dates and then converted to the TimeStampType according to the format specified by the Alpine type 'ColumnType' format argument. The original format is save in the schema as metadata for that column. It can be accessed with SparkSqlDateTimeUtils.getDatFormatInfo(structField) for any given column.
dataset
Alpine specific object. Usually input or output of operator.
returns
Spark SQL DataFrame
def getDataFrameGeneral(dataset: TabularDataset): DataFrame

Returns the dataframe corresponding to a given tabular dataset (can be Hive, or a HDFS path).
Returns the dataframe corresponding to a given tabular dataset (can be Hive, or a HDFS path).
dataset
The dataset containing a reference to the data on disk.
returns
A DataFrame representation of the dataset.
def getDataFrameHiveContext(dataset: HdfsTabularDataset): DataFrame

Returns a DataFrame from an Alpine HdfsTabularDataset, created using a HiveContext (vs SQLContext in getDataFrame).
Returns a DataFrame from an Alpine HdfsTabularDataset, created using a HiveContext (vs SQLContext in getDataFrame). Using a Hive Context allows for window function calls on the DataFrame and Hive UDFs access. It does not require a previous Hive setup, but creating a Hive Context comes with large dependencies, hence this method should not be used if there is no need to leverage Hive UDFs.
(NOTE: loading the Hive dependencies might require to modify the "spark.driver.extraJavaOptions" by increasing MaxPermSize and enabling CMSClassUnloadingEnabled, UseConcMarkSweepGC. This parameter can be added in the "Advanced Spark Settings" dialog window of the operator).
dataset
Alpine specific object. Usually input or output of operator.
returns
Spark SQL DataFrame that allows window function calls and Hive UDFs access.
def getDateMap(tabularSchema: TabularSchema): Map[String, String]

Definition Classes
SparkSchemaUtils
def getSparkDataset(dataset: TabularDataset): Dataset[Row]
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def mapDFtoCustomDateTimeFormat(dataFrame: DataFrame, map: Map[String, String]): DataFrame

JAVA TIME STAMP OBJECT--> STRING (via joda) Take in a dataFrame and map of the column names to the date formats we want to print.
JAVA TIME STAMP OBJECT--> STRING (via joda) Take in a dataFrame and map of the column names to the date formats we want to print. Convert to a dateFrames with the date columns as strings formatted correctly according to the map. Because the format is in joda time, we have to round trip via unix time. We use a udf to convert to a column with a Sql.TimeStamp type to a unix time, and then use Joda to format the unix time stamp according to the column format string provided in the map.
dataFrame
input data where date columns are represented as java TimeStamp Objects
map
columnName -> dateFormat to convert to
def mapDFtoUnixDateTime(dataFrame: DataFrame, map: Map[String, String]): DataFrame

STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a DataFrame and a map of the column names to the date formats represented in them and change the columns with string dates into JAVA TIME STAMP objects by parsing them according to the format defined in the map.
STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a DataFrame and a map of the column names to the date formats represented in them and change the columns with string dates into JAVA TIME STAMP objects by parsing them according to the format defined in the map. Because Alpine uses Joda date formats, we use joda to parse the date strings into unix time stamps, and then convert unix time stamps to java sql objects. Preserves original naming of the columns. Columns which were originally DateTime columns will now be of TimeStampType rather than StringType.
dataFrame
the input dataframe where the date rows are as strings.
map
columnName -> dateFormat for parsing

Exceptions thrown
Exception
"Illegal Date Format" if one of the date formats provided is not a valid Java SimpleDateFormat pattern. And "Could not parse dates correctly. " if the date format is valid, but doesn't correspond to the data that is actually in the column.
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def saveAsAvro(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsAvroDataset

Write a DataFrame as an HDFSAvro dataset, and return the an instance of the Alpine HDFSAvroDataset type which contains the 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.
def saveAsCSV(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

More general version of saveAsCSV.
More general version of saveAsCSV. Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data. Also writes the ".alpine_metadata" to the result directory so that the user can drag and drop the result output and use it without configuring the dataset
path
where file will be written (this function will create a directory of part files)
dataFrame
- data to write
tSVAttributes
- an object which specifies how the file should be written
sourceOperatorInfo
from parameters. Includes name and UUID Same as 'saveAsCSV' but also writes the ".alpine_metadata" to the result so that the user can drag and drop the result output and use it without configuring the dataset
def saveAsCSVoMetadata(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

More general version of saveAsTSV.
More general version of saveAsTSV. Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.
path
where file will be written (this function will create a directory of part files)
dataFrame
- data to write
tSVAttributes
- an object which specifies how the file should be written
sourceOperatorInfo
from parameters. Includes name and UUID
def saveAsParquet(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

Write a DataFrame to HDFS as a Parquet file, and return an instance of the HDFSParquet IO base type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.
def saveAsTSV(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDataset

Write a DataFrame to HDFS as a Tab Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.
Write a DataFrame to HDFS as a Tab Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data. Uses the default TSVAttributes object which specifies that the data be written as a Tab Delimited File. See TSVAAttributes for more information and use the saveAsCSV file to customize csv options such as null string and delimiters.
Also writes the ".alpine_metadata" to the result directory so that the user can drag and drop the result output and use it without configuring the dataset
def saveDataFrame[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, storageFormat: T, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef], tSVAttributes: TSVAttributes): HdfsTabularDataset

Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
path
The path to which we'll save the data frame.
dataFrame
The data frame that we want to save.
storageFormat
The format that we want to store in. Defaults to CSV
overwrite
Whether to overwrite any existing file at the path.
sourceOperatorInfo
Mandatory source operator information to be included in the output object. Note: Only set to None in testing. This information is required for Touchpoints and if this is an output to an operator with multiple inputs.
addendum
Mandatory addendum information to be included in the output object.
returns
After saving the data frame, returns an HdfsTabularDataset object.
def saveDataFrameDefault[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo]): HdfsTabularDataset

Calls above method but defaults to saving as a comma-seperated file with overwrite = true, and no addendum
Calls above method but defaults to saving as a comma-seperated file with overwrite = true, and no addendum
path
The path to which we'll save the data frame.
dataFrame
The data frame that we want to save.
sourceOperatorInfo
Mandatory source operator information to be included in the output object. Note: Only set to None in testing. This information is required for Touchpoints and if this is an output to an operator with multiple inputs.
returns
the HdfsTabularData object corresponding to the output
def saveDatasetAsParquet(path: String, dataset: Dataset[Row], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
def toStructField(columnDef: ColumnDef, nullable: Boolean = true): StructField

Definition Classes
SparkSchemaUtils
def validateDateFormatMap(map: Map[String, String]): Unit
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Deprecated Value Members

def convertColumnTypeToSparkSQLDataType(columnType: TypeValue): DataType

Definition Classes
SparkSchemaUtils
Annotations
@deprecated
Deprecated
Will not properly handle data formats. Use toStructField
def convertSparkSQLDataTypeToColumnType(dataType: DataType): TypeValue

Converts from a Spark SQL data type to an Alpine-specific ColumnType
Converts from a Spark SQL data type to an Alpine-specific ColumnType

Definition Classes
SparkSchemaUtils
Annotations
@deprecated
Deprecated
This doesn't properly handle date formats. Use convertColumnTypeToSparkSQLDataType instead
def saveDataFrame(path: String, dataFrame: DataFrame, storageFormat: HdfsStorageFormat, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsTabularDataset

Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.
path
The path to which we'll save the data frame.
dataFrame
The data frame that we want to save.
storageFormat
The format that we want to store in.
overwrite
Whether to overwrite any existing file at the path.
sourceOperatorInfo
Mandatory source operator information to be included in the output object. Only set to None in testing, this is required for Touchpoints and if this is an output to an operator with multiple inputs.
addendum
Mandatory addendum information to be included in the output object.
returns
After saving the data frame, returns an HdfsTabularDataset object.

Annotations
@deprecated
Deprecated
Use signature with HdfsStorageFormatType rather than HdfsStorageFormat enum or saveDataFrameDefault

class SparkRuntimeUtils extends SparkSchemaUtils

Instance Constructors

new SparkRuntimeUtils(sc: SparkContext)

Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def convertColumnTypeToSparkSQLDataType(columnType: TypeValue, keepDatesAsStrings: Boolean): (DataType, Option[String])

def convertSparkSQLDataTypeToColumnType(structField: StructField): ColumnDef

def convertSparkSQLSchemaToTabularSchema(schema: StructType): TabularSchema

def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema, keepDatesAsStrings: Boolean): StructType

def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema): StructType

def deleteFilePathIfExists(outputPathStr: String): AnyVal

def deleteOrFailIfExists[T <: HdfsStorageFormatType](path: String, overwrite: Boolean): Unit

lazy val driverHdfs: FileSystem

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def getDataFrame(dataset: HiveTable): DataFrame

def getDataFrame(dataset: HdfsTabularDataset): DataFrame

def getDataFrameGeneral(dataset: TabularDataset): DataFrame

def getDataFrameHiveContext(dataset: HdfsTabularDataset): DataFrame

def getDateMap(tabularSchema: TabularSchema): Map[String, String]

def getSparkDataset(dataset: TabularDataset): Dataset[Row]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def mapDFtoCustomDateTimeFormat(dataFrame: DataFrame, map: Map[String, String]): DataFrame

def mapDFtoUnixDateTime(dataFrame: DataFrame, map: Map[String, String]): DataFrame

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def saveAsAvro(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsAvroDataset

def saveAsCSV(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

def saveAsCSVoMetadata(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

def saveAsParquet(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

def saveAsTSV(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDataset

def saveDataFrame[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, storageFormat: T, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef], tSVAttributes: TSVAttributes): HdfsTabularDataset

def saveDataFrameDefault[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo]): HdfsTabularDataset

def saveDatasetAsParquet(path: String, dataset: Dataset[Row], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

def toStructField(columnDef: ColumnDef, nullable: Boolean = true): StructField

def validateDateFormatMap(map: Map[String, String]): Unit

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Deprecated Value Members

def convertColumnTypeToSparkSQLDataType(columnType: TypeValue): DataType

def convertSparkSQLDataTypeToColumnType(dataType: DataType): TypeValue

def saveDataFrame(path: String, dataFrame: DataFrame, storageFormat: HdfsStorageFormat, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsTabularDataset

Inherited from SparkSchemaUtils

Inherited from AnyRef

Inherited from Any

Ungrouped