com.alpine.plugin.core.spark.utils

SparkRuntimeUtils

class SparkRuntimeUtils extends SparkSchemaUtils

:: AlpineSdkApi ::

Annotations
@AlpineSdkApi()
Linear Supertypes
SparkSchemaUtils, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. SparkRuntimeUtils
  2. SparkSchemaUtils
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SparkRuntimeUtils(sc: SparkContext)

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def convertColumnTypeToSparkSQLDataType(columnType: TypeValue, keepDatesAsStrings: Boolean): (DataType, Option[String])

    Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ.

    Convert from Alpine's 'ColumnType' to the corresponding Spark Sql Typ. DateTime Behavior: Converts all DateTime columns to TimeStampType. If there is a format string will add that to the metadata. If there is no format string, will use the ISO format ("yyyy-mm-dd hh:mm:ss")

    Definition Classes
    SparkSchemaUtils
  9. def convertSparkSQLDataTypeToColumnType(structField: StructField): ColumnDef

    Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.

    Converts from a Spark SQL Structfield to an Alpine-specific ColumnDef.

    Definition Classes
    SparkSchemaUtils
  10. def convertSparkSQLSchemaToTabularSchema(schema: StructType): TabularSchema

    Converts from a Spark SQL schema to the Alpine 'TabularSchema' type.

    Converts from a Spark SQL schema to the Alpine 'TabularSchema' type. The 'TabularSchema' object this method returns can be used to create any of the tabular Alpine IO types (HDFSTabular dataset, dataTable etc.)

    Date format behavior: If the column def has not metadata stored at the DATE_METADATA_KEY constant, it wll convert DateType objects to ColumnType(DateTime, "yyyy-mm-dd") and TimeStampType objects to ColumnType(DateTime, "yyyy-mm-dd hh:mm:ss") otherwise will create a column type of ColumnType(DateTime, custom_date_format) where custom_date_format is whatever date format was specified by the column metadata.

    schema

    -a Spark SQL DataFrame schema

    returns

    the equivalent Alpine schema for that dataset

    Definition Classes
    SparkSchemaUtils
  11. def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema, keepDatesAsStrings: Boolean): StructType

    Definition Classes
    SparkSchemaUtils
  12. def convertTabularSchemaToSparkSQLSchema(tabularSchema: TabularSchema): StructType

    Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.

    Convert the Alpine 'TabularSchema' with column names and types to the equivalent Spark SQL data frame header.

    Date/Time behavior: The same as convertTabularSchemaToSparkSQLSchema(tabularSchema, false). Will NOT convert special date formats to String. Instead will render Alpine date formats as Spark SQL TimeStampType. The original date format will be stored as metadata in the StructFiled object for that column definition.

    tabularSchema

    An Alpine 'TabularSchemaOutline' object with fixed column definitions containing a name and Alpine specific type.

    returns

    Definition Classes
    SparkSchemaUtils
  13. def deleteFilePathIfExists(outputPathStr: String): AnyVal

    Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.

    Checks if the given file path already exists (and would cause a 'PathAlreadyExists' exception when we try to write to it) and deletes the directory to prevent existing results at that path if they do exist.

    outputPathStr

    - the full HDFS path

    returns

  14. def deleteOrFailIfExists[T <: HdfsStorageFormatType](path: String, overwrite: Boolean): Unit

    Annotations
    @throws( ... )
  15. lazy val driverHdfs: FileSystem

  16. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  17. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  18. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  19. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  20. def getDataFrame(dataset: HiveTable): DataFrame

    For use with hive.

    For use with hive. Returns a Spark data frame given a hive table.

  21. def getDataFrame(dataset: HdfsTabularDataset): DataFrame

    Returns a DataFrame from an Alpine HdfsTabularDataset.

    Returns a DataFrame from an Alpine HdfsTabularDataset. The DataFrame's schema will correspond to the column header of the Alpine dataset. Uses the databricks csv parser from spark-csv with the following options: 1.withParseMode("DROPMALFORMED"): Catch parse errors such as number format exception caused by a string value in a numeric column and remove those rows rather than fail. 2.withTreatEmptyValuesAsNulls(true) -> the empty string will represent a null value in char columns as it does in alpine 3.If a CSV, The delimiter attributes specified by the CSV attributes object * Date format behavior: DateTime columns are parsed as dates and then converted to the TimeStampType according to the format specified by the Alpine type 'ColumnType' format argument. The original format is save in the schema as metadata for that column. It can be accessed with SparkSqlDateTimeUtils.getDatFormatInfo(structField) for any given column.

    dataset

    Alpine specific object. Usually input or output of operator.

    returns

    Spark SQL DataFrame

  22. def getDataFrameGeneral(dataset: TabularDataset): DataFrame

    Returns the dataframe corresponding to a given tabular dataset (can be Hive, or a HDFS path).

    Returns the dataframe corresponding to a given tabular dataset (can be Hive, or a HDFS path).

    dataset

    The dataset containing a reference to the data on disk.

    returns

    A DataFrame representation of the dataset.

  23. def getDataFrameHiveContext(dataset: HdfsTabularDataset): DataFrame

    Returns a DataFrame from an Alpine HdfsTabularDataset, created using a HiveContext (vs SQLContext in getDataFrame).

    Returns a DataFrame from an Alpine HdfsTabularDataset, created using a HiveContext (vs SQLContext in getDataFrame). Using a Hive Context allows for window function calls on the DataFrame and Hive UDFs access. It does not require a previous Hive setup, but creating a Hive Context comes with large dependencies, hence this method should not be used if there is no need to leverage Hive UDFs.

    (NOTE: loading the Hive dependencies might require to modify the "spark.driver.extraJavaOptions" by increasing MaxPermSize and enabling CMSClassUnloadingEnabled, UseConcMarkSweepGC. This parameter can be added in the "Advanced Spark Settings" dialog window of the operator).

    dataset

    Alpine specific object. Usually input or output of operator.

    returns

    Spark SQL DataFrame that allows window function calls and Hive UDFs access.

  24. def getDateMap(tabularSchema: TabularSchema): Map[String, String]

    Definition Classes
    SparkSchemaUtils
  25. def getSparkDataset(dataset: TabularDataset): Dataset[Row]

  26. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  27. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  28. def mapDFtoCustomDateTimeFormat(dataFrame: DataFrame, map: Map[String, String]): DataFrame

    JAVA TIME STAMP OBJECT--> STRING (via joda) Take in a dataFrame and map of the column names to the date formats we want to print.

    JAVA TIME STAMP OBJECT--> STRING (via joda) Take in a dataFrame and map of the column names to the date formats we want to print. Convert to a dateFrames with the date columns as strings formatted correctly according to the map. Because the format is in joda time, we have to round trip via unix time. We use a udf to convert to a column with a Sql.TimeStamp type to a unix time, and then use Joda to format the unix time stamp according to the column format string provided in the map.

    dataFrame

    input data where date columns are represented as java TimeStamp Objects

    map

    columnName -> dateFormat to convert to

  29. def mapDFtoUnixDateTime(dataFrame: DataFrame, map: Map[String, String]): DataFrame

    STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a DataFrame and a map of the column names to the date formats represented in them and change the columns with string dates into JAVA TIME STAMP objects by parsing them according to the format defined in the map.

    STRING -> JAVA TIMESTAMP OBJECT (based on unix time stamp) Take in a DataFrame and a map of the column names to the date formats represented in them and change the columns with string dates into JAVA TIME STAMP objects by parsing them according to the format defined in the map. Because Alpine uses Joda date formats, we use joda to parse the date strings into unix time stamps, and then convert unix time stamps to java sql objects. Preserves original naming of the columns. Columns which were originally DateTime columns will now be of TimeStampType rather than StringType.

    dataFrame

    the input dataframe where the date rows are as strings.

    map

    columnName -> dateFormat for parsing

    Exceptions thrown
    Exception

    "Illegal Date Format" if one of the date formats provided is not a valid Java SimpleDateFormat pattern. And "Could not parse dates correctly. " if the date format is valid, but doesn't correspond to the data that is actually in the column.

  30. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  31. final def notify(): Unit

    Definition Classes
    AnyRef
  32. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  33. def saveAsAvro(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsAvroDataset

    Write a DataFrame as an HDFSAvro dataset, and return the an instance of the Alpine HDFSAvroDataset type which contains the 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.

  34. def saveAsCSV(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

    More general version of saveAsCSV.

    More general version of saveAsCSV. Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data. Also writes the ".alpine_metadata" to the result directory so that the user can drag and drop the result output and use it without configuring the dataset

    path

    where file will be written (this function will create a directory of part files)

    dataFrame

    - data to write

    tSVAttributes

    - an object which specifies how the file should be written

    sourceOperatorInfo

    from parameters. Includes name and UUID Same as 'saveAsCSV' but also writes the ".alpine_metadata" to the result so that the user can drag and drop the result output and use it without configuring the dataset

  35. def saveAsCSVoMetadata(path: String, dataFrame: DataFrame, tSVAttributes: TSVAttributes, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDatasetDefault

    More general version of saveAsTSV.

    More general version of saveAsTSV. Write a DataFrame to HDFS as a Tabular Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.

    path

    where file will be written (this function will create a directory of part files)

    dataFrame

    - data to write

    tSVAttributes

    - an object which specifies how the file should be written

    sourceOperatorInfo

    from parameters. Includes name and UUID

  36. def saveAsParquet(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

    Write a DataFrame to HDFS as a Parquet file, and return an instance of the HDFSParquet IO base type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.

  37. def saveAsTSV(path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsDelimitedTabularDataset

    Write a DataFrame to HDFS as a Tab Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data.

    Write a DataFrame to HDFS as a Tab Delimited file, and return an instance of the Alpine HDFSDelimitedTabularDataset type which contains the Alpine 'TabularSchema' definition (created by converting the DataFrame schema) and the path to the saved data. Uses the default TSVAttributes object which specifies that the data be written as a Tab Delimited File. See TSVAAttributes for more information and use the saveAsCSV file to customize csv options such as null string and delimiters.

    Also writes the ".alpine_metadata" to the result directory so that the user can drag and drop the result output and use it without configuring the dataset

  38. def saveDataFrame[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, storageFormat: T, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef], tSVAttributes: TSVAttributes): HdfsTabularDataset

    Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.

    Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.

    path

    The path to which we'll save the data frame.

    dataFrame

    The data frame that we want to save.

    storageFormat

    The format that we want to store in. Defaults to CSV

    overwrite

    Whether to overwrite any existing file at the path.

    sourceOperatorInfo

    Mandatory source operator information to be included in the output object. Note: Only set to None in testing. This information is required for Touchpoints and if this is an output to an operator with multiple inputs.

    addendum

    Mandatory addendum information to be included in the output object.

    returns

    After saving the data frame, returns an HdfsTabularDataset object.

  39. def saveDataFrameDefault[T <: HdfsStorageFormatType](path: String, dataFrame: DataFrame, sourceOperatorInfo: Option[OperatorInfo]): HdfsTabularDataset

    Calls above method but defaults to saving as a comma-seperated file with overwrite = true, and no addendum

    Calls above method but defaults to saving as a comma-seperated file with overwrite = true, and no addendum

    path

    The path to which we'll save the data frame.

    dataFrame

    The data frame that we want to save.

    sourceOperatorInfo

    Mandatory source operator information to be included in the output object. Note: Only set to None in testing. This information is required for Touchpoints and if this is an output to an operator with multiple inputs.

    returns

    the HdfsTabularData object corresponding to the output

  40. def saveDatasetAsParquet(path: String, dataset: Dataset[Row], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsParquetDataset

  41. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  42. def toString(): String

    Definition Classes
    AnyRef → Any
  43. def toStructField(columnDef: ColumnDef, nullable: Boolean = true): StructField

    Definition Classes
    SparkSchemaUtils
  44. def validateDateFormatMap(map: Map[String, String]): Unit

  45. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  46. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  47. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Deprecated Value Members

  1. def convertColumnTypeToSparkSQLDataType(columnType: TypeValue): DataType

    Definition Classes
    SparkSchemaUtils
    Annotations
    @deprecated
    Deprecated

    Will not properly handle data formats. Use toStructField

  2. def convertSparkSQLDataTypeToColumnType(dataType: DataType): TypeValue

    Converts from a Spark SQL data type to an Alpine-specific ColumnType

    Converts from a Spark SQL data type to an Alpine-specific ColumnType

    Definition Classes
    SparkSchemaUtils
    Annotations
    @deprecated
    Deprecated

    This doesn't properly handle date formats. Use convertColumnTypeToSparkSQLDataType instead

  3. def saveDataFrame(path: String, dataFrame: DataFrame, storageFormat: HdfsStorageFormat, overwrite: Boolean, sourceOperatorInfo: Option[OperatorInfo], addendum: Map[String, AnyRef] = Map[String, AnyRef]()): HdfsTabularDataset

    Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.

    Save a data frame to a path using the given storage format, and return a corresponding HdfsTabularDataset object that points to the path.

    path

    The path to which we'll save the data frame.

    dataFrame

    The data frame that we want to save.

    storageFormat

    The format that we want to store in.

    overwrite

    Whether to overwrite any existing file at the path.

    sourceOperatorInfo

    Mandatory source operator information to be included in the output object. Only set to None in testing, this is required for Touchpoints and if this is an output to an operator with multiple inputs.

    addendum

    Mandatory addendum information to be included in the output object.

    returns

    After saving the data frame, returns an HdfsTabularDataset object.

    Annotations
    @deprecated
    Deprecated

    Use signature with HdfsStorageFormatType rather than HdfsStorageFormat enum or saveDataFrameDefault

Inherited from SparkSchemaUtils

Inherited from AnyRef

Inherited from Any

Ungrouped