Package

com.alpine.plugin.core.spark

utils

Permalink

package utils

Visibility
  1. Public
  2. All

Type Members

  1. final class FileTooLargeException extends Exception

    Permalink

    Exception thrown if the user is trying to read a file for which the disk space usage on HDFS w/o replication is above the limit configured.

    Exception thrown if the user is trying to read a file for which the disk space usage on HDFS w/o replication is above the limit configured. If the operator developer is getting the file size limit parameter (fileSizeLimitMB) from the SparkExecutionContext.config, the message should indicate to the end user that this threshold is configurable in the alpine configuration file, in the "custom_operator" section.

  2. class HDFSFileReaderUtils extends AnyRef

    Permalink

    * Class for reading hdfs files locally on the Alpine Server.

    * Class for reading hdfs files locally on the Alpine Server. Use either: - the general method openFileGeneralAndProcessLines that processes rows of data in a single part file of an HdfsTabularDataset input (of any storage format) - or the relevant method for reading a single hdfs file depending on the file storage format (CSV, Avro, Parquet). It supports compressed files. It allows setting a max size limit for the input file (disk space usage w/o cluster replication) to avoid memory issues if the file is very large. (default is 1000MB).

    For each storage format, there is - a generic method which allows to pass as argument a function applied to the file reader (e.g openCSVFileAndDoAction expecting function argument: action: InputStreamReader => Unit) - a method which allows to loop through the rows of data in the file, which requires as argument a function to apply on an Iterator[RowWithIndex] (e.g openCSVFileAndProcessLines expecting function argument resultHandler: Iterator[RowWithIndex] => Unit) The second method checks for java memory usage while processing the lines, and stops the process if the memory usage is over the limit (default 90)%.

  3. case class HadoopMetadata(columnNames: List[String], columnTypes: List[String], delimiter: String, escape: String, quote: String, isFirstLineHeader: Boolean = false, totalNumberOfRows: Long = 1) extends Product with Serializable

    Permalink
  4. case class RowWithIndex(values: Array[String], rowNum: Int) extends Product with Serializable

    Permalink

    * Class representing a row of data

    * Class representing a row of data

    values

    string values of the row of data

    rowNum

    row number in the part file (starts at 0 for the 1st row)

  5. class SparkRuntimeUtils extends SparkSchemaUtils

    Permalink

    :: AlpineSdkApi ::

    :: AlpineSdkApi ::

    Annotations
    @AlpineSdkApi()
  6. trait SparkSchemaUtils extends AnyRef

    Permalink

    Created by rachelwarren on 4/4/16.

Value Members

  1. object BadDataReportingUtils

    Permalink
  2. object DateTimeUdfs extends Serializable

    Permalink
  3. object HDFSFileFilter extends PathFilter

    Permalink

    * Utils to retrieve all relevant part files for analysis in an HDFS input directory

  4. object HDFSFileReaderUtils

    Permalink

    Created by emiliedelongueau on 7/26/17.

  5. object HadoopDataType

    Permalink

    Created by rachelwarren on 5/23/16.

  6. object HdfsCSVFileCompressUtils

    Permalink

    * Utils to open and uncompress a CSV (or delimited) file as an inputStream.

    * Utils to open and uncompress a CSV (or delimited) file as an inputStream. Compressions supported: No Compression, GZIP, Deflate. Snappy compression is not supported.

  7. object MLlibUtils

    Permalink

    Helper functions that are useful for working with MLlib.

  8. object SQLContextSingleton

    Permalink

    This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the last created SQLContext instance.

  9. object SparkContextSingleton

    Permalink

    The SparkContext object contains a number of implicit conversions and parameters for use with various Spark features.

  10. object SparkMetadataWriter

    Permalink
  11. object SparkRuntimeUtils

    Permalink
  12. object SparkSchemaUtils extends SparkSchemaUtils with Product with Serializable

    Permalink
  13. object SparkSqlDateTimeUtils

    Permalink

Ungrouped