utils

Type Members

final class FileTooLargeException extends Exception

Exception thrown if the user is trying to read a file for which the disk space usage on HDFS w/o replication is above the limit configured.
Exception thrown if the user is trying to read a file for which the disk space usage on HDFS w/o replication is above the limit configured. If the operator developer is getting the file size limit parameter (fileSizeLimitMB) from the SparkExecutionContext.config, the message should indicate to the end user that this threshold is configurable in the alpine configuration file, in the "custom_operator" section.
class HDFSFileReaderUtils extends AnyRef

* Class for reading hdfs files locally on the Alpine Server.
* Class for reading hdfs files locally on the Alpine Server. Use either: - the general method openFileGeneralAndProcessLines that processes rows of data in a single part file of an HdfsTabularDataset input (of any storage format) - or the relevant method for reading a single hdfs file depending on the file storage format (CSV, Avro, Parquet). It supports compressed files. It allows setting a max size limit for the input file (disk space usage w/o cluster replication) to avoid memory issues if the file is very large. (default is 1000MB).
For each storage format, there is - a generic method which allows to pass as argument a function applied to the file reader (e.g openCSVFileAndDoAction expecting function argument: action: InputStreamReader => Unit) - a method which allows to loop through the rows of data in the file, which requires as argument a function to apply on an Iterator[RowWithIndex] (e.g openCSVFileAndProcessLines expecting function argument resultHandler: Iterator[RowWithIndex] => Unit) The second method checks for java memory usage while processing the lines, and stops the process if the memory usage is over the limit (default 90)%.
case class HadoopMetadata(columnNames: List[String], columnTypes: List[String], delimiter: String, escape: String, quote: String, isFirstLineHeader: Boolean = false, totalNumberOfRows: Long = 1) extends Product with Serializable
case class RowWithIndex(values: Array[String], rowNum: Int) extends Product with Serializable

* Class representing a row of data
* Class representing a row of data
values
string values of the row of data
rowNum
row number in the part file (starts at 0 for the 1st row)
class SparkRuntimeUtils extends SparkSchemaUtils

:: AlpineSdkApi ::
:: AlpineSdkApi ::

Annotations
@AlpineSdkApi()
trait SparkSchemaUtils extends AnyRef

Created by rachelwarren on 4/4/16.

Value Members

object BadDataReportingUtils
object DateTimeUdfs extends Serializable
object HDFSFileFilter extends PathFilter

* Utils to retrieve all relevant part files for analysis in an HDFS input directory
object HDFSFileReaderUtils

Created by emiliedelongueau on 7/26/17.
object HadoopDataType

Created by rachelwarren on 5/23/16.
object HdfsCSVFileCompressUtils

* Utils to open and uncompress a CSV (or delimited) file as an inputStream.
* Utils to open and uncompress a CSV (or delimited) file as an inputStream. Compressions supported: No Compression, GZIP, Deflate. Snappy compression is not supported.
object MLlibUtils

Helper functions that are useful for working with MLlib.
object SQLContextSingleton

This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the last created SQLContext instance.
object SparkContextSingleton

The SparkContext object contains a number of implicit conversions and parameters for use with various Spark features.
object SparkMetadataWriter
object SparkRuntimeUtils
object SparkSchemaUtils extends SparkSchemaUtils with Product with Serializable
object SparkSqlDateTimeUtils

package utils

Type Members

final class FileTooLargeException extends Exception

class HDFSFileReaderUtils extends AnyRef

case class HadoopMetadata(columnNames: List[String], columnTypes: List[String], delimiter: String, escape: String, quote: String, isFirstLineHeader: Boolean = false, totalNumberOfRows: Long = 1) extends Product with Serializable

case class RowWithIndex(values: Array[String], rowNum: Int) extends Product with Serializable

class SparkRuntimeUtils extends SparkSchemaUtils

trait SparkSchemaUtils extends AnyRef

Value Members

object BadDataReportingUtils

object DateTimeUdfs extends Serializable

object HDFSFileFilter extends PathFilter

object HDFSFileReaderUtils

object HadoopDataType

object HdfsCSVFileCompressUtils

object MLlibUtils

object SQLContextSingleton

object SparkContextSingleton

object SparkMetadataWriter

object SparkRuntimeUtils

object SparkSchemaUtils extends SparkSchemaUtils with Product with Serializable

object SparkSqlDateTimeUtils

Ungrouped