Class/Object

com.alpine.plugin.core.spark.utils

HDFSFileReaderUtils

Related Docs: object HDFSFileReaderUtils | package utils

Permalink

class HDFSFileReaderUtils extends AnyRef

* Class for reading hdfs files locally on the Alpine Server. Use either: - the general method openFileGeneralAndProcessLines that processes rows of data in a single part file of an HdfsTabularDataset input (of any storage format) - or the relevant method for reading a single hdfs file depending on the file storage format (CSV, Avro, Parquet). It supports compressed files. It allows setting a max size limit for the input file (disk space usage w/o cluster replication) to avoid memory issues if the file is very large. (default is 1000MB).

For each storage format, there is - a generic method which allows to pass as argument a function applied to the file reader (e.g openCSVFileAndDoAction expecting function argument: action: InputStreamReader => Unit) - a method which allows to loop through the rows of data in the file, which requires as argument a function to apply on an Iterator[RowWithIndex] (e.g openCSVFileAndProcessLines expecting function argument resultHandler: Iterator[RowWithIndex] => Unit) The second method checks for java memory usage while processing the lines, and stops the process if the memory usage is over the limit (default 90)%.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. HDFSFileReaderUtils
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new HDFSFileReaderUtils(context: SparkExecutionContext)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. val defaultFileSizeLimitMB: Double

    Permalink
  7. lazy val defaultFileSizeLimitMBFromConfig: Double

    Permalink
  8. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  16. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  17. def openAvroFileAndDoAction(hdfsPath: Path, action: (FileReader[GenericRecord], Schema) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    Generic method to process a single HDFS file in Avro storage format.

    Generic method to process a single HDFS file in Avro storage format. Compression types supported: No Compression, Deflate, Snappy.

    hdfsPath

    hdfs path of the input file

    action

    function((reader: FileReader[GenericRecord], schema: Schema) => Unit) to apply, which defines how to process the file.

    fileSizeLimitMB

    Optional file size limit (disk space usage w/o replication)

    Annotations
    @throws( classOf[FileTooLargeException] )
  18. def openAvroFileAndProcessLines(hdfsPath: Path, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    * Method to process single HDFS file in Avro storage format by looping through the lines of data and applying a user-specified function on each row.

    * Method to process single HDFS file in Avro storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, Deflate, Snappy.

    hdfsPath

    hdfs path of the input file

    resultHandler

    function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.

    fileSizeLimitMB

    Optional file size limit (disk space usage w/o replication)

  19. def openCSVFileAndDoAction(hdfsPath: Path, action: (BufferedReader) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    Generic method to process a single HDFS file in CSV storage format.

    Generic method to process a single HDFS file in CSV storage format. Compression types supported: No Compression, Deflate, GZIP.

    hdfsPath

    hdfs path of the input file

    action

    function(InputStreamReader => Unit) to apply on the InputStreamReader, which defines how to process the file.

    fileSizeLimitMB

    Optional file size limit (disk space usage w/o replication)

    Annotations
    @throws( classOf[FileTooLargeException] )
  20. def openCSVFileAndProcessLines(hdfsPath: Path, tSVAttributes: TSVAttributes, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    * Method to process single HDFS file in CSV storage format by looping through the lines of data and applying a user-specified function on each row.

    * Method to process single HDFS file in CSV storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, Deflate, GZIP.

    hdfsPath

    hdfs path of the input file

    tSVAttributes

    TSVAttributes of the delimited input.

    resultHandler

    function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.

    fileSizeLimitMB

    Optional file size limit (disk space usage w/o replication)

  21. def openFileGeneralAndProcessLines(partFilePath: Path, parentDataset: HdfsTabularDataset, partFileResultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    * Method to process a single part file in the directory path of an HdfsTabularDataset, by applying a user-specified function on the row iterator of the part file.

    * Method to process a single part file in the directory path of an HdfsTabularDataset, by applying a user-specified function on the row iterator of the part file. Compression types supported depend on the storage format (see methods below specific to each storage format).

    partFilePath

    Path of the part file to process.

    parentDataset

    Parent HdfsTabularDataset that contains the part file in its directory path.

    partFileResultHandler

    function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.

    fileSizeLimitMB

    Optional file size limit checked before opening each part file (disk space usage w/o replication)

  22. def openParquetFileAndDoAction(hdfsPath: Path, action: (ParquetReader[Group], MessageType) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    Generic method to process a single HDFS file in Parquet storage format.

    Generic method to process a single HDFS file in Parquet storage format. Compression types supported: No Compression, GZIP, Snappy.

    hdfsPath

    hdfs path of the input file

    action

    function((reader: ParquetReader[Group], schema: MessageType) => Unit) to apply, which defines how to process the file.

    fileSizeLimitMB

    Optional file size limit (disk space usage w/o replication)

    Annotations
    @throws( classOf[FileTooLargeException] )
  23. def openParquetFileAndProcessLines(hdfsPath: Path, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

    Permalink

    * Method to process single HDFS file in Parquet storage format by looping through the lines of data and applying a user-specified function on each row.

    * Method to process single HDFS file in Parquet storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, GZIP, Snappy.

    hdfsPath

    hdfs path of the input file

    resultHandler

    function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.

    fileSizeLimitMB

    Optional file size limit (disk space usage w/o replication)

  24. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  25. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  26. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  28. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped