HDFSFileReaderUtils

Instance Constructors

new HDFSFileReaderUtils(context: SparkExecutionContext)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
val defaultFileSizeLimitMB: Double
lazy val defaultFileSizeLimitMBFromConfig: Double
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def openAvroFileAndDoAction(hdfsPath: Path, action: (FileReader[GenericRecord], Schema) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

Generic method to process a single HDFS file in Avro storage format.
Generic method to process a single HDFS file in Avro storage format. Compression types supported: No Compression, Deflate, Snappy.
hdfsPath
hdfs path of the input file
action
function((reader: FileReader[GenericRecord], schema: Schema) => Unit) to apply, which defines how to process the file.
fileSizeLimitMB
Optional file size limit (disk space usage w/o replication)

Annotations
@throws( classOf[FileTooLargeException] )
def openAvroFileAndProcessLines(hdfsPath: Path, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

* Method to process single HDFS file in Avro storage format by looping through the lines of data and applying a user-specified function on each row.
* Method to process single HDFS file in Avro storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, Deflate, Snappy.
hdfsPath
hdfs path of the input file
resultHandler
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
fileSizeLimitMB
Optional file size limit (disk space usage w/o replication)
def openCSVFileAndDoAction(hdfsPath: Path, action: (BufferedReader) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

Generic method to process a single HDFS file in CSV storage format.
Generic method to process a single HDFS file in CSV storage format. Compression types supported: No Compression, Deflate, GZIP.
hdfsPath
hdfs path of the input file
action
function(InputStreamReader => Unit) to apply on the InputStreamReader, which defines how to process the file.
fileSizeLimitMB
Optional file size limit (disk space usage w/o replication)

Annotations
@throws( classOf[FileTooLargeException] )
def openCSVFileAndProcessLines(hdfsPath: Path, tSVAttributes: TSVAttributes, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

* Method to process single HDFS file in CSV storage format by looping through the lines of data and applying a user-specified function on each row.
* Method to process single HDFS file in CSV storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, Deflate, GZIP.
hdfsPath
hdfs path of the input file
tSVAttributes
TSVAttributes of the delimited input.
resultHandler
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
fileSizeLimitMB
Optional file size limit (disk space usage w/o replication)
def openFileGeneralAndProcessLines(partFilePath: Path, parentDataset: HdfsTabularDataset, partFileResultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

* Method to process a single part file in the directory path of an HdfsTabularDataset, by applying a user-specified function on the row iterator of the part file.
* Method to process a single part file in the directory path of an HdfsTabularDataset, by applying a user-specified function on the row iterator of the part file. Compression types supported depend on the storage format (see methods below specific to each storage format).
partFilePath
Path of the part file to process.
parentDataset
Parent HdfsTabularDataset that contains the part file in its directory path.
partFileResultHandler
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
fileSizeLimitMB
Optional file size limit checked before opening each part file (disk space usage w/o replication)
def openParquetFileAndDoAction(hdfsPath: Path, action: (ParquetReader[Group], MessageType) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

Generic method to process a single HDFS file in Parquet storage format.
Generic method to process a single HDFS file in Parquet storage format. Compression types supported: No Compression, GZIP, Snappy.
hdfsPath
hdfs path of the input file
action
function((reader: ParquetReader[Group], schema: MessageType) => Unit) to apply, which defines how to process the file.
fileSizeLimitMB
Optional file size limit (disk space usage w/o replication)

Annotations
@throws( classOf[FileTooLargeException] )
def openParquetFileAndProcessLines(hdfsPath: Path, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

* Method to process single HDFS file in Parquet storage format by looping through the lines of data and applying a user-specified function on each row.
* Method to process single HDFS file in Parquet storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, GZIP, Snappy.
hdfsPath
hdfs path of the input file
resultHandler
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
fileSizeLimitMB
Optional file size limit (disk space usage w/o replication)
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object HDFSFileReaderUtils | package utils

class HDFSFileReaderUtils extends AnyRef

Instance Constructors

new HDFSFileReaderUtils(context: SparkExecutionContext)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

val defaultFileSizeLimitMB: Double

lazy val defaultFileSizeLimitMBFromConfig: Double

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def openAvroFileAndDoAction(hdfsPath: Path, action: (FileReader[GenericRecord], Schema) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

def openAvroFileAndProcessLines(hdfsPath: Path, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

def openCSVFileAndDoAction(hdfsPath: Path, action: (BufferedReader) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

def openCSVFileAndProcessLines(hdfsPath: Path, tSVAttributes: TSVAttributes, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

def openFileGeneralAndProcessLines(partFilePath: Path, parentDataset: HdfsTabularDataset, partFileResultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

def openParquetFileAndDoAction(hdfsPath: Path, action: (ParquetReader[Group], MessageType) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

def openParquetFileAndProcessLines(hdfsPath: Path, resultHandler: (Iterator[RowWithIndex]) ⇒ Unit, fileSizeLimitMB: Option[Double] = ...): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped