com.alpine.plugin.core.spark.utils
Generic method to process a single HDFS file in Avro storage format.
Generic method to process a single HDFS file in Avro storage format. Compression types supported: No Compression, Deflate, Snappy.
hdfs path of the input file
function((reader: FileReader[GenericRecord], schema: Schema) => Unit) to apply, which defines how to process the file.
Optional file size limit (disk space usage w/o replication)
* Method to process single HDFS file in Avro storage format by looping through the lines of data and applying a user-specified function on each row.
* Method to process single HDFS file in Avro storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, Deflate, Snappy.
hdfs path of the input file
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
Optional file size limit (disk space usage w/o replication)
Generic method to process a single HDFS file in CSV storage format.
Generic method to process a single HDFS file in CSV storage format. Compression types supported: No Compression, Deflate, GZIP.
hdfs path of the input file
function(InputStreamReader => Unit) to apply on the InputStreamReader, which defines how to process the file.
Optional file size limit (disk space usage w/o replication)
* Method to process single HDFS file in CSV storage format by looping through the lines of data and applying a user-specified function on each row.
* Method to process single HDFS file in CSV storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, Deflate, GZIP.
hdfs path of the input file
TSVAttributes of the delimited input.
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
Optional file size limit (disk space usage w/o replication)
* Method to process a single part file in the directory path of an HdfsTabularDataset, by applying a user-specified function on the row iterator of the part file.
* Method to process a single part file in the directory path of an HdfsTabularDataset, by applying a user-specified function on the row iterator of the part file. Compression types supported depend on the storage format (see methods below specific to each storage format).
Path of the part file to process.
Parent HdfsTabularDataset that contains the part file in its directory path.
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
Optional file size limit checked before opening each part file (disk space usage w/o replication)
Generic method to process a single HDFS file in Parquet storage format.
Generic method to process a single HDFS file in Parquet storage format. Compression types supported: No Compression, GZIP, Snappy.
hdfs path of the input file
function((reader: ParquetReader[Group], schema: MessageType) => Unit) to apply, which defines how to process the file.
Optional file size limit (disk space usage w/o replication)
* Method to process single HDFS file in Parquet storage format by looping through the lines of data and applying a user-specified function on each row.
* Method to process single HDFS file in Parquet storage format by looping through the lines of data and applying a user-specified function on each row. Compression types supported: No Compression, GZIP, Snappy.
hdfs path of the input file
function(Iterator[RowWithIndex] => Unit) to apply on the row iterator of the part file, which defines how to process it. Note: the rowNum parameter of RowWithIndex starts at 0 for the first iterator value.
Optional file size limit (disk space usage w/o replication)
* Class for reading hdfs files locally on the Alpine Server. Use either: - the general method
openFileGeneralAndProcessLines
that processes rows of data in a single part file of an HdfsTabularDataset input (of any storage format) - or the relevant method for reading a single hdfs file depending on the file storage format (CSV, Avro, Parquet). It supports compressed files. It allows setting a max size limit for the input file (disk space usage w/o cluster replication) to avoid memory issues if the file is very large. (default is 1000MB).For each storage format, there is - a generic method which allows to pass as argument a function applied to the file reader (e.g openCSVFileAndDoAction expecting function argument: action: InputStreamReader => Unit) - a method which allows to loop through the rows of data in the file, which requires as argument a function to apply on an Iterator[RowWithIndex] (e.g openCSVFileAndProcessLines expecting function argument resultHandler: Iterator[RowWithIndex] => Unit) The second method checks for java memory usage while processing the lines, and stops the process if the memory usage is over the limit (default 90)%.