Exception thrown if the user is trying to read a file for which the disk space usage on HDFS w/o replication is above the limit configured.
* Class for reading hdfs files locally on the Alpine Server.
*
Class for reading hdfs files locally on the Alpine Server.
Use either:
- the general method openFileGeneralAndProcessLines
that processes rows of data in a single part file of an HdfsTabularDataset input (of any storage format)
- or the relevant method for reading a single hdfs file depending on the file storage format (CSV, Avro, Parquet). It supports compressed files.
It allows setting a max size limit for the input file (disk space usage w/o cluster replication) to avoid memory issues if the file is very large.
(default is 1000MB).
For each storage format, there is - a generic method which allows to pass as argument a function applied to the file reader (e.g openCSVFileAndDoAction expecting function argument: action: InputStreamReader => Unit) - a method which allows to loop through the rows of data in the file, which requires as argument a function to apply on an Iterator[RowWithIndex] (e.g openCSVFileAndProcessLines expecting function argument resultHandler: Iterator[RowWithIndex] => Unit) The second method checks for java memory usage while processing the lines, and stops the process if the memory usage is over the limit (default 90)%.
* Class representing a row of data
* Class representing a row of data
string values of the row of data
row number in the part file (starts at 0 for the 1st row)
:: AlpineSdkApi ::
:: AlpineSdkApi ::
Created by rachelwarren on 4/4/16.
* Utils to retrieve all relevant part files for analysis in an HDFS input directory
Created by emiliedelongueau on 7/26/17.
Created by rachelwarren on 5/23/16.
* Utils to open and uncompress a CSV (or delimited) file as an inputStream.
* Utils to open and uncompress a CSV (or delimited) file as an inputStream. Compressions supported: No Compression, GZIP, Deflate. Snappy compression is not supported.
Helper functions that are useful for working with MLlib.
This SQLContext object contains utility functions to create a singleton SQLContext instance, or to get the last created SQLContext instance.
The SparkContext object contains a number of implicit conversions and parameters for use with various Spark features.
Exception thrown if the user is trying to read a file for which the disk space usage on HDFS w/o replication is above the limit configured. If the operator developer is getting the file size limit parameter (fileSizeLimitMB) from the SparkExecutionContext.config, the message should indicate to the end user that this threshold is configurable in the alpine configuration file, in the "custom_operator" section.