Given a dataFrame, the parameters and an instance of sparkRuntimeUtils, filters out all the rows containing null values.
Given a dataFrame, the parameters and an instance of sparkRuntimeUtils, filters out all the rows containing null values. Writes those rows to a file according to the values of the 'dataToWriteParam' and the 'badDataPathParam' (provided in the HdfsParameterUtils class). The method returns the data frame which does not contain nulls as well as a string containing an HTML formatted table with the information about the what data was removed and if/where it was stored. The message is generated using the 'AddendumWriter' object in the Plugin Core module.
Dirty Data: Spark SQL cannot process CSV files with dirty data (i.e. String values in numeric columns. We use the Drop Malformed option, so in the case of dirty data, the operator will not fail, but will silently remove those rows.
Same as 'filterNullDataAndReport' but rather than using the .
Same as 'filterNullDataAndReport' but rather than using the .anyNull function in the dataFrame class allows the user to define a function which returns a boolean for each row is it contains data which should be removed.
Dirty Data: Spark SQL cannot process CSV files with dirty data (i.e. String values in numeric columns. We use the Drop Malformed option, so in the case of dirty data, the operator will not fail, but will silently remove those rows.
Helper function which uses the AddendumWriter object to generate a message about the bad data and* get the data, if any, to write to the bad data file.
Helper function which uses the AddendumWriter object to generate a message about the bad data and* get the data, if any, to write to the bad data file. The data removed parameter is the message for what the bad data was removed. It will be of the form "data removed " + dataRemovedDueTo. I.e. if you put "due to zero values" then the message would read "Data removed due to zero values".
If specified by Params will write data containing null values to a file.
If specified by Params will write data containing null values to a file. Regardless return a message about how much data was removed.
If applicable writes bad data as a TSV with default attributes.
If applicable writes bad data as a TSV with default attributes.
Split a DataFrame according to the value of the removeRow parameter.
Split a DataFrame according to the value of the removeRow parameter.
Dirty Data: Spark SQL cannot process CSV files with dirty data (i.e. String values in numeric columns). We use the Drop Malformed option, so in the case of dirty data, the operator will not fail, but will silently remove those rows.
A function from spark.sql.Row to boolean. Should return true if the row is false.
Input data read without null or bad data removed.
None if write no data. Some(n) if parameter value is write n rows.
Rather than filtering the data, just provide an RDD of Strings that contain the null data and write the data and report according to the values of the other parameters.
Rather than filtering the data, just provide an RDD of Strings that contain the null data and write the data and report according to the values of the other parameters.
Dirty Data: Spark SQL cannot process CSV files with dirty data (i.e. String values in numeric columns. We use the Drop Malformed option, so in the case of dirty data, the operator will not fail, but will silently remove those rows.
use filterNullDataAndReport
Use filterNullDataAndReportGeneral
Use getNullDataToWriteMessage
Use signature with HdfsStorageFormatType or handelNullDataAsDataFrame
Use reportNullDataAsStringRDD