Skip to content
Snippets Groups Projects
Commit 07549b20 authored by WeichenXu's avatar WeichenXu Committed by Yanbo Liang
Browse files

[SPARK-19634][ML] Multivariate summarizer - dataframes API

## What changes were proposed in this pull request?

This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.

## How was this patch tested?

Testcases added.

## Performance
Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan

### Performance data

(test on my laptop, use 2 partitions. tries out = 20, warm up = 10)

The unit of test results is records/milliseconds (higher is better)

Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000
----|------|----|---|----|----
Dataframe | 15149  | 7441 | 2118 | 224 | 21
RDD from Dataframe | 4992  | 4440 | 2328 | 320 | 33
raw RDD | 53931  | 20683 | 3966 | 528 | 53

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
parent 96608310
No related branches found
No related tags found
No related merge requests found
......@@ -27,17 +27,7 @@ import org.apache.spark.sql.types._
*/
private[spark] class VectorUDT extends UserDefinedType[Vector] {
override def sqlType: StructType = {
// type: 0 = sparse, 1 = dense
// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse
// vectors. The "values" field is nullable because we might want to add binary vectors later,
// which uses "size" and "indices", but not "values".
StructType(Seq(
StructField("type", ByteType, nullable = false),
StructField("size", IntegerType, nullable = true),
StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true),
StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true)))
}
override final def sqlType: StructType = _sqlType
override def serialize(obj: Vector): InternalRow = {
obj match {
......@@ -94,4 +84,16 @@ private[spark] class VectorUDT extends UserDefinedType[Vector] {
override def typeName: String = "vector"
private[spark] override def asNullable: VectorUDT = this
private[this] val _sqlType = {
// type: 0 = sparse, 1 = dense
// We only use "values" for dense vectors, and "size", "indices", and "values" for sparse
// vectors. The "values" field is nullable because we might want to add binary vectors later,
// which uses "size" and "indices", but not "values".
StructType(Seq(
StructField("type", ByteType, nullable = false),
StructField("size", IntegerType, nullable = true),
StructField("indices", ArrayType(IntegerType, containsNull = false), nullable = true),
StructField("values", ArrayType(DoubleType, containsNull = false), nullable = true)))
}
}
This diff is collapsed.
This diff is collapsed.
......@@ -101,6 +101,8 @@ case class InterpretedMutableProjection(expressions: Seq[Expression]) extends Mu
/**
* A projection that returns UnsafeRow.
*
* CAUTION: the returned projection object should *not* be assumed to be thread-safe.
*/
abstract class UnsafeProjection extends Projection {
override def apply(row: InternalRow): UnsafeRow
......@@ -110,11 +112,15 @@ object UnsafeProjection {
/**
* Returns an UnsafeProjection for given StructType.
*
* CAUTION: the returned projection object is *not* thread-safe.
*/
def create(schema: StructType): UnsafeProjection = create(schema.fields.map(_.dataType))
/**
* Returns an UnsafeProjection for given Array of DataTypes.
*
* CAUTION: the returned projection object is *not* thread-safe.
*/
def create(fields: Array[DataType]): UnsafeProjection = {
create(fields.zipWithIndex.map(x => BoundReference(x._2, x._1, true)))
......
......@@ -511,6 +511,12 @@ abstract class TypedImperativeAggregate[T] extends ImperativeAggregate {
* Generates the final aggregation result value for current key group with the aggregation buffer
* object.
*
* Developer note: the only return types accepted by Spark are:
* - primitive types
* - InternalRow and subclasses
* - ArrayData
* - MapData
*
* @param buffer aggregation buffer object.
* @return The aggregation result of current key group
*/
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment