Commits · 0611b3a2bf6d73ab62ee133fbb70430839bea7bc · cs525-sp18-g07 / spark

Aug 31, 2016

[SPARK-17320] add build_profile_flags entry to mesos build module · 0611b3a2

Michael Gummelt authored 8 years ago

## What changes were proposed in this pull request?

add build_profile_flags entry to mesos build module

## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14885 from mgummelt/mesos-profile.

0611b3a2

[MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting... · 9953442a

hyukjinkwon authored 8 years ago

[MINOR][SPARKR] Verbose build comment in WINDOWS.md rather than promoting default build without Hive

## What changes were proposed in this pull request?

This PR fixes `WINDOWS.md` to imply referring other profiles in http://spark.apache.org/docs/latest/building-spark.html#building-with-buildmvn rather than directly pointing to run `mvn -DskipTests -Psparkr package` without Hive supports.

## How was this patch tested?

Manually,

<img width="626" alt="2016-08-31 6 01 08" src="https://cloud.githubusercontent.com/assets/6477701/18122549/f6297b2c-6fa4-11e6-9b5e-fd4347355d87.png">

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14890 from HyukjinKwon/minor-build-r.

9953442a

[SPARK-17180][SPARK-17309][SPARK-17323][SQL] create AlterViewAsCommand to handle ALTER VIEW AS · 12fd0cd6

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

Currently we use `CreateViewCommand` to implement ALTER VIEW AS, which has 3 bugs:

1. SPARK-17180: ALTER VIEW AS should alter temp view if view name has no database part and temp view exists
2. SPARK-17309: ALTER VIEW AS should issue exception if view does not exist.
3. SPARK-17323: ALTER VIEW AS should keep the previous table properties, comment, create_time, etc.

The root cause is, ALTER VIEW AS is quite different from CREATE VIEW, we need different code path to handle them. However, in `CreateViewCommand`, there is no way to distinguish ALTER VIEW AS and CREATE VIEW, we have to introduce extra flag. But instead of doing this, I think a more natural way is to separate the ALTER VIEW AS logic into a new command.

## How was this patch tested?

new tests in SQLViewSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14874 from cloud-fan/minor4.

12fd0cd6

[SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf · fa634793

Jeff Zhang authored 8 years ago

## What changes were proposed in this pull request?

Allow user to set sparkr shell command through --conf spark.r.shell.command

## How was this patch tested?

Unit test is added and also verify it manually through
```
bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14744 from zjffdu/SPARK-17178.

fa634793

Aug 30, 2016

[SPARK-15985][SQL] Eliminate redundant cast from an array without null or a map without null · d92cd227

Kazuaki Ishizaki authored 8 years ago

## What changes were proposed in this pull request?

This PR eliminates redundant cast from an `ArrayType` with `containsNull = false` or a `MapType` with `containsNull = false`.

For example, in `ArrayType` case, current implementation leaves a cast `cast(value#63 as array<double>).toDoubleArray`. However, we can eliminate `cast(value#63 as array<double>)` if we know `value#63` does not include `null`. This PR apply this elimination for `ArrayType` and `MapType` in `SimplifyCasts` at a plan optimization phase.

In summary, we got 1.2-1.3x performance improvements over the code before applying this PR.
Here are performance results of benchmark programs:
```
  test("Read array in Dataset") {
    import sparkSession.implicits._

    val iters = 5
    val n = 1024 * 1024
    val rows = 15

    val benchmark = new Benchmark("Read primnitive array", n)

    val rand = new Random(511)
    val intDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
      .map(i => Array.tabulate(n)(i => i)).toDS()
    intDS.count() // force to create ds
    val lastElement = n - 1
    val randElement = rand.nextInt(lastElement)

    benchmark.addCase(s"Read int array in Dataset", numIters = iters)(iter => {
      val idx0 = randElement
      val idx1 = lastElement
      intDS.map(a => a(0) + a(idx0) + a(idx1)).collect
    })

    val doubleDS = sparkSession.sparkContext.parallelize(0 until rows, 1)
      .map(i => Array.tabulate(n)(i => i.toDouble)).toDS()
    doubleDS.count() // force to create ds

    benchmark.addCase(s"Read double array in Dataset", numIters = iters)(iter => {
      val idx0 = randElement
      val idx1 = lastElement
      doubleDS.map(a => a(0) + a(idx0) + a(idx1)).collect
    })

    benchmark.run()
  }

Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.4
Intel(R) Core(TM) i5-5257U CPU  2.70GHz

without this PR
Read primnitive array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Read int array in Dataset                      525 /  690          2.0         500.9       1.0X
Read double array in Dataset                   947 / 1209          1.1         902.7       0.6X

with this PR
Read primnitive array:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Read int array in Dataset                      400 /  492          2.6         381.5       1.0X
Read double array in Dataset                   788 /  870          1.3         751.4       0.5X
```

An example program that originally caused this performance issue.
```
val ds = Seq(Array(1.0, 2.0, 3.0), Array(4.0, 5.0, 6.0)).toDS()
val ds2 = ds.map(p => {
     var s = 0.0
     for (i <- 0 to 2) { s += p(i) }
     s
   })
ds2.show
ds2.explain(true)
```

Plans before this PR
```
== Parsed Logical Plan ==
'SerializeFromObject [input[0, double, true] AS value#68]
+- 'MapElements <function1>, obj#67: double
   +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#66: [D
      +- LocalRelation [value#63]

== Analyzed Logical Plan ==
value: double
SerializeFromObject [input[0, double, true] AS value#68]
+- MapElements <function1>, obj#67: double
   +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
      +- LocalRelation [value#63]

== Optimized Logical Plan ==
SerializeFromObject [input[0, double, true] AS value#68]
+- MapElements <function1>, obj#67: double
   +- DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
      +- LocalRelation [value#63]

== Physical Plan ==
*SerializeFromObject [input[0, double, true] AS value#68]
+- *MapElements <function1>, obj#67: double
   +- *DeserializeToObject cast(value#63 as array<double>).toDoubleArray, obj#66: [D
      +- LocalTableScan [value#63]
```

Plans after this PR
```
== Parsed Logical Plan ==
'SerializeFromObject [input[0, double, true] AS value#6]
+- 'MapElements <function1>, obj#5: double
   +- 'DeserializeToObject unresolveddeserializer(upcast(getcolumnbyordinal(0, ArrayType(DoubleType,false)), ArrayType(DoubleType,false), - root class: "scala.Array").toDoubleArray), obj#4: [D
      +- LocalRelation [value#1]

== Analyzed Logical Plan ==
value: double
SerializeFromObject [input[0, double, true] AS value#6]
+- MapElements <function1>, obj#5: double
   +- DeserializeToObject cast(value#1 as array<double>).toDoubleArray, obj#4: [D
      +- LocalRelation [value#1]

== Optimized Logical Plan ==
SerializeFromObject [input[0, double, true] AS value#6]
+- MapElements <function1>, obj#5: double
   +- DeserializeToObject value#1.toDoubleArray, obj#4: [D
      +- LocalRelation [value#1]

== Physical Plan ==
*SerializeFromObject [input[0, double, true] AS value#6]
+- *MapElements <function1>, obj#5: double
   +- *DeserializeToObject value#1.toDoubleArray, obj#4: [D
      +- LocalTableScan [value#1]
```

## How was this patch tested?

Tested by new test cases in `SimplifyCastsSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #13704 from kiszk/SPARK-15985.

d92cd227

[SPARK-17318][TESTS] Fix ReplSuite replicating blocks of object with class defined in repl · 231f9732

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

There are a lot of failures recently: http://spark-tests.appspot.com/tests/org.apache.spark.repl.ReplSuite/replicating%20blocks%20of%20object%20with%20class%20defined%20in%20repl

This PR just changed the persist level to `MEMORY_AND_DISK_2` to avoid blocks being evicted from memory.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14884 from zsxwing/SPARK-17318.

231f9732

[SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history · f7beae6d

Alex Bozarth authored 8 years ago

## What changes were proposed in this pull request?

With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)

I've also added a new test for the `limit` param in `HistoryServerSuite.scala`

## How was this patch tested?

Manual testing and dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14835 from ajbozarth/spark17243.

f7beae6d

[SPARK-17314][CORE] Use Netty's DefaultThreadFactory to enable its fast ThreadLocal impl · 02ac379e

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

When a thread is a Netty's FastThreadLocalThread, Netty will use its fast ThreadLocal implementation. It has a better performance than JDK's (See the benchmark results in https://github.com/netty/netty/pull/4417, note: it's not a fix to Netty's FastThreadLocal. It just fixed an issue in Netty's benchmark codes)

This PR just changed the ThreadFactory to Netty's DefaultThreadFactory which will use FastThreadLocalThread. There is also a minor change to the thread names. See https://github.com/netty/netty/blob/netty-4.0.22.Final/common/src/main/java/io/netty/util/concurrent/DefaultThreadFactory.java#L94

## How was this patch tested?

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14879 from zsxwing/netty-thread.

02ac379e

[SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted · fb200843

Josh Rosen authored 8 years ago

This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run

```
sc.parallelize(1 to 100000, 100000).map(identity).count()
```

then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:

![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png)

When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels:

```scala
   for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }
```

In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of  `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that  `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set.

Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level.

Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change.

/cc squito and kayousterhout for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14871 from JoshRosen/bail-early-if-no-cpus.

fb200843

[SPARK-5682][CORE] Add encrypted shuffle in spark · 4b4e329e

Ferdinand Xu authored 8 years ago

This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.

4b4e329e

[MINOR][MLLIB][SQL] Clean up unused variables and unused import · 27209252

Xin Ren authored 8 years ago

## What changes were proposed in this pull request?

Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement,  when I walk through the code examples.

## How was this patch tested?

Testet manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14836 from keypointt/codeWalkThroughML.

27209252

[MINOR][DOCS] Fix minor typos in python example code · d4eee993

Dmitriy Sokolov authored 8 years ago

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

d4eee993

[SPARK-17264][SQL] DataStreamWriter should document that it only supports Parquet for now · befab9c1

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Clarify that only parquet files are supported by DataStreamWriter now

## How was this patch tested?

(Doc build -- no functional changes to test)

Author: Sean Owen <sowen@cloudera.com>

Closes #14860 from srowen/SPARK-17264.

befab9c1

[SPARK-17276][CORE][TEST] Stop env params output on Jenkins job page · 2d76cb11

Xin Ren authored 8 years ago

https://issues.apache.org/jira/browse/SPARK-17276

## What changes were proposed in this pull request?

When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output.
The env parameter output should be muted.

![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png)

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14848 from keypointt/SPARK-17276.

2d76cb11

[SPARK-17234][SQL] Table Existence Checking when Index Table with the Same Name Exists · bca79c82

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?
Hive Index tables are not supported by Spark SQL. Thus, we issue an exception when users try to access Hive Index tables. When the internal function `tableExists` tries to access Hive Index tables, it always gets the same error message: ```Hive index table is not supported```. This message could be confusing to users, since their SQL operations could be completely unrelated to Hive Index tables. For example, when users try to alter a table to a new name and there exists an index table with the same name, the expected exception should be a `TableAlreadyExistsException`.

This PR made the following changes:
- Introduced a new `AnalysisException` type: `SQLFeatureNotSupportedException`. When users try to access an `Index Table`, we will issue a `SQLFeatureNotSupportedException`.
- `tableExists` returns `true` when hitting a `SQLFeatureNotSupportedException` and the feature is `Hive index table`.
- Add a checking `requireTableNotExists` for `SessionCatalog`'s `createTable` API; otherwise, the current implementation relies on the Hive's internal checking.

### How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14801 from gatorsmile/tableExists.

bca79c82

[SPARK-17289][SQL] Fix a bug to satisfy sort requirements in partial aggregations · 94922d79

Takeshi YAMAMURO authored 8 years ago

## What changes were proposed in this pull request?
Partial aggregations are generated in `EnsureRequirements`, but the planner fails to
check if partial aggregation satisfies sort requirements.
For the following query:
```
val df2 = (0 to 1000).map(x => (x % 2, x.toString)).toDF("a", "b").createOrReplaceTempView("t2")
spark.sql("select max(b) from t2 group by a").explain(true)
```
Now, the SortAggregator won't insert Sort operator before partial aggregation, this will break sort-based partial aggregation.
```
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
      +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
         +- LocalTableScan [a#5, b#6]
```
Actually, a correct plan is:
```
== Physical Plan ==
SortAggregate(key=[a#5], functions=[max(b#6)], output=[max(b)#17])
+- *Sort [a#5 ASC], false, 0
   +- Exchange hashpartitioning(a#5, 200)
      +- SortAggregate(key=[a#5], functions=[partial_max(b#6)], output=[a#5, max#19])
         +- *Sort [a#5 ASC], false, 0
            +- LocalTableScan [a#5, b#6]
```

## How was this patch tested?
Added tests in `PlannerSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14865 from maropu/SPARK-17289.

94922d79

[SPARK-17303] Added spark-warehouse to dev/.rat-excludes · 8fb445d9

frreiss authored 8 years ago

## What changes were proposed in this pull request?

Excludes the `spark-warehouse` directory from the Apache RAT checks that src/run-tests performs. `spark-warehouse` is created by some of the Spark SQL tests, as well as by `bin/spark-sql`.

## How was this patch tested?

Ran src/run-tests twice. The second time, the script failed because the first iteration
Made the change in this PR.
Ran src/run-tests a third time; RAT checks succeeded.

Author: frreiss <frreiss@us.ibm.com>

Closes #14870 from frreiss/fred-17303.

8fb445d9

Aug 29, 2016

[SPARK-17301][SQL] Remove unused classTag field from AtomicType base class · 48b459dd

Josh Rosen authored 8 years ago

There's an unused `classTag` val in the AtomicType base class which is causing unnecessary slowness in deserialization because it needs to grab ScalaReflectionLock and create a new runtime reflection mirror. Removing this unused code gives a small but measurable performance boost in SQL task deserialization.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14869 from JoshRosen/remove-unused-classtag.

48b459dd

[SPARK-16581][SPARKR] Make JVM backend calling functions public · 736a7911

Shivaram Venkataraman authored 8 years ago

## What changes were proposed in this pull request?

This change exposes a public API in SparkR to create objects, call methods on the Spark driver JVM

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Unit tests, CRAN checks

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #14775 from shivaram/sparkr-java-api.

736a7911

[SPARK-17063] [SQL] Improve performance of MSCK REPAIR TABLE with Hive metastore · 48caec25

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

This PR split the the single `createPartitions()` call into smaller batches, which could prevent Hive metastore from OOM (caused by millions of partitions).

It will also try to gather all the fast stats (number of files and total size of all files) in parallel to avoid the bottle neck of listing the files in metastore sequential, which is controlled by spark.sql.gatherFastStats (enabled by default).

## How was this patch tested?

Tested locally with 10000 partitions and 100 files with embedded metastore, without gathering fast stats in parallel, adding partitions took 153 seconds, after enable that, gathering the fast stats took about 34 seconds, adding these partitions took 25 seconds (most of the time spent in object store), 59 seconds in total, 2.5X faster (with larger cluster, gathering will much faster).

Author: Davies Liu <davies@databricks.com>

Closes #14607 from davies/repair_batch.

48caec25

[SPARKR][MINOR] Fix LDA doc · 6a0fda2c

Junyang Qian authored 8 years ago

## What changes were proposed in this pull request?

This PR tries to fix the name of the `SparkDataFrame` used in the example. Also, it gives a reference url of an example data file so that users can play with.

## How was this patch tested?

Manual test.

Author: Junyang Qian <junyangq@databricks.com>

Closes #14853 from junyangq/SPARKR-FixLDADoc.

6a0fda2c

fixed a typo · 08913ce0

Seigneurin, Alexis (CONT) authored 8 years ago

idempotant -> idempotent

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #14833 from aseigneurin/fix-typo.

08913ce0

[BUILD] Closes some stale PRs. · 1a48c004

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Closes #10995
Closes #13658
Closes #14505
Closes #14536
Closes #12753
Closes #14449
Closes #12694
Closes #12695
Closes #14810
Closes #10572

## How was this patch tested?

N/A

Author: Sean Owen <sowen@cloudera.com>

Closes #14849 from srowen/CloseStalePRs.

1a48c004

Aug 28, 2016

[SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is... · 095862a3

Tejas Patil authored 8 years ago

[SPARK-17271][SQL] Planner adds un-necessary Sort even if child ordering is semantically same as required ordering

## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17271

Planner is adding un-needed SORT operation due to bug in the way comparison for `SortOrder` is done at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala#L253
`SortOrder` needs to be compared semantically because `Expression` within two `SortOrder` can be "semantically equal" but not literally equal objects.

eg. In case of `sql("SELECT * FROM table1 a JOIN table2 b ON a.col1=b.col1")`

Expression in required SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId,
        qualifier = Some("a")
      )
```

Expression in child SortOrder:
```
      AttributeReference(
        name = "col1",
        dataType = LongType,
        nullable = false
      ) (exprId = exprId)
```

Notice that the output column has a qualifier but the child attribute does not but the inherent expression is the same and hence in this case we can say that the child satisfies the required sort order.

This PR includes following changes:
- Added a `semanticEquals` method to `SortOrder` so that it can compare underlying child expressions semantically (and not using default Object.equals)
- Fixed `EnsureRequirements` to use semantic comparison of SortOrder

## How was this patch tested?

- Added a test case to `PlannerSuite`. Ran rest tests in `PlannerSuite`

Author: Tejas Patil <tejasp@fb.com>

Closes #14841 from tejasapatil/SPARK-17271_sort_order_equals_bug.

095862a3

Aug 27, 2016

[SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True · e07baf14

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.

## How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

Author: Sean Owen <sowen@cloudera.com>

Closes #14663 from srowen/SPARK-17001.

e07baf14

[SPARK-17216][UI] fix event timeline bars length · 9fbced5b

Robert Kruszewski authored 8 years ago

## What changes were proposed in this pull request?

Make event timeline bar expand to full length of the bar (which is total time)

This issue occurs only on chrome, firefox looks fine. Haven't tested other browsers.

## How was this patch tested?
Inspection in browsers

Before
![screen shot 2016-08-24 at 3 38 24 pm](https://cloud.githubusercontent.com/assets/512084/17935104/0d6cda74-6a12-11e6-9c66-e00cfa855606.png)

After
![screen shot 2016-08-24 at 3 36 39 pm](https://cloud.githubusercontent.com/assets/512084/17935114/15740ea4-6a12-11e6-83a1-7c06eef6abb8.png)

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14791 from robert3005/robertk/event-timeline.

9fbced5b

[ML][MLLIB] The require condition and message doesn't match in SparseMatrix. · 40168dbe

Peng, Meng authored 8 years ago

## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also should be optimized.
Small change.  Please kindly let me know if JIRA required.

## How was this patch tested?
No additional test required.

Author: Peng, Meng <peng.meng@intel.com>

Closes #14824 from mpjlu/smallChangeForMatrixRequire.

40168dbe

[SPARK-15382][SQL] Fix a bug in sampling with replacement · cd0ed31e

Takeshi YAMAMURO authored 8 years ago

## What changes were proposed in this pull request?
This pr to fix a bug below in sampling with replacement
```
val df = Seq((1, 0), (2, 0), (3, 0)).toDF("a", "b")
df.sample(true, 2.0).withColumn("c", monotonically_increasing_id).select($"c").show
+---+
|  c|
+---+
|  0|
|  1|
|  1|
|  1|
|  2|
+---+
```

## How was this patch tested?
Added a test in `DataFrameSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14800 from maropu/FixSampleBug.

cd0ed31e

[SPARK-17274][SQL] Move join optimizer rules into a separate file · 718b6bad

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various join rules into a single file.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14846 from rxin/SPARK-17274.

718b6bad

[SPARK-17273][SQL] Move expression optimizer rules into a separate file · 5aad4509

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various expression optimization rules into a single file.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14845 from rxin/SPARK-17273.

5aad4509

[SPARK-17272][SQL] Move subquery optimizer rules into its own file · 0243b328

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various subquery rules into a single file.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14844 from rxin/SPARK-17272.

0243b328

[SPARK-17269][SQL] Move finish analysis optimization stage into its own file · dcefac43

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various finish analysis optimization stage rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14838 from rxin/SPARK-17269.

dcefac43

Aug 26, 2016

[SPARK-17270][SQL] Move object optimization rules into its own file · cc0caa69

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
As part of breaking Optimizer.scala apart, this patch moves various Dataset object optimization rules into a single file. I'm submitting separate pull requests so we can more easily merge this in branch-2.0 to simplify optimizer backports.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #14839 from rxin/SPARK-17270.

cc0caa69

[SPARK-17266][TEST] Add empty strings to the regressionTests of PrefixComparatorsSuite · a6bca3ad

Yin Huai authored 8 years ago

## What changes were proposed in this pull request?
This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/).

I could not reproduce it locally. But, let's this test case in the regressionTests.

Author: Yin Huai <yhuai@databricks.com>

Closes #14837 from yhuai/SPARK-17266.

a6bca3ad

[SPARK-17244] Catalyst should not pushdown non-deterministic join conditions · 540e9128

Sameer Agarwal authored 8 years ago

## What changes were proposed in this pull request?

Given that non-deterministic expressions can be stateful, pushing them down the query plan during the optimization phase can cause incorrect behavior. This patch fixes that issue by explicitly disabling that.

## How was this patch tested?

A new test in `FilterPushdownSuite` that checks catalyst behavior for both deterministic and non-deterministic join conditions.

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14815 from sameeragarwal/constraint-inputfile.

540e9128

[SPARK-17235][SQL] Support purging of old logs in MetadataLog · f64a1ddd

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
This patch adds a purge interface to MetadataLog, and an implementation in HDFSMetadataLog. The purge function is currently unused, but I will use it to purge old execution and file source logs in follow-up patches. These changes are required in a production structured streaming job that runs for a long period of time.

## How was this patch tested?
Added a unit test case in HDFSMetadataLogSuite.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14802 from petermaxlee/SPARK-17235.

f64a1ddd

[SPARK-17246][SQL] Add BigDecimal literal · a11d10f1

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
This PR adds parser support for `BigDecimal` literals. If you append the suffix `BD` to a valid number then this will be interpreted as a `BigDecimal`, for example `12.0E10BD` will interpreted into a BigDecimal with scale -9 and precision 3. This is useful in situations where you need exact values.

## How was this patch tested?
Added tests to `ExpressionParserSuite`, `ExpressionSQLBuilderSuite` and `SQLQueryTestSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14819 from hvanhovell/SPARK-17246.

a11d10f1

[SPARK-16967] move mesos to module · 8e5475be

Michael Gummelt authored 8 years ago

## What changes were proposed in this pull request?

Move Mesos code into a mvn module

## How was this patch tested?

unit tests
manually submitting a client mode and cluster mode job
spark/mesos integration test suite

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14637 from mgummelt/mesos-module.

8e5475be

[SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtils · c0949dc9

Peng, Meng authored 8 years ago

## What changes were proposed in this pull request?

fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng, Meng <peng.meng@intel.com>

Closes #14785 from mpjlu/testUtils.

c0949dc9

[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely · 9812f7d5

petermaxlee authored 8 years ago

## What changes were proposed in this pull request?
Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set.

This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed.

## How was this patch tested?
Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number.

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14728 from petermaxlee/SPARK-17165.

9812f7d5