Skip to content
Snippets Groups Projects
  1. Apr 08, 2016
    • Josh Rosen's avatar
      [SPARK-11416][BUILD] Update to Chill 0.8.0 & Kryo 3.0.3 · 906eef4c
      Josh Rosen authored
      This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12076 from JoshRosen/kryo3.
      906eef4c
    • Josh Rosen's avatar
      [SPARK-14435][BUILD] Shade Kryo in our custom Hive 1.2.1 fork · 464a3c1e
      Josh Rosen authored
      This patch updates our custom Hive 1.2.1 fork in order to shade Kryo in Hive. This is a blocker for upgrading Spark to use Kryo 3 (see #12076).
      
      The source for this new fork of Hive can be found at https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
      
      Here's the complete diff from the official Hive 1.2.1 release: https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2
      
      Here's the diff from the sources that pwendell used to publish the current `1.2.1.spark` release of Hive: https://github.com/pwendell/hive/compare/release-1.2.1-spark...JoshRosen:release-1.2.1-spark2. This diff looks large because his branch used a shell script to rewrite the groupId, whereas I had to commit the groupId changes in order to prevent the find-and-replace from affecting the package names in our relocated Kryo classes: https://github.com/pwendell/hive/compare/release-1.2.1-spark...JoshRosen:release-1.2.1-spark2#diff-6ada9aaec70e069df8f2c34c5519dd1e
      
      Using these changes, I was able to publish a local version of Hive and verify that this change fixes the test failures which are blocking #12076. Note that this PR will not compile until we complete the review of the Hive POM changes and stage and publish a release.
      
      /cc vanzin, steveloughran, and pwendell for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12215 from JoshRosen/shade-kryo-in-hive.
      464a3c1e
    • Sameer Agarwal's avatar
      [SPARK-14394][SQL] Generate AggregateHashMap class for LongTypes during TungstenAggregate codegen · f8c9beca
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for generating the `AggregateHashMap` class in `TungstenAggregate` if the aggregate group by keys/value are of `LongType`. Note that currently this generate aggregate is not actually used.
      
      NB: This currently only supports `LongType` keys/values (please see `isAggregateHashMapSupported` in `TungstenAggregate`) and will be generalized to other data types in a subsequent PR.
      
      ## How was this patch tested?
      
      Manually inspected the generated code. This is what the generated map looks like for 2 keys:
      
      ```java
      /* 068 */   public class agg_GeneratedAggregateHashMap {
      /* 069 */     private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
      /* 070 */     private int[] buckets;
      /* 071 */     private int numBuckets;
      /* 072 */     private int maxSteps;
      /* 073 */     private int numRows = 0;
      /* 074 */     private org.apache.spark.sql.types.StructType schema =
      /* 075 */     new org.apache.spark.sql.types.StructType()
      /* 076 */     .add("k1", org.apache.spark.sql.types.DataTypes.LongType)
      /* 077 */     .add("k2", org.apache.spark.sql.types.DataTypes.LongType)
      /* 078 */     .add("sum", org.apache.spark.sql.types.DataTypes.LongType);
      /* 079 */
      /* 080 */     public agg_GeneratedAggregateHashMap(int capacity, double loadFactor, int maxSteps) {
      /* 081 */       assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
      /* 082 */       this.maxSteps = maxSteps;
      /* 083 */       numBuckets = (int) (capacity / loadFactor);
      /* 084 */       batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
      /* 085 */         org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
      /* 086 */       buckets = new int[numBuckets];
      /* 087 */       java.util.Arrays.fill(buckets, -1);
      /* 088 */     }
      /* 089 */
      /* 090 */     public agg_GeneratedAggregateHashMap() {
      /* 091 */       new agg_GeneratedAggregateHashMap(1 << 16, 0.25, 5);
      /* 092 */     }
      /* 093 */
      /* 094 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert(long agg_key, long agg_key1) {
      /* 095 */       long h = hash(agg_key, agg_key1);
      /* 096 */       int step = 0;
      /* 097 */       int idx = (int) h & (numBuckets - 1);
      /* 098 */       while (step < maxSteps) {
      /* 099 */         // Return bucket index if it's either an empty slot or already contains the key
      /* 100 */         if (buckets[idx] == -1) {
      /* 101 */           batch.column(0).putLong(numRows, agg_key);
      /* 102 */           batch.column(1).putLong(numRows, agg_key1);
      /* 103 */           batch.column(2).putLong(numRows, 0);
      /* 104 */           buckets[idx] = numRows++;
      /* 105 */           return batch.getRow(buckets[idx]);
      /* 106 */         } else if (equals(idx, agg_key, agg_key1)) {
      /* 107 */           return batch.getRow(buckets[idx]);
      /* 108 */         }
      /* 109 */         idx = (idx + 1) & (numBuckets - 1);
      /* 110 */         step++;
      /* 111 */       }
      /* 112 */       // Didn't find it
      /* 113 */       return null;
      /* 114 */     }
      /* 115 */
      /* 116 */     private boolean equals(int idx, long agg_key, long agg_key1) {
      /* 117 */       return batch.column(0).getLong(buckets[idx]) == agg_key && batch.column(1).getLong(buckets[idx]) == agg_key1;
      /* 118 */     }
      /* 119 */
      /* 120 */     // TODO: Improve this Hash Function
      /* 121 */     private long hash(long agg_key, long agg_key1) {
      /* 122 */       return agg_key ^ agg_key1;
      /* 123 */     }
      /* 124 */
      /* 125 */   }
      ```
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12161 from sameeragarwal/tungsten-aggregate.
      f8c9beca
    • tedyu's avatar
      [SPARK-14448] Improvements to ColumnVector · 02757535
      tedyu authored
      ## What changes were proposed in this pull request?
      
      In this PR, two changes are proposed for ColumnVector :
      1. ColumnVector should be declared as implementing AutoCloseable - it already has close() method
      2. In OnHeapColumnVector#reserveInternal(), we only need to allocate new array when existing array is null or the length of existing array is shorter than the newCapacity.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #12225 from tedyu/master.
      02757535
    • Yanbo Liang's avatar
      [SPARK-14298][ML][MLLIB] LDA should support disable checkpoint · 56af8e85
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
      ## How was this patch tested?
      Existing tests.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12089 from yanboliang/spark-14298.
      56af8e85
    • Josh Rosen's avatar
      [BUILD][HOTFIX] Download Maven from regular mirror network rather than archive.apache.org · 94ac58b2
      Josh Rosen authored
      [archive.apache.org](https://archive.apache.org/) is undergoing maintenance, breaking our `build/mvn` script:
      
      > We are in the process of relocating this service. To save on the immense bandwidth that this service outputs, we have put it in maintenance mode, disabling all downloads for the next few days. We expect the maintenance to be complete no later than the morning of Monday the 11th of April, 2016.
      
      This patch fixes this issue by updating the script to use the regular mirror network to download Maven.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12262 from JoshRosen/fix-mvn-download.
      94ac58b2
    • wm624@hotmail.com's avatar
      [SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API · e0ad75f2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.
      
      This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.
      
      ## How was this patch tested?
      ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (12s)
      Finished test(python2.7): pyspark.ml.clustering (18s)
      Finished test(python2.7): pyspark.ml.classification (30s)
      Finished test(python2.7): pyspark.ml.recommendation (28s)
      Finished test(python2.7): pyspark.ml.feature (43s)
      Finished test(python2.7): pyspark.ml.regression (31s)
      Finished test(python2.7): pyspark.ml.tuning (19s)
      Finished test(python2.7): pyspark.ml.tests (34s)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12116 from wangmiao1981/fix_api.
      e0ad75f2
    • Kai Jiang's avatar
      [SPARK-14373][PYSPARK] PySpark RandomForestClassifier, Regressor support export/import · e5d8d6e0
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
      [JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
      ## How was this patch tested?
      doctest
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #12238 from vectorijk/spark-14373.
      e5d8d6e0
    • Mark Grover's avatar
      [SPARK-14477][BUILD] Allow custom mirrors for downloading artifacts in build/mvn · a9b630f4
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Allows to override locations for downloading Apache and Typesafe artifacts in build/mvn script.
      
      ## How was this patch tested?
      By running script like
      ````
      # Remove all previously downloaded artifacts
      rm -rf build/apache-maven*
      rm -rf build/zinc-*
      rm -rf build/scala-*
      
      # Make sure path is clean and doesn't contain mvn, for example.
      ...
      
      # Run a command without setting anything and make sure it succeeds
      build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      
      # Run a command setting the default location as mirror and make sure it succeeds
      APACHE_MIRROR=http://mirror.infra.cloudera.com/apache/ build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      
      # Do the same without the trailing slash this time and make sure it succeeds
      APACHE_MIRROR=http://mirror.infra.cloudera.com/apache build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      
      # Do it with a bad URL and make sure it fails
      APACHE_MIRROR=xyz build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      ````
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #12250 from markgrover/spark-14477.
      a9b630f4
    • Aaron Tokhy's avatar
      [SPARK-14470] Allow for overriding both httpclient and httpcore versions · 583b5e05
      Aaron Tokhy authored
      ## What changes were proposed in this pull request?
      
      This splits commons.httpclient.version from commons.httpcore.version, since these two versions do not necessarily have to be the same.  This change may follow up with an up-to-date version of the httpclient/httpcore libraries.
      
      The latest 4.3.x httpclient version as of writing is 4.3.6 and the latest 4.3.x httpcore version as of writing is 4.3.3.  This change would be a prerequisite for potentially moving to this new bugfix version.
      
      ## How was this patch tested?
      no version change was made for httpclient/httpcore versions
      mvn package
      
      Author: Aaron Tokhy <tokaaron@amazon.com>
      
      Closes #12245 from atokhy/pull-request.
      583b5e05
    • Jacek Laskowski's avatar
      [SPARK-14402][HOTFIX] Fix ExpressionDescription annotation · 64470980
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Fix for the error introduced in https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be:
      
      ```
      /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626: error: annotation argument needs to be a constant; found: "_FUNC_(str) - ".+("Returns str, with the first letter of each word in uppercase, all other letters in ").+("lowercase. Words are delimited by white space.")
          "Returns str, with the first letter of each word in uppercase, all other letters in " +
                                                                                                ^
      ```
      
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #12192 from jaceklaskowski/SPARK-14402-HOTFIX.
      64470980
    • hyukjinkwon's avatar
      [SPARK-14189][SQL] JSON data sources find compatible types even if inferred... · 73b56a3c
      hyukjinkwon authored
      [SPARK-14189][SQL] JSON data sources find compatible types even if inferred decimal type is not capable of the others
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14189
      
      When inferred types in the same field during finding compatible `DataType`, are `IntegralType` and `DecimalType` but `DecimalType` is not capable of the given `IntegralType`, JSON data source simply fails to find a compatible type resulting in `StringType`.
      
      This can be observed when `prefersDecimal` is enabled.
      
      ```scala
      def mixedIntegerAndDoubleRecords: RDD[String] =
        sqlContext.sparkContext.parallelize(
          """{"a": 3, "b": 1.1}""" ::
          """{"a": 3.1, "b": 1}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("prefersDecimal", "true")
        .json(mixedIntegerAndDoubleRecords)
        .printSchema()
      ```
      
      - **Before**
      
      ```
      root
       |-- a: string (nullable = true)
       |-- b: string (nullable = true)
      ```
      
      - **After**
      
      ```
      root
       |-- a: decimal(21, 1) (nullable = true)
       |-- b: decimal(21, 1) (nullable = true)
      ```
      (Note that integer is inferred as `LongType` which becomes `DecimalType(20, 0)`)
      
      ## How was this patch tested?
      
      unit tests were used and style tests by `dev/run_tests`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11993 from HyukjinKwon/SPARK-14189.
      73b56a3c
    • hyukjinkwon's avatar
      [SPARK-14103][SQL] Parse unescaped quotes in CSV data source. · 725b860e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:
      
      ```
      "a"b,ccc,ddd
      e,f,g
      ```
      
      produces a data below:
      
      - **Before**
      
      ```bash
      ["a"b,ccc,ddd[\n]e,f,g]  <- as a value.
      ```
      
      - **After**
      
      ```bash
      ["a"b], [ccc], [ddd]
      [e], [f], [g]
      ```
      
      This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12226 from HyukjinKwon/SPARK-14103-quote.
      725b860e
  2. Apr 07, 2016
    • Reynold Xin's avatar
    • Joseph K. Bradley's avatar
      [SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizer · 953ff897
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint).
      
      This PR adds a "deleteLastCheckpoint" option which defaults to false.  This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default.
      
      This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option.
      
      This also:
      * Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir
      * Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory)
      * Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]```
      
      ## How was this patch tested?
      
      Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12166 from jkbradley/emlda-save-checkpoint.
      953ff897
    • Michael Armbrust's avatar
      [SPARK-14449][SQL] SparkContext should use SparkListenerInterface · 692c7484
      Michael Armbrust authored
      Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`.  This changes the addListener function and the config based injection to use the interface instead.
      
      The existing tests in SparkListenerSuite are improved such that they would have caught this.
      
      Follow-up to #12142
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #12227 from marmbrus/fixListener.
      692c7484
    • Andrew Or's avatar
      [SPARK-14468] Always enable OutputCommitCoordinator · 3e29e372
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468).
      
      Before: `OutputCommitCoordinator` is enabled only if speculation is enabled.
      After: `OutputCommitCoordinator` is always enabled.
      
      Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't...
      
      ## How was this patch tested?
      
      `OutputCommitCoordinator*Suite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12244 from andrewor14/always-occ.
      3e29e372
    • Michael Gummelt's avatar
      [DOCS][MINOR] Remove sentence about Mesos not supporting cluster mode. · 30e980ad
      Michael Gummelt authored
      Docs change to remove the sentence about Mesos not supporting cluster mode.
      
      It was not.
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #12249 from mgummelt/fix-mesos-cluster-docs.
      30e980ad
    • Wenchen Fan's avatar
      [SPARK-14270][SQL] whole stage codegen support for typed filter · 49fb2370
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      We implement typed filter by `MapPartitions`, which doesn't work well with whole stage codegen. This PR use `Filter` to implement typed filter and we can get the whole stage codegen support for free.
      
      This PR also introduced `DeserializeToObject` and `SerializeFromObject`, to seperate serialization logic from object operator, so that it's eaiser to write optimization rules for adjacent object operators.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12061 from cloud-fan/whole-stage-codegen.
      49fb2370
    • Andrew Or's avatar
      [SPARK-14410][SQL] Push functions existence check into catalog · ae1db91d
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog.
      
      No change in functionality is expected.
      
      ## How was this patch tested?
      
      `SessionCatalogSuite`, `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12198 from andrewor14/function-exists.
      ae1db91d
    • Davies Liu's avatar
      [SPARK-12740] [SPARK-13932] support grouping()/grouping_id() in having/order clause · aa852215
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR brings the support of using grouping()/grouping_id() in HAVING/ORDER BY clause.
      
      The resolved grouping()/grouping_id() will be replaced by unresolved "spark_gropuing_id" virtual attribute, then resolved by ResolveMissingAttribute.
      
      This PR also fix the HAVING clause that access a grouping column that is not presented in SELECT clause, for example:
      ```sql
      select count(1) from (select 1 as a) t group by a having a > 0
      ```
      ## How was this patch tested?
      
      Add new tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12235 from davies/grouping_having.
      aa852215
    • Kousuke Saruta's avatar
      [SPARK-14456][SQL][MINOR] Remove unused variables and logics in DataSource · 8dcb0c7c
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      In DataSource#write method, the variables `dataSchema` and `equality`, and related logics are no longer used. Let's remove them.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12237 from sarutak/SPARK-14456.
      8dcb0c7c
    • Tathagata Das's avatar
      [SQL][TESTS] Fix for flaky test in ContinuousQueryManagerSuite · 3aa7d763
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      The timeouts were lower the other timeouts in the test. Other tests were stable over the last month.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12219 from tdas/flaky-test-fix.
      3aa7d763
    • Dhruve Ashar's avatar
      [SPARK-12384] Enables spark-clients to set the min(-Xms) and max(*.memory config) j… · 033d8081
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      
      Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory.
      This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually.
       Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported.
      
      ## How was this patch tested?
      
      Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #12115 from dhruve/impr/SPARK-12384.
      033d8081
    • Alex Bozarth's avatar
      [SPARK-14245][WEB UI] Display the user in the application view · 35e0db2d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      The Spark UI (both active and history) should show the user who ran the application somewhere when you are in the application view. This was added under the Jobs view by total uptime and scheduler mode.
      
      ## How was this patch tested?
      
      Manual testing
      
      <img width="191" alt="username" src="https://cloud.githubusercontent.com/assets/13952758/14222830/6d1fe542-f82a-11e5-885f-c05ee2cdf857.png">
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #12123 from ajbozarth/spark14245.
      35e0db2d
    • Malte's avatar
      Better host description for multi-master mesos · db75ccb5
      Malte authored
      ## What changes were proposed in this pull request?
      
      Since not having the correct zk url causes job failure, the documentation should include all parameters
      
      ## How was this patch tested?
      
      no tests necessary
      
      Author: Malte <elmalto@users.noreply.github.com>
      
      Closes #12218 from elmalto/patch-1.
      db75ccb5
    • Reynold Xin's avatar
      [SPARK-10063][SQL] Remove DirectParquetOutputCommitter · 9ca0760d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.
      
      ## How was this patch tested?
      Removed the related tests also.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12229 from rxin/SPARK-10063.
      9ca0760d
    • Reynold Xin's avatar
      [SPARK-14452][SQL] Explicit APIs in Scala for specifying encoders · e11aa9ec
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The Scala Dataset public API currently only allows users to specify encoders through SQLContext.implicits. This is OK but sometimes people want to explicitly get encoders without a SQLContext (e.g. Aggregator implementations). This patch adds public APIs to Encoders class for getting Scala encoders.
      
      ## How was this patch tested?
      None - I will update test cases once https://github.com/apache/spark/pull/12231 is merged.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12232 from rxin/SPARK-14452.
      e11aa9ec
  3. Apr 06, 2016
    • Marcelo Vanzin's avatar
      [SPARK-14134][CORE] Change the package name used for shading classes. · 21d5ca12
      Marcelo Vanzin authored
      The current package name uses a dash, which is a little weird but seemed
      to work. That is, until a new test tried to mock a class that references
      one of those shaded types, and then things started failing.
      
      Most changes are just noise to fix the logging configs.
      
      For reference, SPARK-8815 also raised this issue, although at the time it
      did not cause any issues in Spark, so it was not addressed.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11941 from vanzin/SPARK-14134.
      21d5ca12
    • Herman van Hovell's avatar
      [SPARK-12610][SQL] Left Anti Join · d7659227
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      
      This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys.
      
      We currently add support for the following SQL join syntax:
      
          SELECT   *
          FROM      tbl1 A
                    LEFT ANTI JOIN tbl2 B
                     ON A.Id = B.Id
      
      Or using a dataframe:
      
          tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti)
      
      This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator.
      
      The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due.
      
      This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`.
      
      ### How was this patch tested?
      
      Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563.
      
      cc davies chenghao-intel rxin
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12214 from hvanhovell/SPARK-12610.
      d7659227
    • Marcelo Vanzin's avatar
      [SPARK-14446][TESTS] Fix ReplSuite for Scala 2.10. · 4901086f
      Marcelo Vanzin authored
      Just use the same test code as the 2.11 version, which seems to pass.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12223 from vanzin/SPARK-14446.
      4901086f
    • Luciano Resende's avatar
      [SPARK-12555][SQL] Result should not be corrupted after input columns are reordered · 611dbce4
      Luciano Resende authored
      This PR add test case described in SPARK-12555 to validate that correct data is returned when input data is reordered and to avoid future regressions.
      
      Author: Luciano Resende <lresende@apache.org>
      
      Closes #11623 from lresende/SPARK-12555.
      611dbce4
    • sethah's avatar
      [SPARK-12382][ML] Remove mllib GBT implementation and wrap ml · bb873754
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML.
      
      Primary changes:
      * Removed `boost` method in mllib GradientBoostedTrees.scala
      * Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib
      
      Other changes:
      * Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePrediction*treeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training
      * Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part  of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API.
      
      ## How was this patch tested?
      
      Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12050 from sethah/SPARK-12382.
      bb873754
    • Marcelo Vanzin's avatar
      [SPARK-14436][SQL] Make JavaDatasetAggregatorSuiteBase public. · 864d1b4d
      Marcelo Vanzin authored
      Without this, unit tests that extend that class fail for me locally
      on maven, because JUnit tries to run methods in that class and gets
      an IllegalAccessError.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12212 from vanzin/SPARK-14436.
      864d1b4d
    • Shixiong Zhu's avatar
      [SPARK-13112][CORE] Make sure RegisterExecutorResponse arrive before LaunchTask · f1def573
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Send `RegisterExecutorResponse` using `executorRef` in order to make sure RegisterExecutorResponse and LaunchTask are both sent using the same channel. Then RegisterExecutorResponse will always arrive before LaunchTask
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Closes #12078
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12211 from zsxwing/SPARK-13112.
      f1def573
    • Zhang, Liye's avatar
      [SPARK-14290][CORE][NETWORK] avoid significant memory copy in netty's transferTo · c4bb02ab
      Zhang, Liye authored
      ## What changes were proposed in this pull request?
      When netty transfer data that is not `FileRegion`, data will be in format of `ByteBuf`, If the data is large, there will occur significant performance issue because there is memory copy underlying in `sun.nio.ch.IOUtil.write`, the CPU is 100% used, and network is very low.
      
      In this PR, if data size is large, we will split it into small chunks to call `WritableByteChannel.write()`, so that avoid wasting of memory copy. Because the data can't be written within a single write, and it will call `transferTo` multiple times.
      
      ## How was this patch tested?
      Spark unit test and manual test.
      Manual test:
      `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
      
      For more details, please refer to [SPARK-14290](https://issues.apache.org/jira/browse/SPARK-14290)
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #12083 from liyezhang556520/spark-14290.
      c4bb02ab
    • Dongjoon Hyun's avatar
      [SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ScalaDoc-style multiline comments · d717ae1f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings.
      ```
      /** In Spark, we don't use the ScalaDoc style so this
        * is not correct.
        */
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including `lint-scala`).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12221 from dongjoon-hyun/SPARK-14444.
      d717ae1f
    • Holden Karau's avatar
      [SPARK-14424][BUILD][DOCS] Update the build docs to switch from assembly to package and add a no… · 457e58be
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Change our build docs & shell scripts to that developers are aware of the change from "assembly" to "package"
      
      ## How was this patch tested?
      
      Manually ran ./bin/spark-shell after ./build/sbt assembly and verified error message printed, ran new suggested build target and verified ./bin/spark-shell runs after this.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12197 from holdenk/SPARK-1424-spark-class-broken-fix-build-docs.
      457e58be
    • Tathagata Das's avatar
      [SPARK-12133][STREAMING] Streaming dynamic allocation · 9af5423e
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Added a new Executor Allocation Manager for the Streaming scheduler for doing Streaming Dynamic Allocation.
      
      ## How was this patch tested
      Unit tests, and cluster tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12154 from tdas/streaming-dynamic-allocation.
      9af5423e
    • Marcelo Vanzin's avatar
      [SPARK-14391][LAUNCHER] Increase test timeouts. · de479260
      Marcelo Vanzin authored
      Most of the time tests should still pass really quickly; it's just
      when machines are overloaded that the tests may take a little time,
      but that's still preferable over just failing the test.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12210 from vanzin/SPARK-14391.
      de479260
Loading