Skip to content
Snippets Groups Projects
  1. Apr 09, 2016
    • DB Tsai's avatar
      [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom · 1598d11b
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12241 from dbtsai/dbtsai-mllib-local-build.
      1598d11b
    • bomeng's avatar
      [SPARK-14496][SQL] fix some javadoc typos · 10a95781
      bomeng authored
      ## What changes were proposed in this pull request?
      
      Minor issues. Found 2 typos while browsing the code.
      
      ## How was this patch tested?
      None.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #12264 from bomeng/SPARK-14496.
      10a95781
    • wm624@hotmail.com's avatar
      [SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param · a9b8b655
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.
      
      ## How was this patch tested?
      
      Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.
      
      All tests in CounterVectorizerSuite.scala are passed.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12200 from wangmiao1981/binary_param.
      a9b8b655
    • Davies Liu's avatar
      [SPARK-14419] [SQL] Improve HashedRelation for key fit within Long · 90c0a045
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
      
      This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
      
      ## How was this patch tested?
      
      Updated existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12190 from davies/long_map2.
      90c0a045
    • Reynold Xin's avatar
      [SPARK-14451][SQL] Move encoder definition into Aggregator interface · 520dde48
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type.
      
      Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators.
      
      ## How was this patch tested?
      Updated unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12231 from rxin/SPARK-14451.
      520dde48
    • Reynold Xin's avatar
      [SPARK-14482][SQL] Change default Parquet codec from gzip to snappy · 2f0b882e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.
      
      This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON).
      
      ## How was this patch tested?
      Should be covered by existing unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12256 from rxin/SPARK-14482.
      2f0b882e
  2. Apr 08, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs · d7af736b
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Cleanups to documentation.  No changes to code.
      * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
      * GLM regParam: needs doc saying it is for L2 only
      * TrainValidationSplitModel: add .. versionadded:: 2.0.0
      * Rename “_transformer_params_from_java” to “_transfer_params_from_java”
      * LogReg Summary classes: “probability” col should not say “calibrated”
      * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
      * approxCountDistinct: Document meaning of “rsd" argument.
      * LDA: note which params are for online LDA only
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12266 from jkbradley/ml-doc-cleanups.
      d7af736b
    • Sameer Agarwal's avatar
      [SPARK-14454] Better exception handling while marking tasks as failed · 813e96e6
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:
      
      ```scala
      logError("Aborting task.", cause)
      // call failure callbacks first, so we could have a chance to cleanup the writer.
      TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
      if (currentWriter != null) {
        currentWriter.close()
      }
      abortTask()
      throw new SparkException("Task failed while writing rows.", cause)
      ```
      
      If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.
      
      ## How was this patch tested?
      
      No new functionality added
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12234 from sameeragarwal/fix-exception.
      813e96e6
    • Shixiong Zhu's avatar
      [SPARK-14437][CORE] Use the address that NettyBlockTransferService listens to create BlockManagerId · 4d7c3592
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Here is why SPARK-14437 happens:
      BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname.
      
      To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer.
      
      ## How was this patch tested?
      
      Manually checked the bound address using local-cluster.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12240 from zsxwing/SPARK-14437.
      4d7c3592
    • Josh Rosen's avatar
      [SPARK-11416][BUILD] Update to Chill 0.8.0 & Kryo 3.0.3 · 906eef4c
      Josh Rosen authored
      This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12076 from JoshRosen/kryo3.
      906eef4c
    • Josh Rosen's avatar
      [SPARK-14435][BUILD] Shade Kryo in our custom Hive 1.2.1 fork · 464a3c1e
      Josh Rosen authored
      This patch updates our custom Hive 1.2.1 fork in order to shade Kryo in Hive. This is a blocker for upgrading Spark to use Kryo 3 (see #12076).
      
      The source for this new fork of Hive can be found at https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
      
      Here's the complete diff from the official Hive 1.2.1 release: https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2
      
      Here's the diff from the sources that pwendell used to publish the current `1.2.1.spark` release of Hive: https://github.com/pwendell/hive/compare/release-1.2.1-spark...JoshRosen:release-1.2.1-spark2. This diff looks large because his branch used a shell script to rewrite the groupId, whereas I had to commit the groupId changes in order to prevent the find-and-replace from affecting the package names in our relocated Kryo classes: https://github.com/pwendell/hive/compare/release-1.2.1-spark...JoshRosen:release-1.2.1-spark2#diff-6ada9aaec70e069df8f2c34c5519dd1e
      
      Using these changes, I was able to publish a local version of Hive and verify that this change fixes the test failures which are blocking #12076. Note that this PR will not compile until we complete the review of the Hive POM changes and stage and publish a release.
      
      /cc vanzin, steveloughran, and pwendell for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12215 from JoshRosen/shade-kryo-in-hive.
      464a3c1e
    • Sameer Agarwal's avatar
      [SPARK-14394][SQL] Generate AggregateHashMap class for LongTypes during TungstenAggregate codegen · f8c9beca
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for generating the `AggregateHashMap` class in `TungstenAggregate` if the aggregate group by keys/value are of `LongType`. Note that currently this generate aggregate is not actually used.
      
      NB: This currently only supports `LongType` keys/values (please see `isAggregateHashMapSupported` in `TungstenAggregate`) and will be generalized to other data types in a subsequent PR.
      
      ## How was this patch tested?
      
      Manually inspected the generated code. This is what the generated map looks like for 2 keys:
      
      ```java
      /* 068 */   public class agg_GeneratedAggregateHashMap {
      /* 069 */     private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
      /* 070 */     private int[] buckets;
      /* 071 */     private int numBuckets;
      /* 072 */     private int maxSteps;
      /* 073 */     private int numRows = 0;
      /* 074 */     private org.apache.spark.sql.types.StructType schema =
      /* 075 */     new org.apache.spark.sql.types.StructType()
      /* 076 */     .add("k1", org.apache.spark.sql.types.DataTypes.LongType)
      /* 077 */     .add("k2", org.apache.spark.sql.types.DataTypes.LongType)
      /* 078 */     .add("sum", org.apache.spark.sql.types.DataTypes.LongType);
      /* 079 */
      /* 080 */     public agg_GeneratedAggregateHashMap(int capacity, double loadFactor, int maxSteps) {
      /* 081 */       assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
      /* 082 */       this.maxSteps = maxSteps;
      /* 083 */       numBuckets = (int) (capacity / loadFactor);
      /* 084 */       batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
      /* 085 */         org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
      /* 086 */       buckets = new int[numBuckets];
      /* 087 */       java.util.Arrays.fill(buckets, -1);
      /* 088 */     }
      /* 089 */
      /* 090 */     public agg_GeneratedAggregateHashMap() {
      /* 091 */       new agg_GeneratedAggregateHashMap(1 << 16, 0.25, 5);
      /* 092 */     }
      /* 093 */
      /* 094 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert(long agg_key, long agg_key1) {
      /* 095 */       long h = hash(agg_key, agg_key1);
      /* 096 */       int step = 0;
      /* 097 */       int idx = (int) h & (numBuckets - 1);
      /* 098 */       while (step < maxSteps) {
      /* 099 */         // Return bucket index if it's either an empty slot or already contains the key
      /* 100 */         if (buckets[idx] == -1) {
      /* 101 */           batch.column(0).putLong(numRows, agg_key);
      /* 102 */           batch.column(1).putLong(numRows, agg_key1);
      /* 103 */           batch.column(2).putLong(numRows, 0);
      /* 104 */           buckets[idx] = numRows++;
      /* 105 */           return batch.getRow(buckets[idx]);
      /* 106 */         } else if (equals(idx, agg_key, agg_key1)) {
      /* 107 */           return batch.getRow(buckets[idx]);
      /* 108 */         }
      /* 109 */         idx = (idx + 1) & (numBuckets - 1);
      /* 110 */         step++;
      /* 111 */       }
      /* 112 */       // Didn't find it
      /* 113 */       return null;
      /* 114 */     }
      /* 115 */
      /* 116 */     private boolean equals(int idx, long agg_key, long agg_key1) {
      /* 117 */       return batch.column(0).getLong(buckets[idx]) == agg_key && batch.column(1).getLong(buckets[idx]) == agg_key1;
      /* 118 */     }
      /* 119 */
      /* 120 */     // TODO: Improve this Hash Function
      /* 121 */     private long hash(long agg_key, long agg_key1) {
      /* 122 */       return agg_key ^ agg_key1;
      /* 123 */     }
      /* 124 */
      /* 125 */   }
      ```
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12161 from sameeragarwal/tungsten-aggregate.
      f8c9beca
    • tedyu's avatar
      [SPARK-14448] Improvements to ColumnVector · 02757535
      tedyu authored
      ## What changes were proposed in this pull request?
      
      In this PR, two changes are proposed for ColumnVector :
      1. ColumnVector should be declared as implementing AutoCloseable - it already has close() method
      2. In OnHeapColumnVector#reserveInternal(), we only need to allocate new array when existing array is null or the length of existing array is shorter than the newCapacity.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #12225 from tedyu/master.
      02757535
    • Yanbo Liang's avatar
      [SPARK-14298][ML][MLLIB] LDA should support disable checkpoint · 56af8e85
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
      ## How was this patch tested?
      Existing tests.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12089 from yanboliang/spark-14298.
      56af8e85
    • Josh Rosen's avatar
      [BUILD][HOTFIX] Download Maven from regular mirror network rather than archive.apache.org · 94ac58b2
      Josh Rosen authored
      [archive.apache.org](https://archive.apache.org/) is undergoing maintenance, breaking our `build/mvn` script:
      
      > We are in the process of relocating this service. To save on the immense bandwidth that this service outputs, we have put it in maintenance mode, disabling all downloads for the next few days. We expect the maintenance to be complete no later than the morning of Monday the 11th of April, 2016.
      
      This patch fixes this issue by updating the script to use the regular mirror network to download Maven.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12262 from JoshRosen/fix-mvn-download.
      94ac58b2
    • wm624@hotmail.com's avatar
      [SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API · e0ad75f2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.
      
      This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.
      
      ## How was this patch tested?
      ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (12s)
      Finished test(python2.7): pyspark.ml.clustering (18s)
      Finished test(python2.7): pyspark.ml.classification (30s)
      Finished test(python2.7): pyspark.ml.recommendation (28s)
      Finished test(python2.7): pyspark.ml.feature (43s)
      Finished test(python2.7): pyspark.ml.regression (31s)
      Finished test(python2.7): pyspark.ml.tuning (19s)
      Finished test(python2.7): pyspark.ml.tests (34s)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12116 from wangmiao1981/fix_api.
      e0ad75f2
    • Kai Jiang's avatar
      [SPARK-14373][PYSPARK] PySpark RandomForestClassifier, Regressor support export/import · e5d8d6e0
      Kai Jiang authored
      ## What changes were proposed in this pull request?
      supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
      [JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
      ## How was this patch tested?
      doctest
      
      Author: Kai Jiang <jiangkai@gmail.com>
      
      Closes #12238 from vectorijk/spark-14373.
      e5d8d6e0
    • Mark Grover's avatar
      [SPARK-14477][BUILD] Allow custom mirrors for downloading artifacts in build/mvn · a9b630f4
      Mark Grover authored
      ## What changes were proposed in this pull request?
      
      Allows to override locations for downloading Apache and Typesafe artifacts in build/mvn script.
      
      ## How was this patch tested?
      By running script like
      ````
      # Remove all previously downloaded artifacts
      rm -rf build/apache-maven*
      rm -rf build/zinc-*
      rm -rf build/scala-*
      
      # Make sure path is clean and doesn't contain mvn, for example.
      ...
      
      # Run a command without setting anything and make sure it succeeds
      build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      
      # Run a command setting the default location as mirror and make sure it succeeds
      APACHE_MIRROR=http://mirror.infra.cloudera.com/apache/ build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      
      # Do the same without the trailing slash this time and make sure it succeeds
      APACHE_MIRROR=http://mirror.infra.cloudera.com/apache build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      
      # Do it with a bad URL and make sure it fails
      APACHE_MIRROR=xyz build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 package
      ````
      
      Author: Mark Grover <mark@apache.org>
      
      Closes #12250 from markgrover/spark-14477.
      a9b630f4
    • Aaron Tokhy's avatar
      [SPARK-14470] Allow for overriding both httpclient and httpcore versions · 583b5e05
      Aaron Tokhy authored
      ## What changes were proposed in this pull request?
      
      This splits commons.httpclient.version from commons.httpcore.version, since these two versions do not necessarily have to be the same.  This change may follow up with an up-to-date version of the httpclient/httpcore libraries.
      
      The latest 4.3.x httpclient version as of writing is 4.3.6 and the latest 4.3.x httpcore version as of writing is 4.3.3.  This change would be a prerequisite for potentially moving to this new bugfix version.
      
      ## How was this patch tested?
      no version change was made for httpclient/httpcore versions
      mvn package
      
      Author: Aaron Tokhy <tokaaron@amazon.com>
      
      Closes #12245 from atokhy/pull-request.
      583b5e05
    • Jacek Laskowski's avatar
      [SPARK-14402][HOTFIX] Fix ExpressionDescription annotation · 64470980
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Fix for the error introduced in https://github.com/apache/spark/commit/c59abad052b7beec4ef550049413e95578e545be:
      
      ```
      /Users/jacek/dev/oss/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:626: error: annotation argument needs to be a constant; found: "_FUNC_(str) - ".+("Returns str, with the first letter of each word in uppercase, all other letters in ").+("lowercase. Words are delimited by white space.")
          "Returns str, with the first letter of each word in uppercase, all other letters in " +
                                                                                                ^
      ```
      
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #12192 from jaceklaskowski/SPARK-14402-HOTFIX.
      64470980
    • hyukjinkwon's avatar
      [SPARK-14189][SQL] JSON data sources find compatible types even if inferred... · 73b56a3c
      hyukjinkwon authored
      [SPARK-14189][SQL] JSON data sources find compatible types even if inferred decimal type is not capable of the others
      
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-14189
      
      When inferred types in the same field during finding compatible `DataType`, are `IntegralType` and `DecimalType` but `DecimalType` is not capable of the given `IntegralType`, JSON data source simply fails to find a compatible type resulting in `StringType`.
      
      This can be observed when `prefersDecimal` is enabled.
      
      ```scala
      def mixedIntegerAndDoubleRecords: RDD[String] =
        sqlContext.sparkContext.parallelize(
          """{"a": 3, "b": 1.1}""" ::
          """{"a": 3.1, "b": 1}""" :: Nil)
      
      val jsonDF = sqlContext.read
        .option("prefersDecimal", "true")
        .json(mixedIntegerAndDoubleRecords)
        .printSchema()
      ```
      
      - **Before**
      
      ```
      root
       |-- a: string (nullable = true)
       |-- b: string (nullable = true)
      ```
      
      - **After**
      
      ```
      root
       |-- a: decimal(21, 1) (nullable = true)
       |-- b: decimal(21, 1) (nullable = true)
      ```
      (Note that integer is inferred as `LongType` which becomes `DecimalType(20, 0)`)
      
      ## How was this patch tested?
      
      unit tests were used and style tests by `dev/run_tests`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11993 from HyukjinKwon/SPARK-14189.
      73b56a3c
    • hyukjinkwon's avatar
      [SPARK-14103][SQL] Parse unescaped quotes in CSV data source. · 725b860e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR resolves the problem during parsing unescaped quotes in input data. For example, currently the data below:
      
      ```
      "a"b,ccc,ddd
      e,f,g
      ```
      
      produces a data below:
      
      - **Before**
      
      ```bash
      ["a"b,ccc,ddd[\n]e,f,g]  <- as a value.
      ```
      
      - **After**
      
      ```bash
      ["a"b], [ccc], [ddd]
      [e], [f], [g]
      ```
      
      This PR bumps up the Univocity parser's version. This was fixed in `2.0.2`, https://github.com/uniVocity/univocity-parsers/issues/60.
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite` and `sbt/sbt scalastyle`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #12226 from HyukjinKwon/SPARK-14103-quote.
      725b860e
  3. Apr 07, 2016
    • Reynold Xin's avatar
    • Joseph K. Bradley's avatar
      [SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizer · 953ff897
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint).
      
      This PR adds a "deleteLastCheckpoint" option which defaults to false.  This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default.
      
      This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option.
      
      This also:
      * Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir
      * Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory)
      * Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]```
      
      ## How was this patch tested?
      
      Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12166 from jkbradley/emlda-save-checkpoint.
      953ff897
    • Michael Armbrust's avatar
      [SPARK-14449][SQL] SparkContext should use SparkListenerInterface · 692c7484
      Michael Armbrust authored
      Currently all `SparkFirehoseListener` implementations are broken since we expect listeners to extend `SparkListener`, while the fire hose only extends `SparkListenerInterface`.  This changes the addListener function and the config based injection to use the interface instead.
      
      The existing tests in SparkListenerSuite are improved such that they would have caught this.
      
      Follow-up to #12142
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #12227 from marmbrus/fixListener.
      692c7484
    • Andrew Or's avatar
      [SPARK-14468] Always enable OutputCommitCoordinator · 3e29e372
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      `OutputCommitCoordinator` was introduced to deal with concurrent task attempts racing to write output, leading to data loss or corruption. For more detail, read the [JIRA description](https://issues.apache.org/jira/browse/SPARK-14468).
      
      Before: `OutputCommitCoordinator` is enabled only if speculation is enabled.
      After: `OutputCommitCoordinator` is always enabled.
      
      Users may still disable this through `spark.hadoop.outputCommitCoordination.enabled`, but they really shouldn't...
      
      ## How was this patch tested?
      
      `OutputCommitCoordinator*Suite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12244 from andrewor14/always-occ.
      3e29e372
    • Michael Gummelt's avatar
      [DOCS][MINOR] Remove sentence about Mesos not supporting cluster mode. · 30e980ad
      Michael Gummelt authored
      Docs change to remove the sentence about Mesos not supporting cluster mode.
      
      It was not.
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #12249 from mgummelt/fix-mesos-cluster-docs.
      30e980ad
    • Wenchen Fan's avatar
      [SPARK-14270][SQL] whole stage codegen support for typed filter · 49fb2370
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      We implement typed filter by `MapPartitions`, which doesn't work well with whole stage codegen. This PR use `Filter` to implement typed filter and we can get the whole stage codegen support for free.
      
      This PR also introduced `DeserializeToObject` and `SerializeFromObject`, to seperate serialization logic from object operator, so that it's eaiser to write optimization rules for adjacent object operators.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12061 from cloud-fan/whole-stage-codegen.
      49fb2370
    • Andrew Or's avatar
      [SPARK-14410][SQL] Push functions existence check into catalog · ae1db91d
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      This is a followup to #12117 and addresses some of the TODOs introduced there. In particular, the resolution of database is now pushed into session catalog, which knows about the current database. Further, the logic for checking whether a function exists is pushed into the external catalog.
      
      No change in functionality is expected.
      
      ## How was this patch tested?
      
      `SessionCatalogSuite`, `DDLSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12198 from andrewor14/function-exists.
      ae1db91d
    • Davies Liu's avatar
      [SPARK-12740] [SPARK-13932] support grouping()/grouping_id() in having/order clause · aa852215
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR brings the support of using grouping()/grouping_id() in HAVING/ORDER BY clause.
      
      The resolved grouping()/grouping_id() will be replaced by unresolved "spark_gropuing_id" virtual attribute, then resolved by ResolveMissingAttribute.
      
      This PR also fix the HAVING clause that access a grouping column that is not presented in SELECT clause, for example:
      ```sql
      select count(1) from (select 1 as a) t group by a having a > 0
      ```
      ## How was this patch tested?
      
      Add new tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12235 from davies/grouping_having.
      aa852215
    • Kousuke Saruta's avatar
      [SPARK-14456][SQL][MINOR] Remove unused variables and logics in DataSource · 8dcb0c7c
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      In DataSource#write method, the variables `dataSchema` and `equality`, and related logics are no longer used. Let's remove them.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #12237 from sarutak/SPARK-14456.
      8dcb0c7c
    • Tathagata Das's avatar
      [SQL][TESTS] Fix for flaky test in ContinuousQueryManagerSuite · 3aa7d763
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      The timeouts were lower the other timeouts in the test. Other tests were stable over the last month.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12219 from tdas/flaky-test-fix.
      3aa7d763
    • Dhruve Ashar's avatar
      [SPARK-12384] Enables spark-clients to set the min(-Xms) and max(*.memory config) j… · 033d8081
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      
      Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory.
      This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually.
       Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported.
      
      ## How was this patch tested?
      
      Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #12115 from dhruve/impr/SPARK-12384.
      033d8081
    • Alex Bozarth's avatar
      [SPARK-14245][WEB UI] Display the user in the application view · 35e0db2d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      The Spark UI (both active and history) should show the user who ran the application somewhere when you are in the application view. This was added under the Jobs view by total uptime and scheduler mode.
      
      ## How was this patch tested?
      
      Manual testing
      
      <img width="191" alt="username" src="https://cloud.githubusercontent.com/assets/13952758/14222830/6d1fe542-f82a-11e5-885f-c05ee2cdf857.png">
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #12123 from ajbozarth/spark14245.
      35e0db2d
    • Malte's avatar
      Better host description for multi-master mesos · db75ccb5
      Malte authored
      ## What changes were proposed in this pull request?
      
      Since not having the correct zk url causes job failure, the documentation should include all parameters
      
      ## How was this patch tested?
      
      no tests necessary
      
      Author: Malte <elmalto@users.noreply.github.com>
      
      Closes #12218 from elmalto/patch-1.
      db75ccb5
    • Reynold Xin's avatar
      [SPARK-10063][SQL] Remove DirectParquetOutputCommitter · 9ca0760d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes DirectParquetOutputCommitter. This was initially created by Databricks as a faster way to write Parquet data to S3. However, given how the underlying S3 Hadoop implementation works, this committer only works when there are no failures. If there are multiple attempts of the same task (e.g. speculation or task failures or node failures), the output data can be corrupted. I don't think this performance optimization outweighs the correctness issue.
      
      ## How was this patch tested?
      Removed the related tests also.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12229 from rxin/SPARK-10063.
      9ca0760d
    • Reynold Xin's avatar
      [SPARK-14452][SQL] Explicit APIs in Scala for specifying encoders · e11aa9ec
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The Scala Dataset public API currently only allows users to specify encoders through SQLContext.implicits. This is OK but sometimes people want to explicitly get encoders without a SQLContext (e.g. Aggregator implementations). This patch adds public APIs to Encoders class for getting Scala encoders.
      
      ## How was this patch tested?
      None - I will update test cases once https://github.com/apache/spark/pull/12231 is merged.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12232 from rxin/SPARK-14452.
      e11aa9ec
  4. Apr 06, 2016
    • Marcelo Vanzin's avatar
      [SPARK-14134][CORE] Change the package name used for shading classes. · 21d5ca12
      Marcelo Vanzin authored
      The current package name uses a dash, which is a little weird but seemed
      to work. That is, until a new test tried to mock a class that references
      one of those shaded types, and then things started failing.
      
      Most changes are just noise to fix the logging configs.
      
      For reference, SPARK-8815 also raised this issue, although at the time it
      did not cause any issues in Spark, so it was not addressed.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11941 from vanzin/SPARK-14134.
      21d5ca12
    • Herman van Hovell's avatar
      [SPARK-12610][SQL] Left Anti Join · d7659227
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      
      This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys.
      
      We currently add support for the following SQL join syntax:
      
          SELECT   *
          FROM      tbl1 A
                    LEFT ANTI JOIN tbl2 B
                     ON A.Id = B.Id
      
      Or using a dataframe:
      
          tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti)
      
      This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator.
      
      The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due.
      
      This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`.
      
      ### How was this patch tested?
      
      Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563.
      
      cc davies chenghao-intel rxin
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12214 from hvanhovell/SPARK-12610.
      d7659227
    • Marcelo Vanzin's avatar
      [SPARK-14446][TESTS] Fix ReplSuite for Scala 2.10. · 4901086f
      Marcelo Vanzin authored
      Just use the same test code as the 2.11 version, which seems to pass.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #12223 from vanzin/SPARK-14446.
      4901086f
Loading