Skip to content
Snippets Groups Projects
  1. Apr 11, 2016
    • DB Tsai's avatar
      [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pom · efaf7d18
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies.
      
      The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch.
      
      Thanks.
      
      ## How was this patch tested?
      
      Unit tests
      
      mengxr tedyu holdenk
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
      efaf7d18
    • Zheng RuiFeng's avatar
      [SPARK-14510][MLLIB] Add args-checking for LDA and StreamingKMeans · 643b4e22
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      add the checking for LDA and StreamingKMeans
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12062 from zhengruifeng/initmodel.
      643b4e22
    • Xiangrui Meng's avatar
      [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIs · 1c751fcf
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator.
      
      Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2.
      
      TODOs:
      - [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920)
      - [x] Python
      - [x] add a new test to accept Dataset[LabeledPoint]
      - [x] remove unused imports of Dataset
      
      ## How was this patch tested?
      
      Exiting unit tests with some modifications.
      
      cc: rxin jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12274 from mengxr/SPARK-14500.
      1c751fcf
    • Rekha Joshi's avatar
      [SPARK-14372][SQL] Dataset.randomSplit() needs a Java version · e82d95bf
      Rekha Joshi authored
      ## What changes were proposed in this pull request?
      
      1.Added method randomSplitAsList() in Dataset for java
      for https://issues.apache.org/jira/browse/SPARK-14372
      
      ## How was this patch tested?
      
      TestSuite
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #12184 from rekhajoshm/SPARK-14372.
      e82d95bf
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Fix wrong data types in JSON Datasets example. · 1a0cca1f
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the `age` data types from `integer` to `long` in `SQL Programming Guide: JSON Datasets`.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12290 from dongjoon-hyun/minor_fix_type_in_json_example.
      1a0cca1f
  2. Apr 10, 2016
    • gatorsmile's avatar
      [SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and Drop Table · 9f838bd2
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`.
      
      #### How was this patch tested?
      Modified the existing test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12284 from gatorsmile/followupDropTable.
      9f838bd2
    • Davies Liu's avatar
      [SPARK-14419] [MINOR] coding style cleanup · fbf8d008
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Making them more consistent.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12289 from davies/cleanup_style.
      fbf8d008
    • Dongjoon Hyun's avatar
      [SPARK-14415][SQL] All functions should show usages by command `DESC FUNCTION` · a7ce473b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, many functions do now show usages like the followings.
      ```
      scala> sql("desc function extended `sin`").collect().foreach(println)
      [Function: sin]
      [Class: org.apache.spark.sql.catalyst.expressions.Sin]
      [Usage: To be added.]
      [Extended Usage:
      To be added.]
      ```
      
      This PR adds descriptions for functions and adds a testcase prevent adding function without usage.
      ```
      scala>  sql("desc function extended `sin`").collect().foreach(println);
      [Function: sin]
      [Class: org.apache.spark.sql.catalyst.expressions.Sin]
      [Usage: sin(x) - Returns the sine of x.]
      [Extended Usage:
      > SELECT sin(0);
       0.0]
      ```
      
      The only exceptions are `cube`, `grouping`, `grouping_id`, `rollup`, `window`.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12185 from dongjoon-hyun/SPARK-14415.
      a7ce473b
    • Örjan Lundberg's avatar
      Update KMeansExample.scala · b5c78562
      Örjan Lundberg authored
      ## What changes were proposed in this pull request?
      example does not work wo DataFrame import
      
      ## How was this patch tested?
      
      example doc only
      
      example does not work wo DataFrame import
      
      Author: Örjan Lundberg <orjan.lundberg@gmail.com>
      
      Closes #12277 from oluies/patch-1.
      b5c78562
    • fwang1's avatar
      [SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as... · f4344582
      fwang1 authored
      [SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as dict in ConutVectorizer
      
      ## What changes were proposed in this pull request?
      
      Replace sortBy() with top() to calculate the top N frequent words as dictionary.
      
      ## How was this patch tested?
      existing unit tests.  The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"...
      
      Author: fwang1 <desperado.wf@gmail.com>
      
      Closes #12265 from lionelfeng/master.
      f4344582
    • Jason Moore's avatar
      [SPARK-14357][CORE] Properly handle the root cause being a commit denied exception · 22014e6f
      Jason Moore authored
      ## What changes were proposed in this pull request?
      
      When deciding whether a CommitDeniedException caused a task to fail, consider the root cause of the Exception.
      
      ## How was this patch tested?
      
      Added a test suite for the component that extracts the root cause of the error.
      Made a distribution after cherry-picking this commit to branch-1.6 and used to run our Spark application that would quite often fail due to the CommitDeniedException.
      
      Author: Jason Moore <jasonmoore2k@outlook.com>
      
      Closes #12228 from jasonmoore2k/SPARK-14357.
      22014e6f
    • jerryshao's avatar
      [SPARK-14455][STREAMING] Fix NPE in allocatedExecutors when calling in receiver-less scenario · 2c95e4e9
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      When calling `ReceiverTracker#allocatedExecutors` in receiver-less scenario, NPE will be thrown, since this `ReceiverTracker` actually is not started and `endpoint` is not created.
      
      This will be happened when playing streaming dynamic allocation with direct Kafka.
      
      ## How was this patch tested?
      
      Local integrated test is done.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #12236 from jerryshao/SPARK-14455.
      2c95e4e9
    • Yin Huai's avatar
      [SPARK-14506][SQL] HiveClientImpl's toHiveTable misses a table property for external tables · 3fb09afd
      Yin Huai authored
      ## What changes were proposed in this pull request?
      
      For an external table's metadata (in Hive's representation), its table type needs to be EXTERNAL_TABLE. Also, there needs to be a field called EXTERNAL set in the table property with a value of TRUE (for a MANAGED_TABLE it will be FALSE) based on https://github.com/apache/hive/blob/release-1.2.1/metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java#L1095-L1105. HiveClientImpl's toHiveTable misses to set this table property.
      
      ## How was this patch tested?
      
      Added a new test.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12275 from yhuai/SPARK-14506.
      3fb09afd
  3. Apr 09, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14465][BUILD] Checkstyle should check all Java files · aea30a1a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `checkstyle` is configured to check the files under `src/main/java`. However, Spark has Java files in `src/main/scala`, too. This PR fixes the following configuration in `pom.xml` and the unchecked-so-far violations on those files.
      ```xml
      -<sourceDirectory>${basedir}/src/main/java</sourceDirectory>
      +<sourceDirectories>${basedir}/src/main/java,${basedir}/src/main/scala</sourceDirectories>
      ```
      
      ## How was this patch tested?
      
      After passing the Jenkins build and manually `dev/lint-java`. (Note that Jenkins does not run `lint-java`)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12242 from dongjoon-hyun/SPARK-14465.
      aea30a1a
    • Yong Tang's avatar
      [SPARK-14301][EXAMPLES] Java examples code merge and clean up. · 72e66bb2
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to remove duplicate Java code in examples/mllib and examples/ml. The following changes have been made:
      
      ```
      deleted: ml/JavaCrossValidatorExample.java (duplicate of JavaModelSelectionViaCrossValidationExample.java)
      deleted: ml/JavaTrainValidationSplitExample.java (duplicated of JavaModelSelectionViaTrainValidationSplitExample.java)
      deleted: mllib/JavaFPGrowthExample.java (duplicate of JavaSimpleFPGrowth.java)
      deleted: mllib/JavaLDAExample.java (duplicate of JavaLatentDirichletAllocationExample.java)
      deleted: mllib/JavaKMeans.java (merged with JavaKMeansExample.java)
      deleted: mllib/JavaLR.java (duplicate of JavaLinearRegressionWithSGDExample.java)
      updated: mllib/JavaKMeansExample.java (merged with mllib/JavaKMeans.java)
      ```
      
      ## How was this patch tested?
      Existing tests passed.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12143 from yongtang/SPARK-14301.
      72e66bb2
    • Holden Karau's avatar
      [SPARK-13687][PYTHON] Cleanup PySpark parallelize temporary files · 00288ea2
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Eagerly cleanup PySpark's temporary parallelize cleanup files rather than waiting for shut down.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12233 from holdenk/SPARK-13687-cleanup-pyspark-temporary-files.
      00288ea2
    • Nong Li's avatar
      [SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary... · 5989c85b
      Nong Li authored
      [SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary encoding for some of the data
      
      ## What changes were proposed in this pull request?
      
      This PR is based on #12017
      
      Currently, this causes batches where some values are dictionary encoded and some
      which are not. The non-dictionary encoded values cause us to remove the dictionary
      from the batch causing the first values to return garbage.
      
      This patch fixes the issue by first decoding the dictionary for the values that are
      already dictionary encoded before switching. A similar thing is done for the reverse
      case where the initial values are not dictionary encoded.
      
      ## How was this patch tested?
      
      This is difficult to test but replicated on a test cluster using a large tpcds data set.
      
      Author: Nong Li <nong@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12279 from davies/fix_dict.
      5989c85b
    • Davies Liu's avatar
      [SPARK-14419] [SQL] Improve HashedRelation for key fit within Long · 5cb5edaf
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
      
      This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
      
      This PR reopen #12190 to fix bugs.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12278 from davies/long_map3.
      5cb5edaf
    • gatorsmile's avatar
      [SPARK-14362][SPARK-14406][SQL] DDL Native Support: Drop View and Drop Table · dfce9665
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is to provide a native support for DDL `DROP VIEW` and `DROP TABLE`. The PR includes native parsing and native analysis.
      
      Based on the HIVE DDL document for [DROP_VIEW_WEB_LINK](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-
      DropView
      ), `DROP VIEW` is defined as,
      **Syntax:**
      ```SQL
      DROP VIEW [IF EXISTS] [db_name.]view_name;
      ```
       - to remove metadata for the specified view.
       - illegal to use DROP TABLE on a view.
       - illegal to use DROP VIEW on a table.
       - this command only works in `HiveContext`. In `SQLContext`, we will get an exception.
      
      This PR also handles `DROP TABLE`.
      **Syntax:**
      ```SQL
      DROP TABLE [IF EXISTS] table_name [PURGE];
      ```
      - Previously, the `DROP TABLE` command only can drop Hive tables in `HiveContext`. Now, after this PR, this command also can drop temporary table, external table, external data source table in `SQLContext`.
      - In `HiveContext`, we will not issue an exception if the to-be-dropped table does not exist and users did not specify `IF EXISTS`. Instead, we just log an error message. If `IF EXISTS` is specified, we will not issue any error message/exception.
      - In `SQLContext`, we will issue an exception if the to-be-dropped table does not exist, unless `IF EXISTS` is specified.
      - Data will not be deleted if the tables are `external`, unless table type is `managed_table`.
      
      #### How was this patch tested?
      For verifying command parsing, added test cases in `spark/sql/hive/HiveDDLCommandSuite.scala`
      For verifying command analysis, added test cases in `spark/sql/hive/execution/HiveDDLSuite.scala`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #12146 from gatorsmile/dropView.
      dfce9665
    • gatorsmile's avatar
      [SPARK-14481][SQL] Issue Exceptions for All Unsupported Options during Parsing · 9be5558e
      gatorsmile authored
      #### What changes were proposed in this pull request?
      "Not good to slightly ignore all the un-supported options/clauses. We should either support it or throw an exception." A comment from yhuai in another PR https://github.com/apache/spark/pull/12146
      
      - Can `Explain` be an exception? The `Formatted` clause is used in `HiveCompatibilitySuite`.
      - Two unsupported clauses in `Drop Table` are handled in a separate PR: https://github.com/apache/spark/pull/12146
      
      #### How was this patch tested?
      Test cases are added to verify all the cases.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #12255 from gatorsmile/warningToException.
      9be5558e
    • Xiangrui Meng's avatar
      415446cc
    • Yong Tang's avatar
      [SPARK-14335][SQL] Describe function command returns wrong output · cd2fed70
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      …because some of built-in functions are not in function registry.
      
      This fix tries to fix issues in `describe function` command where some of the outputs
      still shows Hive's function because some built-in functions are not in FunctionRegistry.
      
      The following built-in functions have been added to FunctionRegistry:
      ```
      -
      !
      *
      /
      &
      %
      ^
      +
      <
      <=
      <=>
      =
      ==
      >
      >=
      |
      ~
      and
      in
      like
      not
      or
      rlike
      when
      ```
      
      The following listed functions are not added, but hard coded in `commands.scala` (hvanhovell):
      ```
      !=
      <>
      between
      case
      ```
      Below are the existing result of the above functions that have not been added:
      ```
      spark-sql> describe function `!=`;
      Function: <>
      Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
      Usage: a <> b - Returns TRUE if a is not equal to b
      ```
      ```
      spark-sql> describe function `<>`;
      Function: <>
      Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFOPNotEqual
      Usage: a <> b - Returns TRUE if a is not equal to b
      ```
      ```
      spark-sql> describe function `between`;
      Function: between
      Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFBetween
      Usage: between a [NOT] BETWEEN b AND c - evaluate if a is [not] in between b and c
      ```
      ```
      spark-sql> describe function `case`;
      Function: case
      Class: org.apache.hadoop.hive.ql.udf.generic.GenericUDFCase
      Usage: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END - When a = b, returns c; when a = d, return e; else return f
      ```
      
      ## How was this patch tested?
      
      Existing tests passed. Additional test cases added.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12128 from yongtang/SPARK-14335.
      cd2fed70
    • Davies Liu's avatar
      f7ec854f
    • Zheng RuiFeng's avatar
      [SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScaler · adb9d73c
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      add three python examples
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12063 from zhengruifeng/dct_pe.
      adb9d73c
    • DB Tsai's avatar
      [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom · 1598d11b
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #12241 from dbtsai/dbtsai-mllib-local-build.
      1598d11b
    • bomeng's avatar
      [SPARK-14496][SQL] fix some javadoc typos · 10a95781
      bomeng authored
      ## What changes were proposed in this pull request?
      
      Minor issues. Found 2 typos while browsing the code.
      
      ## How was this patch tested?
      None.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #12264 from bomeng/SPARK-14496.
      10a95781
    • wm624@hotmail.com's avatar
      [SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param · a9b8b655
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param.
      
      ## How was this patch tested?
      
      Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0.
      
      All tests in CounterVectorizerSuite.scala are passed.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12200 from wangmiao1981/binary_param.
      a9b8b655
    • Davies Liu's avatar
      [SPARK-14419] [SQL] Improve HashedRelation for key fit within Long · 90c0a045
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, we use java HashMap for HashedRelation if the key could fit within a Long. The java HashMap and CompactBuffer are not memory efficient, the memory used by them is also accounted accurately.
      
      This PR introduce a LongToUnsafeRowMap (similar to BytesToBytesMap) for better memory efficiency and performance.
      
      ## How was this patch tested?
      
      Updated existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12190 from davies/long_map2.
      90c0a045
    • Reynold Xin's avatar
      [SPARK-14451][SQL] Move encoder definition into Aggregator interface · 520dde48
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      When we first introduced Aggregators, we required the user of Aggregators to (implicitly) specify the encoders. It would actually make more sense to have the encoders be specified by the implementation of Aggregators, since each implementation should have the most state about how to encode its own data type.
      
      Note that this simplifies the Java API because Java users no longer need to explicitly specify encoders for aggregators.
      
      ## How was this patch tested?
      Updated unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12231 from rxin/SPARK-14451.
      520dde48
    • Reynold Xin's avatar
      [SPARK-14482][SQL] Change default Parquet codec from gzip to snappy · 2f0b882e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.
      
      This patch changes the default compression codec for Parquet output from gzip to snappy, and also introduces a ParquetOptions class to be more consistent with other data sources (e.g. CSV, JSON).
      
      ## How was this patch tested?
      Should be covered by existing unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12256 from rxin/SPARK-14482.
      2f0b882e
  4. Apr 08, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs · d7af736b
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Cleanups to documentation.  No changes to code.
      * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
      * GLM regParam: needs doc saying it is for L2 only
      * TrainValidationSplitModel: add .. versionadded:: 2.0.0
      * Rename “_transformer_params_from_java” to “_transfer_params_from_java”
      * LogReg Summary classes: “probability” col should not say “calibrated”
      * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
      * approxCountDistinct: Document meaning of “rsd" argument.
      * LDA: note which params are for online LDA only
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12266 from jkbradley/ml-doc-cleanups.
      d7af736b
    • Sameer Agarwal's avatar
      [SPARK-14454] Better exception handling while marking tasks as failed · 813e96e6
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch adds support for better handling of exceptions inside catch blocks if the code within the block throws an exception. For instance here is the code in a catch block before this change in `WriterContainer.scala`:
      
      ```scala
      logError("Aborting task.", cause)
      // call failure callbacks first, so we could have a chance to cleanup the writer.
      TaskContext.get().asInstanceOf[TaskContextImpl].markTaskFailed(cause)
      if (currentWriter != null) {
        currentWriter.close()
      }
      abortTask()
      throw new SparkException("Task failed while writing rows.", cause)
      ```
      
      If `markTaskFailed` or `currentWriter.close` throws an exception, we currently lose the original cause. This PR fixes this problem by implementing a utility function `Utils.tryWithSafeCatch` that suppresses (`Throwable.addSuppressed`) the exception that are thrown within the catch block and rethrowing the original exception.
      
      ## How was this patch tested?
      
      No new functionality added
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12234 from sameeragarwal/fix-exception.
      813e96e6
    • Shixiong Zhu's avatar
      [SPARK-14437][CORE] Use the address that NettyBlockTransferService listens to create BlockManagerId · 4d7c3592
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Here is why SPARK-14437 happens:
      BlockManagerId is created using NettyBlockTransferService.hostName which comes from `customHostname`. And `Executor` will set `customHostname` to the hostname which is detected by the driver. However, the driver may not be able to detect the correct address in some complicated network (Netty's Channel.remoteAddress doesn't always return a connectable address). In such case, `BlockManagerId` will be created using a wrong hostname.
      
      To fix this issue, this PR uses `hostname` provided by `SparkEnv.create` to create `NettyBlockTransferService` and set `NettyBlockTransferService.hostname` to this one directly. A bonus of this approach is NettyBlockTransferService won't bound to `0.0.0.0` which is much safer.
      
      ## How was this patch tested?
      
      Manually checked the bound address using local-cluster.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12240 from zsxwing/SPARK-14437.
      4d7c3592
    • Josh Rosen's avatar
      [SPARK-11416][BUILD] Update to Chill 0.8.0 & Kryo 3.0.3 · 906eef4c
      Josh Rosen authored
      This patch upgrades Chill to 0.8.0 and Kryo to 3.0.3. While we'll likely need to bump these dependencies again before Spark 2.0 (due to SPARK-14221 / https://github.com/twitter/chill/issues/252), I wanted to get the bulk of the Kryo 2 -> Kryo 3 migration done now in order to figure out whether there are any unexpected surprises.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12076 from JoshRosen/kryo3.
      906eef4c
    • Josh Rosen's avatar
      [SPARK-14435][BUILD] Shade Kryo in our custom Hive 1.2.1 fork · 464a3c1e
      Josh Rosen authored
      This patch updates our custom Hive 1.2.1 fork in order to shade Kryo in Hive. This is a blocker for upgrading Spark to use Kryo 3 (see #12076).
      
      The source for this new fork of Hive can be found at https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
      
      Here's the complete diff from the official Hive 1.2.1 release: https://github.com/apache/hive/compare/release-1.2.1...JoshRosen:release-1.2.1-spark2
      
      Here's the diff from the sources that pwendell used to publish the current `1.2.1.spark` release of Hive: https://github.com/pwendell/hive/compare/release-1.2.1-spark...JoshRosen:release-1.2.1-spark2. This diff looks large because his branch used a shell script to rewrite the groupId, whereas I had to commit the groupId changes in order to prevent the find-and-replace from affecting the package names in our relocated Kryo classes: https://github.com/pwendell/hive/compare/release-1.2.1-spark...JoshRosen:release-1.2.1-spark2#diff-6ada9aaec70e069df8f2c34c5519dd1e
      
      Using these changes, I was able to publish a local version of Hive and verify that this change fixes the test failures which are blocking #12076. Note that this PR will not compile until we complete the review of the Hive POM changes and stage and publish a release.
      
      /cc vanzin, steveloughran, and pwendell for review.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12215 from JoshRosen/shade-kryo-in-hive.
      464a3c1e
    • Sameer Agarwal's avatar
      [SPARK-14394][SQL] Generate AggregateHashMap class for LongTypes during TungstenAggregate codegen · f8c9beca
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for generating the `AggregateHashMap` class in `TungstenAggregate` if the aggregate group by keys/value are of `LongType`. Note that currently this generate aggregate is not actually used.
      
      NB: This currently only supports `LongType` keys/values (please see `isAggregateHashMapSupported` in `TungstenAggregate`) and will be generalized to other data types in a subsequent PR.
      
      ## How was this patch tested?
      
      Manually inspected the generated code. This is what the generated map looks like for 2 keys:
      
      ```java
      /* 068 */   public class agg_GeneratedAggregateHashMap {
      /* 069 */     private org.apache.spark.sql.execution.vectorized.ColumnarBatch batch;
      /* 070 */     private int[] buckets;
      /* 071 */     private int numBuckets;
      /* 072 */     private int maxSteps;
      /* 073 */     private int numRows = 0;
      /* 074 */     private org.apache.spark.sql.types.StructType schema =
      /* 075 */     new org.apache.spark.sql.types.StructType()
      /* 076 */     .add("k1", org.apache.spark.sql.types.DataTypes.LongType)
      /* 077 */     .add("k2", org.apache.spark.sql.types.DataTypes.LongType)
      /* 078 */     .add("sum", org.apache.spark.sql.types.DataTypes.LongType);
      /* 079 */
      /* 080 */     public agg_GeneratedAggregateHashMap(int capacity, double loadFactor, int maxSteps) {
      /* 081 */       assert (capacity > 0 && ((capacity & (capacity - 1)) == 0));
      /* 082 */       this.maxSteps = maxSteps;
      /* 083 */       numBuckets = (int) (capacity / loadFactor);
      /* 084 */       batch = org.apache.spark.sql.execution.vectorized.ColumnarBatch.allocate(schema,
      /* 085 */         org.apache.spark.memory.MemoryMode.ON_HEAP, capacity);
      /* 086 */       buckets = new int[numBuckets];
      /* 087 */       java.util.Arrays.fill(buckets, -1);
      /* 088 */     }
      /* 089 */
      /* 090 */     public agg_GeneratedAggregateHashMap() {
      /* 091 */       new agg_GeneratedAggregateHashMap(1 << 16, 0.25, 5);
      /* 092 */     }
      /* 093 */
      /* 094 */     public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert(long agg_key, long agg_key1) {
      /* 095 */       long h = hash(agg_key, agg_key1);
      /* 096 */       int step = 0;
      /* 097 */       int idx = (int) h & (numBuckets - 1);
      /* 098 */       while (step < maxSteps) {
      /* 099 */         // Return bucket index if it's either an empty slot or already contains the key
      /* 100 */         if (buckets[idx] == -1) {
      /* 101 */           batch.column(0).putLong(numRows, agg_key);
      /* 102 */           batch.column(1).putLong(numRows, agg_key1);
      /* 103 */           batch.column(2).putLong(numRows, 0);
      /* 104 */           buckets[idx] = numRows++;
      /* 105 */           return batch.getRow(buckets[idx]);
      /* 106 */         } else if (equals(idx, agg_key, agg_key1)) {
      /* 107 */           return batch.getRow(buckets[idx]);
      /* 108 */         }
      /* 109 */         idx = (idx + 1) & (numBuckets - 1);
      /* 110 */         step++;
      /* 111 */       }
      /* 112 */       // Didn't find it
      /* 113 */       return null;
      /* 114 */     }
      /* 115 */
      /* 116 */     private boolean equals(int idx, long agg_key, long agg_key1) {
      /* 117 */       return batch.column(0).getLong(buckets[idx]) == agg_key && batch.column(1).getLong(buckets[idx]) == agg_key1;
      /* 118 */     }
      /* 119 */
      /* 120 */     // TODO: Improve this Hash Function
      /* 121 */     private long hash(long agg_key, long agg_key1) {
      /* 122 */       return agg_key ^ agg_key1;
      /* 123 */     }
      /* 124 */
      /* 125 */   }
      ```
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12161 from sameeragarwal/tungsten-aggregate.
      f8c9beca
    • tedyu's avatar
      [SPARK-14448] Improvements to ColumnVector · 02757535
      tedyu authored
      ## What changes were proposed in this pull request?
      
      In this PR, two changes are proposed for ColumnVector :
      1. ColumnVector should be declared as implementing AutoCloseable - it already has close() method
      2. In OnHeapColumnVector#reserveInternal(), we only need to allocate new array when existing array is null or the length of existing array is shorter than the newCapacity.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #12225 from tedyu/master.
      02757535
    • Yanbo Liang's avatar
      [SPARK-14298][ML][MLLIB] LDA should support disable checkpoint · 56af8e85
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug.
      ## How was this patch tested?
      Existing tests.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12089 from yanboliang/spark-14298.
      56af8e85
    • Josh Rosen's avatar
      [BUILD][HOTFIX] Download Maven from regular mirror network rather than archive.apache.org · 94ac58b2
      Josh Rosen authored
      [archive.apache.org](https://archive.apache.org/) is undergoing maintenance, breaking our `build/mvn` script:
      
      > We are in the process of relocating this service. To save on the immense bandwidth that this service outputs, we have put it in maintenance mode, disabling all downloads for the next few days. We expect the maintenance to be complete no later than the morning of Monday the 11th of April, 2016.
      
      This patch fixes this issue by updating the script to use the regular mirror network to download Maven.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12262 from JoshRosen/fix-mvn-download.
      94ac58b2
    • wm624@hotmail.com's avatar
      [SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API · e0ad75f2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.
      
      This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.
      
      ## How was this patch tested?
      ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (12s)
      Finished test(python2.7): pyspark.ml.clustering (18s)
      Finished test(python2.7): pyspark.ml.classification (30s)
      Finished test(python2.7): pyspark.ml.recommendation (28s)
      Finished test(python2.7): pyspark.ml.feature (43s)
      Finished test(python2.7): pyspark.ml.regression (31s)
      Finished test(python2.7): pyspark.ml.tuning (19s)
      Finished test(python2.7): pyspark.ml.tests (34s)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12116 from wangmiao1981/fix_api.
      e0ad75f2
Loading