Skip to content
Snippets Groups Projects
  1. Aug 10, 2015
    • Hao Zhu's avatar
      [SPARK-9801] [STREAMING] Check if file exists before deleting temporary files. · 3c9802d9
      Hao Zhu authored
      Spark streaming deletes the temp file and backup files without checking if they exist or not
      
      Author: Hao Zhu <viadeazhu@gmail.com>
      
      Closes #8082 from viadea/master and squashes the following commits:
      
      242d05f [Hao Zhu] [SPARK-9801][Streaming]No need to check the existence of those files
      fd143f2 [Hao Zhu] [SPARK-9801][Streaming]Check if backupFile exists before deleting backupFile files.
      087daf0 [Hao Zhu] SPARK-9801
      3c9802d9
    • Prabeesh K's avatar
      [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python · 853809e9
      Prabeesh K authored
      This PR is based on #4229, thanks prabeesh.
      
      Closes #4229
      
      Author: Prabeesh K <prabsmails@gmail.com>
      Author: zsxwing <zsxwing@gmail.com>
      Author: prabs <prabsmails@gmail.com>
      Author: Prabeesh K <prabeesh.k@namshi.com>
      
      Closes #7833 from zsxwing/pr4229 and squashes the following commits:
      
      9570bec [zsxwing] Fix the variable name and check null in finally
      4a9c79e [zsxwing] Fix pom.xml indentation
      abf5f18 [zsxwing] Merge branch 'master' into pr4229
      935615c [zsxwing] Fix the flaky MQTT tests
      47278c5 [zsxwing] Include the project class files
      478f844 [zsxwing] Add unpack
      5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
      734db99 [zsxwing] Merge branch 'master' into pr4229
      126608a [Prabeesh K] address the comments
      b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
      d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
      a6747cb [Prabeesh K] wait for starting the receiver before publishing data
      87fc677 [Prabeesh K] address the comments:
      97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
      80474d1 [Prabeesh K] fix
      1f0cfe9 [Prabeesh K] python style fix
      e1ee016 [Prabeesh K] scala style fix
      a5a8f9f [Prabeesh K] added Python test
      9767d82 [Prabeesh K] implemented Python-friendly class
      a11968b [Prabeesh K] fixed python style
      795ec27 [Prabeesh K] address comments
      ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
      3f4df12 [Prabeesh K] updated version
      b34c3c1 [prabs] adress comments
      3aa7fff [prabs] Added Python streaming mqtt word count example
      b7d42ff [prabs] Mqtt streaming support in Python
      853809e9
    • Davies Liu's avatar
      [SPARK-9759] [SQL] improve decimal.times() and cast(int, decimalType) · c4fd2a24
      Davies Liu authored
      This patch optimize two things:
      
      1. passing MathContext to JavaBigDecimal.multiply/divide/reminder to do right rounding, because java.math.BigDecimal.apply(MathContext) is expensive
      
      2. Cast integer/short/byte to decimal directly (without double)
      
      This two optimizations could speed up the end-to-end time of a aggregation (SUM(short * decimal(5, 2)) 75% (from 19s -> 10.8s)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8052 from davies/optimize_decimal and squashes the following commits:
      
      225efad [Davies Liu] improve decimal.times() and cast(int, decimalType)
      c4fd2a24
    • Davies Liu's avatar
      [SPARK-9620] [SQL] generated UnsafeProjection should support many columns or large exressions · fe2fb7fb
      Davies Liu authored
      Currently, generated UnsafeProjection can reach 64k byte code limit of Java. This patch will split the generated expressions into multiple functions, to avoid the limitation.
      
      After this patch, we can work well with table that have up to 64k columns (hit max number of constants limit in Java), it should be enough in practice.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8044 from davies/wider_table and squashes the following commits:
      
      9192e6c [Davies Liu] fix generated safe projection
      d1ef81a [Davies Liu] fix failed tests
      737b3d3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
      ffcd132 [Davies Liu] address comments
      1b95be4 [Davies Liu] put the generated class into sql package
      77ed72d [Davies Liu] address comments
      4518e17 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
      75ccd01 [Davies Liu] Merge branch 'master' of github.com:apache/spark into wider_table
      495e932 [Davies Liu] support wider table with more than 1k columns for generated projections
      fe2fb7fb
    • Reynold Xin's avatar
      [SPARK-9763][SQL] Minimize exposure of internal SQL classes. · 40ed2af5
      Reynold Xin authored
      There are a few changes in this pull request:
      
      1. Moved all data sources to execution.datasources, except the public JDBC APIs.
      2. In order to maintain backward compatibility from 1, added a backward compatibility translation map in data source resolution.
      3. Moved ui and metric package into execution.
      4. Added more documentation on some internal classes.
      5. Renamed DataSourceRegister.format -> shortName.
      6. Added "override" modifier on shortName.
      7. Removed IntSQLMetric.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8056 from rxin/SPARK-9763 and squashes the following commits:
      
      9df4801 [Reynold Xin] Removed hardcoded name in test cases.
      d9babc6 [Reynold Xin] Shorten.
      e484419 [Reynold Xin] Removed VisibleForTesting.
      171b812 [Reynold Xin] MimaExcludes.
      2041389 [Reynold Xin] Compile ...
      79dda42 [Reynold Xin] Compile.
      0818ba3 [Reynold Xin] Removed IntSQLMetric.
      c46884f [Reynold Xin] Two more fixes.
      f9aa88d [Reynold Xin] [SPARK-9763][SQL] Minimize exposure of internal SQL classes.
      40ed2af5
    • Josh Rosen's avatar
      [SPARK-9784] [SQL] Exchange.isUnsafe should check whether codegen and unsafe are enabled · 0fe66744
      Josh Rosen authored
      Exchange.isUnsafe should check whether codegen and unsafe are enabled.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8073 from JoshRosen/SPARK-9784 and squashes the following commits:
      
      7a1019f [Josh Rosen] [SPARK-9784] Exchange.isUnsafe should check whether codegen and unsafe are enabled
      0fe66744
    • Mahmoud Lababidi's avatar
      Fixed AtmoicReference<> Example · d2852127
      Mahmoud Lababidi authored
      Author: Mahmoud Lababidi <lababidi@gmail.com>
      
      Closes #8076 from lababidi/master and squashes the following commits:
      
      af4553b [Mahmoud Lababidi] Fixed AtmoicReference<> Example
      d2852127
    • Feynman Liang's avatar
      [SPARK-9755] [MLLIB] Add docs to MultivariateOnlineSummarizer methods · 00b655cc
      Feynman Liang authored
      Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5.
      
      jkbradley
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8045 from feynmanliang/SPARK-9755 and squashes the following commits:
      
      af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs
      00b655cc
    • Marcelo Vanzin's avatar
      [SPARK-9710] [TEST] Fix RPackageUtilsSuite when R is not available. · 0f3366a4
      Marcelo Vanzin authored
      RUtils.isRInstalled throws an exception if R is not installed,
      instead of returning false. Fix that.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8008 from vanzin/SPARK-9710 and squashes the following commits:
      
      df72d8c [Marcelo Vanzin] [SPARK-9710] [test] Fix RPackageUtilsSuite when R is not available.
      0f3366a4
    • Cheng Lian's avatar
      [SPARK-9743] [SQL] Fixes JSONRelation refreshing · e3fef0f9
      Cheng Lian authored
      PR #7696 added two `HadoopFsRelation.refresh()` calls ([this] [1], and [this] [2]) in `DataSourceStrategy` to make test case `InsertSuite.save directly to the path of a JSON table` pass. However, this forces every `HadoopFsRelation` table scan to do a refresh, which can be super expensive for tables with large number of partitions.
      
      The reason why the original test case fails without the `refresh()` calls is that, the old JSON relation builds the base RDD with the input paths, while `HadoopFsRelation` provides `FileStatus`es of leaf files. With the old JSON relation, we can create a temporary table based on a path, writing data to that, and then read newly written data without refreshing the table. This is no long true for `HadoopFsRelation`.
      
      This PR removes those two expensive refresh calls, and moves the refresh into `JSONRelation` to fix this issue. We might want to update `HadoopFsRelation` interface to provide better support for this use case.
      
      [1]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L63
      [2]: https://github.com/apache/spark/blob/ebfd91c542aaead343cb154277fcf9114382fee7/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L91
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8035 from liancheng/spark-9743/fix-json-relation-refreshing and squashes the following commits:
      
      ec1957d [Cheng Lian] Fixes JSONRelation refreshing
      e3fef0f9
    • Yin Huai's avatar
      [SPARK-9777] [SQL] Window operator can accept UnsafeRows · be80def0
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-9777
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8064 from yhuai/windowUnsafe and squashes the following commits:
      
      8fb3537 [Yin Huai] Set canProcessUnsafeRows to true.
      be80def0
  2. Aug 09, 2015
    • Shivaram Venkataraman's avatar
      [CORE] [SPARK-9760] Use Option instead of Some for Ivy repos · 46025616
      Shivaram Venkataraman authored
      This was introduced in #7599
      
      cc rxin brkyvz
      
      Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
      
      Closes #8055 from shivaram/spark-packages-repo-fix and squashes the following commits:
      
      890f306 [Shivaram Venkataraman] Remove test case
      51d69ee [Shivaram Venkataraman] Add test case for --packages without --repository
      c02e0b4 [Shivaram Venkataraman] Use Option instead of Some for Ivy repos
      46025616
    • Josh Rosen's avatar
      [SPARK-9703] [SQL] Refactor EnsureRequirements to avoid certain unnecessary shuffles · 23cf5af0
      Josh Rosen authored
      This pull request refactors the `EnsureRequirements` planning rule in order to avoid the addition of certain unnecessary shuffles.
      
      As an example of how unnecessary shuffles can occur, consider SortMergeJoin, which requires clustered distribution and sorted ordering of its children's input rows. Say that both of SMJ's children produce unsorted output but are both SinglePartition. In this case, we will need to inject sort operators but should not need to inject Exchanges. Unfortunately, it looks like the EnsureRequirements unnecessarily repartitions using a hash partitioning.
      
      This patch solves this problem by refactoring `EnsureRequirements` to properly implement the `compatibleWith` checks that were broken in earlier implementations. See the significant inline comments for a better description of how this works. The majority of this PR is new comments and test cases, with few actual changes to the code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7988 from JoshRosen/exchange-fixes and squashes the following commits:
      
      38006e7 [Josh Rosen] Rewrite EnsureRequirements _yet again_ to make things even simpler
      0983f75 [Josh Rosen] More guarantees vs. compatibleWith cleanup; delete BroadcastPartitioning.
      8784bd9 [Josh Rosen] Giant comment explaining compatibleWith vs. guarantees
      1307c50 [Josh Rosen] Update conditions for requiring child compatibility.
      18cddeb [Josh Rosen] Rename DummyPlan to DummySparkPlan.
      2c7e126 [Josh Rosen] Merge remote-tracking branch 'origin/master' into exchange-fixes
      fee65c4 [Josh Rosen] Further refinement to comments / reasoning
      642b0bb [Josh Rosen] Further expand comment / reasoning
      06aba0c [Josh Rosen] Add more comments
      8dbc845 [Josh Rosen] Add even more tests.
      4f08278 [Josh Rosen] Fix the test by adding the compatibility check to EnsureRequirements
      a1c12b9 [Josh Rosen] Add failing test to demonstrate allCompatible bug
      0725a34 [Josh Rosen] Small assertion cleanup.
      5172ac5 [Josh Rosen] Add test for requiresChildrenToProduceSameNumberOfPartitions.
      2e0f33a [Josh Rosen] Write a more generic test for EnsureRequirements.
      752b8de [Josh Rosen] style fix
      c628daf [Josh Rosen] Revert accidental ExchangeSuite change.
      c9fb231 [Josh Rosen] Rewrite exchange to fix better handle this case.
      adcc742 [Josh Rosen] Move test to PlannerSuite.
      0675956 [Josh Rosen] Preserving ordering and partitioning in row format converters also does not help.
      cc5669c [Josh Rosen] Adding outputPartitioning to Repartition does not fix the test.
      2dfc648 [Josh Rosen] Add failing test illustrating bad exchange planning.
      23cf5af0
    • Reynold Xin's avatar
    • Yadong Qi's avatar
      [SPARK-9737] [YARN] Add the suggested configuration when required executor... · 86fa4ba6
      Yadong Qi authored
      [SPARK-9737] [YARN] Add the suggested configuration when required executor memory is above the max threshold of this cluster on YARN mode
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      
      Closes #8028 from watermen/SPARK-9737 and squashes the following commits:
      
      48bdf3d [Yadong Qi] Add suggested configuration.
      86fa4ba6
    • Yijie Shen's avatar
      [SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if... · 68ccc6e1
      Yijie Shen authored
      [SPARK-8930] [SQL] Throw a AnalysisException with meaningful messages if DataFrame#explode takes a star in expressions
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #8057 from yjshen/explode_star and squashes the following commits:
      
      eae181d [Yijie Shen] change explaination message
      54c9d11 [Yijie Shen] meaning message for * in explode
      68ccc6e1
    • Reynold Xin's avatar
      [SPARK-9752][SQL] Support UnsafeRow in Sample operator. · e9c36938
      Reynold Xin authored
      In order for this to work, I had to disable gap sampling.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8040 from rxin/SPARK-9752 and squashes the following commits:
      
      f9e248c [Reynold Xin] Fix the test case for real this time.
      adbccb3 [Reynold Xin] Fixed test case.
      589fb23 [Reynold Xin] Merge branch 'SPARK-9752' of github.com:rxin/spark into SPARK-9752
      55ccddc [Reynold Xin] Fixed core test.
      78fa895 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.
      c9e7112 [Reynold Xin] [SPARK-9752][SQL] Support UnsafeRow in Sample operator.
      e9c36938
  3. Aug 08, 2015
    • Yijie Shen's avatar
      [SPARK-6212] [SQL] The EXPLAIN output of CTAS only shows the analyzed plan · 3ca995b7
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-6212
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7986 from yjshen/ctas_explain and squashes the following commits:
      
      bb6fee5 [Yijie Shen] refine test
      f731041 [Yijie Shen] address comment
      b2cf8ab [Yijie Shen] bug fix
      bd7eb20 [Yijie Shen] ctas explain
      3ca995b7
    • CodingCat's avatar
      [MINOR] inaccurate comments for showString() · 25c363e9
      CodingCat authored
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #8050 from CodingCat/minor and squashes the following commits:
      
      5bc4b89 [CodingCat] inaccurate comments
      25c363e9
    • Joseph Batchik's avatar
      [SPARK-9486][SQL] Add data source aliasing for external packages · a3aec918
      Joseph Batchik authored
      Users currently have to provide the full class name for external data sources, like:
      
      `sqlContext.read.format("com.databricks.spark.avro").load(path)`
      
      This allows external data source packages to register themselves using a Service Loader so that they can add custom alias like:
      
      `sqlContext.read.format("avro").load(path)`
      
      This makes it so that using external data source packages uses the same format as the internal data sources like parquet, json, etc.
      
      Author: Joseph Batchik <joseph.batchik@cloudera.com>
      Author: Joseph Batchik <josephbatchik@gmail.com>
      
      Closes #7802 from JDrit/service_loader and squashes the following commits:
      
      49a01ec [Joseph Batchik] fixed a couple of format / error bugs
      e5e93b2 [Joseph Batchik] modified rat file to only excluded added services
      72b349a [Joseph Batchik] fixed error with orc data source actually
      9f93ea7 [Joseph Batchik] fixed error with orc data source
      87b7f1c [Joseph Batchik] fixed typo
      101cd22 [Joseph Batchik] removing unneeded changes
      8f3cf43 [Joseph Batchik] merged in changes
      b63d337 [Joseph Batchik] merged in master
      95ae030 [Joseph Batchik] changed the new trait to be used as a mixin for data source to register themselves
      74db85e [Joseph Batchik] reformatted class loader
      ac2270d [Joseph Batchik] removing some added test
      a6926db [Joseph Batchik] added test cases for data source loader
      208a2a8 [Joseph Batchik] changes to do error catching if there are multiple data sources
      946186e [Joseph Batchik] started working on service loader
      a3aec918
    • Yijie Shen's avatar
      [SPARK-9728][SQL]Support CalendarIntervalType in HiveQL · 23695f1d
      Yijie Shen authored
      This PR enables converting interval term in HiveQL to CalendarInterval Literal.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-9728
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #8034 from yjshen/interval_hiveql and squashes the following commits:
      
      7fe9a5e [Yijie Shen] declare throw exception and add unit test
      fce7795 [Yijie Shen] convert hiveql interval term into CalendarInterval literal
      23695f1d
    • Davies Liu's avatar
      [SPARK-6902] [SQL] [PYSPARK] Row should be read-only · ac507a03
      Davies Liu authored
      Raise an read-only exception when user try to mutable a Row.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8009 from davies/readonly_row and squashes the following commits:
      
      8722f3f [Davies Liu] add tests
      05a3d36 [Davies Liu] Row should be read-only
      ac507a03
    • Davies Liu's avatar
      [SPARK-4561] [PYSPARK] [SQL] turn Row into dict recursively · 74a6541a
      Davies Liu authored
      Add an option `recursive` to `Row.asDict()`, when True (default is False), it will convert the nested Row into dict.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8006 from davies/as_dict and squashes the following commits:
      
      922cc5a [Davies Liu] turn Row into dict recursively
      74a6541a
    • Wenchen Fan's avatar
      [SPARK-9738] [SQL] remove FromUnsafe and add its codegen version to GenerateSafe · 106c0789
      Wenchen Fan authored
      In https://github.com/apache/spark/pull/7752 we added `FromUnsafe` to convert nexted unsafe data like array/map/struct to safe versions. It's a quick solution and we already have `GenerateSafe` to do the conversion which is codegened. So we should remove `FromUnsafe` and implement its codegen version in `GenerateSafe`.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8029 from cloud-fan/from-unsafe and squashes the following commits:
      
      ed40d8f [Wenchen Fan] add the copy back
      a93fd4b [Wenchen Fan] cogengen FromUnsafe
      106c0789
    • Cheng Lian's avatar
      [SPARK-4176] [SQL] [MINOR] Should use unscaled Long to write decimals for... · 11caf1ce
      Cheng Lian authored
      [SPARK-4176] [SQL] [MINOR] Should use unscaled Long to write decimals for precision <= 18 rather than 8
      
      This PR fixes a minor bug introduced in #7455: when writing decimals, we should use the unscaled Long for better performance when the precision <= 18 rather than 8 (should be a typo). This bug doesn't affect correctness, but hurts Parquet decimal writing performance.
      
      This PR also replaced similar magic numbers with newly defined constants.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8031 from liancheng/spark-4176/minor-fix-for-writing-decimals and squashes the following commits:
      
      10d4ea3 [Cheng Lian] Should use unscaled Long to write decimals for precision <= 18 rather than 8
      11caf1ce
    • Carson Wang's avatar
      [SPARK-9731] Standalone scheduling incorrect cores if spark.executor.cores is not set · ef062c15
      Carson Wang authored
      The issue only happens if `spark.executor.cores` is not set and executor memory is set to a high value.
      For example, if we have a worker with 4G and 10 cores and we set `spark.executor.memory` to 3G, then only 1 core is assigned to the executor. The correct number should be 10 cores.
      I've added a unit test to illustrate the issue.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #8017 from carsonwang/SPARK-9731 and squashes the following commits:
      
      d09ec48 [Carson Wang] Fix code style
      86b651f [Carson Wang] Simplify the code
      943cc4c [Carson Wang] fix scheduling correct cores to executors
      ef062c15
  4. Aug 07, 2015
    • Yin Huai's avatar
      [SPARK-9753] [SQL] TungstenAggregate should also accept InternalRow instead of just UnsafeRow · c564b274
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-9753
      
      This PR makes TungstenAggregate to accept `InternalRow` instead of just `UnsafeRow`. Also, it adds an `getAggregationBufferFromUnsafeRow` method to `UnsafeFixedWidthAggregationMap`. It is useful when we already have grouping keys stored in `UnsafeRow`s. Finally, it wraps `InputStream` and `OutputStream` in `UnsafeRowSerializer` with `BufferedInputStream` and `BufferedOutputStream`, respectively.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #8041 from yhuai/joinedRowForProjection and squashes the following commits:
      
      7753e34 [Yin Huai] Use BufferedInputStream and BufferedOutputStream.
      d68b74e [Yin Huai] Use joinedRow instead of UnsafeRowJoiner.
      e93c009 [Yin Huai] Add getAggregationBufferFromUnsafeRow for cases that the given groupingKeyRow is already an UnsafeRow.
      c564b274
    • Reynold Xin's avatar
      [SPARK-9754][SQL] Remove TypeCheck in debug package. · 998f4ff9
      Reynold Xin authored
      TypeCheck no longer applies in the new "Tungsten" world.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8043 from rxin/SPARK-9754 and squashes the following commits:
      
      4ec471e [Reynold Xin] [SPARK-9754][SQL] Remove TypeCheck in debug package.
      998f4ff9
    • Feynman Liang's avatar
      [SPARK-9719] [ML] Clean up Naive Bayes doc · 85be65b3
      Feynman Liang authored
      Small documentation cleanups, including:
       * Adds documentation for `pi` and `theta`
       * setParam to `setModelType`
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8047 from feynmanliang/SPARK-9719 and squashes the following commits:
      
      b372438 [Feynman Liang] Clean up naive bayes doc
      85be65b3
    • Feynman Liang's avatar
      [SPARK-9756] [ML] Make constructors in ML decision trees private · cd540c1e
      Feynman Liang authored
      These should be made private until there is a public constructor for providing `rootNode: Node` to use these constructors.
      
      jkbradley
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #8046 from feynmanliang/SPARK-9756 and squashes the following commits:
      
      2cbdf08 [Feynman Liang] Make RFRegressionModel aux constructor private
      a06f596 [Feynman Liang] Make constructors in ML decision trees private
      cd540c1e
    • Michael Armbrust's avatar
      [SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions · 49702bd7
      Michael Armbrust authored
      Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`.  For formats like parquet this is very costly due to the buffers required to get good compression.  In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen.  As such each task will open no more than `spark.sql.sources.maxFiles` files.  I also did the following cleanup:
      
       - Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations.
       - The control flow for instantiating and invoking a writer container has been simplified.  Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`.
       - `InternalOutputWriter` has been removed.  Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method.  This method can be overridden by internal datasources to avoid the conversion.  This change remove a lot of code duplication and per-row `asInstanceOf` checks.
       - `commands.scala` has been split up.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #8010 from marmbrus/fsWriting and squashes the following commits:
      
      00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes
      775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting
      17b690e [Michael Armbrust] remove comment
      40f0372 [Michael Armbrust] address comments
      f5675bd [Michael Armbrust] char -> string
      7e2d0a4 [Michael Armbrust] make sure we close current writer
      8100100 [Michael Armbrust] delete empty commands.scala
      71cc717 [Michael Armbrust] update comment
      8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions
      49702bd7
    • Bertrand Dechoux's avatar
      [SPARK-9748] [MLLIB] Centriod typo in KMeansModel · 902334fd
      Bertrand Dechoux authored
      A minor typo (centriod -> centroid). Readable variable names help every users.
      
      Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
      
      Closes #8037 from BertrandDechoux/kmeans-typo and squashes the following commits:
      
      47632fe [Bertrand Dechoux] centriod typo
      902334fd
    • Dariusz Kobylarz's avatar
      [SPARK-8481] [MLLIB] GaussianMixtureModel predict accepting single vector · e2fbbe73
      Dariusz Kobylarz authored
      Resubmit of [https://github.com/apache/spark/pull/6906] for adding single-vec predict to GMMs
      
      CC: dkobylarz  mengxr
      
      To be merged with master and branch-1.5
      Primary author: dkobylarz
      
      Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>
      
      Closes #8039 from jkbradley/gmm-predict-vec and squashes the following commits:
      
      bfbedc4 [Dariusz Kobylarz] [SPARK-8481] [MLlib] GaussianMixtureModel predict accepting single vector
      e2fbbe73
    • Andrew Or's avatar
      [SPARK-9674] Re-enable ignored test in SQLQuerySuite · 881548ab
      Andrew Or authored
      The original code that this test tests is removed in https://github.com/apache/spark/commit/9270bd06fd0b16892e3f37213b5bc7813ea11fdd. It was ignored shortly before that so we never caught it. This patch re-enables the test and adds the code necessary to make it pass.
      
      JoshRosen yhuai
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8015 from andrewor14/SPARK-9674 and squashes the following commits:
      
      225eac2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into SPARK-9674
      8c24209 [Andrew Or] Fix NPE
      e541d64 [Andrew Or] Track aggregation memory for both sort and hash
      0be3a42 [Andrew Or] Fix test
      881548ab
    • Reynold Xin's avatar
      [SPARK-9733][SQL] Improve physical plan explain for data sources · 05d04e10
      Reynold Xin authored
      All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source.
      
      Without this patch:
      ```
      == Physical Plan ==
      NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false))
       Exchange hashpartitioning(date#0,cat#1)
        NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false))
         PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at
      ```
      
      With this patch:
      ```
      == Physical Plan ==
      TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)]
       Exchange hashpartitioning(date#0,cat#1)
        TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)]
         ConvertToUnsafe
          Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2]
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8024 from rxin/SPARK-9733 and squashes the following commits:
      
      811b90e [Reynold Xin] Fixed Python test case.
      52cab77 [Reynold Xin] Cast.
      eea9ccc [Reynold Xin] Fix test case.
      fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.
      05d04e10
    • Reynold Xin's avatar
      [SPARK-9667][SQL] followup: Use GenerateUnsafeProjection.canSupport to test... · aeddeafc
      Reynold Xin authored
      [SPARK-9667][SQL] followup: Use GenerateUnsafeProjection.canSupport to test Exchange supported data types.
      
      This way we recursively test the data types.
      
      cc chenghao-intel
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8036 from rxin/cansupport and squashes the following commits:
      
      f7302ff [Reynold Xin] Can GenerateUnsafeProjection.canSupport to test Exchange supported data types.
      aeddeafc
    • Reynold Xin's avatar
      [SPARK-9736] [SQL] JoinedRow.anyNull should delegate to the underlying rows. · 9897cc5e
      Reynold Xin authored
      JoinedRow.anyNull currently loops through every field to check for null, which is inefficient if the underlying rows are UnsafeRows. It should just delegate to the underlying implementation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8027 from rxin/SPARK-9736 and squashes the following commits:
      
      03a2e92 [Reynold Xin] Include all files.
      90f1add [Reynold Xin] [SPARK-9736][SQL] JoinedRow.anyNull should delegate to the underlying rows.
      9897cc5e
    • Wenchen Fan's avatar
      [SPARK-8382] [SQL] Improve Analysis Unit test framework · 2432c2e2
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #8025 from cloud-fan/analysis and squashes the following commits:
      
      51461b1 [Wenchen Fan] move test file to test folder
      ec88ace [Wenchen Fan] Improve Analysis Unit test framework
      2432c2e2
    • Reynold Xin's avatar
      [SPARK-9674][SPARK-9667] Remove SparkSqlSerializer2 · 76eaa701
      Reynold Xin authored
      It is now subsumed by various Tungsten operators.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7981 from rxin/SPARK-9674 and squashes the following commits:
      
      144f96e [Reynold Xin] Re-enable test
      58b7332 [Reynold Xin] Disable failing list.
      fb797e3 [Reynold Xin] Match all UDTs.
      be9f243 [Reynold Xin] Updated if.
      71fc99c [Reynold Xin] [SPARK-9674][SPARK-9667] Remove GeneratedAggregate & SparkSqlSerializer2.
      76eaa701
    • zsxwing's avatar
      [SPARK-9467][SQL]Add SQLMetric to specialize accumulators to avoid boxing · ebfd91c5
      zsxwing authored
      This PR adds SQLMetric/SQLMetricParam/SQLMetricValue to specialize accumulators to avoid boxing. All SQL metrics should use these classes rather than `Accumulator`.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7996 from zsxwing/sql-accu and squashes the following commits:
      
      14a5f0a [zsxwing] Address comments
      367ca23 [zsxwing] Use localValue directly to avoid changing Accumulable
      42f50c3 [zsxwing] Add SQLMetric to specialize accumulators to avoid boxing
      ebfd91c5
Loading