Skip to content
Snippets Groups Projects
  1. Jun 20, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14995][R] Add `since` tag in Roxygen documentation for SparkR API methods · d0eddb80
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.
      
      https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13734 from dongjoon-hyun/SPARK-14995.
      d0eddb80
    • Sean Owen's avatar
      [MINOR] Closing stale pull requests. · 92514232
      Sean Owen authored
      Closes #13114
      Closes #10187
      Closes #13432
      Closes #13550
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13781 from srowen/CloseStalePR.
      92514232
    • Felix Cheung's avatar
      [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example updates · 359c2e82
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      roxygen2 doc, programming guide, example updates
      
      ## How was this patch tested?
      
      manual checks
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13751 from felixcheung/rsparksessiondoc.
      359c2e82
    • Dongjoon Hyun's avatar
      [SPARK-16053][R] Add `spark_partition_id` in SparkR · b0f2fb5b
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
      
      The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
      ```r
      > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
         id SPARK_PARTITION_ID()
      1   3                    0
      2   4                    0
      3   8                    1
      4   9                    1
      5   0                    2
      6   1                    3
      7   2                    4
      8   5                    5
      9   6                    6
      10  7                    7
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13768 from dongjoon-hyun/SPARK-16053.
      b0f2fb5b
    • Felix Cheung's avatar
      [SPARKR] fix R roxygen2 doc for count on GroupedData · aee1420e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      fix code doc
      
      ## How was this patch tested?
      
      manual
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13782 from felixcheung/rcountdoc.
      aee1420e
    • Felix Cheung's avatar
      [SPARK-16028][SPARKR] spark.lapply can work with active context · 46d98e0a
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      spark.lapply and setLogLevel
      
      ## How was this patch tested?
      
      unit test
      
      shivaram thunterdb
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13752 from felixcheung/rlapply.
      46d98e0a
    • Dongjoon Hyun's avatar
      [SPARK-16051][R] Add `read.orc/write.orc` to SparkR · c44bf137
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue adds `read.orc/write.orc` to SparkR for API parity.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13763 from dongjoon-hyun/SPARK-16051.
      c44bf137
    • Felix Cheung's avatar
      [SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable · 36e812d4
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add dropTempView and deprecate dropTempTable
      
      ## How was this patch tested?
      
      unit tests
      
      shivaram liancheng
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13753 from felixcheung/rdroptempview.
      36e812d4
    • Dongjoon Hyun's avatar
      [SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR · 96134248
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
      After this PR, SparkR supports the followings.
      
      ```r
      > df <- read.json("examples/src/main/resources/people.json")
      > collect(select(df, monotonically_increasing_id(), df$name, df$age))
        monotonically_increasing_id()    name age
      1                             0 Michael  NA
      2                             1    Andy  30
      3                             2  Justin  19
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with added testcase).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13774 from dongjoon-hyun/SPARK-16059.
      96134248
    • Shixiong Zhu's avatar
      [SPARK-16050][TESTS] Remove the flaky test: ConsoleSinkSuite · 5cfabec8
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      ConsoleSinkSuite just collects content from stdout and compare them with the expected string. However, because Spark may not stop some background threads at once, there is a race condition that other threads are outputting logs to **stdout** while ConsoleSinkSuite is running. Then it will make ConsoleSinkSuite fail.
      
      Therefore, I just deleted `ConsoleSinkSuite`. If we want to test ConsoleSinkSuite in future, we should refactoring ConsoleSink to make it testable instead of depending on stdout. Therefore, this test is useless and I just delete it.
      
      ## How was this patch tested?
      
      Just removed a flaky test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13776 from zsxwing/SPARK-16050.
      5cfabec8
    • Yin Huai's avatar
      [SPARK-16030][SQL] Allow specifying static partitions when inserting to data source tables · 905f774b
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR adds the static partition support to INSERT statement when the target table is a data source table.
      
      ## How was this patch tested?
      New tests in InsertIntoHiveTableSuite and DataSourceAnalysisSuite.
      
      **Note: This PR is based on https://github.com/apache/spark/pull/13766. The last commit is the actual change.**
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13769 from yhuai/SPARK-16030-1.
      905f774b
  2. Jun 19, 2016
    • Yin Huai's avatar
      [SPARK-16036][SPARK-16037][SPARK-16034][SQL] Follow up code clean up and improvement · 6d0f921a
      Yin Huai authored
      ## What changes were proposed in this pull request?
      This PR is the follow-up PR for https://github.com/apache/spark/pull/13754/files and https://github.com/apache/spark/pull/13749. I will comment inline to explain my changes.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13766 from yhuai/caseSensitivity.
      6d0f921a
    • Matei Zaharia's avatar
      [SPARK-16031] Add debug-only socket source in Structured Streaming · 4f17fddc
      Matei Zaharia authored
      ## What changes were proposed in this pull request?
      
      This patch adds a text-based socket source similar to the one in Spark Streaming for debugging and tutorials. The source is clearly marked as debug-only so that users don't try to run it in production applications, because this type of source cannot provide HA without storing a lot of state in Spark.
      
      ## How was this patch tested?
      
      Unit tests and manual tests in spark-shell.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #13748 from mateiz/socket-source.
      4f17fddc
    • wm624@hotmail.com's avatar
      [SPARK-16040][MLLIB][DOC] spark.mllib PIC document extra line of refernece · 5930d7a2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      In the 2.0 document, Line "A full example that produces the experiment described in the PIC paper can be found under examples/." is redundant.
      
      There is already "Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala" in the Spark repo.".
      
      We should remove the first line, which is consistent with other documents.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Manual test
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13755 from wangmiao1981/doc.
      5930d7a2
    • Prashant Sharma's avatar
      [SPARK-15942][REPL] Unblock `:reset` command in REPL. · 1b3a9b96
      Prashant Sharma authored
      ## What changes were proposed in this pull
      (Paste from JIRA issue.)
      As a follow up for SPARK-15697, I have following semantics for `:reset` command.
      On `:reset` we forget all that user has done but not the initialization of spark. To avoid confusion or make it more clear, we show the message `spark` and `sc` are not erased, infact they are in same state as they were left by previous operations done by the user.
      While doing above, somewhere I felt that this is not usually what reset means. But an accidental shutdown of a cluster can be very costly, so may be in that sense this is less surprising and still useful.
      
      ## How was this patch tested?
      
      Manually, by calling `:reset` command, by both altering the state of SparkContext and creating some local variables.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #13661 from ScrapCodes/repl-reset-command.
      1b3a9b96
    • Davies Liu's avatar
      [SPARK-15613] [SQL] Fix incorrect days to millis conversion due to Daylight Saving Time · 001a5896
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Internally, we use Int to represent a date (the days since 1970-01-01), when we convert that into unix timestamp (milli-seconds since epoch in UTC), we get the offset of a timezone using local millis (the milli-seconds since 1970-01-01 in a timezone), but TimeZone.getOffset() expect unix timestamp, the result could be off by one hour (in Daylight Saving Time (DST) or not).
      
      This PR change to use best effort approximate of posix timestamp to lookup the offset. In the event of changing of DST, Some time is not defined (for example, 2016-03-13 02:00:00 PST), or could lead to multiple valid result in UTC (for example, 2016-11-06 01:00:00), this best effort approximate should be enough in practice.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13652 from davies/fix_timezone.
      001a5896
  3. Jun 18, 2016
    • Sean Zhong's avatar
      [SPARK-16034][SQL] Checks the partition columns when calling... · ce3b98ba
      Sean Zhong authored
      [SPARK-16034][SQL] Checks the partition columns when calling dataFrame.write.mode("append").saveAsTable
      
      ## What changes were proposed in this pull request?
      
      `DataFrameWriter` can be used to append data to existing data source tables. It becomes tricky when partition columns used in `DataFrameWriter.partitionBy(columns)` don't match the actual partition columns of the underlying table. This pull request enforces the check so that the partition columns of these two always match.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13749 from clockfly/SPARK-16034.
      ce3b98ba
    • Wenchen Fan's avatar
      [SPARK-16036][SPARK-16037][SQL] fix various table insertion problems · 3d010c83
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      The current table insertion has some weird behaviours:
      
      1. inserting into a partitioned table with mismatch columns has confusing error message for hive table, and wrong result for datasource table
      2. inserting into a partitioned table without partition list has wrong result for hive table.
      
      This PR fixes these 2 problems.
      
      ## How was this patch tested?
      
      new test in hive `SQLQuerySuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13754 from cloud-fan/insert2.
      3d010c83
    • Josh Howes's avatar
      [SPARK-15973][PYSPARK] Fix GroupedData Documentation · e574c997
      Josh Howes authored
      *This contribution is my original work and that I license the work to the project under the project's open source license.*
      
      ## What changes were proposed in this pull request?
      
      Documentation updates to PySpark's GroupedData
      
      ## How was this patch tested?
      
      Manual Tests
      
      Author: Josh Howes <josh.howes@gmail.com>
      Author: Josh Howes <josh.howes@maxpoint.com>
      
      Closes #13724 from josh-howes/bugfix/SPARK-15973.
      e574c997
    • Andrew Or's avatar
      [SPARK-16023][SQL] Move InMemoryRelation to its own file · 35a2f3c0
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Improve readability of `InMemoryTableScanExec.scala`, which has too much stuff in it.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13742 from andrewor14/move-inmemory-relation.
      35a2f3c0
    • Jeff Zhang's avatar
      [SPARK-15803] [PYSPARK] Support with statement syntax for SparkSession · 898cb652
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Support with statement syntax for SparkSession in pyspark
      
      ## How was this patch tested?
      
      Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13541 from zjffdu/SPARK-15803.
      898cb652
    • andreapasqua's avatar
      [SPARK-16035][PYSPARK] Fix SparseVector parser assertion for end parenthesis · 4c64e88d
      andreapasqua authored
      ## What changes were proposed in this pull request?
      The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that.
      ## How was this patch tested?
      Manual test
      
      Author: andreapasqua <andrea@radius.com>
      
      Closes #13750 from andreapasqua/sparse-vector-parser-assertion-fix.
      4c64e88d
  4. Jun 17, 2016
    • Shixiong Zhu's avatar
      [SPARK-16020][SQL] Fix complete mode aggregation with console sink · d0ac0e6f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      We cannot use `limit` on DataFrame in ConsoleSink because it will use a wrong planner. This PR just collects `DataFrame` and calls `show` on a batch DataFrame based on the result. This is fine since ConsoleSink is only for debugging.
      
      ## How was this patch tested?
      
      Manually confirmed ConsoleSink now works with complete mode aggregation.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13740 from zsxwing/complete-console.
      d0ac0e6f
    • Felix Cheung's avatar
      [SPARK-15159][SPARKR] SparkR SparkSession API · 8c198e24
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      This PR introduces the new SparkSession API for SparkR.
      `sparkR.session.getOrCreate()` and `sparkR.session.stop()`
      
      "getOrCreate" is a bit unusual in R but it's important to name this clearly.
      
      SparkR implementation should
      - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
      - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
      - Changes to SparkSession is mostly transparent to users due to SPARK-10903
      - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
      - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
      - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
      - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
      - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
      - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
      - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
      - All tests are updated to use the SparkSession entrypoint
      - A bug in `read.jdbc` is fixed
      
      TODO
      - [x] Add more tests
      - [ ] Separate PR - update all roxygen2 doc coding example
      - [ ] Separate PR - update SparkR programming guide
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      shivaram sun-rui rxin
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #13635 from felixcheung/rsparksession.
      8c198e24
    • Xiangrui Meng's avatar
      [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python) · edb23f9e
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.
      
      ## How was this patch tested?
      
      doctest in Python
      
      cc: yanboliang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #13731 from mengxr/SPARK-15946.
      edb23f9e
    • GayathriMurali's avatar
      [SPARK-15129][R][DOC] R API changes in ML · af2a4b08
      GayathriMurali authored
      ## What changes were proposed in this pull request?
      
      Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs
      
      Author: GayathriMurali <gayathri.m@intel.com>
      
      Closes #13285 from GayathriMurali/SPARK-15129.
      af2a4b08
    • Cheng Lian's avatar
      [SPARK-16033][SQL] insertInto() can't be used together with partitionBy() · 10b67144
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      When inserting into an existing partitioned table, partitioning columns should always be determined by catalog metadata of the existing table to be inserted. Extra `partitionBy()` calls don't make sense, and mess up existing data because newly inserted data may have wrong partitioning directory layout.
      
      ## How was this patch tested?
      
      New test case added in `InsertIntoHiveTableSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13747 from liancheng/spark-16033-insert-into-without-partition-by.
      10b67144
    • hyukjinkwon's avatar
      [SPARK-15916][SQL] JDBC filter push down should respect operator precedence · ebb9a3b6
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the problem that the precedence order is messed when pushing where-clause expression to JDBC layer.
      
      **Case 1:**
      
      For sql `select * from table where (a or b) and c`, the where-clause is wrongly converted to JDBC where-clause `a or (b and c)` after filter push down. The consequence is that JDBC may returns less or more rows than expected.
      
      **Case 2:**
      
      For sql `select * from table where always_false_condition`, the result table may not be empty if the JDBC RDD is partitioned using where-clause:
      ```
      spark.read.jdbc(url, table, predicates = Array("partition 1 where clause", "partition 2 where clause"...)
      ```
      
      ## How was this patch tested?
      
      Unit test.
      
      This PR also close #13640
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13743 from clockfly/SPARK-15916.
      ebb9a3b6
    • Dongjoon Hyun's avatar
      [SPARK-16005][R] Add `randomSplit` to SparkR · 7d65a0db
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR adds `randomSplit` to SparkR for API parity.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with new testcase.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13721 from dongjoon-hyun/SPARK-16005.
      7d65a0db
    • Felix Cheung's avatar
      [SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests · ef3cc4fc
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add registerTempTable to DataFrame with Deprecate
      
      ## How was this patch tested?
      
      unit tests
      shivaram liancheng
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13722 from felixcheung/rregistertemptable.
      ef3cc4fc
    • Reynold Xin's avatar
      [SPARK-16014][SQL] Rename optimizer rules to be more consistent · 1a65e62a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This small patch renames a few optimizer rules to make the naming more consistent, e.g. class name start with a verb. The main important "fix" is probably SamplePushDown -> PushProjectThroughSample. SamplePushDown is actually the wrong name, since the rule is not about pushing Sample down.
      
      ## How was this patch tested?
      Updated test cases.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13732 from rxin/SPARK-16014.
      1a65e62a
    • Shixiong Zhu's avatar
      [SPARK-16017][CORE] Send hostname from CoarseGrainedExecutorBackend to driver · 62d8fe20
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      [SPARK-15395](https://issues.apache.org/jira/browse/SPARK-15395) changes the behavior that how the driver gets the executor host and the driver will get the executor IP address instead of the host name. This PR just sends the hostname from executors to driver so that driver can pass it to TaskScheduler.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13741 from zsxwing/SPARK-16017.
      62d8fe20
    • Dhruve Ashar's avatar
      [SPARK-16018][SHUFFLE] Shade netty to load shuffle jar in Nodemanger · 298c4ae8
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Shade the netty.io namespace so that we can use it in shuffle independent of the dependencies being pulled by hadoop jars.
      
      ## How was this patch tested?
      Ran a decent job involving shuffle write/read and tested the new spark-x-yarn-shuffle jar. After shading netty.io namespace, the nodemanager loads and shuffle job completes successfully.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #13739 from dhruve/bug/SPARK-16018.
      298c4ae8
    • Kay Ousterhout's avatar
      [SPARK-15926] Improve readability of DAGScheduler stage creation methods · c8809db5
      Kay Ousterhout authored
      ## What changes were proposed in this pull request?
      
      This pull request refactors parts of the DAGScheduler to improve readability, focusing on the code around stage creation.  One goal of this change it to make it clearer which functions may create new stages (as opposed to looking up stages that already exist).  There are no functionality changes in this pull request.  In more detail:
      
      * shuffleToMapStage was renamed to shuffleIdToMapStage (when reading the existing code I have sometimes struggled to remember what the key is -- is it a stage? A stage id? This change is intended to avoid that confusion)
      * Cleaned up the code to create shuffle map stages.  Previously, creating a shuffle map stage involved 3 different functions (newOrUsedShuffleStage, newShuffleMapStage, and getShuffleMapStage), and it wasn't clear what the purpose of each function was.  With the new code, a single function (getOrCreateShuffleMapStage) is responsible for getting a stage (if it already exists) or creating new shuffle map stages and any missing ancestor stages, and it delegates to createShuffleMapStage when new stages need to be created.  There's some remaining confusion here because the getOrCreateParentStages call in createShuffleMapStage may recursively create ancestor stages; this is an issue I plan to fix in a future pull request, because it's trickier to fix and involves a slight functionality change.
      * newResultStage was renamed to createResultStage, for consistency with naming around shuffle map stages.
      * getParentStages has been renamed to getOrCreateParentStages, to make it clear that this function will sometimes create missing ancestor stages.
      * The only *slight* functionality change is that on line 478, updateJobIdStageIdMaps now uses a stage's parents instance variable rather than re-calculating them (I couldn't see any reason why they'd need to be re-calculated, and suspect this is just leftover from older code).
      * getAncestorShuffleDependencies was renamed to getMissingAncestorShuffleDependencies, to make it clear that this only returns dependencies that have not yet been run.
      
      cc squito markhamstra JoshRosen (who requested more DAG scheduler commenting long ago -- an issue this pull request tries, in part, to address)
      
      FYI rxin
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #13677 from kayousterhout/SPARK-15926.
      c8809db5
    • sethah's avatar
      [SPARK-16008][ML] Remove unnecessary serialization in logistic regression · 1f0a4695
      sethah authored
      JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008)
      
      ## What changes were proposed in this pull request?
      `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller).
      
      This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization.
      
      ## How was this patch tested?
      
      I tested this locally and verified the serialization reduction.
      
      ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)
      
      Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #13729 from sethah/lr_improvement.
      1f0a4695
    • Sameer Agarwal's avatar
      Remove non-obvious conf settings from TPCDS benchmark · 34d6c4cd
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      My fault -- these 2 conf entries are mysteriously hidden inside the benchmark code and makes it non-obvious to disable whole stage codegen and/or the vectorized parquet reader.
      
      PS: Didn't attach a JIRA as this change should otherwise be a no-op (both these conf are enabled by default in Spark)
      
      ## How was this patch tested?
      
      N/A
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #13726 from sameeragarwal/tpcds-conf.
      34d6c4cd
    • Davies Liu's avatar
      [SPARK-15811][SQL] fix the Python UDF in Scala 2.10 · ef43b4ed
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Iterator can't be serialized in Scala 2.10, we should force it into a array to make sure that .
      
      ## How was this patch tested?
      
      Build with Scala 2.10 and ran all the Python unit tests manually (will be covered by a jenkins build).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13717 from davies/fix_udf_210.
      ef43b4ed
    • gatorsmile's avatar
      [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT... · e5d703bc
      gatorsmile authored
      [SPARK-15706][SQL] Fix Wrong Answer when using IF NOT EXISTS in INSERT OVERWRITE for DYNAMIC PARTITION
      
      #### What changes were proposed in this pull request?
      `IF NOT EXISTS` in `INSERT OVERWRITE` should not support dynamic partitions. If we specify `IF NOT EXISTS`, the inserted statement is not shown in the table.
      
      This PR is to issue an exception in this case, just like what Hive does. Also issue an exception if users specify `IF NOT EXISTS` if users do not specify any `PARTITION` specification.
      
      #### How was this patch tested?
      Added test cases into `PlanParserSuite` and `InsertIntoHiveTableSuite`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13447 from gatorsmile/insertIfNotExist.
      e5d703bc
    • Pete Robbins's avatar
      [SPARK-15822] [SQL] Prevent byte array backed classes from referencing freed memory · 5ada6061
      Pete Robbins authored
      ## What changes were proposed in this pull request?
      `UTF8String` and all `Unsafe*` classes are backed by either on-heap or off-heap byte arrays. The code generated version `SortMergeJoin` buffers the left hand side join keys during iteration. This was actually problematic in off-heap mode when one of the keys is a `UTF8String` (or any other 'Unsafe*` object) and the left hand side iterator was exhausted (and released its memory); the buffered keys would reference freed memory. This causes Seg-faults and all kinds of other undefined behavior when we would use one these buffered keys.
      
      This PR fixes this problem by creating copies of the buffered variables. I have added a general method to the `CodeGenerator` for this. I have checked all places in which this could happen, and only `SortMergeJoin` had this problem.
      
      This PR is largely based on the work of robbinspg and he should be credited for this.
      
      closes https://github.com/apache/spark/pull/13707
      
      ## How was this patch tested?
      Manually tested on problematic workloads.
      
      Author: Pete Robbins <robbinspg@gmail.com>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #13723 from hvanhovell/SPARK-15822-2.
      5ada6061
  5. Jun 16, 2016
Loading