Skip to content
Snippets Groups Projects
  1. Jan 15, 2016
    • Hossein's avatar
      [SPARK-12833][SQL] Initial import of spark-csv · 5f83c699
      Hossein authored
      CSV is the most common data format in the "small data" world. It is often the first format people want to try when they see Spark on a single node. Having to rely on a 3rd party component for this leads to poor user experience for new users. This PR merges the popular spark-csv data source package (https://github.com/databricks/spark-csv) with SparkSQL.
      
      This is a first PR to bring the functionality to spark 2.0 master. We will complete items outlines in the design document (see JIRA attachment) in follow up pull requests.
      
      Author: Hossein <hossein@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10766 from rxin/csv.
      5f83c699
    • Davies Liu's avatar
      [MINOR] [SQL] GeneratedExpressionCode -> ExprCode · c5e7076d
      Davies Liu authored
      GeneratedExpressionCode is too long
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10767 from davies/renaming.
      c5e7076d
    • Oscar D. Lara Yejas's avatar
      [SPARK-11031][SPARKR] Method str() on a DataFrame · ba4a6419
      Oscar D. Lara Yejas authored
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
      Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      
      Closes #9613 from olarayej/SPARK-11031.
      ba4a6419
    • Tom Graves's avatar
      [SPARK-2930] clarify docs on using webhdfs with spark.yarn.access.nam… · 96fb894d
      Tom Graves authored
      …enodes
      
      Author: Tom Graves <tgraves@yahoo-inc.com>
      
      Closes #10699 from tgravescs/SPARK-2930.
      96fb894d
    • Jason Lee's avatar
      [SPARK-12655][GRAPHX] GraphX does not unpersist RDDs · d0a5c32b
      Jason Lee authored
      Some VertexRDD and EdgeRDD are created during the intermediate step of g.connectedComponents() but unnecessarily left cached after the method is done. The fix is to unpersist these RDDs once they are no longer in use.
      
      A test case is added to confirm the fix for the reported bug.
      
      Author: Jason Lee <cjlee@us.ibm.com>
      
      Closes #10713 from jasoncl/SPARK-12655.
      d0a5c32b
    • Reynold Xin's avatar
      [SPARK-12830] Java style: disallow trailing whitespaces. · fe7246fe
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10764 from rxin/SPARK-12830.
      fe7246fe
  2. Jan 14, 2016
    • Reynold Xin's avatar
      [SPARK-12829] Turn Java style checker on · 591c88c9
      Reynold Xin authored
      It was previously turned off because there was a problem with a pull request. We should turn it on now.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10763 from rxin/SPARK-12829.
      591c88c9
    • Koyo Yoshida's avatar
      [SPARK-12708][UI] Sorting task error in Stages Page when yarn mode. · 32cca933
      Koyo Yoshida authored
      If sort column contains slash(e.g. "Executor ID / Host") when yarn mode,sort fail with following message.
      
      ![spark-12708](https://cloud.githubusercontent.com/assets/6679275/12193320/80814f8c-b62a-11e5-9914-7bf3907029df.png)
      
      It's similar to SPARK-4313 .
      
      Author: root <root@R520T1.(none)>
      Author: Koyo Yoshida <koyo0615@gmail.com>
      
      Closes #10663 from yoshidakuy/SPARK-12708.
      32cca933
    • Michael Armbrust's avatar
      [SPARK-12813][SQL] Eliminate serialization for back to back operations · cc7af86a
      Michael Armbrust authored
      The goal of this PR is to eliminate unnecessary translations when there are back-to-back `MapPartitions` operations.  In order to achieve this I also made the following simplifications:
      
       - Operators no longer have hold encoders, instead they have only the expressions that they need.  The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process.  now that they are visible we can change them on a case by case basis.
       - Operators no longer have type parameters.  Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication.  We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded.
      
      Deferred to a follow up PR:
       - Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an `as` operation.
       - Eliminate serializations in more cases by adding more cases to `EliminateSerialization`
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #10747 from marmbrus/encoderExpressions.
      cc7af86a
    • Josh Rosen's avatar
      [SPARK-12174] Speed up BlockManagerSuite getRemoteBytes() test · 25782981
      Josh Rosen authored
      This patch significantly speeds up the BlockManagerSuite's "SPARK-9591: getRemoteBytes from another location when Exception throw" test, reducing the test time from 45s to ~250ms. The key change was to set `spark.shuffle.io.maxRetries` to 0 (the code previously set `spark.network.timeout` to `2s`, but this didn't make a difference because the slowdown was not due to this timeout).
      
      Along the way, I also cleaned up the way that we handle SparkConf in BlockManagerSuite: previously, each test would mutate a shared SparkConf instance, while now each test gets a fresh SparkConf.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10759 from JoshRosen/SPARK-12174.
      25782981
    • Kousuke Saruta's avatar
      [SPARK-12821][BUILD] Style checker should run when some configuration files... · bcc7373f
      Kousuke Saruta authored
      [SPARK-12821][BUILD] Style checker should run when some configuration files for style are modified but any source files are not.
      
      When running the `run-tests` script, style checkers run only when any source files are modified but they should run when configuration files related to style are modified.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #10754 from sarutak/SPARK-12821.
      bcc7373f
    • Reynold Xin's avatar
      [SPARK-12771][SQL] Simplify CaseWhen code generation · 902667fd
      Reynold Xin authored
      The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient.
      
      This closes #10737.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10755 from rxin/SPARK-12771.
      902667fd
    • Shixiong Zhu's avatar
      [SPARK-12784][UI] Fix Spark UI IndexOutOfBoundsException with dynamic allocation · 501e99ef
      Shixiong Zhu authored
      Add `listener.synchronized` to get `storageStatusList` and `execInfo` atomically.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10728 from zsxwing/SPARK-12784.
      501e99ef
    • Bryan Cutler's avatar
      [SPARK-9844][CORE] File appender race condition during shutdown · 56cdbd65
      Bryan Cutler authored
      When an Executor process is destroyed, the FileAppender that is asynchronously reading the stderr stream of the process can throw an IOException during read because the stream is closed.  Before the ExecutorRunner destroys the process, the FileAppender thread is flagged to stop.  This PR wraps the inputStream.read call of the FileAppender in a try/catch block so that if an IOException is thrown and the thread has been flagged to stop, it will safely ignore the exception.  Additionally, the FileAppender thread was changed to use Utils.tryWithSafeFinally to better log any exception that do occur.  Added unit tests to verify a IOException is thrown and logged if FileAppender is not flagged to stop, and that no IOException when the flag is set.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #10714 from BryanCutler/file-appender-read-ioexception-SPARK-9844.
      56cdbd65
    • Jeff Zhang's avatar
      [SPARK-12707][SPARK SUBMIT] Remove submit python/R scripts through py… · 8f13cd4c
      Jeff Zhang authored
      …spark/sparkR
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10658 from zjffdu/SPARK-12707.
      8f13cd4c
    • Wenchen Fan's avatar
      [SPARK-12756][SQL] use hash expression in Exchange · 962e9bcf
      Wenchen Fan authored
      This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.
      
      This PR also fixes the tests that are broken by the new hash behaviour in shuffle.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
      962e9bcf
  3. Jan 13, 2016
    • Josh Rosen's avatar
      [SPARK-12819] Deprecate TaskContext.isRunningLocally() · e2ae7bd0
      Josh Rosen authored
      We've already removed local execution but didn't deprecate `TaskContext.isRunningLocally()`; we should deprecate it for 2.0.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10751 from JoshRosen/remove-local-exec-from-taskcontext.
      e2ae7bd0
    • Joseph K. Bradley's avatar
      [SPARK-12703][MLLIB][DOC][PYTHON] Fixed pyspark.mllib.clustering.KMeans user guide example · 20d8ef85
      Joseph K. Bradley authored
      Fixed WSSSE computeCost in Python mllib KMeans user guide example by using new computeCost method API in Python.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10707 from jkbradley/kmeans-doc-fix.
      20d8ef85
    • Yuhao Yang's avatar
      [SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large · 021dafc6
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-12026
      
      The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.
      
      I tested on local and the change can improve the performance and the running time was stable.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #10146 from hhbyyh/chiSq.
      021dafc6
    • jerryshao's avatar
      [SPARK-12400][SHUFFLE] Avoid generating temp shuffle files for empty partitions · cd81fc9e
      jerryshao authored
      This problem lies in `BypassMergeSortShuffleWriter`, empty partition will also generate a temp shuffle file with several bytes. So here change to only create file when partition is not empty.
      
      This problem only lies in here, no such issue in `HashShuffleWriter`.
      
      Please help to review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #10376 from jerryshao/SPARK-12400.
      cd81fc9e
    • Carson Wang's avatar
      [SPARK-12690][CORE] Fix NPE in UnsafeInMemorySorter.free() · eabc7b8e
      Carson Wang authored
      I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it.
      
      ```
      ERROR spark.TaskContextImpl: Error in TaskCompletionListener
      java.lang.NullPointerException
              at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110)
              at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288)
              at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141)
              at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
              at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
              at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
              at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
              at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
              at org.apache.spark.scheduler.Task.run(Task.scala:91)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
              at java.lang.Thread.run(Thread.java:722)
      ```
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #10637 from carsonwang/FixNPE.
      eabc7b8e
    • Reynold Xin's avatar
      [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" · cbbcd8e4
      Reynold Xin authored
      This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.
      
      Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10734 from rxin/simplify-case.
      cbbcd8e4
    • Wenchen Fan's avatar
      [SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row · c2ea79f9
      Wenchen Fan authored
      https://issues.apache.org/jira/browse/SPARK-12642
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10694 from cloud-fan/hash-expr.
      c2ea79f9
    • Erik Selin's avatar
      [SPARK-12268][PYSPARK] Make pyspark shell pythonstartup work under python3 · e4e0b3f7
      Erik Selin authored
      This replaces the `execfile` used for running custom python shell scripts
      with explicit open, compile and exec (as recommended by 2to3). The reason
      for this change is to make the pythonstartup option compatible with python3.
      
      Author: Erik Selin <erik.selin@gmail.com>
      
      Closes #10255 from tyro89/pythonstartup-python3.
      e4e0b3f7
    • Josh Rosen's avatar
      [SPARK-9383][PROJECT-INFRA] PR merge script should reset back to previous branch when possible · 97e0c7c5
      Josh Rosen authored
      This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior).
      
      This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10709 from JoshRosen/SPARK-9383.
      97e0c7c5
    • Jakob Odersky's avatar
      [SPARK-12761][CORE] Remove duplicated code · 38148f73
      Jakob Odersky authored
      Removes some duplicated code that was reintroduced during a merge.
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #10711 from jodersky/repl-2.11-duplicate.
      38148f73
    • Luc Bourlier's avatar
      [SPARK-12805][MESOS] Fixes documentation on Mesos run modes · cc91e218
      Luc Bourlier authored
      The default run has changed, but the documentation didn't fully reflect the change.
      
      Author: Luc Bourlier <luc.bourlier@typesafe.com>
      
      Closes #10740 from skyluc/issue/mesos-modes-doc.
      cc91e218
    • Liang-Chi Hsieh's avatar
      [SPARK-9297] [SQL] Add covar_pop and covar_samp · 63eee86c
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9297
      
      Add two aggregation functions: covar_pop and covar_samp.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #10029 from viirya/covar-funcs.
      63eee86c
    • Yin Huai's avatar
      [SPARK-12692][BUILD][HOT-FIX] Fix the scala style of KinesisBackedBlockRDDSuite.scala. · d6fd9b37
      Yin Huai authored
      https://github.com/apache/spark/pull/10736 was merged yesterday and caused the master start to fail because of the style issue.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10742 from yhuai/fixStyle.
      d6fd9b37
    • Kousuke Saruta's avatar
      [SPARK-12692][BUILD] Enforce style checking about white space before comma · 3d81d63f
      Kousuke Saruta authored
      This is the final PR about SPARK-12692.
      We have removed all of white spaces before comma from code so let's enforce style checking.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #10736 from sarutak/SPARK-12692-followup-enforce-checking.
      3d81d63f
    • Kousuke Saruta's avatar
      [SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",") · cb7b864a
      Kousuke Saruta authored
      Fix the style violation (space before , and :).
      This PR is a followup for #10643 and rework of #10685 .
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #10732 from sarutak/SPARK-12692-followup-sql.
      cb7b864a
  4. Jan 12, 2016
    • Dilip Biswal's avatar
      [SPARK-12558][SQL] AnalysisException when multiple functions applied in GROUP BY clause · dc7b3870
      Dilip Biswal authored
      cloud-fan Can you please take a look ?
      
      In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #10520 from dilipbiswal/spark-12558.
      dc7b3870
    • Kousuke Saruta's avatar
      [SPARK-12692][BUILD][CORE] Scala style: Fix the style violation (Space before ",") · f14922cf
      Kousuke Saruta authored
      Fix the style violation (space before , and :).
      This PR is a followup for #10643
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #10719 from sarutak/SPARK-12692-followup-core.
      f14922cf
    • Reynold Xin's avatar
      [SPARK-12788][SQL] Simplify BooleanEquality by using casts. · b3b9ad23
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10730 from rxin/SPARK-12788.
      b3b9ad23
    • Nong Li's avatar
      [SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution. · 92470849
      Nong Li authored
      There are many potential benefits of having an efficient in memory columnar format as an alternate
      to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The
      remaining implementation can be done as follow up patches.
      
      As stated in the in the JIRA, there are useful external components that operate on memory in a
      simple columnar format. ColumnarBatch would serve that purpose and could server as a
      zero-serialization/zero-copy exchange for this use case.
      
      This patch supports running the underlying data either on heap or off heap. On heap runs a bit
      faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one
      interface (ColumnVector).
      
      This differs from Parquet or the existing columnar cache because this is *not* intended to be used
      as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these
      batches in memory per task. The layout of the values is just dense arrays of the value type.
      
      Author: Nong Li <nong@databricks.com>
      Author: Nong <nongli@gmail.com>
      
      Closes #10628 from nongli/spark-12635.
      92470849
    • Shixiong Zhu's avatar
      [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1 · 4f60651c
      Shixiong Zhu authored
      - [x] Upgrade Py4J to 0.9.1
      - [x] SPARK-12657: Revert SPARK-12617
      - [x] SPARK-12658: Revert SPARK-12511
        - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. https://github.com/zsxwing/spark/commit/bfd4b5c040eb29394c3132af3c670b1a7272457c
      - [x] Verify no leak any more after reverting our workarounds
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10692 from zsxwing/py4j-0.9.1.
      4f60651c
    • Cheng Lian's avatar
      [SPARK-12724] SQL generation support for persisted data source tables · 8ed5f12d
      Cheng Lian authored
      This PR implements SQL generation support for persisted data source tables.  A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`.  When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10712 from liancheng/spark-12724-datasources-sql-gen.
      8ed5f12d
    • Reynold Xin's avatar
    • Reynold Xin's avatar
      [SPARK-12768][SQL] Remove CaseKeyWhen expression · 0ed430e3
      Reynold Xin authored
      This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer.
      
      Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10722 from rxin/SPARK-12768.
      0ed430e3
    • Robert Kruszewski's avatar
      [SPARK-9843][SQL] Make catalyst optimizer pass pluggable at runtime · 508592b1
      Robert Kruszewski authored
      Let me know whether you'd like to see it in other place
      
      Author: Robert Kruszewski <robertk@palantir.com>
      
      Closes #10210 from robert3005/feature/pluggable-optimizer.
      508592b1
Loading