Skip to content
Snippets Groups Projects
  1. Oct 17, 2015
  2. Oct 16, 2015
    • zero323's avatar
      [SPARK-11084] [ML] [PYTHON] Check if index can contain non-zero value before binary search · 8ac71d62
      zero323 authored
      At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`.
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9098 from zero323/sparse_vector_getitem_improved.
      8ac71d62
    • Burak Yavuz's avatar
      [SPARK-10599] [MLLIB] Lower communication for block matrix multiplication · 10046ea7
      Burak Yavuz authored
      This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
       - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
       - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition
      
      **NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
      
      Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.
      
      Size A: 1e5 x 1e5
      Size B: 1e5 x 1e5
      Block Sizes: 1024 x 1024
      Sparsity: 0.01
      Old implementation: 1m 13s
      New implementation: 9s
      
      cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #8757 from brkyvz/opt-bmm.
      10046ea7
    • Bhargav Mangipudi's avatar
      [SPARK-11050] [MLLIB] PySpark SparseVector can return wrong index in e… · 1ec0a0dc
      Bhargav Mangipudi authored
      …rror message
      
      For negative indices in the SparseVector, we update the index value. If we have an incorrect index
      at this point, the error message has the incorrect *updated* index instead of the original one. This
      change contains the fix for the same.
      
      Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com>
      
      Closes #9069 from bhargav/spark-10759.
      1ec0a0dc
    • gweidner's avatar
      [SPARK-11109] [CORE] Move FsHistoryProvider off deprecated AccessControlException · ac09a3a4
      gweidner authored
      Switched from deprecated org.apache.hadoop.fs.permission.AccessControlException to org.apache.hadoop.security.AccessControlException.
      
      Author: gweidner <gweidner@us.ibm.com>
      
      Closes #9144 from gweidner/SPARK-11109.
      ac09a3a4
    • zsxwing's avatar
      [SPARK-11104] [STREAMING] Fix a deadlock in StreamingContex.stop · e1eef248
      zsxwing authored
      The following deadlock may happen if shutdownHook and StreamingContext.stop are running at the same time.
      ```
      Java stack information for the threads listed above:
      ===================================================
      "Thread-2":
      	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:699)
      	- waiting to lock <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext)
      	at org.apache.spark.streaming.StreamingContext.org$apache$spark$streaming$StreamingContext$$stopOnShutdown(StreamingContext.scala:729)
      	at org.apache.spark.streaming.StreamingContext$$anonfun$start$1.apply$mcV$sp(StreamingContext.scala:625)
      	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:266)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:236)
      	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1697)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:236)
      	at scala.util.Try$.apply(Try.scala:161)
      	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:236)
      	- locked <0x00000005405b6a00> (a org.apache.spark.util.SparkShutdownHookManager)
      	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
      	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
      "main":
      	at org.apache.spark.util.SparkShutdownHookManager.remove(ShutdownHookManager.scala:248)
      	- waiting to lock <0x00000005405b6a00> (a org.apache.spark.util.SparkShutdownHookManager)
      	at org.apache.spark.util.ShutdownHookManager$.removeShutdownHook(ShutdownHookManager.scala:199)
      	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:712)
      	- locked <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext)
      	at org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:684)
      	- locked <0x00000005405a1680> (a org.apache.spark.streaming.StreamingContext)
      	at org.apache.spark.streaming.SessionByKeyBenchmark$.main(SessionByKeyBenchmark.scala:108)
      	at org.apache.spark.streaming.SessionByKeyBenchmark.main(SessionByKeyBenchmark.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:497)
      	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
      	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
      	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      ```
      
      This PR just moved `ShutdownHookManager.removeShutdownHook` out of `synchronized` to avoid deadlock.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #9116 from zsxwing/stop-deadlock.
      e1eef248
    • zsxwing's avatar
      [SPARK-10974] [STREAMING] Add progress bar for output operation column and use... · 369d786f
      zsxwing authored
      [SPARK-10974] [STREAMING] Add progress bar for output operation column and use red dots for failed batches
      
      Screenshot:
      <img width="1363" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10342571/385d9340-6d4c-11e5-8e79-1fa4c3c98f81.png">
      
      Also fixed the description and duration for output operations that don't have spark jobs.
      <img width="1354" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10342775/4bd52a0e-6d4d-11e5-99bc-26265a9fc792.png">
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #9010 from zsxwing/output-op-progress-bar.
      369d786f
    • Pravin Gadakh's avatar
      [SPARK-10581] [DOCS] Groups are not resolved in scaladoc in sql classes · 3d683a13
      Pravin Gadakh authored
      Groups are not resolved properly in scaladoc in following classes:
      
      sql/core/src/main/scala/org/apache/spark/sql/Column.scala
      sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
      sql/core/src/main/scala/org/apache/spark/sql/functions.scala
      
      Author: Pravin Gadakh <pravingadakh177@gmail.com>
      
      Closes #9148 from pravingadakh/master.
      3d683a13
    • navis.ryu's avatar
      [SPARK-11124] JsonParser/Generator should be closed for resource recycle · b9c5e5d4
      navis.ryu authored
      Some json parsers are not closed. parser in JacksonParser#parseJson, for example.
      
      Author: navis.ryu <navis@apache.org>
      
      Closes #9130 from navis/SPARK-11124.
      b9c5e5d4
    • Jakob Odersky's avatar
      [SPARK-11122] [BUILD] [WARN] Add tag to fatal warnings · 4ee2cea2
      Jakob Odersky authored
      Shows that an error is actually due to a fatal warning.
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #9128 from jodersky/fatalwarnings.
      4ee2cea2
    • Jakob Odersky's avatar
      [SPARK-11094] Strip extra strings from Java version in test runner · 08698ee1
      Jakob Odersky authored
      Removes any extra strings from the Java version, fixing subsequent integer parsing.
      This is required since some OpenJDK versions (specifically in Debian testing), append an extra "-internal" string to the version field.
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #9111 from jodersky/fixtestrunner.
      08698ee1
    • Jakob Odersky's avatar
      [SPARK-11092] [DOCS] Add source links to scaladoc generation · ed775042
      Jakob Odersky authored
      Modify the SBT build script to include GitHub source links for generated Scaladocs, on releases only (no snapshots).
      
      Author: Jakob Odersky <jodersky@gmail.com>
      
      Closes #9110 from jodersky/unidoc.
      ed775042
    • jerryshao's avatar
      [SPARK-11060] [STREAMING] Fix some potential NPE in DStream transformation · 43f5d1f3
      jerryshao authored
      This patch fixes:
      
      1. Guard out against NPEs in `TransformedDStream` when parent DStream returns None instead of empty RDD.
      2. Verify some input streams which will potentially return None.
      3. Add unit test to verify the behavior when input stream returns None.
      
      cc tdas , please help to review, thanks a lot :).
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #9070 from jerryshao/SPARK-11060.
      43f5d1f3
  3. Oct 15, 2015
  4. Oct 14, 2015
    • Cheng Hao's avatar
      [SPARK-11076] [SQL] Add decimal support for floor and ceil · 9808052b
      Cheng Hao authored
      Actually all of the `UnaryMathExpression` doens't support the Decimal, will create follow ups for supporing it. This is the first PR which will be good to review the approach I am taking.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #9086 from chenghao-intel/ceiling.
      9808052b
    • Josh Rosen's avatar
      [SPARK-11017] [SQL] Support ImperativeAggregates in TungstenAggregate · 4ace4f8a
      Josh Rosen authored
      This patch extends TungstenAggregate to support ImperativeAggregate functions. The existing TungstenAggregate operator only supported DeclarativeAggregate functions, which are defined in terms of Catalyst expressions and can be evaluated via generated projections. ImperativeAggregate functions, on the other hand, are evaluated by calling their `initialize`, `update`, `merge`, and `eval` methods.
      
      The basic strategy here is similar to how SortBasedAggregate evaluates both types of aggregate functions: use a generated projection to evaluate the expression-based declarative aggregates with dummy placeholder expressions inserted in place of the imperative aggregate function output, then invoke the imperative aggregate functions and target them against the aggregation buffer. The bulk of the diff here consists of code that was copied and adapted from SortBasedAggregate, with some key changes to handle TungstenAggregate's sort fallback path.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9038 from JoshRosen/support-interpreted-in-tungsten-agg-final.
      4ace4f8a
    • Cheng Hao's avatar
      [SPARK-10829] [SQL] Filter combine partition key and attribute doesn't work in DataSource scan · 1baaf2b9
      Cheng Hao authored
      ```scala
      withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
            withTempPath { dir =>
              val path = s"${dir.getCanonicalPath}/part=1"
              (1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
      
              // If the "part = 1" filter gets pushed down, this query will throw an exception since
              // "part" is not a valid column in the actual Parquet file
              checkAnswer(
                sqlContext.read.parquet(path).filter("a > 0 and (part = 0 or a > 1)"),
                (2 to 3).map(i => Row(i, i.toString, 1)))
            }
          }
      ```
      
      We expect the result to be:
      ```
      2,1
      3,1
      ```
      But got
      ```
      1,1
      2,1
      3,1
      ```
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #8916 from chenghao-intel/partition_filter.
      1baaf2b9
    • Reynold Xin's avatar
      [SPARK-11113] [SQL] Remove DeveloperApi annotation from private classes. · 2b5e31c7
      Reynold Xin authored
      o.a.s.sql.catalyst and o.a.s.sql.execution are supposed to be private.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9121 from rxin/SPARK-11113.
      2b5e31c7
    • Wenchen Fan's avatar
      [SPARK-10104] [SQL] Consolidate different forms of table identifiers · 56d7da14
      Wenchen Fan authored
      Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to represent table identifiers. We should only have one form and TableIdentifier is the best one because it provides methods to get table name, database name, return unquoted string, and return quoted string.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8453 from cloud-fan/table-name.
      56d7da14
    • Wenchen Fan's avatar
      [SPARK-11068] [SQL] [FOLLOW-UP] move execution listener to util · 9a430a02
      Wenchen Fan authored
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #9119 from cloud-fan/callback.
      9a430a02
    • Reynold Xin's avatar
      [SPARK-11096] Post-hoc review Netty based RPC implementation - round 2 · cf2e0ae7
      Reynold Xin authored
      A few more changes:
      
      1. Renamed IDVerifier -> RpcEndpointVerifier
      2. Renamed NettyRpcAddress -> RpcEndpointAddress
      3. Simplified NettyRpcHandler a bit by removing the connection count tracking. This is OK because I now force spark.shuffle.io.numConnectionsPerPeer to 1
      4. Reduced spark.rpc.connect.threads to 64. It would be great to eventually remove this extra thread pool.
      5. Minor cleanup & documentation.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9112 from rxin/SPARK-11096.
      cf2e0ae7
    • Reynold Xin's avatar
      [SPARK-10973] · 615cc858
      Reynold Xin authored
      Close #9064
      Close #9063
      Close #9062
      
      These pull requests were merged into branch-1.5, branch-1.4, and branch-1.3.
      615cc858
    • Huaxin Gao's avatar
      [SPARK-8386] [SQL] add write.mode for insertIntoJDBC when the parm overwrite is false · 7e1308d3
      Huaxin Gao authored
      the fix is for jira https://issues.apache.org/jira/browse/SPARK-8386
      
      Author: Huaxin Gao <huaxing@us.ibm.com>
      
      Closes #9042 from huaxingao/spark8386.
      7e1308d3
    • Marcelo Vanzin's avatar
      [SPARK-11040] [NETWORK] Make sure SASL handler delegates all events. · 31f31598
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9053 from vanzin/SPARK-11040.
      31f31598
    • Tom Graves's avatar
      [SPARK-10619] Can't sort columns on Executor Page · 135a2ce5
      Tom Graves authored
      should pick into spark 1.5.2 also.
      
      https://issues.apache.org/jira/browse/SPARK-10619
      
      looks like this was broken by commit: https://github.com/apache/spark/commit/fb1d06fc242ec00320f1a3049673fbb03c4a6eb9#diff-b8adb646ef90f616c34eb5c98d1ebd16
      It looks like somethings were change to use the UIUtils.listingTable but executor page wasn't converted so when it removed sortable from the UIUtils. TABLE_CLASS_NOT_STRIPED it broke this page.
      
      Simply add the sortable tag back in and it fixes both active UI and the history server UI.
      
      Author: Tom Graves <tgraves@yahoo-inc.com>
      
      Closes #9101 from tgravescs/SPARK-10619.
      135a2ce5
    • Sun Rui's avatar
      [SPARK-10996] [SPARKR] Implement sampleBy() in DataFrameStatFunctions. · 390b22fa
      Sun Rui authored
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #9023 from sun-rui/SPARK-10996.
      390b22fa
    • Monica Liu's avatar
      [SPARK-10981] [SPARKR] SparkR Join improvements · 8b328857
      Monica Liu authored
      I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
      Pull request because I filed this JIRA bug report:
      https://issues.apache.org/jira/browse/SPARK-10981
      
      Author: Monica Liu <liu.monica.f@gmail.com>
      
      Closes #9029 from mfliu/master.
      8b328857
  5. Oct 13, 2015
Loading