Skip to content
Snippets Groups Projects
  1. Oct 14, 2014
    • Masayoshi TSUZUKI's avatar
      [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows · 66af8e25
      Masayoshi TSUZUKI authored
      Modified not to pollute environment variables.
      Just moved the main logic into `XXX2.cmd` from `XXX.cmd`, and call `XXX2.cmd` with cmd command in `XXX.cmd`.
      `pyspark.cmd` and `spark-class.cmd` are already using the same way, but `spark-shell.cmd`, `spark-submit.cmd` and `/python/docs/make.bat` are not.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #2797 from tsudukim/feature/SPARK-3943 and squashes the following commits:
      
      b397a7d [Masayoshi TSUZUKI] [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows
      66af8e25
    • cocoatomo's avatar
      [SPARK-3869] ./bin/spark-class miss Java version with _JAVA_OPTIONS set · 7b4f39f6
      cocoatomo authored
      When _JAVA_OPTIONS environment variable is set, a command "java -version" outputs a message like "Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8".
      ./bin/spark-class knows java version from the first line of "java -version" output, so it mistakes java version with _JAVA_OPTIONS set.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2725 from cocoatomo/issues/3869-mistake-java-version and squashes the following commits:
      
      f894ebd [cocoatomo] [SPARK-3869] ./bin/spark-class miss Java version with _JAVA_OPTIONS set
      7b4f39f6
    • Sean Owen's avatar
      SPARK-3803 [MLLIB] ArrayIndexOutOfBoundsException found in executing computePrincipalComponents · 56096dba
      Sean Owen authored
      Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2801 from srowen/SPARK-3803 and squashes the following commits:
      
      b4e6d92 [Sean Owen] Avoid overflow in computing n*(n+1)/2 as much as possible; throw explicit error when Gramian computation will fail due to negative array size; warn about large result when computing Gramian too
      56096dba
    • shitis's avatar
      [SPARK-3944][Core] Using Option[String] where value of String can be null · 24b818b9
      shitis authored
      Author: shitis <ssaxena.ece@gmail.com>
      
      Closes #2795 from Shiti/master and squashes the following commits:
      
      46897d7 [shitis] Using Option Wrapper to convert String with value null to None
      24b818b9
    • Masayoshi TSUZUKI's avatar
      [SPARK-3946] gitignore in /python includes wrong directory · 7ced88b0
      Masayoshi TSUZUKI authored
      Modified to ignore not the docs/ directory, but only the docs/_build/ which is the output directory of sphinx build.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #2796 from tsudukim/feature/SPARK-3946 and squashes the following commits:
      
      2bea6a9 [Masayoshi TSUZUKI] [SPARK-3946] gitignore in /python includes wrong directory
      7ced88b0
    • Bill Bejeck's avatar
      SPARK-3178 setting SPARK_WORKER_MEMORY to a value without a label (m or g)... · 9b6de6fb
      Bill Bejeck authored
      SPARK-3178  setting SPARK_WORKER_MEMORY to a value without a label (m or g) sets the worker memory limit to zero
      
      Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
      
      Author: Bill Bejeck <bbejeck@gmail.com>
      
      Closes #2309 from bbejeck/spark-memory-worker and squashes the following commits:
      
      51cf915 [Bill Bejeck] SPARK-3178 - Validate the memory is greater than zero when set from the SPARK_WORKER_MEMORY environment variable or command line without a g or m label.  Added unit tests. If memory is 0 an IllegalStateException is thrown. Updated unit tests to mock environment variables by subclassing SparkConf (tip provided by Josh Rosen).   Updated WorkerArguments to use SparkConf.getenv instead of System.getenv for reading the SPARK_WORKER_MEMORY environment variable.
      9b6de6fb
    • Aaron Davidson's avatar
      [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode · 186b497c
      Aaron Davidson authored
      The goal of this patch is to fix the swapped arguments in standalone mode, which was caused by  https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153.
      
      More details can be found in the JIRA: [SPARK-3921](https://issues.apache.org/jira/browse/SPARK-3921)
      
      Tested in Standalone mode, but not in Mesos.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #2779 from aarondav/fix-standalone and squashes the following commits:
      
      725227a [Aaron Davidson] Fix ExecutorRunnerTest
      9d703fe [Aaron Davidson] [SPARK-3921] Fix CoarseGrainedExecutorBackend's arguments for Standalone mode
      186b497c
    • Tathagata Das's avatar
      [SPARK-3912][Streaming] Fixed flakyFlumeStreamSuite · 4d26aca7
      Tathagata Das authored
      @harishreedharan @pwendell
      See JIRA for diagnosis of the problem
      https://issues.apache.org/jira/browse/SPARK-3912
      
      The solution was to reimplement it.
      1. Find a free port (by binding and releasing a server-scoket), and then use that port
      2. Remove thread.sleep()s, instead repeatedly try to create a sender and send data and check whether data was sent. Use eventually() to minimize waiting time.
      3. Check whether all the data was received, without caring about batches.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #2773 from tdas/flume-test-fix and squashes the following commits:
      
      93cd7f6 [Tathagata Das] Reimplimented FlumeStreamSuite to be more robust.
      4d26aca7
  2. Oct 13, 2014
    • Cheng Lian's avatar
      [SPARK-3809][SQL] Fixes test suites in hive-thriftserver · 9eb49d41
      Cheng Lian authored
      As scwf pointed out, `HiveThriftServer2Suite` isn't effective anymore after the Thrift server was made a daemon. On the other hand, these test suites were known flaky, PR #2214 tried to fix them but failed because of unknown Jenkins build error. This PR fixes both sets of issues.
      
      In this PR, instead of watching `start-thriftserver.sh` output, the test code start a `tail` process to watch the log file. A `Thread.sleep` has to be introduced because the `kill` command used in `stop-thriftserver.sh` is not synchronous.
      
      As for the root cause of the mysterious Jenkins build failure. Please refer to [this comment](https://github.com/apache/spark/pull/2675#issuecomment-58464189) below for details.
      
      ----
      
      (Copied from PR description of #2214)
      
      This PR fixes two issues of `HiveThriftServer2Suite` and brings 1 enhancement:
      
      1. Although metastore, warehouse directories and listening port are randomly chosen, all test cases share the same configuration. Due to parallel test execution, one of the two test case is doomed to fail
      2. We caught any exceptions thrown from a test case and print diagnosis information, but forgot to re-throw the exception...
      3. When the forked server process ends prematurely (e.g., fails to start), the `serverRunning` promise is completed with a failure, preventing the test code to keep waiting until timeout.
      
      So, embarrassingly, this test suite was failing continuously for several days but no one had ever noticed it... Fortunately no bugs in the production code were covered under the hood.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2675 from liancheng/fix-thriftserver-tests and squashes the following commits:
      
      1c384b7 [Cheng Lian] Minor code cleanup, restore the logging level hack in TestHive.scala
      7805c33 [wangfei]  reset SPARK_TESTING to avoid loading Log4J configurations in testing class paths
      af2b5a9 [Cheng Lian] Removes log level hacks from TestHiveContext
      d116405 [wangfei] make sure that log4j level is INFO
      ee92a82 [Cheng Lian] Relaxes timeout
      7fd6757 [Cheng Lian] Fixes test suites in hive-thriftserver
      9eb49d41
    • Liquan Pei's avatar
      [SQL]Small bug in unresolved.scala · 9d9ca91f
      Liquan Pei authored
      name should throw exception with name instead of exprId.
      
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #2758 from Ishiihara/SparkSQL-bug and squashes the following commits:
      
      aa36a3b [Liquan Pei] small bug
      9d9ca91f
    • chirag's avatar
      SPARK-3807: SparkSql does not work for tables created using custom serde · e6e37701
      chirag authored
      
      SparkSql crashes on selecting tables using custom serde.
      
      Example:
      ----------------
      
      CREATE EXTERNAL TABLE table_name PARTITIONED BY ( a int) ROW FORMAT 'SERDE "org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer" with serdeproperties("serialization.format"="org.apache.thrift.protocol.TBinaryProtocol","serialization.class"="ser_class") STORED AS SEQUENCEFILE;
      
      The following exception is seen on running a query like 'select * from table_name limit 1':
      
      ERROR CliDriver: org.apache.hadoop.hive.serde2.SerDeException: java.lang.NullPointerException
      at org.apache.hadoop.hive.serde2.thrift.ThriftDeserializer.initialize(ThriftDeserializer.java:68)
      at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:80)
      at org.apache.spark.sql.hive.execution.HiveTableScan.addColumnMetadataToConf(HiveTableScan.scala:86)
      at org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:100)
      at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
      at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:188)
      at org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:364)
      at org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:184)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
      at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:280)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
      at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      at org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
      at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:402)
      at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:400)
      at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:406)
      at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:406)
      at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:406)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:59)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
      at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
      at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
      at java.lang.reflect.Method.invoke(Unknown Source)
      at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: java.lang.NullPointerException
      
      Author: chirag <chirag.aggarwal@guavus.com>
      
      Closes #2674 from chiragaggarwal/branch-1.1 and squashes the following commits:
      
      370c31b [chirag] SPARK-3807: Add a test case to validate the fix.
      1f26805 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde (Incorporated Review Comments)
      ba4bc0c [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
      5c73b72 [chirag] SPARK-3807: SparkSql does not work for tables created using custom serde
      
      (cherry picked from commit 925e22d3)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      e6e37701
    • Michael Armbrust's avatar
      [SQL] Add type checking debugging functions · 371321ca
      Michael Armbrust authored
      Adds some functions that were very useful when trying to track down the bug from #2656.  This change also changes the tree output for query plans to include the `'` prefix to unresolved nodes and `!` prefix to nodes that refer to non-existent attributes.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2657 from marmbrus/debugging and squashes the following commits:
      
      654b926 [Michael Armbrust] Clean-up, add tests
      763af15 [Michael Armbrust] Add typeChecking debugging functions
      8c69303 [Michael Armbrust] Add inputSet, references to QueryPlan. Improve tree string with a prefix to denote invalid or unresolved nodes.
      fbeab54 [Michael Armbrust] Better toString, factories for AttributeSet.
      371321ca
    • Venkata Ramana Gollamudi's avatar
      [SPARK-3559][SQL] Remove unnecessary columns from List of needed Column Ids in Hive Conf · e10d71e7
      Venkata Ramana Gollamudi authored
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #2713 from gvramana/remove_unnecessary_columns and squashes the following commits:
      
      b7ba768 [Venkata Ramana Gollamudi] Added comment and checkstyle fix
      6a93459 [Venkata Ramana Gollamudi] cloned hiveconf for each TableScanOperators so that only required columns are added
      e10d71e7
    • Takuya UESHIN's avatar
      [SPARK-3771][SQL] AppendingParquetOutputFormat should use reflection to... · 73da9c26
      Takuya UESHIN authored
      [SPARK-3771][SQL] AppendingParquetOutputFormat should use reflection to prevent from breaking binary-compatibility.
      
      Original problem is [SPARK-3764](https://issues.apache.org/jira/browse/SPARK-3764).
      
      `AppendingParquetOutputFormat` uses a binary-incompatible method `context.getTaskAttemptID`.
      This causes binary-incompatible of Spark itself, i.e. if Spark itself is built against hadoop-1, the artifact is for only hadoop-1, and vice versa.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #2638 from ueshin/issues/SPARK-3771 and squashes the following commits:
      
      efd3784 [Takuya UESHIN] Add a comment to explain the reason to use reflection.
      ec213c1 [Takuya UESHIN] Use reflection to prevent breaking binary-compatibility.
      73da9c26
    • Cheng Hao's avatar
      [SPARK-3529] [SQL] Delete the temp files after test exit · d3cdf912
      Cheng Hao authored
      There are lots of temporal files created by TestHive under the /tmp by default, which may cause potential performance issue for testing. This PR will automatically delete them after test exit.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2393 from chenghao-intel/delete_temp_on_exit and squashes the following commits:
      
      3a6511f [Cheng Hao] Remove the temp dir after text exit
      d3cdf912
    • Cheng Lian's avatar
      [SPARK-2066][SQL] Adds checks for non-aggregate attributes with aggregation · 56102dc2
      Cheng Lian authored
      This PR adds a new rule `CheckAggregation` to the analyzer to provide better error message for non-aggregate attributes with aggregation.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2774 from liancheng/non-aggregate-attr and squashes the following commits:
      
      5246004 [Cheng Lian] Passes test suites
      bf1878d [Cheng Lian] Adds checks for non-aggregate attributes with aggregation
      56102dc2
    • Daoyuan Wang's avatar
      [SPARK-3407][SQL]Add Date type support · 2ac40da3
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2344 from adrian-wang/date and squashes the following commits:
      
      f15074a [Daoyuan Wang] remove outdated lines
      2038085 [Daoyuan Wang] update return type
      00fe81f [Daoyuan Wang] address lian cheng's comments
      0df6ea1 [Daoyuan Wang] rebase and remove simple string
      bb1b1ef [Daoyuan Wang] remove failing test
      aa96735 [Daoyuan Wang] not cast for same type compare
      30bf48b [Daoyuan Wang] resolve rebase conflict
      617d1a8 [Daoyuan Wang] add date_udf case to white list
      c37e848 [Daoyuan Wang] comment update
      5429212 [Daoyuan Wang] change to long
      f8f219f [Daoyuan Wang] revise according to Cheng Hao
      0e0a4f5 [Daoyuan Wang] minor format
      4ddcb92 [Daoyuan Wang] add java api for date
      0e3110e [Daoyuan Wang] try to fix timezone issue
      17fda35 [Daoyuan Wang] set test list
      2dfbb5b [Daoyuan Wang] support date type
      2ac40da3
    • Daoyuan Wang's avatar
      [SPARK-3892][SQL] remove redundant type name · 46db277c
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #2747 from adrian-wang/typename and squashes the following commits:
      
      2824216 [Daoyuan Wang] remove redundant typeName
      fbaf340 [Daoyuan Wang] typename
      46db277c
    • yingjieMiao's avatar
      [Spark] RDD take() method: overestimate too much · 49bbdcb6
      yingjieMiao authored
      In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
      
      `(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
      Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
      
      This could be a performance problem. (unless this is the intended behavior)
      
      Author: yingjieMiao <yingjie@42go.com>
      
      Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:
      
      d758218 [yingjieMiao] scala style fix
      a8e74bb [yingjieMiao] python style fix
      4b6e777 [yingjieMiao] infix operator style fix
      4391d3b [yingjieMiao] typo fix.
      692f4e6 [yingjieMiao] cap numPartsToTry
      c4483dc [yingjieMiao] style fix
      1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
      d31ff7e [yingjieMiao] handle the edge case after 1 iteration
      a2aa36b [yingjieMiao] RDD take method: overestimate too much
      49bbdcb6
    • Reynold Xin's avatar
      [SPARK-3861][SQL] Avoid rebuilding hash tables for broadcast joins on each partition · 39ccabac
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2727 from rxin/SPARK-3861-broadcast-hash-2 and squashes the following commits:
      
      9c7b1a2 [Reynold Xin] Revert "Reuse CompactBuffer in UniqueKeyHashedRelation."
      97626a1 [Reynold Xin] Reuse CompactBuffer in UniqueKeyHashedRelation.
      7fcffb5 [Reynold Xin] Make UniqueKeyHashedRelation private[joins].
      18eb214 [Reynold Xin] Merge branch 'SPARK-3861-broadcast-hash' into SPARK-3861-broadcast-hash-1
      4b9d0c9 [Reynold Xin] UniqueKeyHashedRelation.get should return null if the value is null.
      e0ebdd1 [Reynold Xin] Added a test case.
      90b58c0 [Reynold Xin] [SPARK-3861] Avoid rebuilding hash tables on each partition
      0c0082b [Reynold Xin] Fix line length.
      cbc664c [Reynold Xin] Rename join -> joins package.
      a070d44 [Reynold Xin] Fix line length in HashJoin
      a39be8c [Reynold Xin] [SPARK-3857] Create a join package for various join operators.
      39ccabac
    • omgteam's avatar
      Bug Fix: without unpersist method in RandomForest.scala · 942847fd
      omgteam authored
      During trainning Gradient Boosting Decision Tree on large-scale sparse data, spark spill hundreds of data onto disk. And find the bug below:
          In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage.
          In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too.
      
      After added unpersist, it works right.
      https://issues.apache.org/jira/browse/SPARK-3918
      
      Author: omgteam <Kimlong.Liu@gmail.com>
      
      Closes #2775 from omgteam/master and squashes the following commits:
      
      815d543 [omgteam] adjust tab to spaces
      1a36f83 [omgteam] Bug: fix without unpersist baggedInput in RandomForest.scala
      942847fd
    • w00228970's avatar
      [SPARK-3899][Doc]fix wrong links in streaming doc · 92e017fb
      w00228970 authored
      There are three  [Custom Receiver Guide] links in streaming doc, the first is wrong.
      
      Author: w00228970 <wangfei1@huawei.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #2749 from scwf/streaming-doc and squashes the following commits:
      
      0cd76b7 [wangfei] update link tojump to the Akka-specific section
      45b0646 [w00228970] wrong link in streaming doc
      92e017fb
    • Ken Takagiwa's avatar
      Add echo "Run streaming tests ..." · d8b8c210
      Ken Takagiwa authored
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      
      Closes #2778 from giwa/patch-2 and squashes the following commits:
      
      a59f9a1 [Ken Takagiwa] Add echo "Run streaming tests ..."
      d8b8c210
    • GuoQiang Li's avatar
      [SPARK-3905][Web UI]The keys for sorting the columns of Executor page ,Stage... · b4a7fa7a
      GuoQiang Li authored
      [SPARK-3905][Web UI]The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
      
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #2763 from witgo/SPARK-3905 and squashes the following commits:
      
      17d7990 [GuoQiang Li] The keys for sorting the columns of Executor page ,Stage page Storage page are incorrect
      b4a7fa7a
    • Jakub Dubovský's avatar
      [SPARK-3121] Wrong implementation of implicit bytesWritableConverter · fc616d51
      Jakub Dubovský authored
      val path = ... //path to seq file with BytesWritable as type of both key and value
      val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
      file.take(1)(0)._1
      
      This prints incorrect content of byte array. Actual content starts with correct one and some "random" bytes and zeros are appended. BytesWritable has two methods:
      
      getBytes() - return content of all internal array which is often longer then actual value stored. It usually contains the rest of previous longer values
      
      copyBytes() - return just begining of internal array determined by internal length property
      
      It looks like in implicit conversion between BytesWritable and Array[byte] getBytes is used instead of correct copyBytes.
      
      dbtsai
      
      Author: Jakub Dubovský <james64@inMail.sk>
      Author: Dubovsky Jakub <dubovsky@avast.com>
      
      Closes #2712 from james64/3121-bugfix and squashes the following commits:
      
      f85d24c [Jakub Dubovský] Test name changed, comments added
      1b20d51 [Jakub Dubovský] Import placed correctly
      406e26c [Jakub Dubovský] Scala style fixed
      f92ffa6 [Dubovsky Jakub] performance tuning
      480f9cd [Dubovsky Jakub] Bug 3121 fixed
      fc616d51
  3. Oct 12, 2014
    • Andrew Or's avatar
      [HOTFIX] Fix compilation error for Yarn 2.0.*-alpha · c86c9760
      Andrew Or authored
      This was reported in https://issues.apache.org/jira/browse/SPARK-3445. There are API differences between the 0.23.* and the 2.0.*-alpha branches that are not accounted for when this code was introduced.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2776 from andrewor14/fix-yarn-alpha and squashes the following commits:
      
      ec94752 [Andrew Or] Fix compilation error for 2.0.*-alpha
      c86c9760
    • NamelessAnalyst's avatar
      SPARK-3716 [GraphX] Update Analytics.scala for partitionStrategy assignment · e5be4de7
      NamelessAnalyst authored
      
      Previously, when the val partitionStrategy was created it called a function in the Analytics object which was a copy of the PartitionStrategy.fromString() method. This function has been removed, and the assignment of partitionStrategy now uses the PartitionStrategy.fromString method instead. In this way, it better matches the declarations of edge/vertex StorageLevel variables.
      
      Author: NamelessAnalyst <NamelessAnalyst@users.noreply.github.com>
      
      Closes #2569 from NamelessAnalyst/branch-1.1 and squashes the following commits:
      
      c24ff51 [NamelessAnalyst] Update Analytics.scala
      
      (cherry picked from commit 5a21e3e7)
      Signed-off-by: default avatarAnkur Dave <ankurdave@gmail.com>
      e5be4de7
    • Josh Rosen's avatar
      [SPARK-3887] Send stracktrace in ConnectionManager error replies · 18bd67c2
      Josh Rosen authored
      When reporting that a remote error occurred, the ConnectionManager should also log the stacktrace of the remote exception. This PR accomplishes this by sending the remote exception's stacktrace as the payload in the "negative ACK / error message."
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2741 from JoshRosen/propagate-cm-exceptions-to-sender and squashes the following commits:
      
      b5366cc [Josh Rosen] Explicitly encode error messages using UTF-8.
      cef18b3 [Josh Rosen] [SPARK-3887] Send stracktrace in ConnectionManager error messages.
      18bd67c2
    • giwa's avatar
      [SPARK-2377] Python API for Streaming · 69c67aba
      giwa authored
      This patch brings Python API for Streaming.
      
      This patch is based on work from @giwa
      
      Author: giwa <ugw.gi.world@gmail.com>
      Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Ken Takagiwa <ken@kens-mbp.gateway.sonic.net>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Ken <ugw.gi.world@gmail.com>
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2538 from davies/streaming and squashes the following commits:
      
      64561e4 [Davies Liu] fix tests
      331ecce [Davies Liu] fix example
      3e2492b [Davies Liu] change updateStateByKey() to easy API
      182be73 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      02d0575 [Davies Liu] add wrapper for foreachRDD()
      bebeb4a [Davies Liu] address all comments
      6db00da [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      8380064 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      52c535b [Davies Liu] remove fix for sum()
      e108ec1 [Davies Liu]  address comments
      37fe06f [Davies Liu] use random port for callback server
      d05871e [Davies Liu] remove reuse of PythonRDD
      be5e5ff [Davies Liu] merge branch of env, make tests stable.
      8071541 [Davies Liu] Merge branch 'env' into streaming
      c7bbbce [Davies Liu] fix sphinx docs
      6bb9d91 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
      54bd92b [Davies Liu] improve tests
      c2b31cb [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      7a88f9f [Davies Liu] rollback RDD.setContext(), use textFileStream() to test checkpointing
      bd8a4c2 [Davies Liu] fix scala style
      7797c70 [Davies Liu] refactor
      ff88bec [Davies Liu] rename RDDFunction to TransformFunction
      d328aca [Davies Liu] fix serializer in queueStream
      6f0da2f [Davies Liu] recover from checkpoint
      fa7261b [Davies Liu] refactor
      a13ff34 [Davies Liu] address comments
      8466916 [Davies Liu] support checkpoint
      9a16bd1 [Davies Liu] change number of partitions during tests
      b98d63f [Davies Liu] change private[spark] to private[python]
      eed6e2a [Davies Liu] rollback not needed changes
      e00136b [Davies Liu] address comments
      069a94c [Davies Liu] fix the number of partitions during window()
      338580a [Davies Liu] change _first(), _take(), _collect() as private API
      19797f9 [Davies Liu] clean up
      6ebceca [Davies Liu] add more tests
      c40c52d [Davies Liu] change first(), take(n) to has the same behavior as RDD
      98ac6c2 [Davies Liu] support ssc.transform()
      b983f0f [Davies Liu] address comments
      847f9b9 [Davies Liu] add more docs, add first(), take()
      e059ca2 [Davies Liu] move check of window into Python
      fce0ef5 [Davies Liu] rafactor of foreachRDD()
      7001b51 [Davies Liu] refactor of queueStream()
      26ea396 [Davies Liu] refactor
      74df565 [Davies Liu] fix print and docs
      b32774c [Davies Liu] move java_import into streaming
      604323f [Davies Liu] enable streaming tests
      c499ba0 [Davies Liu] remove Time and Duration
      3f0fb4b [Davies Liu] refactor fix tests
      c28f520 [Davies Liu] support updateStateByKey
      d357b70 [Davies Liu] support windowed dstream
      bd13026 [Davies Liu] fix examples
      eec401e [Davies Liu] refactor, combine TransformedRDD, fix reuse PythonRDD, fix union
      9a57685 [Davies Liu] fix python style
      bd27874 [Davies Liu] fix scala style
      7339be0 [Davies Liu] delete tests
      7f53086 [Davies Liu] support transform(), refactor and cleanup
      df098fc [Davies Liu] Merge branch 'master' into giwa
      550dfd9 [giwa] WIP fixing 1.1 merge
      5cdb6fa [giwa] changed for SCCallSiteSync
      e685853 [giwa] meged with rebased 1.1 branch
      2d32a74 [giwa] added some StreamingContextTestSuite
      4a59e1e [giwa] WIP:added more test for StreamingContext
      8ffdbf1 [giwa] added atexit to handle callback server
      d5f5fcb [giwa] added comment for StreamingContext.sparkContext
      63c881a [giwa] added StreamingContext.sparkContext
      d39f102 [giwa] added StreamingContext.remember
      d542743 [giwa] clean up code
      2fdf0de [Matthew Farrellee] Fix scalastyle errors
      c0a06bc [giwa] delete not implemented functions
      f385976 [giwa] delete inproper comments
      b0f2015 [giwa] added comment in dstream._test_output
      bebb3f3 [giwa] remove the last brank line
      fbed8da [giwa] revert pom.xml
      8ed93af [giwa] fixed explanaiton
      066ba90 [giwa] revert pom.xml
      fa4af88 [giwa] remove duplicated import
      6ae3caa [giwa] revert pom.xml
      7dc7391 [giwa] fixed typo
      62dc7a3 [giwa] clean up exmples
      f04882c [giwa] clen up examples
      b171ec3 [giwa] fixed pep8 violation
      f198d14 [giwa] clean up code
      3166d31 [giwa] clean up
      c00e091 [giwa] change test case not to use awaitTermination
      e80647e [giwa] adopted the latest compression way of python command
      58e41ff [giwa] merge with master
      455e5af [giwa] removed wasted print in DStream
      af336b7 [giwa] add comments
      ddd4ee1 [giwa] added TODO coments
      99ce042 [giwa] added saveAsTextFiles and saveAsPickledFiles
      2a06cdb [giwa] remove waste duplicated code
      c5ecfc1 [giwa] basic function test cases are passed
      8dcda84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      795b2cd [giwa] broke something
      1e126bf [giwa] WIP: solved partitioned and None is not recognized
      f67cf57 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      953deb0 [giwa] edited the comment to add more precise description
      af610d3 [giwa] removed unnesessary changes
      c1d546e [giwa] fixed PEP-008 violation
      99410be [giwa] delete waste file
      b3b0362 [giwa] added basic operation test cases
      9cde7c9 [giwa] WIP added test case
      bd3ba53 [giwa] WIP
      5c04a5f [giwa] WIP: added PythonTestInputStream
      019ef38 [giwa] WIP
      1934726 [giwa] update comment
      376e3ac [giwa] WIP
      932372a [giwa] clean up dstream.py
      0b09cff [giwa] added stop in StreamingContext
      92e333e [giwa] implemented reduce and count function in Dstream
      1b83354 [giwa] Removed the waste line
      88f7506 [Ken Takagiwa] Kill py4j callback server properly
      54b5358 [Ken Takagiwa] tried to restart callback server
      4f07163 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      fe02547 [Ken Takagiwa] remove waste file
      2ad7bd3 [Ken Takagiwa] clean up codes
      6197a11 [Ken Takagiwa] clean up code
      eb4bf48 [Ken Takagiwa] fix map function
      98c2a00 [Ken Takagiwa] added count operation but this implementation need double check
      58591d2 [Ken Takagiwa] reduceByKey is working
      0df7111 [Ken Takagiwa] delete old file
      f485b1d [Ken Takagiwa] fied input of socketTextDStream
      dd6de81 [Ken Takagiwa] initial commit for socketTextStream
      247fd74 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      4bcb318 [Ken Takagiwa] implementing transform function in Python
      38adf95 [Ken Takagiwa] added reducedByKey not working yet
      66fcfff [Ken Takagiwa] modify dstream.py to fix indent error
      41886c2 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      0b99bec [Ken] initial commit for pySparkStreaming
      c214199 [giwa] added testcase for combineByKey
      5625bdc [giwa] added gorupByKey testcase
      10ab87b [giwa] added sparkContext as input parameter in StreamingContext
      10b5b04 [giwa] removed wasted print in DStream
      e54f986 [giwa] add comments
      16aa64f [giwa] added TODO coments
      74535d4 [giwa] added saveAsTextFiles and saveAsPickledFiles
      f76c182 [giwa] remove waste duplicated code
      18c8723 [giwa] modified streaming test case to add coment
      13fb44c [giwa] basic function test cases are passed
      3000b2b [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      ff14070 [giwa] broke something
      bcdec33 [giwa] WIP: solved partitioned and None is not recognized
      270a9e1 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      bb10956 [giwa] edited the comment to add more precise description
      253a863 [giwa] removed unnesessary changes
      3d37822 [giwa] fixed PEP-008 violation
      f21cab3 [giwa] delete waste file
      878bad7 [giwa] added basic operation test cases
      ce2acd2 [giwa] WIP added test case
      9ad6855 [giwa] WIP
      1df77f5 [giwa] WIP: added PythonTestInputStream
      1523b66 [giwa] WIP
      8a0fbbc [giwa] update comment
      fe648e3 [giwa] WIP
      29c2bc5 [giwa] initial commit for testcase
      4d40d63 [giwa] clean up dstream.py
      c462bb3 [giwa] added stop in StreamingContext
      d2c01ba [giwa] clean up examples
      3c45cd2 [giwa] implemented reduce and count function in Dstream
      b349649 [giwa] Removed the waste line
      3b498e1 [Ken Takagiwa] Kill py4j callback server properly
      84a9668 [Ken Takagiwa] tried to restart callback server
      9ab8952 [Tathagata Das] Added extra line.
      05e991b [Tathagata Das] Added missing file
      b1d2a30 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      678e854 [Ken Takagiwa] remove waste file
      0a8bbbb [Ken Takagiwa] clean up codes
      bab31c1 [Ken Takagiwa] clean up code
      72b9738 [Ken Takagiwa] fix map function
      d3ee86a [Ken Takagiwa] added count operation but this implementation need double check
      15feea9 [Ken Takagiwa] edit python sparkstreaming example
      6f98e50 [Ken Takagiwa] reduceByKey is working
      c455c8d [Ken Takagiwa] added reducedByKey not working yet
      dc6995d [Ken Takagiwa] delete old file
      b31446a [Ken Takagiwa] fixed typo of network_workdcount.py
      ccfd214 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      0d1b954 [Ken Takagiwa] fied input of socketTextDStream
      f746109 [Ken Takagiwa] initial commit for socketTextStream
      bb7ccf3 [Ken Takagiwa] remove unused import in python
      224fc5e [Ken Takagiwa] add empty line
      d2099d8 [Ken Takagiwa] sorted the import following Spark coding convention
      5bac7ec [Ken Takagiwa] revert streaming/pom.xml
      e1df940 [Ken Takagiwa] revert pom.xml
      494cae5 [Ken Takagiwa] remove not implemented DStream functions in python
      17a74c6 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      1a0f065 [Ken Takagiwa] implementing transform function in Python
      d7b4d6f [Ken Takagiwa] added reducedByKey not working yet
      87438e2 [Ken Takagiwa] modify dstream.py to fix indent error
      b406252 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      454981d [Ken] initial commit for pySparkStreaming
      150b94c [giwa] added some StreamingContextTestSuite
      f7bc8f9 [giwa] WIP:added more test for StreamingContext
      ee50c5a [giwa] added atexit to handle callback server
      fdc9125 [giwa] added comment for StreamingContext.sparkContext
      f5bfb70 [giwa] added StreamingContext.sparkContext
      da09768 [giwa] added StreamingContext.remember
      d68b568 [giwa] clean up code
      4afa390 [giwa] clean up code
      1fd6bc7 [Ken Takagiwa] Merge pull request #2 from mattf/giwa-master
      d9d59fe [Matthew Farrellee] Fix scalastyle errors
      67473a9 [giwa] delete not implemented functions
      c97377c [giwa] delete inproper comments
      2ea769e [giwa] added comment in dstream._test_output
      3b27bd4 [giwa] remove the last brank line
      acfcaeb [giwa] revert pom.xml
      93f7637 [giwa] fixed explanaiton
      50fd6f9 [giwa] revert pom.xml
      4f82c89 [giwa] remove duplicated import
      9d1de23 [giwa] revert pom.xml
      7339df2 [giwa] fixed typo
      9c85e48 [giwa] clean up exmples
      24f95db [giwa] clen up examples
      0d30109 [giwa] fixed pep8 violation
      b7dab85 [giwa] improve test case
      583e66d [giwa] move tests for streaming inside streaming directory
      1d84142 [giwa] remove unimplement test
      f0ea311 [giwa] clean up code
      171edeb [giwa] clean up
      4dedd2d [giwa] change test case not to use awaitTermination
      268a6a5 [giwa] Changed awaitTermination not to call awaitTermincation in Scala. Just use time.sleep instread
      09a28bf [giwa] improve testcases
      58150f5 [giwa] Changed the test case to focus the test operation
      199e37f [giwa] adopted the latest compression way of python command
      185fdbf [giwa] merge with master
      f1798c4 [giwa] merge with master
      e70f706 [giwa] added testcase for combineByKey
      e162822 [giwa] added gorupByKey testcase
      97742fe [giwa] added sparkContext as input parameter in StreamingContext
      14d4c0e [giwa] removed wasted print in DStream
      6d8190a [giwa] add comments
      4aa99e4 [giwa] added TODO coments
      e9fab72 [giwa] added saveAsTextFiles and saveAsPickledFiles
      94f2b65 [giwa] remove waste duplicated code
      580fbc2 [giwa] modified streaming test case to add coment
      99e4bb3 [giwa] basic function test cases are passed
      7051a84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      35933e1 [giwa] broke something
      9767712 [giwa] WIP: solved partitioned and None is not recognized
      4f2d7e6 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      33c0f94d [giwa] edited the comment to add more precise description
      774f18d [giwa] removed unnesessary changes
      3a671cc [giwa] remove export PYSPARK_PYTHON in spark submit
      8efa266 [giwa] fixed PEP-008 violation
      fa75d71 [giwa] delete waste file
      7f96294 [giwa] added basic operation test cases
      3dda31a [giwa] WIP added test case
      1f68b78 [giwa] WIP
      c05922c [giwa] WIP: added PythonTestInputStream
      1fd12ae [giwa] WIP
      c880a33 [giwa] update comment
      5d22c92 [giwa] WIP
      ea4b06b [giwa] initial commit for testcase
      5a9b525 [giwa] clean up dstream.py
      79c5809 [giwa] added stop in StreamingContext
      189dcea [giwa] clean up examples
      b8d7d24 [giwa] implemented reduce and count function in Dstream
      b6468e6 [giwa] Removed the waste line
      b47b5fd [Ken Takagiwa] Kill py4j callback server properly
      19ddcdd [Ken Takagiwa] tried to restart callback server
      c9fc124 [Tathagata Das] Added extra line.
      4caae3f [Tathagata Das] Added missing file
      4eff053 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      5e822d4 [Ken Takagiwa] remove waste file
      aeaf8a5 [Ken Takagiwa] clean up codes
      9fa249b [Ken Takagiwa] clean up code
      05459c6 [Ken Takagiwa] fix map function
      a9f4ecb [Ken Takagiwa] added count operation but this implementation need double check
      d1ee6ca [Ken Takagiwa] edit python sparkstreaming example
      0b8b7d0 [Ken Takagiwa] reduceByKey is working
      d25d5cf [Ken Takagiwa] added reducedByKey not working yet
      7f7c5d1 [Ken Takagiwa] delete old file
      967dc26 [Ken Takagiwa] fixed typo of network_workdcount.py
      57fb740 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      4b69fb1 [Ken Takagiwa] fied input of socketTextDStream
      02f618a [Ken Takagiwa] initial commit for socketTextStream
      4ce4058 [Ken Takagiwa] remove unused import in python
      856d98e [Ken Takagiwa] add empty line
      490e338 [Ken Takagiwa] sorted the import following Spark coding convention
      5594bd4 [Ken Takagiwa] revert pom.xml
      2adca84 [Ken Takagiwa] remove not implemented DStream functions in python
      e551e13 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      3758175 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      c5518b4 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      dcf243f [Ken Takagiwa] implementing transform function in Python
      9af03f4 [Ken Takagiwa] added reducedByKey not working yet
      6e0d9c7 [Ken Takagiwa] modify dstream.py to fix indent error
      e497b9b [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      5c3a683 [Ken] initial commit for pySparkStreaming
      665bfdb [giwa] added testcase for combineByKey
      a3d2379 [giwa] added gorupByKey testcase
      636090a [giwa] added sparkContext as input parameter in StreamingContext
      e7ebb08 [giwa] removed wasted print in DStream
      d8b593b [giwa] add comments
      ea9c873 [giwa] added TODO coments
      89ae38a [giwa] added saveAsTextFiles and saveAsPickledFiles
      e3033fc [giwa] remove waste duplicated code
      a14c7e1 [giwa] modified streaming test case to add coment
      536def4 [giwa] basic function test cases are passed
      2112638 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      080541a [giwa] broke something
      0704b86 [giwa] WIP: solved partitioned and None is not recognized
      90a6484 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      a65f302 [giwa] edited the comment to add more precise description
      bdde697 [giwa] removed unnesessary changes
      e8c7bfc [giwa] remove export PYSPARK_PYTHON in spark submit
      3334169 [giwa] fixed PEP-008 violation
      db0a303 [giwa] delete waste file
      2cfd3a0 [giwa] added basic operation test cases
      90ae568 [giwa] WIP added test case
      a120d07 [giwa] WIP
      f671cdb [giwa] WIP: added PythonTestInputStream
      56fae45 [giwa] WIP
      e35e101 [giwa] Merge branch 'master' into testcase
      ba5112d [giwa] update comment
      28aa56d [giwa] WIP
      fb08559 [giwa] initial commit for testcase
      a613b85 [giwa] clean up dstream.py
      c40c0ef [giwa] added stop in StreamingContext
      31e4260 [giwa] clean up examples
      d2127d6 [giwa] implemented reduce and count function in Dstream
      48f7746 [giwa] Removed the waste line
      0f83eaa [Ken Takagiwa] delete py4j 0.8.1
      1679808 [Ken Takagiwa] Kill py4j callback server properly
      f96cd4e [Ken Takagiwa] tried to restart callback server
      fe86198 [Ken Takagiwa] add py4j 0.8.2.1 but server is not launched
      1064fe0 [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
      28c6620 [Ken Takagiwa] Implemented DStream.foreachRDD in the Python API using Py4J callback server
      85b0fe1 [Ken Takagiwa] Merge pull request #1 from tdas/python-foreach
      54e2e8c [Tathagata Das] Added extra line.
      e185338 [Tathagata Das] Added missing file
      a778d4b [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      cc2092b [Ken Takagiwa] remove waste file
      d042ac6 [Ken Takagiwa] clean up codes
      84a021f [Ken Takagiwa] clean up code
      bd20e17 [Ken Takagiwa] fix map function
      d01a125 [Ken Takagiwa] added count operation but this implementation need double check
      7d05109 [Ken Takagiwa] merge with remote branch
      ae464e0 [Ken Takagiwa] edit python sparkstreaming example
      04af046 [Ken Takagiwa] reduceByKey is working
      3b6d7b0 [Ken Takagiwa] implementing transform function in Python
      571d52d [Ken Takagiwa] added reducedByKey not working yet
      5720979 [Ken Takagiwa] delete old file
      e604fcb [Ken Takagiwa] fixed typo of network_workdcount.py
      4b7c08b [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
      ce7d426 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      a8c9fd5 [Ken Takagiwa] fixed for socketTextStream
      a61fa9e [Ken Takagiwa] fied input of socketTextDStream
      1e84f41 [Ken Takagiwa] initial commit for socketTextStream
      6d012f7 [Ken Takagiwa] remove unused import in python
      25d30d5 [Ken Takagiwa] add empty line
      6e0a64a [Ken Takagiwa] sorted the import following Spark coding convention
      fa4a7fc [Ken Takagiwa] revert streaming/pom.xml
      8f8202b [Ken Takagiwa] revert streaming pom.xml
      c9d79dd [Ken Takagiwa] revert pom.xml
      57e3e52 [Ken Takagiwa] remove not implemented DStream functions in python
      0a516f5 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      a7a0b5c [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      72bfc66 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      69e9cd3 [Ken Takagiwa] implementing transform function in Python
      94a0787 [Ken Takagiwa] added reducedByKey not working yet
      88068cf [Ken Takagiwa] modify dstream.py to fix indent error
      1367be5 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      eb2b3ba [Ken] Merge remote-tracking branch 'upstream/master'
      d8e51f9 [Ken] initial commit for pySparkStreaming
      69c67aba
  4. Oct 11, 2014
    • cocoatomo's avatar
      [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings · 7a3f589e
      cocoatomo authored
      Sphinx documents contains a corrupted ReST format and have some warnings.
      
      The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
      
      commit: 0e8203f4
      
      output
      ```
      $ cd ./python/docs
      $ make clean html
      rm -rf _build/*
      sphinx-build -b html -d _build/doctrees   . _build/html
      Making output directory...
      Running Sphinx v1.2.3
      loading pickled environment... not yet created
      building [html]: targets for 4 source files that are out of date
      updating environment: 4 added, 0 changed, 0 removed
      reading sources... [100%] pyspark.sql
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
      looking for now-outdated files... none found
      pickling environment... done
      checking consistency... done
      preparing documents... done
      writing output... [100%] pyspark.sql
      writing additional files... (12 module code pages) _modules/index search
      copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
      done
      copying extra files... done
      dumping search index... done
      dumping object inventory... done
      build succeeded, 4 warnings.
      
      Build finished. The HTML pages are in _build/html.
      ```
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
      
      2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      7a3f589e
    • cocoatomo's avatar
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6... · 81015a2b
      cocoatomo authored
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      
      ./python/run-tests search a Python 2.6 executable on PATH and use it if available.
      When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
      
      f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      81015a2b
  5. Oct 10, 2014
    • Prashant Sharma's avatar
      [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden... · 0e8203f4
      Prashant Sharma authored
      [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
      
      ...riden alternatives, can have default argument.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2750 from ScrapCodes/SPARK-2924/default-args-removed and squashes the following commits:
      
      d9785c3 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one function/ctor amongst overriden alternatives, can have default argument.
      0e8203f4
    • Patrick Wendell's avatar
      HOTFIX: Fix build issue with Akka 2.3.4 upgrade. · 1d72a308
      Patrick Wendell authored
      We had to upgrade our Hive 0.12 version as well to deal with a protobuf
      conflict (both hive and akka have been using a shaded protobuf version).
      This is testing a correctly patched version of Hive 0.12.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2756 from pwendell/hotfix and squashes the following commits:
      
      cc979d0 [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
      1d72a308
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] use AutoBatchedSerializer by default · 72f36ee5
      Davies Liu authored
      Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into  [64k - 640k].
      
      In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2740 from davies/batchsize and squashes the following commits:
      
      52cdb88 [Davies Liu] update docs
      185f2b9 [Davies Liu] use AutoBatchedSerializer by default
      72f36ee5
    • Aaron Davidson's avatar
      [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager · 90f73fcc
      Aaron Davidson authored
      In general, individual shuffle blocks are frequently small, so mmapping them often creates a lot of waste. It may not be bad to mmap the larger ones, but it is pretty inconvenient to get configuration into ManagedBuffer, and besides it is unlikely to help all that much.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #2742 from aarondav/mmap and squashes the following commits:
      
      a152065 [Aaron Davidson] Add other pathway back
      52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
      90f73fcc
    • Anand Avati's avatar
      [SPARK-2805] Upgrade Akka to 2.3.4 · 411cf29f
      Anand Avati authored
      This is a second rev of the Akka upgrade (earlier merged, but reverted). I made a slight modification which is that I also upgrade Hive to deal with a compatibility issue related to the protocol buffers library.
      
      Author: Anand Avati <avati@redhat.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2752 from pwendell/akka-upgrade and squashes the following commits:
      
      4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
      57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
      2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
      411cf29f
  6. Oct 09, 2014
    • ravipesala's avatar
      [SPARK-3834][SQL] Backticks not correctly handled in subquery aliases · 6f98902a
      ravipesala authored
      The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not work as backticks in subquery aliases are not handled properly. This PR fixes that.
      
      Author : ravipesala ravindra.pesalahuawei.com
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2737 from ravipesala/SPARK-3834 and squashes the following commits:
      
      0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases
      6f98902a
    • Cheng Lian's avatar
      [SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK · 421382d0
      Cheng Lian authored
      Using `MEMORY_AND_DISK` as default storage level for in-memory table caching. Due to the in-memory columnar representation, recomputing an in-memory cached table partitions can be very expensive.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2686 from liancheng/spark-3824 and squashes the following commits:
      
      35d2ed0 [Cheng Lian] Removes extra space
      1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
      ba565f0 [Cheng Lian] Maks CachedBatch serializable
      07f0204 [Cheng Lian] Sets in-memory table default storage level to MEMORY_AND_DISK
      421382d0
    • Cheng Lian's avatar
      [SPARK-3654][SQL] Unifies SQL and HiveQL parsers · edf02da3
      Cheng Lian authored
      This PR is a follow up of #2590, and tries to introduce a top level SQL parser entry point for all SQL dialects supported by Spark SQL.
      
      A top level parser `SparkSQLParser` is introduced to handle the syntaxes that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and `SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it fallbacks to a specified function that tries to parse arbitrary input to a `LogicalPlan`. This function is typically another parser combinator like `SqlParser`. DDL syntaxes introduced in #2475 can be moved to here.
      
      The `ExtendedHiveQlParser` now only handle Hive specific extensions.
      
      Also took the chance to refactor/reformat `SqlParser` for better readability.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2698 from liancheng/gen-sql-parser and squashes the following commits:
      
      ceada76 [Cheng Lian] Minor styling fixes
      9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in the parser
      bb2ab12 [Cheng Lian] SET property value can be empty string
      ce8860b [Cheng Lian] Passes test suites
      e86968e [Cheng Lian] Removes debugging code
      8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking doesn't like it)
      d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers
      edf02da3
    • Sean Owen's avatar
      SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, Utils.createTempDir · 363baaca
      Sean Owen authored
      I noticed a few issues with how temp directories are created and deleted:
      
      *Minor*
      
      * Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
      * Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
      * _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_
      
      *Bit Less Minor*
      
      * `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
      * `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.
      
      I noticed a few other things that might be changed but wanted to ask first:
      
      * Shouldn't the set of dirs to delete be `File`, not just `String` paths?
      * `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2670 from srowen/SPARK-3811 and squashes the following commits:
      
      071ae60 [Sean Owen] Update per @vanzin's review
      da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
      3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
      363baaca
Loading