Skip to content
Snippets Groups Projects
  1. Oct 02, 2014
    • scwf's avatar
      [SPARK-3766][Doc]Snappy is also the default compress codec for broadcast variables · c6469a02
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2632 from scwf/compress-doc and squashes the following commits:
      
      7983a1a [scwf] snappy is the default compression codec for broadcast
      c6469a02
    • Nishkam Ravi's avatar
      Modify default YARN memory_overhead-- from an additive constant to a multiplier · b4fb7b80
      Nishkam Ravi authored
      Redone against the recent master branch (https://github.com/apache/spark/pull/1391)
      
      Author: Nishkam Ravi <nravi@cloudera.com>
      Author: nravi <nravi@c1704.halxg.cloudera.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      
      Closes #2485 from nishkamravi2/master_nravi and squashes the following commits:
      
      636a9ff [nishkamravi2] Update YarnAllocator.scala
      8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
      35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
      5ac2ec1 [Nishkam Ravi] Remove out
      dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
      42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
      362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
      c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
      1cf2d1e [nishkamravi2] Update YarnAllocator.scala
      ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
      2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
      efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
      2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
      3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
      5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
      eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
      df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
      6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
      5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
      681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
      b4fb7b80
    • Yin Huai's avatar
      [SQL][Docs] Update the output of printSchema and fix a typo in SQL programming guide. · 82a6a083
      Yin Huai authored
      We have changed the output format of `printSchema`. This PR will update our SQL programming guide to show the updated format. Also, it fixes a typo (the value type of `StructType` in Java API).
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #2630 from yhuai/sqlDoc and squashes the following commits:
      
      267d63e [Yin Huai] Update the output of printSchema and fix a typo.
      82a6a083
    • cocoatomo's avatar
      [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset · 5b4a5b1a
      cocoatomo authored
      ### Problem
      
      The section "Using the shell" in Spark Programming Guide (https://spark.apache.org/docs/latest/programming-guide.html#using-the-shell) says that we can run pyspark REPL through IPython.
      But a folloing command does not run IPython but a default Python executable.
      
      ```
      $ IPYTHON=1 ./bin/pyspark
      Python 2.7.8 (default, Jul  2 2014, 10:14:46)
      ...
      ```
      
      the spark/bin/pyspark script on the commit b235e013 decides which executable and options it use folloing way.
      
      1. if PYSPARK_PYTHON unset
         * → defaulting to "python"
      2. if IPYTHON_OPTS set
         * → set IPYTHON "1"
      3. some python scripts passed to ./bin/pyspak → run it with ./bin/spark-submit
         * out of this issues scope
      4. if IPYTHON set as "1"
         * → execute $PYSPARK_PYTHON (default: ipython) with arguments $IPYTHON_OPTS
         * otherwise execute $PYSPARK_PYTHON
      
      Therefore, when PYSPARK_PYTHON is unset, python is executed though IPYTHON is "1".
      In other word, when PYSPARK_PYTHON is unset, IPYTHON_OPS and IPYTHON has no effect on decide which command to use.
      
      PYSPARK_PYTHON | IPYTHON_OPTS | IPYTHON | resulting command | expected command
      ---- | ---- | ----- | ----- | -----
      (unset → defaults to python) | (unset) | (unset) | python | (same)
      (unset → defaults to python) | (unset) | 1 | python | ipython
      (unset → defaults to python) | an_option | (unset → set to 1) | python an_option | ipython an_option
      (unset → defaults to python) | an_option | 1 | python an_option | ipython an_option
      ipython | (unset) | (unset) | ipython | (same)
      ipython | (unset) | 1 | ipython | (same)
      ipython | an_option | (unset → set to 1) | ipython an_option | (same)
      ipython | an_option | 1 | ipython an_option | (same)
      
      ### Suggestion
      
      The pyspark script should determine firstly whether a user wants to run IPython or other executables.
      
      1. if IPYTHON_OPTS set
         * set IPYTHON "1"
      2.  if IPYTHON has a value "1"
         * PYSPARK_PYTHON defaults to "ipython" if not set
      3. PYSPARK_PYTHON defaults to "python" if not set
      
      See the pull request for more detailed modification.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2554 from cocoatomo/issues/cannot-run-ipython-without-options and squashes the following commits:
      
      d2a9b06 [cocoatomo] [SPARK-3706][PySpark] Use PYTHONUNBUFFERED environment variable instead of -u option
      264114c [cocoatomo] [SPARK-3706][PySpark] Remove the sentence about deprecated environment variables
      42e02d5 [cocoatomo] [SPARK-3706][PySpark] Replace environment variables used to customize execution of PySpark REPL
      10d56fb [cocoatomo] [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset
      5b4a5b1a
    • Colin Patrick Mccabe's avatar
      SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks · 6e27cb63
      Colin Patrick Mccabe authored
      This change reorders the replicas returned by
      HadoopRDD#getPreferredLocations so that replicas cached by HDFS are at
      the start of the list.  This requires Hadoop 2.5 or higher; previous
      versions of Hadoop do not expose the information needed to determine
      whether a replica is cached.
      
      Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
      
      Closes #1486 from cmccabe/SPARK-1767 and squashes the following commits:
      
      338d4f8 [Colin Patrick Mccabe] SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
      6e27cb63
    • ravipesala's avatar
      [SPARK-3371][SQL] Renaming a function expression with group by gives error · bbdf1de8
      ravipesala authored
      The following code gives error.
      ```
      sqlContext.registerFunction("len", (s: String) => s.length)
      sqlContext.sql("select len(foo) as a, count(1) from t1 group by len(foo)").collect()
      ```
      Because SQl parser creates the aliases to the functions in grouping expressions with generated alias names. So if user gives the alias names to the functions inside projection then it does not match the generated alias name of grouping expression.
      This kind of queries are working in Hive.
      So the fix I have given that if user provides alias to the function in projection then don't generate alias in grouping expression,use the same alias.
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2511 from ravipesala/SPARK-3371 and squashes the following commits:
      
      9fb973f [ravipesala] Removed aliases to grouping expressions.
      f8ace79 [ravipesala] Fixed the testcase issue
      bad2fd0 [ravipesala] SPARK-3371 : Fixed Renaming a function expression with group by gives error
      bbdf1de8
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · f341e1c8
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #1375 (close requested by 'pwendell')
      Closes #476 (close requested by 'mengxr')
      Closes #2502 (close requested by 'pwendell')
      Closes #2391 (close requested by 'andrewor14')
      f341e1c8
  2. Oct 01, 2014
    • Marcelo Vanzin's avatar
      [SPARK-3446] Expose underlying job ids in FutureAction. · 29c35132
      Marcelo Vanzin authored
      FutureAction is the only type exposed through the async APIs, so
      for job IDs to be useful they need to be exposed there. The complication
      is that some async jobs run more than one job (e.g. takeAsync),
      so the exposed ID has to actually be a list of IDs that can actually
      change over time. So the interface doesn't look very nice, but...
      
      Change is actually small, I just added a basic test to make sure
      it works.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2337 from vanzin/SPARK-3446 and squashes the following commits:
      
      e166a68 [Marcelo Vanzin] Fix comment.
      1fed2bc [Marcelo Vanzin] [SPARK-3446] Expose underlying job ids in FutureAction.
      29c35132
    • aniketbhatnagar's avatar
      SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile · 93861a5e
      aniketbhatnagar authored
      This patch forces use of commons http client 4.2 in Kinesis-asl profile so that the AWS SDK does not run into dependency conflicts
      
      Author: aniketbhatnagar <aniket.bhatnagar@gmail.com>
      
      Closes #2535 from aniketbhatnagar/Kinesis-HttpClient-Dep-Fix and squashes the following commits:
      
      aa2079f [aniketbhatnagar] Merge branch 'Kinesis-HttpClient-Dep-Fix' of https://github.com/aniketbhatnagar/spark into Kinesis-HttpClient-Dep-Fix
      73f55f6 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
      70cc75b [aniketbhatnagar] deleted merge files
      725dbc9 [aniketbhatnagar] Merge remote-tracking branch 'origin/Kinesis-HttpClient-Dep-Fix' into Kinesis-HttpClient-Dep-Fix
      4ed61d8 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
      9cd6103 [aniketbhatnagar] SPARK-3638 | Forced a compatible version of http client in kinesis-asl profile
      93861a5e
    • scwf's avatar
      [SPARK-3704][SQL] Fix ColumnValue type for Short values in thrift server · 1b9f0d67
      scwf authored
      case ```ShortType```, we should add short value to hive row. Int value may lead to some problems.
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2551 from scwf/fix-addColumnValue and squashes the following commits:
      
      08bcc59 [scwf] ColumnValue.shortValue for short type
      1b9f0d67
    • Michael Armbrust's avatar
      [SPARK-3729][SQL] Do all hive session state initialization in lazy val · 45e058ca
      Michael Armbrust authored
      This change avoids a NPE during context initialization when settings are present.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2583 from marmbrus/configNPE and squashes the following commits:
      
      da2ec57 [Michael Armbrust] Do all hive session state initilialization in lazy val
      45e058ca
    • Patrick Wendell's avatar
      4e79970d
    • Cheng Lian's avatar
      [SQL] Made Command.sideEffectResult protected · a31f4ff2
      Cheng Lian authored
      Considering `Command.executeCollect()` simply delegates to `Command.sideEffectResult`, we no longer need to leave the latter `protected[sql]`.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2431 from liancheng/narrow-scope and squashes the following commits:
      
      1bfc16a [Cheng Lian] Made Command.sideEffectResult protected
      a31f4ff2
    • Venkata Ramana Gollamudi's avatar
      [SPARK-3593][SQL] Add support for sorting BinaryType · f84b228c
      Venkata Ramana Gollamudi authored
      BinaryType is derived from NativeType and added Ordering support.
      
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #2617 from gvramana/binarytype_sort and squashes the following commits:
      
      1cf26f3 [Venkata Ramana Gollamudi] Supported Sorting of BinaryType
      f84b228c
    • scwf's avatar
      [SPARK-3705][SQL] Add case for VoidObjectInspector to cover NullType · f315fb7e
      scwf authored
      add case for VoidObjectInspector in ```inspectorToDataType```
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2552 from scwf/inspectorToDataType and squashes the following commits:
      
      453d892 [scwf] add case for VoidObjectInspector
      f315fb7e
    • ravipesala's avatar
      [SPARK-3708][SQL] Backticks aren't handled correctly is aliases · 3508ce8a
      ravipesala authored
      The below query gives error
      sql("SELECT k FROM (SELECT \`key\` AS \`k\` FROM src) a")
      It gives error because the aliases are not cleaned so it could not be resolved in further processing.
      
      Author: ravipesala <ravindra.pesala@huawei.com>
      
      Closes #2594 from ravipesala/SPARK-3708 and squashes the following commits:
      
      d55db54 [ravipesala] Fixed SPARK-3708 (Backticks aren't handled correctly is aliases)
      3508ce8a
    • WangTaoTheTonic's avatar
      [SPARK-3658][SQL] Start thrift server as a daemon · d61f2c15
      WangTaoTheTonic authored
      https://issues.apache.org/jira/browse/SPARK-3658
      
      And keep the `CLASS_NOT_FOUND_EXIT_STATUS` and exit message in `SparkSubmit.scala`.
      
      Author: WangTaoTheTonic <barneystinson@aliyun.com>
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #2509 from WangTaoTheTonic/thriftserver and squashes the following commits:
      
      5dcaab2 [WangTaoTheTonic] issue about coupling
      8ad9f95 [WangTaoTheTonic] generalization
      598e21e [WangTao] take thrift server as a daemon
      d61f2c15
    • Michael Armbrust's avatar
      [SPARK-3746][SQL] Lock hive client when creating tables · fcad3fae
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2598 from marmbrus/hiveClientLock and squashes the following commits:
      
      ca89fe8 [Michael Armbrust] Lock hive client when creating tables
      fcad3fae
    • jyotiska's avatar
      Python SQL Example Code · 17333c7a
      jyotiska authored
      SQL example code for Python, as shown on [SQL Programming Guide](https://spark.apache.org/docs/1.0.2/sql-programming-guide.html)
      
      Author: jyotiska <jyotiska123@gmail.com>
      
      Closes #2521 from jyotiska/sql_example and squashes the following commits:
      
      1471dcb [jyotiska] added imports for sql
      b25e436 [jyotiska] pep 8 compliance
      43fd10a [jyotiska] lines broken to maintain 80 char limit
      b4fdf4e [jyotiska] removed blank lines
      83d5ab7 [jyotiska] added inferschema and applyschema to the demo
      306667e [jyotiska] replaced blank line with end line
      c90502a [jyotiska] fixed new line
      4939a70 [jyotiska] added new line at end for python style
      0b46148 [jyotiska] fixed appname for python sql example
      8f67b5b [jyotiska] added python sql example
      17333c7a
    • Gaspar Munoz's avatar
      Typo error in KafkaWordCount example · b81ee0b4
      Gaspar Munoz authored
      topicpMap to topicMap
      
      Author: Gaspar Munoz <munozs.88@gmail.com>
      
      Closes #2614 from gasparms/patch-1 and squashes the following commits:
      
      00aab2c [Gaspar Munoz] Typo error in KafkaWordCount example
      b81ee0b4
    • Cheng Lian's avatar
      [SQL] Kill dangerous trailing space in query string · 8cc70e7e
      Cheng Lian authored
      MD5 of query strings in `createQueryTest` calls are used to generate golden files, leaving trailing spaces there can be really dangerous. Got bitten by this while working on #2616: my "smart" IDE automatically removed a trailing space and makes Jenkins fail.
      
      (Really should add "no trailing space" to our coding style guidelines!)
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #2619 from liancheng/kill-trailing-space and squashes the following commits:
      
      034f119 [Cheng Lian] Kill dangerous trailing space in query string
      8cc70e7e
    • scwf's avatar
      [SPARK-3756] [Core]check exception is caused by an address-port collision properly · 2fedb5dd
      scwf authored
      Jetty server use MultiException to handle exceptions when start server
      refer https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java
      
      So in ```isBindCollision``` add the logical to cover MultiException
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2611 from scwf/fix-isBindCollision and squashes the following commits:
      
      984cb12 [scwf] optimize the fix
      3a6c849 [scwf] fix bug in isBindCollision
      2fedb5dd
    • scwf's avatar
      [SPARK-3755][Core] Do not bind port 1 - 1024 to server in spark · 6390aae4
      scwf authored
      Non-root user use port 1- 1024 to start jetty server will get the exception " java.net.SocketException: Permission denied", so not use these ports
      
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2610 from scwf/1-1024 and squashes the following commits:
      
      cb8cc76 [scwf] do not use port 1 - 1024
      6390aae4
    • Sean Owen's avatar
      SPARK-2626 [DOCS] Stop SparkContext in all examples · dcb2f73f
      Sean Owen authored
      Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2575 from srowen/SPARK-2626 and squashes the following commits:
      
      5b2baae [Sean Owen] Call SparkContext.stop() in all examples (and touch up minor nearby code style issues while at it)
      dcb2f73f
    • Davies Liu's avatar
      [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD · abf588f4
      Davies Liu authored
      1. broadcast is triggle unexpected
      2. fd is leaked in JVM (also leak in parallelize())
      3. broadcast is not unpersisted in JVM after RDD is not be used any more.
      
      cc JoshRosen , sorry for these stupid bugs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2603 from davies/fix_broadcast and squashes the following commits:
      
      080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
      abf588f4
    • Masayoshi TSUZUKI's avatar
      [SPARK-3757] mvn clean doesn't delete some files · 0bfd3afb
      Masayoshi TSUZUKI authored
      Added directory to be deleted into maven-clean-plugin in pom.xml.
      
      Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
      
      Closes #2613 from tsudukim/feature/SPARK-3757 and squashes the following commits:
      
      8804bfc [Masayoshi TSUZUKI] Modified indent.
      67c7171 [Masayoshi TSUZUKI] [SPARK-3757] mvn clean doesn't delete some files
      0bfd3afb
    • Reynold Xin's avatar
      [SPARK-3748] Log thread name in unit test logs · 3888ee2f
      Reynold Xin authored
      Thread names are useful for correlating failures.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2600 from rxin/log4j and squashes the following commits:
      
      83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
      3888ee2f
    • Joseph K. Bradley's avatar
      [SPARK-3751] [mllib] DecisionTree: example update + print options · 7bf6cc97
      Joseph K. Bradley authored
      DecisionTreeRunner functionality additions:
      * Allow user to pass in a test dataset
      * Do not print full model if the model is too large.
      
      As part of this, modify DecisionTreeModel and RandomForestModel to allow printing less info.  Proposed updates:
      * toString: prints model summary
      * toDebugString: prints full model (named after RDD.toDebugString)
      
      Similar update to Python API:
      * __repr__() now prints a model summary
      * toDebugString() now prints the full model
      
      CC: mengxr  chouqin manishamde codedeft  Small update (whomever can take a look).  Thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2604 from jkbradley/dtrunner-update and squashes the following commits:
      
      b2b3c60 [Joseph K. Bradley] re-added python sql doc test, temporarily removed before
      07b1fae [Joseph K. Bradley] repr() now prints a model summary toDebugString() now prints the full model
      1d0d93d [Joseph K. Bradley] Updated DT and RF to print less when toString is called. Added toDebugString for verbose printing.
      22eac8c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
      e007a95 [Joseph K. Bradley] Updated DecisionTreeRunner to accept a test dataset.
      7bf6cc97
    • Reynold Xin's avatar
      [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get... · eb43043f
      Reynold Xin authored
      [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2599 from rxin/SPARK-3747 and squashes the following commits:
      
      a74c04d [Reynold Xin] Added a line of comment explaining NonFatal
      0e8d44c [Reynold Xin] [SPARK-3747] TaskResultGetter could incorrectly abort a stage if it cannot get result for a specific task
      eb43043f
  3. Sep 30, 2014
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      This is bugfix of #2351 cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2556 from davies/profiler and squashes the following commits:
      
      e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      858e74c [Davies Liu] compatitable with python 2.6
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      c5414b68
    • Xiangrui Meng's avatar
      [SPARK-3701][MLLIB] update python linalg api and small fixes · d75496b1
      Xiangrui Meng authored
      1. doc updates
      2. simple checks on vector dimensions
      3. use column major for matrices
      
      davies jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2548 from mengxr/mllib-py-clean and squashes the following commits:
      
      6dce2df [Xiangrui Meng] address comments
      116b5db [Xiangrui Meng] use np.dot instead of array.dot
      75f2fcc [Xiangrui Meng] fix python style
      fefce00 [Xiangrui Meng] better check of vector size with more tests
      067ef71 [Xiangrui Meng] majored -> major
      ef853f9 [Xiangrui Meng] update python linalg api and small fixes
      d75496b1
    • Reynold Xin's avatar
      Remove compiler warning from TaskContext change. · 6c696d7d
      Reynold Xin authored
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2602 from rxin/warning and squashes the following commits:
      
      130186b [Reynold Xin] Remove compiler warning from TaskContext change.
      6c696d7d
    • Sean Owen's avatar
      SPARK-3744 [STREAMING] FlumeStreamSuite will fail during port contention · 8764fe36
      Sean Owen authored
      Since it looked quite easy, I took the liberty of making a quick PR that just uses `Utils.startServiceOnPort` to fix this. It works locally for me.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2601 from srowen/SPARK-3744 and squashes the following commits:
      
      ddc9319 [Sean Owen] Avoid port contention in tests by retrying several ports for Flume stream
      8764fe36
    • Nicholas Chammas's avatar
      [Build] Post commit hash with timeout messages · d3a3840e
      Nicholas Chammas authored
      [By request](https://github.com/apache/spark/pull/2588#issuecomment-57266871), and because it also makes sense.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #2597 from nchammas/timeout-commit-hash and squashes the following commits:
      
      3d90714 [Nicholas Chammas] Revert "testing: making timeout 1 minute"
      2353c95 [Nicholas Chammas] testing: making timeout 1 minute
      e3a477e [Nicholas Chammas] post commit hash with timeout
      d3a3840e
    • shane knapp's avatar
      SPARK-3745 - fix check-license to properly download and check jar · a01a3092
      shane knapp authored
      for details, see: https://issues.apache.org/jira/browse/SPARK-3745
      
      Author: shane knapp <incomplete@gmail.com>
      
      Closes #2596 from shaneknapp/SPARK-3745 and squashes the following commits:
      
      c95eea9 [shane knapp] SPARK-3745 - fix check-license to properly download and check jar
      a01a3092
    • Sean Owen's avatar
      [SPARK-3356] [DOCS] Document when RDD elements' ordering within partitions is nondeterministic · ab6dd80b
      Sean Owen authored
      As suggested by mateiz , and because it came up on the mailing list again last week, this attempts to document that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods. Suggestions welcome about the wording, or other methods that need a note.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2508 from srowen/SPARK-3356 and squashes the following commits:
      
      b7c96fd [Sean Owen] Undo change to programming guide
      ad4aeec [Sean Owen] Don't mention ordering in partition-wise methods, reword description of ordering for zip methods per review, and add similar note to programming guide, which mentions groupByKey (but not zip methods)
      fce943b [Sean Owen] Note that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods
      ab6dd80b
    • Patrick Wendell's avatar
      HOTFIX: Ignore flaky tests in YARN · 157e7d0f
      Patrick Wendell authored
      157e7d0f
    • Patrick Wendell's avatar
      b64fcbd2
    • Josh Rosen's avatar
      [SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment · b167a8c7
      Josh Rosen authored
      When using spark-submit in `cluster` mode to submit a job to a Spark Standalone
      cluster, if the JAVA_HOME environment variable was set on the submitting
      machine then DriverRunner would attempt to use the submitter's JAVA_HOME to
      launch the driver process (instead of the worker's JAVA_HOME), causing the
      driver to fail unless the submitter and worker had the same Java location.
      
      This commit fixes this by reading JAVA_HOME from sys.env instead of
      command.environment.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #2586 from JoshRosen/SPARK-3734 and squashes the following commits:
      
      e9513d9 [Josh Rosen] [SPARK-3734] DriverRunner should not read SPARK_HOME from submitter's environment.
      b167a8c7
    • Reynold Xin's avatar
      [SPARK-3709] Executors don't always report broadcast block removal properly back to the driver · de700d31
      Reynold Xin authored
      The problem was that the 2nd argument in RemoveBroadcast is not tellMaster! It is "removeFromDriver". Basically when removeFromDriver is not true, we don't report broadcast block removal back to the driver, and then other executors mistakenly think that the executor would still have the block, and try to fetch from it.
      
      cc @tdas
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2588 from rxin/debug and squashes the following commits:
      
      6dab2e3 [Reynold Xin] Don't log random messages.
      f430686 [Reynold Xin] Always report broadcast removal back to master.
      2a13f70 [Reynold Xin] iii
      de700d31
Loading