Skip to content
Snippets Groups Projects
  1. Dec 18, 2014
    • Ernest's avatar
      [SPARK-4880] remove spark.locality.wait in Analytics · a7ed6f3c
      Ernest authored
      spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
      Should be left to the user.
      
      Author: Ernest <earneyzxl@gmail.com>
      
      Closes #3730 from Earne/SPARK-4880 and squashes the following commits:
      
      d79ed04 [Ernest] remove spark.locality.wait in Analytics
      a7ed6f3c
    • DB Tsai's avatar
      [SPARK-4887][MLlib] Fix a bad unittest in LogisticRegressionSuite · 59a49db5
      DB Tsai authored
      The original test doesn't make sense since if you step in, the lossSum is already NaN,
      and the coefficients are diverging. That's because the step size is too large for SGD,
      so it doesn't work.
      
      The correct behavior is that you should get smaller coefficients than the one
      without regularization. Comparing the values using 20000.0 relative error doesn't
      make sense as well.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #3735 from dbtsai/mlortestfix and squashes the following commits:
      
      b1a3c42 [DB Tsai] first commit
      59a49db5
    • Ilya Ganelin's avatar
      [SPARK-3607] ConnectionManager threads.max configs on the thread pools don't work · 3720057b
      Ilya Ganelin authored
      Hi all - cleaned up the code to get rid of the unused parameter and added some discussion of the ThreadPoolExecutor parameters to explain why we can use a single threadCount instead of providing a min/max.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #3664 from ilganeli/SPARK-3607C and squashes the following commits:
      
      3c05690 [Ilya Ganelin] Updated documentation and refactored code to extract shared variables
      3720057b
    • Timothy Chen's avatar
      Add mesos specific configurations into doc · d9956f86
      Timothy Chen authored
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #3349 from tnachen/mesos_doc and squashes the following commits:
      
      737ef49 [Timothy Chen] Add TOC
      5ca546a [Timothy Chen] Update description around cores requested.
      26283a5 [Timothy Chen] Add mesos specific configurations into doc
      d9956f86
    • Sandy Ryza's avatar
      SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be... · 253b72b5
      Sandy Ryza authored
      ... changed to a time period
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3471 from sryza/sandy-spark-3779 and squashes the following commits:
      
      20b9887 [Sandy Ryza] Deprecate old property
      42b5df7 [Sandy Ryza] Review feedback
      9a959a1 [Sandy Ryza] SPARK-3779. yarn spark.yarn.applicationMaster.waitTries config should be changed to a time period
      253b72b5
    • Zhan Zhang's avatar
      [SPARK-4461][YARN] pass extra java options to yarn application master · 3b764699
      Zhan Zhang authored
      Currently, there is no way to pass yarn am specific java options. It cause some potential issues when reading classpath from hadoop configuration file. Hadoop configuration actually replace variables in its property with the system property passed in java options. How to specify the value depends on different hadoop distribution.
      
      The new options are SPARK_YARN_JAVA_OPTS or spark.yarn.extraJavaOptions. I make it as spark global level, because typically we don't want user to specify this in their command line each time submitting spark job after it is setup in spark-defaults.conf.
      
      In addition, with this new extra options enabled to be passed to AM, it provides more flexibility.
      
      For example int the following valid mapred-site.xml file, we have the class path which specify values using system property. Hadoop can correctly handle it because it has java options passed in.
      
      This is the example, currently spark will break due to hadoop.version is not passed in.
        <property>
          <name>mapreduce.application.classpath</name>
          <value>/etc/hadoop/${hadoop.version}/mapreduce/*</value>
        </property>
      
      In the meantime, we cannot relies on  mapreduce.admin.map.child.java.opts in mapred-site.xml, because it has its own extra java options specified, which does not apply to Spark.
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      
      Closes #3409 from zhzhan/Spark-4461 and squashes the following commits:
      
      daec3d0 [Zhan Zhang] solve review comments
      08f44a7 [Zhan Zhang] add warning in driver mode if spark.yarn.am.extraJavaOptions is configured
      5a505d3 [Zhan Zhang] solve review comments
      4ed43ad [Zhan Zhang] solve review comments
      ad777ed [Zhan Zhang] Merge branch 'master' into Spark-4461
      3e9e574 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      e3f9abe [Zhan Zhang] solve review comments
      8963552 [Zhan Zhang] rebase
      f8f6700 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      dea1692 [Zhan Zhang] change the option key name to client mode specific
      90d5dff [Zhan Zhang] rebase
      8ac9254 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      092a25f [Zhan Zhang] solve review comments
      bc5a9ae [Zhan Zhang] solve review comments
      782b014 [Zhan Zhang] add new configuration to docs/running-on-yarn.md and remove it from spark-defaults.conf.template
      6faaa97 [Zhan Zhang] solve review comments
      369863f [Zhan Zhang] clean up unnecessary var
      733de9c [Zhan Zhang] Merge branch 'master' into Spark-4461
      a68e7f0 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      864505a [Zhan Zhang] Add extra java options to be passed to Yarn application master
      15830fc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      685d911 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      03ebad3 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
      46d9e3d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      ebb213a [Zhan Zhang] revert
      b983ef3 [Zhan Zhang] test
      c4efb9b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      779d67b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      4daae6d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      12e1be5 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      ce0ca7b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      93f3081 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3764505 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      a9d372b [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
      a00f60f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      f6a8a40 [Zhan Zhang] revert
      ba14f28 [Zhan Zhang] test
      3b764699
  2. Dec 17, 2014
    • lewuathe's avatar
      [SPARK-4822] Use sphinx tags for Python doc annotations · 3cd51619
      lewuathe authored
      Modify python annotations for sphinx. There is no change to build process from.
      https://github.com/apache/spark/blob/master/docs/README.md
      
      Author: lewuathe <lewuathe@me.com>
      
      Closes #3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:
      
      88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
      3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
      3cd51619
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · ca126089
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #3137 (close requested by 'marmbrus')
      Closes #3362 (close requested by 'marmbrus')
      Closes #2979 (close requested by 'JoshRosen')
      Closes #2223 (close requested by 'JoshRosen')
      Closes #2998 (close requested by 'marmbrus')
      Closes #3202 (close requested by 'marmbrus')
      Closes #3079 (close requested by 'marmbrus')
      Closes #3210 (close requested by 'marmbrus')
      Closes #2764 (close requested by 'marmbrus')
      Closes #3618 (close requested by 'marmbrus')
      Closes #3501 (close requested by 'marmbrus')
      Closes #2768 (close requested by 'marmbrus')
      Closes #3381 (close requested by 'marmbrus')
      Closes #3510 (close requested by 'marmbrus')
      Closes #3703 (close requested by 'marmbrus')
      Closes #2543 (close requested by 'marmbrus')
      Closes #2876 (close requested by 'marmbrus')
      Closes #1281 (close requested by 'JoshRosen')
      ca126089
    • Venkata Ramana Gollamudi's avatar
      [SPARK-3891][SQL] Add array support to percentile, percentile_approx and... · f33d5504
      Venkata Ramana Gollamudi authored
      [SPARK-3891][SQL] Add array support to percentile, percentile_approx and constant inspectors support
      
      Supported passing array to percentile and percentile_approx UDAFs
      To support percentile_approx,  constant inspectors are supported for GenericUDAF
      Constant folding support added to CreateArray expression
      Avoided constant udf expression re-evaluation
      
      Author: Venkata Ramana G <ramana.gollamudihuawei.com>
      
      Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
      
      Closes #2802 from gvramana/percentile_array_support and squashes the following commits:
      
      a0182e5 [Venkata Ramana Gollamudi] fixed review comment
      a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch
      c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset
      4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes
      f37fd69 [Venkata Ramana Gollamudi] Fixed review comments
      47f6365 [Venkata Ramana Gollamudi] fixed test
      cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue.
      7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray
      f33d5504
    • Cheng Hao's avatar
      [SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul... · 8d0d2a65
      Cheng Hao authored
      ```
      TestSQLContext.sparkContext.parallelize(
        """{"ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
        """{"ip":"27.31.100.29","headers":{}}""" ::
        """{"ip":"27.31.100.29","headers":""}""" :: Nil)
      ```
      As empty string (the "headers") will be considered as String in the beginning (in line 2 and 3), it ignores the real nested data type (struct type "headers" in line 1), and also take the line 1 (the "headers") as String Type, which is not our expected.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3708 from chenghao-intel/json and squashes the following commits:
      
      e7a72e9 [Cheng Hao] add more concise unit test
      853de51 [Cheng Hao] NullType instead of StringType when sampling against empty string or null value
      8d0d2a65
    • Michael Armbrust's avatar
      [HOTFIX][SQL] Fix parquet filter suite · 19c0faad
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3727 from marmbrus/parquetNotEq and squashes the following commits:
      
      2157bfc [Michael Armbrust] Fix parquet filter suite
      19c0faad
    • Joseph K. Bradley's avatar
      [SPARK-4821] [mllib] [python] [docs] Fix for pyspark.mllib.rand doc · affc3f46
      Joseph K. Bradley authored
      + small doc edit
      + include edit to make IntelliJ happy
      
      CC: davies  mengxr
      
      Note to davies  -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like.  (Those warnings occur when generating the Python docs.)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3669 from jkbradley/python-warnings and squashes the following commits:
      
      4587868 [Joseph K. Bradley] fixed warning
      8cb073c [Joseph K. Bradley] Updated based on davies recommendation
      c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc.  Small doc edit.  Small include edit to make IntelliJ happy.
      affc3f46
    • Cheng Hao's avatar
      [SPARK-3739] [SQL] Update the split num base on block size for table scanning · 636d9fc4
      Cheng Hao authored
      In local mode, Hadoop/Hive will ignore the "mapred.map.tasks", hence for small table file, it's always a single input split, however, SparkSQL doesn't honor that in table scanning, and we will get different result when do the Hive Compatibility test. This PR will fix that.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #2589 from chenghao-intel/source_split and squashes the following commits:
      
      dff38e7 [Cheng Hao] Remove the extra blank line
      160a2b6 [Cheng Hao] fix the compiling bug
      04d67f7 [Cheng Hao] Keep 1 split for small file in table scanning
      636d9fc4
    • Daoyuan Wang's avatar
      [SPARK-4755] [SQL] sqrt(negative value) should return null · 902e4d54
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3616 from adrian-wang/sqrt and squashes the following commits:
      
      d877439 [Daoyuan Wang] fix NULLTYPE
      3effa2c [Daoyuan Wang] sqrt(negative value) should return null
      902e4d54
    • Cheng Lian's avatar
      [SPARK-4493][SQL] Don't pushdown Eq, NotEq, Lt, LtEq, Gt and GtEq predicates with nulls for Parquet · 62771353
      Cheng Lian authored
      Predicates like `a = NULL` and `a < NULL` can't be pushed down since Parquet `Lt`, `LtEq`, `Gt`, `GtEq` doesn't accept null value. Note that `Eq` and `NotEq` can only be used with `null` to represent predicates like `a IS NULL` and `a IS NOT NULL`.
      
      However, normally this issue doesn't cause NPE because any value compared to `NULL` results `NULL`, and Spark SQL automatically optimizes out `NULL` predicate in the `SimplifyFilters` rule. Only testing code that intentionally disables the optimizer may trigger this issue. (That's why this issue is not marked as blocker and I do **NOT** think we need to backport this to branch-1.1
      
      This PR restricts `Lt`, `LtEq`, `Gt` and `GtEq` to non-null values only, and only uses `Eq` with null value to pushdown `IsNull` and `IsNotNull`. Also, added support for Parquet `NotEq` filter for completeness and (tiny) performance gain, it's also used to pushdown `IsNotNull`.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3367)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3367 from liancheng/filters-with-null and squashes the following commits:
      
      cc41281 [Cheng Lian] Fixes several styling issues
      de7de28 [Cheng Lian] Adds stricter rules for Parquet filters with null
      62771353
    • Michael Armbrust's avatar
      [SPARK-3698][SQL] Fix case insensitive resolution of GetField. · 7ad579ee
      Michael Armbrust authored
      Based on #2543.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3724 from marmbrus/resolveGetField and squashes the following commits:
      
      0a47aae [Michael Armbrust] Fix case insensitive resolution of GetField.
      7ad579ee
    • carlmartin's avatar
      [SPARK-4694]Fix HiveThriftServer2 cann't stop In Yarn HA mode. · 4782def0
      carlmartin authored
      HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode.
      The scheduler backend was aware of the AM had been exited so it call sc.stop to exit the driver process but there was a user thread(HiveThriftServer2 ) which was still alive and cause this problem.
      To fix it, make a demo thread to detect the sparkContext is null or not.If the sc is stopped, call the ThriftServer.stop to stop the user thread.
      
      Author: carlmartin <carlmartinmax@gmail.com>
      
      Closes #3576 from SaintBacchus/ThriftServer2ExitBug and squashes the following commits:
      
      2890b4a [carlmartin] Use SparkListener instead of the demo thread to stop the hive server.
      c15da0e [carlmartin] HiveThriftServer2 can not exit automactic when changing the standy resource manager in Yarn HA mode
      4782def0
    • Cheng Hao's avatar
      [SPARK-4625] [SQL] Add sort by for DSL & SimpleSqlParser · 5fdcbdc0
      Cheng Hao authored
      Add `sort by` support for both DSL & SqlParser.
      
      This PR is relevant with #3386, either one merged, will cause the other rebased.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3481 from chenghao-intel/sortby and squashes the following commits:
      
      041004f [Cheng Hao] Add sort by for DSL & SimpleSqlParser
      5fdcbdc0
    • Saisai Shao's avatar
      [SPARK-4595][Core] Fix MetricsServlet not work issue · cf50631a
      Saisai Shao authored
      `MetricsServlet` handler should be added to the web UI after initialized by `MetricsSystem`, otherwise servlet handler cannot be attached.
      
      Author: Saisai Shao <saisai.shao@intel.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3444 from jerryshao/SPARK-4595 and squashes the following commits:
      
      434d17e [Saisai Shao] Merge pull request #10 from JoshRosen/metrics-system-cleanup
      87a2292 [Josh Rosen] Guard against misuse of MetricsSystem methods.
      f779fe0 [jerryshao] Fix MetricsServlet not work issue
      cf50631a
    • Josh Rosen's avatar
      [HOTFIX] Fix RAT exclusion for known_translations file · 3d0c37b8
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3719 from JoshRosen/rat-fix and squashes the following commits:
      
      1542886 [Josh Rosen] [HOTFIX] Fix RAT exclusion for known_translations file
      3d0c37b8
    • Andrew Or's avatar
      [Release] Update contributors list format and sort it · 4e1112e7
      Andrew Or authored
      Additionally, we now warn the user when a duplicate author name
      arises, in which case he/she needs to resolve it manually.
      4e1112e7
  3. Dec 16, 2014
    • scwf's avatar
      [SPARK-4618][SQL] Make foreign DDL commands options case-insensitive · 60698801
      scwf authored
      Using lowercase for ```options``` key to make it case-insensitive, then we should use lower case to get value from parameters.
      So flowing cmd work
      ```
            create temporary table normal_parquet
            USING org.apache.spark.sql.parquet
            OPTIONS (
              PATH '/xxx/data'
            )
      ```
      
      Author: scwf <wangfei1@huawei.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3470 from scwf/ddl-ulcase and squashes the following commits:
      
      ae78509 [scwf] address comments
      8f4f585 [wangfei] address comments
      3c132ef [scwf] minor fix
      a0fc20b [scwf] Merge branch 'master' of https://github.com/apache/spark into ddl-ulcase
      4f86401 [scwf] adding CaseInsensitiveMap
      e244e8d [wangfei] using lower case in json
      e0cb017 [wangfei] make options in-casesensitive
      60698801
    • Davies Liu's avatar
      [SPARK-4866] support StructType as key in MapType · ec5c4279
      Davies Liu authored
      This PR brings support of using StructType(and other hashable types) as key in MapType.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3714 from davies/fix_struct_in_map and squashes the following commits:
      
      68585d7 [Davies Liu] fix primitive types in MapType
      9601534 [Davies Liu] support StructType as key in MapType
      ec5c4279
    • Cheng Hao's avatar
      [SPARK-4375] [SQL] Add 0 argument support for udf · 770d8153
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3595 from chenghao-intel/udf0 and squashes the following commits:
      
      a858973 [Cheng Hao] Add 0 arguments support for udf
      770d8153
    • Takuya UESHIN's avatar
      [SPARK-4720][SQL] Remainder should also return null if the divider is 0. · ddc7ba31
      Takuya UESHIN authored
      This is a follow-up of SPARK-4593 (#3443).
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3581 from ueshin/issues/SPARK-4720 and squashes the following commits:
      
      c3959d4 [Takuya UESHIN] Make Remainder return null if the divider is 0.
      ddc7ba31
    • Cheng Hao's avatar
      [SPARK-4744] [SQL] Short circuit evaluation for AND & OR in CodeGen · 0aa834ad
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3606 from chenghao-intel/codegen_short_circuit and squashes the following commits:
      
      f466303 [Cheng Hao] short circuit for AND & OR
      0aa834ad
    • Cheng Lian's avatar
      [SPARK-4798][SQL] A new set of Parquet testing API and test suites · 3b395e10
      Cheng Lian authored
      This PR provides a set Parquet testing API (see trait `ParquetTest`) that enables developers to write more concise test cases. A new set of Parquet test suites built upon this API  are added and aim to replace the old `ParquetQuerySuite`. To avoid potential merge conflicts, old testing code are not removed yet. The following classes can be safely removed after most Parquet related PRs are handled:
      
      - `ParquetQuerySuite`
      - `ParquetTestData`
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3644)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3644 from liancheng/parquet-tests and squashes the following commits:
      
      800e745 [Cheng Lian] Enforces ordering of test output
      3bb8731 [Cheng Lian] Refactors HiveParquetSuite
      aa2cb2e [Cheng Lian] Decouples ParquetTest and TestSQLContext
      7b43a68 [Cheng Lian] Updates ParquetTest Scaladoc
      7f07af0 [Cheng Lian] Adds a new set of Parquet test suites
      3b395e10
    • Andrew Or's avatar
      [Release] Cache known author translations locally · b85044ec
      Andrew Or authored
      This bypasses unnecessary calls to the Github and JIRA API.
      Additionally, having a local cache allows us to remember names
      that we had to manually discover ourselves.
      b85044ec
    • Andrew Or's avatar
      [Release] Major improvements to generate contributors script · 6f80b749
      Andrew Or authored
      This commit introduces several major improvements to the script
      that generates the contributors list for release notes, notably:
      
      (1) Use release tags instead of a range of commits. Across branches,
      commits are not actually strictly two-dimensional, and so it is not
      sufficient to specify a start hash and an end hash. Otherwise, we
      end up counting commits that were already merged in an older branch.
      
      (2) Match PR numbers in addition to commit hashes. This is related
      to the first point in that if a PR is already merged in an older
      minor release tag, it should be filtered out here. This requires us
      to do some intelligent regex parsing on the commit description in
      addition to just relying on the GitHub API.
      
      (3) Relax author validity check. The old code fails on a name that
      has many middle names, for instance. The test was just too strict.
      
      (4) Use GitHub authentication. This allows us to make far more
      requests through the GitHub API than before (5000 as opposed to 60
      per hour).
      
      (5) Translate from Github username, not commit author name. This is
      important because the commit author name is not always configured
      correctly by the user. For instance, the username "falaki" used to
      resolve to just "Hossein", which was treated as a github username
      and translated to something else that is completely arbitrary.
      
      (6) Add an option to use the untranslated name. If there is not
      a satisfactory candidate to replace the untranslated name with,
      at least allow the user to not translate it.
      6f80b749
    • Jacky Li's avatar
      [SPARK-4269][SQL] make wait time configurable in BroadcastHashJoin · fa66ef6c
      Jacky Li authored
      In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to wait for the execution and broadcast of the small table.
      In my opinion, it should be a configurable value since broadcast may exceed 5 minutes in some case, like in a busy/congested network environment.
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #3133 from jackylk/timeout-config and squashes the following commits:
      
      733ac08 [Jacky Li] add spark.sql.broadcastTimeout in SQLConf.scala
      557acd4 [Jacky Li] switch to sqlContext.getConf
      81a5e20 [Jacky Li] make wait time configurable in BroadcastHashJoin
      fa66ef6c
    • Michael Armbrust's avatar
      [SPARK-4827][SQL] Fix resolution of deeply nested Project(attr, Project(Star,...)). · a66c23e1
      Michael Armbrust authored
      Since `AttributeReference` resolution and `*` expansion are currently in separate rules, each pair requires a full iteration instead of being able to resolve in a single pass.  Since its pretty easy to construct queries that have many of these in a row, I combine them into a single rule in this PR.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3674 from marmbrus/projectStars and squashes the following commits:
      
      d83d6a1 [Michael Armbrust] Fix resolution of deeply nested Project(attr, Project(Star,...)).
      a66c23e1
    • tianyi's avatar
      [SPARK-4483][SQL]Optimization about reduce memory costs during the HashOuterJoin · 30f6b85c
      tianyi authored
      In `HashOuterJoin.scala`, spark read data from both side of join operation before zip them together. It is a waste for memory. We are trying to read data from only one side, put them into a hashmap, and then generate the `JoinedRow` with data from other side one by one.
      Currently, we could only do this optimization for `left outer join` and `right outer join`. For `full outer join`, we will do something in another issue.
      
      for
      table test_csv contains 1 million records
      table dim_csv contains 10 thousand records
      
      SQL:
      `select * from test_csv a left outer join dim_csv b on a.key = b.key`
      
      the result is:
      master:
      ```
      CSV: 12671 ms
      CSV: 9021 ms
      CSV: 9200 ms
      Current Mem Usage:787788984
      ```
      after patch:
      ```
      CSV: 10382 ms
      CSV: 7543 ms
      CSV: 7469 ms
      Current Mem Usage:208145728
      ```
      
      Author: tianyi <tianyi@asiainfo-linkage.com>
      Author: tianyi <tianyi.asiainfo@gmail.com>
      
      Closes #3375 from tianyi/SPARK-4483 and squashes the following commits:
      
      72a8aec [tianyi] avoid having mutable state stored inside of the task
      99c5c97 [tianyi] performance optimization
      d2f94d7 [tianyi] fix bug: missing output when the join-key is null.
      2be45d1 [tianyi] fix spell bug
      1f2c6f1 [tianyi] remove commented codes
      a676de6 [tianyi] optimize some codes
      9e7d5b5 [tianyi] remove commented old codes
      838707d [tianyi] Optimization about reduce memory costs during the HashOuterJoin
      30f6b85c
    • wangxiaojing's avatar
      [SPARK-4527][SQl]Add BroadcastNestedLoopJoin operator selection testsuite · ea1315e3
      wangxiaojing authored
      In `JoinSuite` add BroadcastNestedLoopJoin operator selection testsuite
      
      Author: wangxiaojing <u9jing@gmail.com>
      
      Closes #3395 from wangxiaojing/SPARK-4527 and squashes the following commits:
      
      ea0e495 [wangxiaojing] change style
      53c3952 [wangxiaojing] Add BroadcastNestedLoopJoin operator selection testsuite
      ea1315e3
    • Holden Karau's avatar
      SPARK-4767: Add support for launching in a specified placement group to spark_ec2 · b0dfdbdd
      Holden Karau authored
      Placement groups are cool and all the cool kids are using them. Lets add support for them to spark_ec2.py because I'm lazy
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #3623 from holdenk/SPARK-4767-add-support-for-launching-in-a-specified-placement-group-to-spark-ec2-scripts and squashes the following commits:
      
      111a5fd [Holden Karau] merge in master
      70ace25 [Holden Karau] Placement groups are cool and all the cool kids are using them. Lets add support for them to spark_ec2.py because I'm lazy
      b0dfdbdd
    • zsxwing's avatar
      [SPARK-4812][SQL] Fix the initialization issue of 'codegenEnabled' · 6530243a
      zsxwing authored
      The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, which can be override by subclasses. Here is a simple example to show this issue.
      
      ```Scala
      scala> :paste
      // Entering paste mode (ctrl-D to finish)
      
      abstract class Foo {
      
        protected val sqlContext = "Foo"
      
        val codegenEnabled: Boolean = {
          println(sqlContext) // it will call subclass's `sqlContext` which has not yet been initialized.
          if (sqlContext != null) {
            true
          } else {
            false
          }
        }
      }
      
      class Bar extends Foo {
        override val sqlContext = "Bar"
      }
      
      println(new Bar().codegenEnabled)
      
      // Exiting paste mode, now interpreting.
      
      null
      false
      defined class Foo
      defined class Bar
      ```
      
      We should make `sqlContext` `final` to prevent subclasses from overriding it incorrectly.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3660 from zsxwing/SPARK-4812 and squashes the following commits:
      
      1cbb623 [zsxwing] Make `sqlContext` final to prevent subclasses from overriding it incorrectly
      6530243a
    • jerryshao's avatar
      [SPARK-4847][SQL]Fix "extraStrategies cannot take effect in SQLContext" issue · dc8280dc
      jerryshao authored
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3698 from jerryshao/SPARK-4847 and squashes the following commits:
      
      4741130 [jerryshao] Make later added extraStrategies effect when calling strategies
      dc8280dc
    • Peter Vandenabeele's avatar
      [DOCS][SQL] Add a Note on jsonFile having separate JSON objects per line · 1a9e35e5
      Peter Vandenabeele authored
      * This commit hopes to avoid the confusion I faced when trying
        to submit a regular, valid multi-line JSON file, also see
      
        http://apache-spark-user-list.1001560.n3.nabble.com/Loading-JSON-Dataset-fails-with-com-fasterxml-jackson-databind-JsonMappingException-td20041.html
      
      Author: Peter Vandenabeele <peter@vandenabeele.com>
      
      Closes #3517 from petervandenabeele/pv-docs-note-on-jsonFile-format/01 and squashes the following commits:
      
      1f98e52 [Peter Vandenabeele] Revert to people.json and simple Note text
      6b6e062 [Peter Vandenabeele] Change the "JSON" connotation to "txt"
      fca7dfb [Peter Vandenabeele] Add a Note on jsonFile having separate JSON objects per line
      1a9e35e5
    • Judy Nash's avatar
      [SQL] SPARK-4700: Add HTTP protocol spark thrift server · 17688d14
      Judy Nash authored
      Add HTTP protocol support and test cases to spark thrift server, so users can deploy thrift server in both TCP and http mode.
      
      Author: Judy Nash <judynash@microsoft.com>
      Author: judynash <judynash@microsoft.com>
      
      Closes #3672 from judynash/master and squashes the following commits:
      
      526315d [Judy Nash] correct spacing on startThriftServer method
      31a6520 [Judy Nash] fix code style issues and update sql programming guide format issue
      47bf87e [Judy Nash] modify withJdbcStatement method definition to meet less than 100 line length
      2e9c11c [Judy Nash] add thrift server in http mode documentation on sql programming guide
      1cbd305 [Judy Nash] Merge remote-tracking branch 'upstream/master'
      2b1d312 [Judy Nash] updated http thrift server support based on feedback
      377532c [judynash] add HTTP protocol spark thrift server
      17688d14
    • Mike Jennings's avatar
      [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py · d12c0711
      Mike Jennings authored
      Based on this gist:
      https://gist.github.com/amar-analytx/0b62543621e1f246c0a2
      
      We use security group ids instead of security group to get around this issue:
      https://github.com/boto/boto/issues/350
      
      Author: Mike Jennings <mvj101@gmail.com>
      Author: Mike Jennings <mvj@google.com>
      
      Closes #2872 from mvj101/SPARK-3405 and squashes the following commits:
      
      be9cb43 [Mike Jennings] `pep8 spark_ec2.py` runs cleanly.
      4dc6756 [Mike Jennings] Remove duplicate comment
      731d94c [Mike Jennings] Update for code review.
      ad90a36 [Mike Jennings] Merge branch 'master' of https://github.com/apache/spark into SPARK-3405
      1ebffa1 [Mike Jennings] Merge branch 'master' into SPARK-3405
      52aaeec [Mike Jennings] [SPARK-3405] add subnet-id and vpc-id options to spark_ec2.py
      d12c0711
    • jbencook's avatar
      [SPARK-4855][mllib] testing the Chi-squared hypothesis test · cb484474
      jbencook authored
      This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc5 and moves some of the error messaging in to python.
      
      It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)
      
      Hopefully, SPARK-2980 can be closed.
      
      Author: jbencook <jbenjamincook@gmail.com>
      
      Closes #3679 from jbencook/master and squashes the following commits:
      
      44078e0 [jbencook] checking that bad input throws the correct exceptions
      f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
      7536cf1 [jbencook] removing python checks for invalid input
      a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
      3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
      cb484474
Loading