Skip to content
Snippets Groups Projects
  1. Jan 28, 2016
    • Tejas Patil's avatar
      [SPARK-12926][SQL] SQLContext to display warning message when non-sql configs are being set · 67680396
      Tejas Patil authored
      Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done.
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #10849 from tejasapatil/SPARK-12926.
      67680396
    • Cheng Lian's avatar
      [SPARK-12818][SQL] Specialized integral and string types for Count-min Sketch · 415d0a85
      Cheng Lian authored
      This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10968 from liancheng/cms-specialized.
      415d0a85
    • James Lohse's avatar
      Provide same info as in spark-submit --help · c2204436
      James Lohse authored
      this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths."
      
      Author: James Lohse <jimlohse@users.noreply.github.com>
      
      Closes #10890 from jimlohse/patch-1.
      c2204436
  2. Jan 27, 2016
    • Nong Li's avatar
      [SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBatch.Row · 4a091232
      Nong Li authored
      These two classes became identical as the implementation progressed.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #10952 from nongli/spark-13045.
      4a091232
    • Andrew Or's avatar
      [HOTFIX] Fix Scala 2.11 compilation · d702f0c1
      Andrew Or authored
      by explicitly marking annotated parameters as vals (SI-8813).
      
      Caused by #10835.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10955 from andrewor14/fix-scala211.
      d702f0c1
    • Herman van Hovell's avatar
      [SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQlParser commands to new Parser · ef96cd3c
      Herman van Hovell authored
      This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive).
      
      This PR and https://github.com/apache/spark/pull/10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst.
      
      The PR is marked WIP as long as it doesn't pass all tests.
      
      cc rxin viirya winningsix (this touches https://github.com/apache/spark/pull/10144)
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10905 from hvanhovell/SPARK-12866.
      ef96cd3c
    • Wenchen Fan's avatar
      [SPARK-12938][SQL] DataFrame API for Bloom filter · 680afabe
      Wenchen Fan authored
      This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs.
      
      This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10937 from cloud-fan/bloom-filter.
      680afabe
    • Josh Rosen's avatar
      [SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API contract · 32f74111
      Josh Rosen authored
      Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array.
      
      If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html
      
      In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10932 from JoshRosen/SPARK-13021.
      32f74111
    • Andrew Or's avatar
      [SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators · 87abcf7d
      Andrew Or authored
      The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts:
      
      **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver.
      
      **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620.
      
      While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here.
      
      Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10835 from andrewor14/task-metrics-use-accums.
      87abcf7d
    • Jason Lee's avatar
      [SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with... · edd47375
      Jason Lee authored
      [SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with `None` triggers cryptic failure
      
      The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works.
      
      Author: Jason Lee <cjlee@us.ibm.com>
      
      Closes #8969 from jasoncl/SPARK-10847.
      edd47375
    • Josh Rosen's avatar
      [SPARK-13023][PROJECT INFRA] Fix handling of root module in modules_to_test() · 41f0c85f
      Josh Rosen authored
      There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10933 from JoshRosen/build-module-fix.
      41f0c85f
    • Andrew's avatar
      [SPARK-1680][DOCS] Explain environment variables for running on YARN in cluster mode · 093291cf
      Andrew authored
      JIRA 1680 added a property called spark.yarn.appMasterEnv.  This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables
      
      Author: Andrew <weiner.andrew.j@gmail.com>
      
      Closes #10869 from weineran/branch-yarn-docs.
      093291cf
    • BenFradet's avatar
      [SPARK-12983][CORE][DOC] Correct metrics.properties.template · 90b0e562
      BenFradet authored
      There are some typos or plain unintelligible sentences in the metrics template.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #10902 from BenFradet/SPARK-12983.
      90b0e562
  3. Jan 26, 2016
    • Xusen Yin's avatar
      [SPARK-12780] Inconsistency returning value of ML python models' properties · 4db255c7
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-12780
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #10724 from yinxusen/SPARK-12780.
      4db255c7
    • Nishkam Ravi's avatar
      [SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext shutdown · bae3c9a4
      Nishkam Ravi authored
      If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence.
      
      vanzin
      
      Author: Nishkam Ravi <nishkamravi@gmail.com>
      Author: nishkamravi2 <nishkamravi@gmail.com>
      
      Closes #10881 from nishkamravi2/master_netty.
      bae3c9a4
    • Cheng Lian's avatar
      [SPARK-12728][SQL] Integrates SQL generation with native view · 58f5d8c1
      Cheng Lian authored
      This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical.
      
      In this PR, a new SQL option `spark.sql.nativeView.canonical` is added.  When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach.
      
      One important issue this PR fixes is that, now we can use CTE when defining a view.  Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`.  However, HiveQL parser doesn't allow CTE appearing as a subquery.  Namely, something like this is disallowed:
      
      ```sql
      SELECT n
      FROM (
        WITH w AS (SELECT 1 AS n)
        SELECT * FROM w
      ) v
      ```
      
      This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string).
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
      58f5d8c1
    • Cheng Lian's avatar
      [SPARK-12935][SQL] DataFrame API for Count-Min Sketch · ce38a35b
      Cheng Lian authored
      This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10911 from liancheng/cms-df-api.
      ce38a35b
    • Yanbo Liang's avatar
      [SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR · e7f9199e
      Yanbo Liang authored
      Add ```covar_samp``` and ```covar_pop``` for SparkR.
      Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change.
      
      cc sun-rui felixcheung shivaram
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10829 from yanboliang/spark-12903.
      e7f9199e
    • Holden Karau's avatar
      [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be regularized · b72611f2
      Holden Karau authored
      The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
      The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
      Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.
      
      Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review.
      
      Author: Holden Karau <holden@us.ibm.com>
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
      b72611f2
    • Nong Li's avatar
      [SPARK-12854][SQL] Implement complex types support in ColumnarBatch · 55512738
      Nong Li authored
      This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs
      and arrays. There is a simple mapping between the richer catalyst types to these two. Strings
      are treated as an array of bytes.
      
      ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists
      of just leaf nodes. Structs represent an internal node with one child for each field. Arrays
      are internal nodes with one child. Structs just contain nullability. Arrays contain offsets
      and lengths into the child array. This structure is able to handle arbitrary nesting. It has
      the key property that we maintain columnar throughout and that primitive types are only stored
      in the leaf nodes and contiguous across rows. For example, if the schema is
      ```
      array<array<int>>
      ```
      There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively.
      
      As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v)
      vs appendLong(v)). These APIs are necessary when the batch contains variable length elements.
      The vectors are not fixed length and will grow as necessary. This should make the usage a lot
      simpler for the writer.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #10820 from nongli/spark-12854.
      55512738
    • Jeff Zhang's avatar
      [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and… · 1dac964c
      Jeff Zhang authored
      … Add LibSVMOutputWriter
      
      The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter
      * Partition is still not supported
      * Multiple input paths is not supported
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #9595 from zjffdu/SPARK-11622.
      1dac964c
    • Shixiong Zhu's avatar
      [SPARK-12614][CORE] Don't throw non fatal exception from ask · 22662b24
      Shixiong Zhu authored
      Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10568 from zsxwing/send-ask-fail.
      22662b24
    • Holden Karau's avatar
      [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code · eb917291
      Holden Karau authored
      The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
      eb917291
    • Jeff Zhang's avatar
      [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark · 19fdb21a
      Jeff Zhang authored
      environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #10913 from zjffdu/SPARK-12993.
      19fdb21a
    • Cheng Lian's avatar
      [SQL] Minor Scaladoc format fix · 83507fea
      Cheng Lian authored
      Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10926 from liancheng/agg-doc-fix.
      83507fea
    • Josh Rosen's avatar
      [SPARK-8725][PROJECT-INFRA] Test modules in topologically-sorted order in dev/run-tests · ee74498d
      Josh Rosen authored
      This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies.  This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure
      
      Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after.
      
      In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10885 from JoshRosen/SPARK-8725.
      ee74498d
    • Xusen Yin's avatar
      [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other than its parent class · fbf7623d
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-12952
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #10863 from yinxusen/SPARK-12952.
      fbf7623d
    • Xusen Yin's avatar
      [SPARK-11923][ML] Python API for ml.feature.ChiSqSelector · 8beab681
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-11923
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #10186 from yinxusen/SPARK-11923.
      8beab681
    • Shixiong Zhu's avatar
      [SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instructions... · cbd507d6
      Shixiong Zhu authored
      [SPARK-7799][STREAMING][DOCUMENT] Add the linking and deploying instructions for streaming-akka project
      
      Since `actorStream` is an external project, we should add the linking and deploying instructions for it.
      
      A follow up PR of #10744
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10856 from zsxwing/akka-link-instruction.
      cbd507d6
    • Sameer Agarwal's avatar
      [SPARK-12682][SQL] Add support for (optionally) not storing tables in hive metadata format · 08c781ca
      Sameer Agarwal authored
      This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #10826 from sameeragarwal/skip-hive-metadata.
      08c781ca
    • zhuol's avatar
      [SPARK-10911] Executors should System.exit on clean shutdown. · ae0309a8
      zhuol authored
      Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441.
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #9946 from zhuoliu/10911.
      ae0309a8
    • Sean Owen's avatar
      [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is... · 649e9d0f
      Sean Owen authored
      [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator
      
      Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.
      
      CC rxin pwendell for API change; tdas since it also touches streaming.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10413 from srowen/SPARK-3369.
      649e9d0f
    • Liang-Chi Hsieh's avatar
      [SPARK-12961][CORE] Prevent snappy-java memory leak · 5936bf9f
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12961
      
      To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object.
      
      JoshRosen
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #10875 from viirya/prevent-snappy-memory-leak.
      5936bf9f
    • Wenchen Fan's avatar
      [SPARK-12937][SQL] bloom filter serialization · 6743de3a
      Wenchen Fan authored
      This PR adds serialization support for BloomFilter.
      
      A version number is added to version the serialized binary format.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10920 from cloud-fan/bloom-filter.
      6743de3a
    • Reynold Xin's avatar
      [SQL][MINOR] A few minor tweaks to CSV reader. · d54cfed5
      Reynold Xin authored
      This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10919 from rxin/csv-minor.
      d54cfed5
    • Xiangrui Meng's avatar
      [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test in PySpark for now · 27c910f7
      Xiangrui Meng authored
      I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086.
      
      gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"?
      
      cc: jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #10909 from mengxr/SPARK-10086.
      27c910f7
    • Xusen Yin's avatar
      [SPARK-12834] Change ser/de of JavaArray and JavaList · ae47ba71
      Xusen Yin authored
      https://issues.apache.org/jira/browse/SPARK-12834
      
      We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #10772 from yinxusen/SPARK-12834.
      ae47ba71
    • Holden Karau's avatar
      [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer · b66afdeb
      Holden Karau authored
      Add Python API for ml.feature.QuantileDiscretizer.
      
      One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
      cc brkyvz & mengxr
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
      b66afdeb
  4. Jan 25, 2016
Loading