Skip to content
Snippets Groups Projects
  1. Jun 23, 2015
    • Cheng Lian's avatar
      [DOC] [SQL] Addes Hive metastore Parquet table conversion section · d96d7b55
      Cheng Lian authored
      This PR adds a section about Hive metastore Parquet table conversion. It documents:
      
      1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in #5188)
      2. Metadata refreshing requirement introduced in #5339
      
      [1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the following commits:
      
      42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet
      4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet
      756e660 [Cheng Lian] Adds Python snippet for metadata refreshing
      50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section
      d96d7b55
    • Oleksiy Dyagilev's avatar
      [SPARK-8525] [MLLIB] fix LabeledPoint parser when there is a whitespace... · a8031183
      Oleksiy Dyagilev authored
      [SPARK-8525] [MLLIB] fix LabeledPoint parser when there is a whitespace between label and features vector
      
      fix LabeledPoint parser when there is a whitespace between label and features vector, e.g.
      (y, [x1, x2, x3])
      
      Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com>
      
      Closes #6954 from fe2s/SPARK-8525 and squashes the following commits:
      
      0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang
      c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position
      a8031183
    • Alok  Singh's avatar
      [SPARK-8111] [SPARKR] SparkR shell should display Spark logo and version banner on startup. · f2fb0285
      Alok Singh authored
      spark version is taken from the environment variable SPARK_VERSION
      
      Author: Alok  Singh <singhal@Aloks-MacBook-Pro.local>
      Author: Alok  Singh <singhal@aloks-mbp.usca.ibm.com>
      
      Closes #6944 from aloknsingh/aloknsingh_spark_jiras and squashes the following commits:
      
      ed607bd [Alok  Singh] [SPARK-8111][SparkR] As per suggestion, 1) using the version from sparkContext rather than the Sys.env. 2) change "Welcome to SparkR!" to "Welcome to" followed by Spark logo and version
      acd5b85 [Alok  Singh] fix the jira SPARK-8111 to add the spark version and logo. Currently spark version is taken from the environment variable SPARK_VERSION
      f2fb0285
    • MechCoder's avatar
      [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utils · f2022fa0
      MechCoder authored
      It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6715 from MechCoder/generate_linear_input and squashes the following commits:
      
      6182884 [MechCoder] Minor changes
      8bda047 [MechCoder] Minor style fixes
      0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
      f2022fa0
    • Holden Karau's avatar
      [SPARK-7888] Be able to disable intercept in linear regression in ml package · 2b1111dd
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6927 from holdenk/SPARK-7888-Be-able-to-disable-intercept-in-Linear-Regression-in-ML-package and squashes the following commits:
      
      0ad384c [Holden Karau] Add MiMa excludes
      4016fac [Holden Karau] Switch to wild card import, remove extra blank lines
      ae5baa8 [Holden Karau] CR feedback, move the fitIntercept down rather than changing ymean and etc above
      f34971c [Holden Karau] Fix some more long lines
      319bd3f [Holden Karau] Fix long lines
      3bb9ee1 [Holden Karau] Update the regression suite tests
      7015b9f [Holden Karau] Our code performs the same with R, except we need more than one data point but that seems reasonable
      0b0c8c0 [Holden Karau] fix the issue with the sample R code
      e2140ba [Holden Karau] Add a test, it fails!
      5e84a0b [Holden Karau] Write out thoughts and use the correct trait
      91ffc0a [Holden Karau] more murh
      006246c [Holden Karau] murp?
      2b1111dd
    • Davies Liu's avatar
      [SPARK-8432] [SQL] fix hashCode() and equals() of BinaryType in Row · 6f4cadf5
      Davies Liu authored
      Also added more tests in LiteralExpressionSuite
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6876 from davies/fix_hashcode and squashes the following commits:
      
      429c2c0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
      32d9811 [Davies Liu] fix test
      a0626ed [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_hashcode
      89c2432 [Davies Liu] fix style
      bd20780 [Davies Liu] check with catalyst types
      41caec6 [Davies Liu] change for to while
      d96929b [Davies Liu] address comment
      6ad2a90 [Davies Liu] fix style
      5819d33 [Davies Liu] unify equals() and hashCode()
      0fff25d [Davies Liu] fix style
      53c38b1 [Davies Liu] fix hashCode() and equals() of BinaryType in Row
      6f4cadf5
    • Cheng Hao's avatar
      [SPARK-7235] [SQL] Refactor the grouping sets · 7b1450b6
      Cheng Hao authored
      The logical plan `Expand` takes the `output` as constructor argument, which break the references chain. We need to refactor the code, as well as the column pruning.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #5780 from chenghao-intel/expand and squashes the following commits:
      
      76e4aa4 [Cheng Hao] revert the change for case insenstive
      7c10a83 [Cheng Hao] refactor the grouping sets
      7b1450b6
    • lockwobr's avatar
      [SQL] [DOCS] updated the documentation for explode · 4f7fbefb
      lockwobr authored
      the syntax was incorrect in the example in explode
      
      Author: lockwobr <lockwobr@gmail.com>
      
      Closes #6943 from lockwobr/master and squashes the following commits:
      
      3d864d1 [lockwobr] updated the documentation for explode
      4f7fbefb
    • Holden Karau's avatar
      [SPARK-8498] [TUNGSTEN] fix npe in errorhandling path in unsafeshuffle writer · 0f92be5b
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6918 from holdenk/SPARK-8498-fix-npe-in-errorhandling-path-in-unsafeshuffle-writer and squashes the following commits:
      
      f807832 [Holden Karau] Log error if we can't throw it
      855f9aa [Holden Karau] Spelling - not my strongest suite. Fix Propegates to Propagates.
      039d620 [Holden Karau] Add missing closeandwriteoutput
      30e558d [Holden Karau] go back to try/finally
      e503b8c [Holden Karau] Improve the test to ensure we aren't masking the underlying exception
      ae0b7a7 [Holden Karau] Fix the test
      2e6abf7 [Holden Karau] Be more cautious when cleaning up during failed write and re-throw user exceptions
      0f92be5b
    • Reynold Xin's avatar
      [SPARK-8300] DataFrame hint for broadcast join. · 6ceb1696
      Reynold Xin authored
      Users can now do
      ```scala
      left.join(broadcast(right), "joinKey")
      ```
      to give the query planner a hint that "right" DataFrame is small and should be broadcasted.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6751 from rxin/broadcastjoin-hint and squashes the following commits:
      
      953eec2 [Reynold Xin] Code review feedback.
      88752d8 [Reynold Xin] Fixed import.
      8187b88 [Reynold Xin] [SPARK-8300] DataFrame hint for broadcast join.
      6ceb1696
    • Scott Taylor's avatar
      [SPARK-8541] [PYSPARK] test the absolute error in approx doctests · f0dcbe8a
      Scott Taylor authored
      A minor change but one which is (presumably) visible on the public api docs webpage.
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:
      
      fbed000 [Scott Taylor] test the absolute error in approx doctests
      f0dcbe8a
    • Hari Shreedharan's avatar
      [SPARK-8483] [STREAMING] Remove commons-lang3 dependency from Flume Si… · 9b618fb0
      Hari Shreedharan authored
      …nk. Also bump Flume version to 1.6.0
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #6910 from harishreedharan/remove-commons-lang3 and squashes the following commits:
      
      9875f7d [Hari Shreedharan] Revert back to Flume 1.4.0
      ca35eb0 [Hari Shreedharan] [SPARK-8483][Streaming] Remove commons-lang3 dependency from Flume Sink. Also bump Flume version to 1.6.0
      9b618fb0
    • Liang-Chi Hsieh's avatar
      [SPARK-8359] [SQL] Fix incorrect decimal precision after multiplication · 31bd3068
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8359
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6814 from viirya/fix_decimal2 and squashes the following commits:
      
      071a757 [Liang-Chi Hsieh] Remove maximum precision and use MathContext.UNLIMITED.
      df217d4 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
      a43bfc3 [Liang-Chi Hsieh] Add MathContext with maximum supported precision.
      72eeb3f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into fix_decimal2
      44c9348 [Liang-Chi Hsieh] Fix incorrect decimal precision after multiplication.
      31bd3068
    • Yu ISHIKAWA's avatar
      [SPARK-8431] [SPARKR] Add in operator to DataFrame Column in SparkR · d4f63351
      Yu ISHIKAWA authored
      [[SPARK-8431] Add in operator to DataFrame Column in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8431)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6941 from yu-iskw/SPARK-8431 and squashes the following commits:
      
      1f64423 [Yu ISHIKAWA] Modify the comment
      f4309a7 [Yu ISHIKAWA] Make a `setMethod` for `%in%` be independent
      6e37936 [Yu ISHIKAWA] Modify a variable name
      c196173 [Yu ISHIKAWA] [SPARK-8431][SparkR] Add in operator to DataFrame Column in SparkR
      d4f63351
    • Holden Karau's avatar
      [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins · 164fe2aa
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:
      
      2894695 [Holden Karau] remove extra blank line
      2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
      3a09170 [Holden Karau] add maxBins to to the train method as well
      af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
      164fe2aa
  2. Jun 22, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-8548] [SPARKR] Remove the trailing whitespaces from the SparkR files · 44fa7df6
      Yu ISHIKAWA authored
      [[SPARK-8548] Remove the trailing whitespaces from the SparkR files - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8548)
      
      - This is the result of `lint-r`
          https://gist.github.com/yu-iskw/0019b37a2c1167f33986
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6945 from yu-iskw/SPARK-8548 and squashes the following commits:
      
      0bd567a [Yu ISHIKAWA] [SPARK-8548][SparkR] Remove the trailing whitespaces from the SparkR files
      44fa7df6
    • Patrick Wendell's avatar
      MAINTENANCE: Automated closing of pull requests. · c4d23439
      Patrick Wendell authored
      This commit exists to close the following pull requests on Github:
      
      Closes #2849 (close requested by 'srowen')
      Closes #2786 (close requested by 'andrewor14')
      Closes #4678 (close requested by 'JoshRosen')
      Closes #5457 (close requested by 'andrewor14')
      Closes #3346 (close requested by 'andrewor14')
      Closes #6518 (close requested by 'andrewor14')
      Closes #5403 (close requested by 'pwendell')
      Closes #2110 (close requested by 'srowen')
      c4d23439
    • Cheng Hao's avatar
      [SPARK-7859] [SQL] Collect_set() behavior differences which fails the unit test under jdk8 · 13321e65
      Cheng Hao authored
      To reproduce that:
      ```
      JAVA_HOME=/home/hcheng/Java/jdk1.8.0_45 | build/sbt -Phadoop-2.3 -Phive  'test-only org.apache.spark.sql.hive.execution.HiveWindowFunctionQueryWithoutCodeGenSuite'
      ```
      
      A simple workaround to fix that is update the original query, for getting the output size instead of the exact elements of the array (output by collect_set())
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #6402 from chenghao-intel/windowing and squashes the following commits:
      
      99312ad [Cheng Hao] add order by for the select clause
      edf8ce3 [Cheng Hao] update the code as suggested
      7062da7 [Cheng Hao] fix the collect_set() behaviour differences under different versions of JDK
      13321e65
    • Davies Liu's avatar
      [SPARK-8307] [SQL] improve timestamp from parquet · 6b7f2cea
      Davies Liu authored
      This PR change to convert julian day to unix timestamp directly (without Calendar and Timestamp).
      
      cc adrian-wang rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6759 from davies/improve_ts and squashes the following commits:
      
      849e301 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      b0e4cad [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      8e2d56f [Davies Liu] address comments
      634b9f5 [Davies Liu] fix mima
      4891efb [Davies Liu] address comment
      bfc437c [Davies Liu] fix build
      ae5979c [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      602b969 [Davies Liu] remove jodd
      2f2e48c [Davies Liu] fix test
      8ace611 [Davies Liu] fix mima
      212143b [Davies Liu] fix mina
      c834108 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      a3171b8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_ts
      5233974 [Davies Liu] fix scala style
      361fd62 [Davies Liu] address comments
      ea196d4 [Davies Liu] improve timestamp from parquet
      6b7f2cea
    • Wenchen Fan's avatar
      [SPARK-7153] [SQL] support all integral type ordinal in GetArrayItem · 860a49ef
      Wenchen Fan authored
      first convert `ordinal` to `Number`, then convert to int type.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5706 from cloud-fan/7153 and squashes the following commits:
      
      915db79 [Wenchen Fan] fix 7153
      860a49ef
    • Andrew Or's avatar
      [HOTFIX] [TESTS] Typo mqqt -> mqtt · 1dfb0f7b
      Andrew Or authored
      This was introduced in #6866.
      1dfb0f7b
    • Davies Liu's avatar
      [SPARK-8492] [SQL] support binaryType in UnsafeRow · 96aa0137
      Davies Liu authored
      Support BinaryType in UnsafeRow, just like StringType.
      
      Also change the layout of StringType and BinaryType in UnsafeRow, by combining offset and size together as Long, which will limit the size of Row to under 2G (given that fact that any single buffer can not be bigger than 2G in JVM).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6911 from davies/unsafe_bin and squashes the following commits:
      
      d68706f [Davies Liu] update comment
      519f698 [Davies Liu] address comment
      98a964b [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_bin
      180b49d [Davies Liu] fix zero-out
      22e4c0a [Davies Liu] zero-out padding bytes
      6abfe93 [Davies Liu] fix style
      447dea0 [Davies Liu] support binaryType in UnsafeRow
      96aa0137
    • BenFradet's avatar
      [SPARK-8356] [SQL] Reconcile callUDF and callUdf · 50d3242d
      BenFradet authored
      Deprecates ```callUdf``` in favor of ```callUDF```.
      
      Author: BenFradet <benjamin.fradet@gmail.com>
      
      Closes #6902 from BenFradet/SPARK-8356 and squashes the following commits:
      
      ef4e9d8 [BenFradet] deprecated callUDF, use udf instead
      9b1de4d [BenFradet] reinstated unit test for the deprecated callUdf
      cbd80a5 [BenFradet] deprecated callUdf in favor of callUDF
      50d3242d
    • Yu ISHIKAWA's avatar
      [SPARK-8537] [SPARKR] Add a validation rule about the curly braces in SparkR to `.lintr` · b1f3a489
      Yu ISHIKAWA authored
      [[SPARK-8537] Add a validation rule about the curly braces in SparkR to `.lintr` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8537)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6940 from yu-iskw/SPARK-8537 and squashes the following commits:
      
      7eec1a0 [Yu ISHIKAWA] [SPARK-8537][SparkR] Add a validation rule about the curly braces in SparkR to `.lintr`
      b1f3a489
    • Feynman Liang's avatar
      [SPARK-8455] [ML] Implement n-gram feature transformer · afe35f05
      Feynman Liang authored
      Implementation of n-gram feature transformer for ML.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6887 from feynmanliang/ngram-featurizer and squashes the following commits:
      
      d2c839f [Feynman Liang] Make n > input length yield empty output
      9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces
      fe93873 [Feynman Liang] Implement n-gram feature transformer
      afe35f05
    • Yin Huai's avatar
      [SPARK-8532] [SQL] In Python's DataFrameWriter,... · 5ab9fcfb
      Yin Huai authored
      [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
      
      https://issues.apache.org/jira/browse/SPARK-8532
      
      This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:
      
      f972d5d [Yin Huai] davies's comment.
      d37abd2 [Yin Huai] style.
      d21290a [Yin Huai] Python doc.
      889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
      7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
      d696dff [Yin Huai] Python style.
      88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
      c40c461 [Yin Huai] Regression test.
      5ab9fcfb
    • Wenchen Fan's avatar
      [SPARK-8104] [SQL] auto alias expressions in analyzer · da7bbb94
      Wenchen Fan authored
      Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain.
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #6647 from cloud-fan/alias and squashes the following commits:
      
      552eba4 [Wenchen Fan] fix python
      5b5786d [Wenchen Fan] fix agg
      73a90cb [Wenchen Fan] fix case-preserve of ExtractValue
      4cfd23c [Wenchen Fan] fix order by
      d18f401 [Wenchen Fan] refine
      9f07359 [Wenchen Fan] address comments
      39c1aef [Wenchen Fan] small fix
      33640ec [Wenchen Fan] auto alias expressions in analyzer
      da7bbb94
    • Yu ISHIKAWA's avatar
      [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in `regression.py` · 5d89d9f0
      Yu ISHIKAWA authored
      [[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:
      
      7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
      4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
      5d89d9f0
    • Pradeep Chhetri's avatar
      [SPARK-8482] Added M4 instances to the list. · ba8a4537
      Pradeep Chhetri authored
      AWS recently added M4 instances (https://aws.amazon.com/blogs/aws/the-new-m4-instance-type-bonus-price-reduction-on-m3-c4/).
      
      Author: Pradeep Chhetri <pradeep.chhetri89@gmail.com>
      
      Closes #6899 from pradeepchhetri/master and squashes the following commits:
      
      4f4ea79 [Pradeep Chhetri] Added t2.large instance
      3d2bb6c [Pradeep Chhetri] Added M4 instances to the list
      ba8a4537
    • Stefano Parmesan's avatar
      [SPARK-8429] [EC2] Add ability to set additional tags · 42a1f716
      Stefano Parmesan authored
      Add the `--additional-tags` parameter that allows to set additional tags to all the created instances (masters and slaves).
      
      The user can specify multiple tags by separating them with a comma (`,`), while each tag name and value should be separated by a colon (`:`); for example, `Task:MySparkProject,Env:production` would add two tags, `Task` and `Env`, with the given values.
      
      Author: Stefano Parmesan <s.parmesan@gmail.com>
      
      Closes #6857 from armisael/patch-1 and squashes the following commits:
      
      c5ac92c [Stefano Parmesan] python style (pep8)
      8e614f1 [Stefano Parmesan] Set multiple tags in a single request
      bfc56af [Stefano Parmesan] Address SPARK-7900 by inceasing sleep time
      daf8615 [Stefano Parmesan] Add ability to set additional tags
      42a1f716
    • Cheng Lian's avatar
      [SPARK-8406] [SQL] Adding UUID to output file name to avoid accidental overwriting · 0818fdec
      Cheng Lian authored
      This PR fixes a Parquet output file name collision bug which may cause data loss.  Changes made:
      
      1.  Identify each write job issued by `InsertIntoHadoopFsRelation` with a UUID
      
          All concrete data sources which extend `HadoopFsRelation` (Parquet and ORC for now) must use this UUID to generate task output file path to avoid name collision.
      
      2.  Make `TestHive` use a local mode `SparkContext` with 32 threads to increase parallelism
      
          The major reason for this is that, the original parallelism of 2 is too low to reproduce the data loss issue.  Also, higher concurrency may potentially caught more concurrency bugs during testing phase. (It did help us spotted SPARK-8501.)
      
      3. `OrcSourceSuite` was updated to workaround SPARK-8501, which we detected along the way.
      
      NOTE: This PR is made a little bit more complicated than expected because we hit two other bugs on the way and have to work them around. See [SPARK-8501] [1] and [SPARK-8513] [2].
      
      [1]: https://github.com/liancheng/spark/tree/spark-8501
      [2]: https://github.com/liancheng/spark/tree/spark-8513
      
      ----
      
      Some background and a summary of offline discussion with yhuai about this issue for better understanding:
      
      In 1.4.0, we added `HadoopFsRelation` to abstract partition support of all data sources that are based on Hadoop `FileSystem` interface.  Specifically, this makes partition discovery, partition pruning, and writing dynamic partitions for data sources much easier.
      
      To support appending, the Parquet data source tries to find out the max part number of part-files in the destination directory (i.e., `<id>` in output file name `part-r-<id>.gz.parquet`) at the beginning of the write job.  In 1.3.0, this step happens on driver side before any files are written.  However, in 1.4.0, this is moved to task side.  Unfortunately, for tasks scheduled later, they may see wrong max part number generated of files newly written by other finished tasks within the same job.  This actually causes a race condition.  In most cases, this only causes nonconsecutive part numbers in output file names.  But when the DataFrame contains thousands of RDD partitions, it's likely that two tasks may choose the same part number, then one of them gets overwritten by the other.
      
      Before `HadoopFsRelation`, Spark SQL already supports appending data to Hive tables.  From a user's perspective, these two look similar.  However, they differ a lot internally.  When data are inserted into Hive tables via Spark SQL, `InsertIntoHiveTable` simulates Hive's behaviors:
      
      1.  Write data to a temporary location
      
      2.  Move data in the temporary location to the final destination location using
      
          -   `Hive.loadTable()` for non-partitioned table
          -   `Hive.loadPartition()` for static partitions
          -   `Hive.loadDynamicPartitions()` for dynamic partitions
      
      The important part is that, `Hive.copyFiles()` is invoked in step 2 to move the data to the destination directory (I found the name is kinda confusing since no "copying" occurs here, we are just moving and renaming stuff).  If a file in the source directory and another file in the destination directory happen to have the same name, say `part-r-00001.parquet`, the former is moved to the destination directory and renamed with a `_copy_N` postfix (`part-r-00001_copy_1.parquet`).  That's how Hive handles appending and avoids name collision between different write jobs.
      
      Some alternatives fixes considered for this issue:
      
      1.  Use a similar approach as Hive
      
          This approach is not preferred in Spark 1.4.0 mainly because file metadata operations in S3 tend to be slow, especially for tables with lots of file and/or partitions.  That's why `InsertIntoHadoopFsRelation` just inserts to destination directory directly, and is often used together with `DirectParquetOutputCommitter` to reduce latency when working with S3.  This means, we don't have the chance to do renaming, and must avoid name collision from the very beginning.
      
      2.  Same as 1.3, just move max part number detection back to driver side
      
          This isn't doable because unlike 1.3, 1.4 also takes dynamic partitioning into account.  When inserting into dynamic partitions, we don't know which partition directories will be touched on driver side before issuing the write job.  Checking all partition directories is simply too expensive for tables with thousands of partitions.
      
      3.  Add extra component to output file names to avoid name collision
      
          This seems to be the only reasonable solution for now.  To be more specific, we need a JOB level unique identifier to identify all write jobs issued by `InsertIntoHadoopFile`.  Notice that TASK level unique identifiers can NOT be used.  Because in this way a speculative task will write to a different output file from the original task.  If both tasks succeed, duplicate output will be left behind.  Currently, the ORC data source adds `System.currentTimeMillis` to the output file name for uniqueness.  This doesn't work because of exactly the same reason.
      
          That's why this PR adds a job level random UUID in `BaseWriterContainer` (which is used by `InsertIntoHadoopFsRelation` to issue write jobs).  The drawback is that record order is not preserved any more (output files of a later job may be listed before those of a earlier job).  However, we never promise to preserve record order when writing data, and Hive doesn't promise this either because the `_copy_N` trick breaks the order.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6864 from liancheng/spark-8406 and squashes the following commits:
      
      db7a46a [Cheng Lian] More comments
      f5c1133 [Cheng Lian] Addresses comments
      85c478e [Cheng Lian] Workarounds SPARK-8513
      088c76c [Cheng Lian] Adds comment about SPARK-8501
      99a5e7e [Cheng Lian] Uses job level UUID in SimpleTextRelation and avoids double task abortion
      4088226 [Cheng Lian] Works around SPARK-8501
      1d7d206 [Cheng Lian] Adds more logs
      8966bbb [Cheng Lian] Fixes Scala style issue
      18b7003 [Cheng Lian] Uses job level UUID to take speculative tasks into account
      3806190 [Cheng Lian] Lets TestHive use all cores by default
      748dbd7 [Cheng Lian] Adding UUID to output file name to avoid accidental overwriting
      0818fdec
  3. Jun 21, 2015
    • Mike Dusenberry's avatar
      [SPARK-7426] [MLLIB] [ML] Updated Attribute.fromStructField to allow any NumericType. · 47c1d562
      Mike Dusenberry authored
      Updated `Attribute.fromStructField` to allow any `NumericType`, rather than just `DoubleType`, and added unit tests for a few of the other NumericTypes.
      
      Author: Mike Dusenberry <dusenberrymw@gmail.com>
      
      Closes #6540 from dusenberrymw/SPARK-7426_AttributeFactory.fromStructField_Should_Allow_NumericTypes and squashes the following commits:
      
      87fecb3 [Mike Dusenberry] Updated Attribute.fromStructField to allow any NumericType, rather than just DoubleType, and added unit tests for a few of the other NumericTypes.
      47c1d562
    • Joseph K. Bradley's avatar
      [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4 · a1894422
      Joseph K. Bradley authored
      Reorganized docs a bit.  Added migration guides.
      
      **Q**: Do we want to say more for the 1.3 -> 1.4 migration guide for ```spark.ml```?  It would be a lot.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6897 from jkbradley/ml-guide-1.4 and squashes the following commits:
      
      4bf26d6 [Joseph K. Bradley] tiny fix
      8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from previous commit in this PR
      6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4
      a1894422
    • Cheng Lian's avatar
      [SPARK-8508] [SQL] Ignores a test case to cleanup unnecessary testing output until #6882 is merged · 83cdfd84
      Cheng Lian authored
      Currently [the test case for SPARK-7862] [1] writes 100,000 lines of integer triples to stderr and makes Jenkins build output unnecessarily large and it's hard to debug other build errors. A proper fix is on the way in #6882. This PR ignores this test case temporarily until #6882 is merged.
      
      [1]: https://github.com/apache/spark/pull/6404/files#diff-1ea02a6fab84e938582f7f87cc4d9ea1R641
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #6925 from liancheng/spark-8508 and squashes the following commits:
      
      41e5b47 [Cheng Lian] Ignores the test case until #6882 is merged
      83cdfd84
    • Yanbo Liang's avatar
      [SPARK-7604] [MLLIB] Python API for PCA and PCAModel · 32e3cdaa
      Yanbo Liang authored
      Python API for PCA and PCAModel
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6315 from yanboliang/spark-7604 and squashes the following commits:
      
      1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
      4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
      32e3cdaa
    • jeanlyn's avatar
      [SPARK-8379] [SQL] avoid speculative tasks write to the same file · a1e3649c
      jeanlyn authored
      The issue link [SPARK-8379](https://issues.apache.org/jira/browse/SPARK-8379)
      Currently,when we insert data to the dynamic partition with speculative tasks we will get the Exception
      ```
      org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
      Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-10000/ds=2015-06-15/type=2/part-00301.lzo
      owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53
      but is accessed by DFSClient_attempt_201506031520_0011_m_000042_0_-1275047721_57
      ```
      This pr try to write the data to temporary dir when using dynamic parition  avoid the speculative tasks writing the same file
      
      Author: jeanlyn <jeanlyn92@gmail.com>
      
      Closes #6833 from jeanlyn/speculation and squashes the following commits:
      
      64bbfab [jeanlyn] use FileOutputFormat.getTaskOutputPath to get the path
      8860af0 [jeanlyn] remove the never using code
      e19a3bd [jeanlyn] avoid speculative tasks write same file
      a1e3649c
  4. Jun 20, 2015
    • Tarek Auel's avatar
      [SPARK-8301] [SQL] Improve UTF8String substring/startsWith/endsWith/contains performance · 41ab2853
      Tarek Auel authored
      Jira: https://issues.apache.org/jira/browse/SPARK-8301
      
      Added the private method startsWith(prefix, offset) to implement startsWith, endsWith and contains without copying the array
      
      I hope that the component SQL is still correct. I copied it from the Jira ticket.
      
      Author: Tarek Auel <tarek.auel@googlemail.com>
      Author: Tarek Auel <tarek.auel@gmail.com>
      
      Closes #6804 from tarekauel/SPARK-8301 and squashes the following commits:
      
      f5d6b9a [Tarek Auel] fixed parentheses and annotation
      6d7b068 [Tarek Auel] [SPARK-8301] removed null checks
      9ca0473 [Tarek Auel] [SPARK-8301] removed null checks
      1c327eb [Tarek Auel] [SPARK-8301] removed new
      9f17cc8 [Tarek Auel] [SPARK-8301] fixed conversion byte to string in codegen
      3a0040f [Tarek Auel] [SPARK-8301] changed call of UTF8String.set to UTF8String.from
      e4530d2 [Tarek Auel] [SPARK-8301] changed call of UTF8String.set to UTF8String.from
      a5f853a [Tarek Auel] [SPARK-8301] changed visibility of set to protected. Changed annotation of bytes from Nullable to Nonnull
      d2fb05f [Tarek Auel] [SPARK-8301] added additional null checks
      79cb55b [Tarek Auel] [SPARK-8301] null check. Added test cases for null check.
      b17909e [Tarek Auel] [SPARK-8301] removed unnecessary copying of UTF8String. Added a private function startsWith(prefix, offset) to implement the check for startsWith, endsWith and contains.
      41ab2853
    • Yu ISHIKAWA's avatar
      [SPARK-8495] [SPARKR] Add a `.lintr` file to validate the SparkR files and the `lint-r` script · 004f5737
      Yu ISHIKAWA authored
      Thank Shivaram Venkataraman for your support. This is a prototype script to validate the R files.
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6922 from yu-iskw/SPARK-6813 and squashes the following commits:
      
      c1ffe6b [Yu ISHIKAWA] Modify to save result to a log file and add a rule to validate
      5520806 [Yu ISHIKAWA] Exclude the .lintr file not to check Apache lincence
      8f94680 [Yu ISHIKAWA] [SPARK-8495][SparkR] Add a `.lintr` file to validate the SparkR files and the `lint-r` script
      004f5737
    • Josh Rosen's avatar
      [SPARK-8422] [BUILD] [PROJECT INFRA] Add a module abstraction to dev/run-tests · 7a3c424e
      Josh Rosen authored
      This patch builds upon #5694 to add a 'module' abstraction to the `dev/run-tests` script which groups together the per-module test logic, including the mapping from file paths to modules, the mapping from modules to test goals and build profiles, and the dependencies / relationships between modules.
      
      This refactoring makes it much easier to increase the granularity of test modules, which will let us skip even more tests.  It's also a prerequisite for other changes that will reduce test time, such as running subsets of the Python tests based on which files / modules have changed.
      
      This patch also adds doctests for the new graph traversal / change mapping code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6866 from JoshRosen/more-dev-run-tests-refactoring and squashes the following commits:
      
      75de450 [Josh Rosen] Use module system to determine which build profiles to enable.
      4224da5 [Josh Rosen] Add documentation to Module.
      a86a953 [Josh Rosen] Clean up modules; add new modules for streaming external projects
      e46539f [Josh Rosen] Fix camel-cased endswith()
      35a3052 [Josh Rosen] Enable Hive tests when running all tests
      df10e23 [Josh Rosen] update to reflect fact that no module depends on root
      3670d50 [Josh Rosen] mllib should depend on streaming
      dc6f1c6 [Josh Rosen] Use changed files' extensions to decide whether to run style checks
      7092d3e [Josh Rosen] Skip SBT tests if no test goals are specified
      43a0ced [Josh Rosen] Minor fixes
      3371441 [Josh Rosen] Test everything if nothing has changed (needed for non-PRB builds)
      37f3fb3 [Josh Rosen] Remove doc profiles option, since it's not actually needed (see #6865)
      f53864b [Josh Rosen] Finish integrating module changes
      f0249bd [Josh Rosen] WIP
      7a3c424e
    • Liang-Chi Hsieh's avatar
      [SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator to... · 0b899516
      Liang-Chi Hsieh authored
      [SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator to get correct cross validation
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-8468
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6905 from viirya/cv_min and squashes the following commits:
      
      930d3db [Liang-Chi Hsieh] Fix python unit test and add document.
      d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min
      16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal.
      c3dd8d9 [Liang-Chi Hsieh] For comments.
      b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.
      0b899516
Loading