Skip to content
Snippets Groups Projects
  1. Jul 30, 2015
    • Calvin Jia's avatar
      [SPARK-9199] [CORE] Update Tachyon dependency from 0.6.4 -> 0.7.0 · 04c84091
      Calvin Jia authored
      No new dependencies are added. The exclusion changes are due to the change in tachyon-client 0.7.0's project structure.
      
      There is no client side API change in Tachyon 0.7.0 so no code changes are required.
      
      Author: Calvin Jia <jia.calvin@gmail.com>
      
      Closes #7577 from calvinjia/SPARK-9199 and squashes the following commits:
      
      4e81e40 [Calvin Jia] Update Tachyon dependency from 0.6.4 -> 0.7.0
      04c84091
    • Hossein's avatar
      [SPARK-8742] [SPARKR] Improve SparkR error messages for DataFrame API · 157840d1
      Hossein authored
      This patch improves SparkR error message reporting, especially with DataFrame API. When there is a user error (e.g., malformed SQL query), the message of the cause is sent back through the RPC and the R client reads it and returns it back to user.
      
      cc shivaram
      
      Author: Hossein <hossein@databricks.com>
      
      Closes #7742 from falaki/SPARK-8742 and squashes the following commits:
      
      4f643c9 [Hossein] Not logging exceptions in RBackendHandler
      4a8005c [Hossein] Returning stack track of causing exception from RBackendHandler
      5cf17f0 [Hossein] Adding unit test for error messages from SQLContext
      2af75d5 [Hossein] Reading error message in case of failure and stoping with that message
      f479c99 [Hossein] Wrting exception cause message in JVM
      157840d1
    • Eric Liang's avatar
      [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula · e7905a93
      Eric Liang authored
      Preview:
      
      ```
      > summary(m)
                  features coefficients
      1        (Intercept)    1.6765001
      2       Sepal_Length    0.3498801
      3 Species.versicolor   -0.9833885
      4  Species.virginica   -1.0075104
      
      ```
      
      Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
      
      cc mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7771 from ericl/summary and squashes the following commits:
      
      ccd54c3 [Eric Liang] second pass
      a5ca93b [Eric Liang] comments
      2772111 [Eric Liang] clean up
      70483ef [Eric Liang] fix test
      7c247d4 [Eric Liang] Merge branch 'master' into summary
      3c55024 [Eric Liang] working
      8c539aa [Eric Liang] first pass
      e7905a93
    • Joseph K. Bradley's avatar
      [SPARK-6684] [MLLIB] [ML] Add checkpointing to GBTs · be7be6d4
      Joseph K. Bradley authored
      Add checkpointing to GradientBoostedTrees, GBTClassifier, GBTRegressor
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7804 from jkbradley/gbt-checkpoint3 and squashes the following commits:
      
      3fbd7ba [Joseph K. Bradley] tiny fix
      b3e160c [Joseph K. Bradley] unset checkpoint dir after test
      9cc3a04 [Joseph K. Bradley] added checkpointing to GBTs
      be7be6d4
    • martinzapletal's avatar
      [SPARK-8671] [ML] Added isotonic regression to the pipeline API. · 7f7a319c
      martinzapletal authored
      Author: martinzapletal <zapletal-martin@email.cz>
      
      Closes #7517 from zapletal-martin/SPARK-8671-isotonic-regression-api and squashes the following commits:
      
      8c435c1 [martinzapletal] Review https://github.com/apache/spark/pull/7517 feedback update.
      bebbb86 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
      b68efc0 [martinzapletal] Added tests for param validation.
      07c12bd [martinzapletal] Comments and refactoring.
      834fcf7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api
      b611fee [martinzapletal] SPARK-8671. Added first version of isotonic regression to pipeline API
      7f7a319c
    • zsxwing's avatar
      [SPARK-9479] [STREAMING] [TESTS] Fix ReceiverTrackerSuite failure for maven... · 0dbd6963
      zsxwing authored
      [SPARK-9479] [STREAMING] [TESTS] Fix ReceiverTrackerSuite failure for maven build and other potential test failures in Streaming
      
      See https://issues.apache.org/jira/browse/SPARK-9479 for the failure cause.
      
      The PR includes the following changes:
      1. Make ReceiverTrackerSuite create StreamingContext in the test body.
      2. Fix places that don't stop StreamingContext. I verified no SparkContext was stopped in the shutdown hook locally after this fix.
      3. Fix an issue that `ReceiverTracker.endpoint` may be null.
      4. Make sure stopping SparkContext in non-main thread won't fail other tests.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7797 from zsxwing/fix-ReceiverTrackerSuite and squashes the following commits:
      
      3a4bb98 [zsxwing] Fix another potential NPE
      d7497df [zsxwing] Fix ReceiverTrackerSuite; make sure StreamingContext in tests is closed
      0dbd6963
    • Feynman Liang's avatar
      [SPARK-9454] Change LDASuite tests to use vector comparisons · 89cda69e
      Feynman Liang authored
      jkbradley Changes the current hacky string-comparison for vector compares.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7775 from feynmanliang/SPARK-9454-ldasuite-vector-compare and squashes the following commits:
      
      bd91a82 [Feynman Liang] Remove println
      905c76e [Feynman Liang] Fix string compare in distributed EM
      2f24c13 [Feynman Liang] Improve LDASuite tests
      89cda69e
    • Daoyuan Wang's avatar
      [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290]... · 1abf7dc1
      Daoyuan Wang authored
      [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation
      
      This PR is based on #7589 , thanks to adrian-wang
      
      Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
      add/subtract of date/timestamp and interval.
      
      Closes #7589
      
      cc rxin
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7754 from davies/date_add and squashes the following commits:
      
      e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      6224ce4 [Davies Liu] fix conclict
      bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
      e47ff2c [Davies Liu] add python api, fix date functions
      01943d0 [Davies Liu] Merge branch 'master' into date_add
      522e91a [Daoyuan Wang] fix
      e8a639a [Daoyuan Wang] fix
      42df486 [Daoyuan Wang] fix style
      87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
      1a68e03 [Daoyuan Wang] poc of time interval calculation
      c506661 [Daoyuan Wang] function date_add , date_sub
      1abf7dc1
    • Feynman Liang's avatar
      [SPARK-5567] [MLLIB] Add predict method to LocalLDAModel · d8cfd531
      Feynman Liang authored
      jkbradley hhbyyh
      
      Adds `topicDistributions` to LocalLDAModel. Please review after #7757 is merged.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7760 from feynmanliang/SPARK-5567-predict-in-LDA and squashes the following commits:
      
      0ad1134 [Feynman Liang] Remove println
      27b3877 [Feynman Liang] Code review fixes
      6bfb87c [Feynman Liang] Remove extra newline
      476f788 [Feynman Liang] Fix checks and doc for variationalInference
      061780c [Feynman Liang] Code review cleanup
      3be2947 [Feynman Liang] Rename topicDistribution -> topicDistributions
      2a821a6 [Feynman Liang] Add predict methods to LocalLDAModel
      d8cfd531
    • Reynold Xin's avatar
      [SPARK-9460] Fix prefix generation for UTF8String. · a20e743f
      Reynold Xin authored
      Previously we could be getting garbage data if the number of bytes is 0, or on JVMs that are 4 byte aligned, or when compressedoops is on.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7789 from rxin/utf8string and squashes the following commits:
      
      86ffa3e [Reynold Xin] Mask out data outside of valid range.
      4d647ed [Reynold Xin] Mask out data.
      c6e8794 [Reynold Xin] [SPARK-9460] Fix prefix generation for UTF8String.
      a20e743f
    • Daoyuan Wang's avatar
      [SPARK-8174] [SPARK-8175] [SQL] function unix_timestamp, from_unixtime · 6d94bf6a
      Daoyuan Wang authored
      unix_timestamp(): long
      Gets current Unix timestamp in seconds.
      
      unix_timestamp(string|date): long
      Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
      
      unix_timestamp(string date, string pattern): long
      Convert time string with given pattern (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) to Unix time stamp (in seconds), return null if fail: unix_timestamp('2009-03-20', 'yyyy-MM-dd') = 1237532400.
      
      from_unixtime(bigint unixtime[, string format]): string
      Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00".
      
      Jira:
      https://issues.apache.org/jira/browse/SPARK-8174
      https://issues.apache.org/jira/browse/SPARK-8175
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #7644 from adrian-wang/udfunixtime and squashes the following commits:
      
      2fe20c4 [Daoyuan Wang] util.Date
      ea2ec16 [Daoyuan Wang] use util.Date for better performance
      a2cf929 [Daoyuan Wang] doc return null instead of 0
      f6f070a [Daoyuan Wang] address comments from davies
      6a4cbb3 [Daoyuan Wang] temp
      56ded53 [Daoyuan Wang] rebase and address comments
      14a8b37 [Daoyuan Wang] function unix_timestamp, from_unixtime
      6d94bf6a
    • Imran Rashid's avatar
      [SPARK-9437] [CORE] avoid overflow in SizeEstimator · 06b6a074
      Imran Rashid authored
      https://issues.apache.org/jira/browse/SPARK-9437
      
      Author: Imran Rashid <irashid@cloudera.com>
      
      Closes #7750 from squito/SPARK-9437_size_estimator_overflow and squashes the following commits:
      
      29493f1 [Imran Rashid] prevent another potential overflow
      bc1cb82 [Imran Rashid] avoid overflow
      06b6a074
    • Josh Rosen's avatar
      [SPARK-8850] [SQL] Enable Unsafe mode by default · 520ec0ff
      Josh Rosen authored
      This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:
      
      **List of fixed blockers**:
      
      - [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
      - [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
      - [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
      - [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
      - [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
      - [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
      - [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
      - [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
      - [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
      - [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
        - [x] Wrong answer in `join_1to1` (fixed by #7680)
        - [x] Wrong answer in `join_nulls` (fixed by #7680)
        - [x] Managed memory OOM / leak in `lateral_view`
        - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
        - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
      - [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
      - [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:
      
      83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
      f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
      963f567 [Josh Rosen] Reduce buffer size for R tests
      d6986de [Josh Rosen] Lower page size in PySpark tests
      013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
      5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
      ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
      715517b [Josh Rosen] Enable Unsafe by default
      520ec0ff
    • Marcelo Vanzin's avatar
      [SPARK-9388] [YARN] Make executor info log messages easier to read. · ab78b1d2
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7706 from vanzin/SPARK-9388 and squashes the following commits:
      
      028b990 [Marcelo Vanzin] Single log statement.
      3c5fb6a [Marcelo Vanzin] YARN not Yarn.
      5bcd7a0 [Marcelo Vanzin] [SPARK-9388] [yarn] Make executor info log messages easier to read.
      ab78b1d2
    • Mridul Muralidharan's avatar
      [SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARN · e5353465
      Mridul Muralidharan authored
      This change adds code to notify the scheduler backend when a container dies in YARN.
      
      Author: Mridul Muralidharan <mridulm@yahoo-inc.com>
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7431 from vanzin/SPARK-8297 and squashes the following commits:
      
      471e4a0 [Marcelo Vanzin] Fix unit test after merge.
      d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
      3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297
      537da6f [Marcelo Vanzin] Make an expected log less scary.
      04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request.
      8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297
      687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master
      9218fcc [Mridul Muralidharan] Fix failing testcase
      362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug
      9ee1307 [Mridul Muralidharan] Fix SPARK-8297
      a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
      e5353465
    • Liang-Chi Hsieh's avatar
      [SPARK-9361] [SQL] Refactor new aggregation code to reduce the times of checking compatibility · 5363ed71
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9361
      
      Currently, we call `aggregate.Utils.tryConvert` in many places to check it the logical.Aggregate can be run with new aggregation. But looks like `aggregate.Utils.tryConvert` will cost considerable time to run. We should only call `tryConvert` once and keep it value in `logical.Aggregate` and reuse it.
      
      In `org.apache.spark.sql.execution.aggregate.Utils`, the codes involving with `tryConvert` should be moved to catalyst because it actually doesn't deal with execution details.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7677 from viirya/refactor_aggregate and squashes the following commits:
      
      babea30 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into refactor_aggregate
      9a589d7 [Liang-Chi Hsieh] Fix scala style.
      0a91329 [Liang-Chi Hsieh] Refactor new aggregation code to reduce the times to call tryConvert.
      5363ed71
    • François Garillot's avatar
      [SPARK-9267] [CORE] Retire stringify(Partial)?Value from Accumulators · 7bbf02f0
      François Garillot authored
      cc srowen
      
      Author: François Garillot <francois@garillot.net>
      
      Closes #7678 from huitseeker/master and squashes the following commits:
      
      5e99f57 [François Garillot] [SPARK-9267][Core] Retire stringify(Partial)?Value from Accumulators
      7bbf02f0
    • Wenchen Fan's avatar
      [SPARK-9390][SQL] create a wrapper for array type · c0cc0eae
      Wenchen Fan authored
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #7724 from cloud-fan/array-data and squashes the following commits:
      
      d0408a1 [Wenchen Fan] fix python
      661e608 [Wenchen Fan] rebase
      f39256c [Wenchen Fan] fix hive...
      6dbfa6f [Wenchen Fan] fix hive again...
      8cb8842 [Wenchen Fan] remove element type parameter from getArray
      43e9816 [Wenchen Fan] fix mllib
      e719afc [Wenchen Fan] fix hive
      4346290 [Wenchen Fan] address comment
      d4a38da [Wenchen Fan] remove sizeInBytes and add license
      7e283e2 [Wenchen Fan] create a wrapper for array type
      c0cc0eae
    • Yuu ISHIKAWA's avatar
      [SPARK-9248] [SPARKR] Closing curly-braces should always be on their own line · 7492a33f
      Yuu ISHIKAWA authored
      ### JIRA
      [[SPARK-9248] Closing curly-braces should always be on their own line - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9248)
      
      ## The result of `dev/lint-r`
      [The result of `dev/lint-r` for SPARK-9248 at the revistion:6175d6cf](https://gist.github.com/yu-iskw/96cadcea4ce664c41f81)
      
      Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #7795 from yu-iskw/SPARK-9248 and squashes the following commits:
      
      c8eccd3 [Yuu ISHIKAWA] [SPARK-9248][SparkR] Closing curly-braces should always be on their own line
      7492a33f
    • Xiangrui Meng's avatar
      [MINOR] [MLLIB] fix doc for RegexTokenizer · 81464f2a
      Xiangrui Meng authored
      This is #7791 for Python. hhbyyh
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7798 from mengxr/regex-tok-py and squashes the following commits:
      
      baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
      81464f2a
    • Sean Owen's avatar
      [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when... · ed3cb1d2
      Sean Owen authored
      [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared number of elements less than array length
      
      Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
      
      CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the Python API? I didn't see it had any similar checks to begin with, but I don't know it well.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #7794 from srowen/SPARK-9277 and squashes the following commits:
      
      e8dc31e [Sean Owen] Fix scalastyle
      6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks.
      ed3cb1d2
    • Meihua Wu's avatar
      [SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents · a6e53a9c
      Meihua Wu authored
      Add unit tests for running LDA with empty documents.
      Both EMLDAOptimizer and OnlineLDAOptimizer are tested.
      
      feynmanliang
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7620 from rotationsymmetry/SPARK-9225 and squashes the following commits:
      
      3ed7c88 [Meihua Wu] Incorporate reviewer's further comments
      f9432e8 [Meihua Wu] Incorporate reviewer's comments
      8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into SPARK-9225
      ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents
      a6e53a9c
    • Yuhao Yang's avatar
      [SPARK-] [MLLIB] minor fix on tokenizer doc · 9c0501c5
      Yuhao Yang authored
      A trivial fix for the comments of RegexTokenizer.
      
      Maybe this is too small, yet I just noticed it and think it can be quite misleading. I can create a jira if necessary.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7791 from hhbyyh/docFix and squashes the following commits:
      
      cdf2542 [Yuhao Yang] minor fix on tokenizer doc
      9c0501c5
    • zhangjiajin's avatar
      [SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases · d212a314
      zhangjiajin authored
      Continuation of work by zhangjiajin
      
      Closes #7412
      
      Author: zhangjiajin <zhangjiajin@huawei.com>
      Author: Feynman Liang <fliang@databricks.com>
      Author: zhang jiajin <zhangjiajin@huawei.com>
      
      Closes #7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits:
      
      a61943d [Feynman Liang] Collect small patterns to local
      4ddf479 [Feynman Liang] Parallelize freqItemCounts
      ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal
      87fa021 [Feynman Liang] Improve extend prefix readability
      c2caa5c [Feynman Liang] Readability improvements and comments
      1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database
      da0091b [Feynman Liang] Use lists for prefixes to reuse data
      cb2a4fc [Feynman Liang] Inline code for readability
      01c9ae9 [Feynman Liang] Add getters
      6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs
      64271b3 [zhangjiajin] Modified codes according to comments.
      d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing.
      b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes
      095aa3a [zhangjiajin] Modified the code according to the review comments.
      baa2885 [zhangjiajin] Modified the code according to the review comments.
      6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan
      a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark
      4dd1c8a [zhangjiajin] initialize file before rebase.
      078d410 [zhangjiajin] fix a scala style error.
      22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan.
      ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
      574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
      ba5df34 [zhangjiajin] Fix a Scala style error.
      4c60fb3 [zhangjiajin] Fix some Scala style errors.
      1dd33ad [zhangjiajin] Modified the code according to the review comments.
      89bc368 [zhangjiajin] Fixed a Scala style error.
      a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
      951fd42 [zhang jiajin] Delete Prefixspan.scala
      575995f [zhangjiajin] Modified the code according to the review comments.
      91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
      d212a314
    • Joseph K. Bradley's avatar
      [SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs · c5815930
      Joseph K. Bradley authored
      PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs.  This PR generalizes it.
      
      For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects.
      
      I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684].  It should be useful for other iterative algorithms as well.
      
      Changes I made:
      * Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
      * Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.).
      * The subclasses for Graphs and RDDs implement those abstract methods.
      * I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs.
      
      To review this PR, I recommend doing 2 diffs:
      (1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala
      (2) diff between the 2 test suites
      
      CCing andrewor14 in case there are relevant changes to checkpointing.
      CCing feynmanliang in case you're interested in learning about checkpointing.
      CCing mengxr for final OK.
      Thanks all!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7728 from jkbradley/gbt-checkpoint and squashes the following commits:
      
      d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before.
      32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda
      0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take initial data.
      568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to PeriodicCheckpointer, with subclasses for RDDs and Graphs.
      c5815930
    • Yuhao Yang's avatar
      [SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix · d31c618e
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-7368
      Add QR decomposition for RowMatrix.
      
      I'm not sure what's the blueprint about the distributed Matrix from community and whether this will be a desirable feature , so I sent a prototype for discussion. I'll go on polish the code and provide ut and performance statistics if it's acceptable.
      
      The implementation refers to the [paper: https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf]
      Austin R. Benson, David F. Gleich, James Demmel. "Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures", 2013 IEEE International Conference on Big Data, which is a stable algorithm with good scalability.
      
      Currently I tried it on a 400000 * 500 rowMatrix (16 partitions) and it can bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced)  to 2.6 mins on a 4 worker cluster. I think there will still be some room for performance improvement.
      
      Any trial and suggestion is welcome.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #5909 from hhbyyh/qrDecomposition and squashes the following commits:
      
      cec797b [Yuhao Yang] remove unnecessary qr
      0fb1012 [Yuhao Yang] hierarchy R computing
      3fbdb61 [Yuhao Yang] update qr to indirect and add ut
      0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
      39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
      c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition
      39b0b22 [Yuhao Yang] initial draft for discussion
      d31c618e
    • Liang-Chi Hsieh's avatar
      [SPARK-8838] [SQL] Add config to enable/disable merging part-files when merging parquet schema · 6175d6cf
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-8838
      
      Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema.
      
      In short, we need to merge parquet schema because different summary files may contain different schema. But the part-files are confirmed to have the same schema with summary files.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #7238 from viirya/option_partfile_merge and squashes the following commits:
      
      71d5b5f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      8816f44 [Liang-Chi Hsieh] For comments.
      dbc8e6b [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      afc2fa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      d4ed7e6 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      df43027 [Liang-Chi Hsieh] Get dataStatuses' partitions based on all paths.
      4eb2f00 [Liang-Chi Hsieh] Use given parameter.
      ea8f6e5 [Liang-Chi Hsieh] Correct the code comments.
      a57be0e [Liang-Chi Hsieh] Merge part-files if there are no summary files.
      47df981 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      4caf293 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into option_partfile_merge
      0e734e0 [Liang-Chi Hsieh] Use correct API.
      3b6be5b [Liang-Chi Hsieh] Fix key not found.
      4bdd7e0 [Liang-Chi Hsieh] Don't read footer files if we can skip them.
      8bbebcb [Liang-Chi Hsieh] Figure out how to test the config.
      bbd4ce7 [Liang-Chi Hsieh] Add config to enable/disable merging part-files when merging parquet schema.
      6175d6cf
    • Reynold Xin's avatar
      Fix flaky HashedRelationSuite · 5ba2d440
      Reynold Xin authored
      SparkEnv might not have been set in local unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7784 from rxin/HashedRelationSuite and squashes the following commits:
      
      435d64b [Reynold Xin] Fix flaky HashedRelationSuite
      5ba2d440
    • Reynold Xin's avatar
      Revert "[SPARK-9458] Avoid object allocation in prefix generation." · 4a8bb9d0
      Reynold Xin authored
      This reverts commit 9514d874.
      4a8bb9d0
    • zsxwing's avatar
      [SPARK-9335] [TESTS] Enable Kinesis tests only when files in extras/kinesis-asl are changed · 76f2e393
      zsxwing authored
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #7711 from zsxwing/SPARK-9335-test and squashes the following commits:
      
      c13ec2f [zsxwing] environs -> environ
      69c2865 [zsxwing] Merge remote-tracking branch 'origin/master' into SPARK-9335-test
      ef84a08 [zsxwing] Revert "Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS"
      f691028 [zsxwing] Modify the Kinesis project to trigger ENABLE_KINESIS_TESTS
      7618205 [zsxwing] Enable Kinesis tests only when files in extras/kinesis-asl are changed
      76f2e393
    • Joseph Batchik's avatar
      [SPARK-8005][SQL] Input file name · 1221849f
      Joseph Batchik authored
      Users can now get the file name of the partition being read in. A thread local variable is in `SQLNewHadoopRDD` and is set when the partition is computed. `SQLNewHadoopRDD` is moved to core so that the catalyst package can reach it.
      
      This supports:
      
      `df.select(inputFileName())`
      
      and
      
      `sqlContext.sql("select input_file_name() from table")`
      
      Author: Joseph Batchik <josephbatchik@gmail.com>
      
      Closes #7743 from JDrit/input_file_name and squashes the following commits:
      
      abb8609 [Joseph Batchik] fixed failing test and changed the default value to be an empty string
      d2f323d [Joseph Batchik] updates per review
      102061f [Joseph Batchik] updates per review
      75313f5 [Joseph Batchik] small fixes
      c7f7b5a [Joseph Batchik] addeding input file name to Spark SQL
      1221849f
    • Yijie Shen's avatar
      [SPARK-9428] [SQL] Add test cases for null inputs for expression unit tests · e127ec34
      Yijie Shen authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-9428
      
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #7748 from yjshen/string_cleanup and squashes the following commits:
      
      e0c2b3d [Yijie Shen] update codegen in RegExpExtract and RegExpReplace
      26614d2 [Yijie Shen] MathFunctionSuite
      a402859 [Yijie Shen] complex_create, conditional and cast
      6e4e608 [Yijie Shen] arithmetic and cast
      52593c1 [Yijie Shen] null input test cases for StringExpressionSuite
      e127ec34
    • Reynold Xin's avatar
      HOTFIX: disable HashedRelationSuite. · 712465b6
      Reynold Xin authored
      712465b6
    • Davies Liu's avatar
      [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__ · e044705b
      Davies Liu authored
      Also we could create a Python UDT without having a Scala one, it's important for Python users.
      
      cc mengxr JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7453 from davies/class_in_main and squashes the following commits:
      
      4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
      793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      dc65f19 [Davies Liu] address comment
      a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      a86e1fc [Davies Liu] fix serialization
      ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      63f52ef [Davies Liu] fix pylint check
      655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
      316a394 [Davies Liu] support Python UDT with UTF
      0bcb3ef [Davies Liu] fix bug in mllib
      de986d6 [Davies Liu] fix test
      83d65ac [Davies Liu] fix bug in StructType
      55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
      e044705b
    • Alex Angelini's avatar
      Fix reference to self.names in StructType · f5dd1133
      Alex Angelini authored
      `names` is not defined in this context, I think you meant `self.names`.
      
      davies
      
      Author: Alex Angelini <alex.louis.angelini@gmail.com>
      
      Closes #7766 from angelini/fix_struct_type_names and squashes the following commits:
      
      01543a1 [Alex Angelini] Fix reference to self.names in StructType
      f5dd1133
  2. Jul 29, 2015
    • Reynold Xin's avatar
      [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode. · 27850af5
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7767 from rxin/SPARK-9462 and squashes the following commits:
      
      ef3e2d9 [Reynold Xin] Removed println
      713ac3a [Reynold Xin] More unit tests.
      bb5c334 [Reynold Xin] [SPARK-9462][SQL] Initialize nondeterministic expressions in code gen fallback mode.
      27850af5
    • Reynold Xin's avatar
      [SPARK-9460] Avoid byte array allocation in StringPrefixComparator. · 07fd7d36
      Reynold Xin authored
      As of today, StringPrefixComparator converts the long values back to byte arrays in order to compare them. This patch optimizes this to compare the longs directly, rather than turning the longs into byte arrays and comparing them byte by byte (unsigned).
      
      This only works on little-endian architecture right now.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7765 from rxin/SPARK-9460 and squashes the following commits:
      
      e4908cc [Reynold Xin] Stricter randomized tests.
      4c8d094 [Reynold Xin] [SPARK-9460] Avoid byte array allocation in StringPrefixComparator.
      07fd7d36
    • Reynold Xin's avatar
      [SPARK-9458] Avoid object allocation in prefix generation. · 9514d874
      Reynold Xin authored
      In our existing sort prefix generation code, we use expression's eval method to generate the prefix, which results in object allocation for every prefix. We can use the specialized getters available on InternalRow directly to avoid the object allocation.
      
      I also removed the FLOAT prefix, opting for converting float directly to double.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7763 from rxin/sort-prefix and squashes the following commits:
      
      5dc2f06 [Reynold Xin] [SPARK-9458] Avoid object allocation in prefix generation.
      9514d874
    • Feynman Liang's avatar
      [SPARK-9440] [MLLIB] Add hyperparameters to LocalLDAModel save/load · a200e645
      Feynman Liang authored
      jkbradley MechCoder
      
      Resolves blocking issue for SPARK-6793. Please review after #7705 is merged.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7757 from feynmanliang/SPARK-9940-localSaveLoad and squashes the following commits:
      
      d0d8cf4 [Feynman Liang] Fix thisClassName
      0f30109 [Feynman Liang] Fix tests after changing LDAModel public API
      dc61981 [Feynman Liang] Add hyperparams to LocalLDAModel save/load
      a200e645
    • sethah's avatar
      [SPARK-6129] [MLLIB] [DOCS] Added user guide for evaluation metrics · 2a9fe4a4
      sethah authored
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #7655 from sethah/Working_on_6129 and squashes the following commits:
      
      253db2d [sethah] removed number formatting from example code
      b769cab [sethah] rewording threshold section
      d5dad4d [sethah] adding some explanations of concepts to the eval metrics user guide
      3a61ff9 [sethah] Removing unnecessary latex commands from metrics guide
      c9dd058 [sethah] Cleaning up and formatting metrics user guide section
      6f31c21 [sethah] All example code for metrics section done
      98813fe [sethah] Most java and python example code added. Further latex formatting
      53a24fc [sethah] Adding documentations of metrics for ML algorithms to user guide
      2a9fe4a4
Loading